BeautifulSoup4とPythonを使用して、一貫性のないHTMLページからデータを抽出する

私はこのwebpageからデータを抽出しようとしていますが、ページのHTMLフォーマットの不整合のためにいくつか問題があります。私はOGAP IDのリストを持っています。私は、繰り返すOGAP IDごとにGene Nameと文献情報（PMID＃）を抽出したいと思います。ここの他の質問とBeautifulSoupのドキュメントのおかげで、私は各IDの遺伝子名を一貫して得ることができましたが、文献の部分に問題があります。ここには、不一致を強調する検索用語がいくつかあります。BeautifulSoup4とPythonを使用して、一貫性のないHTMLページからデータを抽出する

検索用語働く

HTMLサンプル：OG00020

：

検索用語を動作しませんOG00131

<tr> 
 
    <td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation: 
 
    <br>&nbsp;&nbsp;PMID: 
 
    <a href="http://www.ncbi.nlm.nih.gov/pubmed/20068230">20068230</a> 
 
    [CAD, ETD MS/MS]; <br> 
 
    <br> 
 
    </td> 
 
</tr>

HTMLサンプルを3210

ここで私が持っているコードは、だから、要素がループのコードを投げているものであると思われる、これまで

import urllib2 
from bs4 import BeautifulSoup 

#define list of genes 

#initialize variables 
gene_list = [] 
literature = [] 
# Test list 
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"] 


for i in range(len(gene_listID)): 
    print gene_listID[i] 
    # Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided 
    dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i] 
    # Opens the URL as a page 
    page = urllib2.urlopen(dbOGAP) 
    # Reads the page and parses it through "lxml" format 
    soup = BeautifulSoup(page, "lxml") 

    gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text 
    print gene_name[1:] 
    gene_list.append(gene_name[1:]) 

    # PubMed IDs are located near the <td> tag with the term "Data and Source" 
    pmid = soup.find("span", text="Data and Source") 

    # Based on inspection of the website, need to move up to the parent <td> tag 
    pmid_p = pmid.parent 

    # Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag) 
    pmid_s = pmid_p.next_sibling 
    #for child in pmid_s.descendants: 
    # print child 
    # Now we search down the tree to find the next table data (<td>) tag 
    pmid_c = pmid_s.find("td") 
    temp_lit = [] 
    # Next we print the text of the data 
    #print pmid_c.text 
    if "No literature is available" in pmid_c.text: 
     temp_lit.append("No literature is available") 
     print "Not available" 
    else: 
    # and then print out a list of urls for each pubmed ID we have 
     print "The following is available" 
     for link in pmid_c.find_all('a'): 
      # the <a> tag includes more than just the link address. 
      # for each <a> tag found, print the address (href attribute) and extra bits 
      # link.string provides the string that appears to be hyperlinked. 
      # In this case, it is the pubmedID 
      print link.string 
      temp_lit.append("PMID: " + link.string + " URL: " + link.get('href')) 
    literature.append(temp_lit) 
    print "\n"

です。テキスト "PMID"を持つ要素を検索し、その後に来るテキストを返します（そしてPMID番号がある場合はURL）。もしそうでなければ、私はちょうど私が興味を持っているテキストを探して、各子供をチェックしたいですか？

私は、Python 2.7.10を使用してい

出典

2016-12-05 Peter M.

import requests 
from bs4 import BeautifulSoup 
import re 
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"] 
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID) 

for url in urls: 
    r = requests.get(url) 
    soup = BeautifulSoup(r.text, 'lxml') 
    regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+') 

    a_tag = soup.find('a', href=regex) 
    has_pmid = 'PMID' in a_tag.previous_element 

    if has_pmid : 
     print(a_tag.text, a_tag.next_sibling, a_tag.get("href")) 
    else: 
     print("Not available")

アウト：

18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734 
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230 
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230 
Not available 
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927 
Not available 
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation

チェックよりも、ターゲットURL、番号の末尾に一致する最初のタグを見つけた場合は 'PMID'それは前の要素です。このウェブは矛盾しているので、何度も試してみると、これが助けてくれることを願っています。

出典

2016-12-06 01:26:45

ねえ、助けてくれてありがとう。この方法を使ってすべての文献を手に入れることができるかどうかを知るために、これを利用して遊ぶべきです。 –

BeautifulSoup4とPythonを使用して、一貫性のないHTMLページからデータを抽出する

答えて

関連する問題