のPythonのWebスクレイピング特殊文字、私は次のようなデータを得ることを期待データ

cityURL='https://en.wikipedia.org/wiki/Elko,_Nevada' 

def createObj(url): 
    html = urlopen(url) 
    bsObj = BeautifulSoup(html, 'lxml') 
    return bsObj 

bsObj1 = createObj(cityURL) 

table1 = bsObj1.find("table", {"class":"infobox geography vcard"}) 
incorporated = table1.find("th", text='Incorporated (city)').findNext('td').get_text() 

table1.find("th", text='. Total') # Problem here, due to the special dot, I cannot identify the "th"

を抽出します。のPythonのWebスクレイピング特殊文字、私は次のようなデータを得ることを期待データ

合計、17.6 土地、17.6 水、0.0

出典

2017-03-17 zhan2383

は答えを受け入れています。ありがとう。 –

"" ページであります「ドット」ではありません。これはユニコード文字BULLET（\ u2022）です。

これを実現するには、Pythonのregex（re）モジュールを使用できます。

更新されたコードは次のようになります：

import re 
cityURL='https://en.wikipedia.org/wiki/Elko,_Nevada' 

def createObj(url): 
    html = urlopen(url) 
    bsObj = BeautifulSoup(html, 'lxml') 
    return bsObj 

bsObj1 = createObj(cityURL) 

table1 = bsObj1.find("table", {"class":"infobox geography vcard"}) 
incorporated = table1.find("th", text='Incorporated (city)').findNext('td').get_text() 

pattern = re.compile(r'Total') 
table1.find("th", text=pattern)

また、あなたがbeautifulsoupよりもはるかに高速であるlxmlのモジュールを使用することができます。

import requests 
from lxml import html 

cityURL='https://en.wikipedia.org/wiki/Elko,_Nevada' 
r = requests.get(cityURL) 
root = html.fromstring(r.content) 

def normalize(text) : 
    return ''.join([i if ord(i) < 128 else ' ' for i in text]).strip().split()[0] 

val_list = [(normalize(root.xpath('//table[@class="infobox geography vcard"]//tr[./th/text()="Area"]/following-sibling::tr[{}]//text()'.format(str(val)))[1]), normalize(root.xpath('//table[@class="infobox geography vcard"]//tr[./th/text()="Area"]/following-sibling::tr[{}]//text()'.format(str(val)))[3])) for val in xrange(1,4)] 
print(val_list)

上記のコードは出力されます：それはあまりにも他の人を助けるかもしれないとして、それはあなたのために働いていた場合

[(u'Total', u'17.6'), (u'Land', u'17.6'), (u'Water', u'0.0')]

出典

2017-03-17 08:47:11

のPythonのWebスクレイピング特殊文字、私は次のようなデータを得ることを期待データ

答えて

関連する問題