2016-08-28 26 views
0

私はテーブルを解析し、CSV形式へのデータの書き込みをしようとしましたが、beautifoulsoup は正しくテーブルを解析しません。のpython beautifoulsoup間違った構文解析テーブル

date=[] 
pollster=[] 
grade=[] 
sample=[] 
weight=[] 
clinton=[] 
trump=[] 
johnson=[] 
leader=[] 
adjusted=[] 

import requests 
from bs4 import BeautifulSoup 
url='http://projects.fivethirtyeight.com/2016-election-forecast/florida/' 
r = requests.get(url) 
soup=BeautifulSoup(r.content,"lxml") 
the_table=soup.find("table", attrs={"class":"t-desktop t-polls"}) 
rows = the_table.tbody.find_all('tr') 
for row in rows: 
    if 'data-created' in row.attrs: 
     cols = row.find_all('td') 
     text_cols = [ele.text.strip() for ele in cols] 
     date.append(text_cols[2]) 
     pollster.append(text_cols[3]) 
     grade.append(text_cols[4]) 
     sample.append(text_cols[5]) 
     weight.append(text_cols[6]) 
     clinton.append(text_cols[7]) 
     trump.append(text_cols[8]) 
     johnson.append(text_cols[9]) 
     leader.append(text_cols[10]) 
     adjusted.append(text_cols[11]) 

import pandas as pd 
df=pd.DataFrame(date,columns=['date']) 
df['pollster']=pollster 
df['grade']=grade 
df['sample']=sample 
df['weight']=weight 
df['clinton']=clinton 
df['trump']=trump 
df['johnson']=johnson 
df['leader']=leader 
df['adjusted']=adjusted 
from urllib.parse import urlparse 
s=urlparse(url) 
import os 
f=os.getcwd()+"/"+s.path.split('/')[-2] + '.csv' 
df.to_csv(f) 

それは間違っデータをcsvファイルを保存します: http://projects.fivethirtyeight.com/2016-election-forecast/arizona/

これは私が使用しているコードは次のとおりです。 これがページです

,date  ,pollster    ,grade,sample ,weight,clinton,trump,johnson,leader  ,adjusted 
0,Aug. 21-27,USC Dornsife/LA Times,  ,"2,545",LV ,44% ,44% ,  ,Clinton +1 ,Clinton +4 
1,Aug. 24-26,Morning Consult  ,  ,"2,007",RV ,39% ,37% ,8%  ,Clinton +2 ,Clinton +2 
2,Aug. 20-26,USC Dornsife/LA Times,  ,"2,460",LV ,45% ,43% ,  ,Clinton +1 ,Clinton +5 
3,Aug. 19-25,Ipsos    ,A- ,334 ,LV ,50% ,43% ,  ,Clinton +7 ,Clinton +7 
4,Aug. 19-25,Ipsos    ,A- ,500 ,LV ,53% ,31% ,  ,Clinton +22,Clinton +22 
5,Aug. 19-25,Ipsos    ,A- ,443 ,LV ,32% ,45% ,  ,Trump +13 ,Trump +13 
6,Aug. 19-25,Ipsos    ,A- ,518 ,LV ,61% ,25% ,  ,Clinton +36,Clinton +36 
7,Aug. 19-25,Ipsos    ,A- ,392 ,LV ,47% ,41% ,  ,Clinton +7 ,Clinton +7 
8,Aug. 19-25,Ipsos    ,A- ,666 ,LV ,49% ,42% ,  ,Clinton +7 ,Clinton +7 
and so on..... 

私はbeautifoulsoupを変更した場合パーサー、まだ間違って解析します。 を手動で保存すると、の表がクロムインスペクタまたはfirefox firebugでコピーされました。が動作します。ここで生成された正しいデータのcsvです:

,date       ,pollster,grade ,sample,weight,clinton,trump,johnson,leader ,adjusted 
0 ,Ipsos      ,A-  ,362 ,LV ,0.67 ,43% ,46% ,  ,Trump +3 ,Trump +3 
1 ,CNN/Opinion Research Corp. ,A-  ,809 ,LV ,1.40 ,38% ,45% ,12% ,Trump +7 ,Trump +7 
2 ,Ipsos      ,A-  ,438 ,LV ,0.25 ,39% ,47% ,  ,Trump +8 ,Trump +8 
3 ,YouGov      ,B  ,"1,095",LV ,0.65 ,42% ,44% ,5%  ,Trump +2 ,Trump +1 
4 ,OH Predictive Insights/MBQF,C+  ,996 ,LV ,0.44 ,45% ,42% ,4%  ,Clinton +3,Clinton +2 
5 ,Integrated Web Strategy  ,  ,679 ,LV ,0.35 ,41% ,49% ,3%  ,Trump +8 ,Trump +5 
6 ,Public Policy Polling  ,B+  ,691 ,V  ,0.49 ,40% ,44% ,  ,Trump +4 ,Trump +1 
7 ,OH Predictive Insights/MBQF,C+  ,"1,060",LV ,0.16 ,47% ,42% ,  ,Clinton +4,Clinton +4 
8 ,Greenberg Quinlan Rosner  ,B-  ,300 ,LV ,0.23 ,39% ,45% ,10% ,Trump +6 ,Trump +6 
9 ,Public Policy Polling  ,B+  ,896 ,V  ,0.20 ,38% ,40% ,6%  ,Trump +2 ,Tie 
10,Behavior Research Center  ,A  ,564 ,RV ,0.16 ,42% ,35% ,  ,Clinton +7,Clinton +5 
11,Merrill Poll     ,B  ,701 ,LV ,0.11 ,38% ,38% ,  ,Tie  ,Tie 
12,Strategies 360    ,B  ,504 ,LV ,0.03 ,42% ,44% ,  ,Trump +2 ,Tie 

ウェブから全体のHTMLが間違った構文解析beatifulsoupますなぜ?

[編集:SOLVED] このコードエキスJSONオブジェクトrace.stateData正規表現を使用してスクリプトタグから。データは最終的に解析されます。

r = requests.get(url) 
soup = BeautifulSoup(r.content, "lxml") 
script = soup.body.script.text 
script = script.replace("\n", "") 
re_match = re.match('.*race\.stateData = (.*);race\.path', script) 
str_json = re_match.group(1) 
j = json.loads(str_json) 
#parsing data code not relevant.. 
+0

表は、 '

関連する問題