mynetaからテーブルを掻き集めるR

http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summaryから私のスタジオにテーブルを掻き回そうとしています。mynetaからテーブルを掻き集めるR

は、ここでは、コード

url<-'http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary' 
webpage<-read_html(url) 
candidate_info<- html_nodes(webpage,xpath='//*[@id="main"]/div/div[2]/div[2]/table') 
candidate_info<- html_table(candidate_info) 
head(candidate_info)

しかし、何も出力を得ていないですが、私が間違っているの何示唆？

出典

2017-12-30 Prakhar Agarwal

どのようなXPathに導いたのですか？ – hrbrmstr

クロムでページを検査し、HTMLタグのxpathをコピーして、タグ上にカーソルを置いてテーブルをハイライト表示します。 –

答えに別のXPathを提供しました。コピーメソッド（そしてSelector Gadgetユーザーのための1つの大きな問題）は、_rendered_ HTMLが_source_ HTMLとかなり異なることがあり、 'rvest'（' xml2'、実際には）は_source_ HTMLを扱うことです。 – hrbrmstr

このサイトには、非常に壊れたHTMLがいくつかあります。しかし、それは実行可能です。

私は、少し脆弱な方法でノードをターゲットにするほうがよいことがわかります。以下のXPathは、テーブルの内容によってそれを検出します。

html_table()私はテーブルを "手作り"にしてしまいました。

library(rvest) 

# helper to clean column names 
mcga <- function(x) { make.unique(gsub("(^_|_$)", "", gsub("_+", "_", gsub("[[:punct:][:space:]]+", "_", tolower(x)))), sep = "_") } 

pg <- read_html("http://myneta.info/uttarpradesh2017/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary") 

# target the table 
tab <- html_node(pg, xpath=".//table[contains(thead, 'Liabilities')]") 

# get the rows so we can target columns 
rows <- html_nodes(tab, xpath=".//tr[td[not(@colspan)]]") 

# make a data frame 
do.call(
    cbind.data.frame, 
    c(lapply(1:8, function(i) { 
    html_text(html_nodes(rows, xpath=sprintf(".//td[%s]", i)), trim=TRUE) 
    }), list(stringsAsFactors=FALSE)) 
) -> xdf 

# make nicer names 
xdf <- setNames(xdf, mcga(html_text(html_nodes(tab, "th")))) # get the header to get column names 

str(xdf) 
## 'data.frame': 4823 obs. of 8 variables: 
## $ sno   : chr "1" "2" "3" "4" ... 
## $ candidate : chr "A Hasiv" "A Wahid" "Aan Shikhar Shrivastava" "Aaptab Urf Aftab" ... 
## $ constituency : chr "ARYA NAGAR" "GAINSARI" "GOSHAINGANJ" "MUBARAKPUR" ... 
## $ party  : chr "BSP" "IND" "Satya Shikhar Party" "Islam Party Hind" ... 
## $ criminal_case: chr "0" "0" "0" "0" ... 
## $ education : chr "12th Pass" "10th Pass" "Graduate" "Illiterate" ... 
## $ total_assets : chr "Rs 3,94,24,827 ~ 3 Crore+" "Rs 75,106 ~ 75 Thou+" "Rs 41,000 ~ 41 Thou+" "Rs 20,000 ~ 20 Thou+" ... 
## $ liabilities : chr "Rs 58,46,335 ~ 58 Lacs+" "Rs 0 ~" "Rs 0 ~" "Rs 0 ~" ...

出典

2017-12-30 18:17:26 hrbrmstr

mynetaからテーブルを掻き集めるR

答えて

関連する問題