ウェブページからHTMLリストを読み取る方法R

-2

4つ以上のリスト<li> html要素を持つウェブサイトを考えてみましょう。例えばこのようなウェブサイト：xml2（または他のアプローチが、xml2と配管が好ましい）を使用してhttps://www.cprd.com/bibliography/bibliography.html ウェブページからHTMLリストを読み取る方法R

、文字のベクトルにリストを抽出するための最良の方法は何ですか？

url <- 'https://www.cprd.com/bibliography/bibliography.html' 
library(xml2) 
page <- read_html(url)

出力は、<li>のリストにする必要があります。（毎年1つのリストがあります）

そして、最初のリストには、「グルコース低下薬のイニシエーター間のチャネリングバイアスの評価：英国コホート研究」に相当する最初の項目があります。 Ankarfeldt MZ、Thorsted BL、Groenwold RH、Adalsteinsson E、Ali MS、Klungel OH。 Clin Epidemiol。。 2017; 9：19-30。

EDIT：コメント（つまりxml2オーバー若干の改善です）rvestを使用して

library(rvest) 
output<-page %>% html_nodes('ol') %>% lapply(html_nodes, 'li') %>% lapply(html_text, trim = TRUE) 
output[[1]][1] 

[1] "Assessment of channeling bias among initiators of glucose-lowering drugs: A UK cohort study. \r\n  Ankarfeldt MZ, Thorsted BL, Groenwold RH, Adalsteinsson E, Ali MS, Klungel OH. Clin Epidemiol. 2017;9:19㤼㸶30."

出典

2017-02-15 userJT

何か試しましたか？あなたはどんな問題に遭遇していますか？ – Jota

これには、['rvest']（http://stat4701.github.io/edav/2015/04/02/rvest_tutorial/）パッケージを使用してみてください：' library（rvest）; html_nodes（ 'ol'）％>％ lapply（。、function（x）html_nodes（x、 'li'））read_html（ 'https://www.cprd.com/bibliography/bibliography.html'）％>％ html_nodes ％>％html_text（）） ' – Abdou

@Abdou 'lapply'（状況が複雑になるにつれてより便利になる' purrr :: map'）を2回実行すれば、コードは読みやすくなります： 'h2％>％html_nodes（ 'ol'）％lapply（html_nodes、 'li'）％>％lapply（html_text、trim = TRUE） '時間的には、ほとんど同じです。 – alistaire

を提案：

library(rvest) 

url <- 'https://www.cprd.com/bibliography/bibliography.html' 

page <- read_html(url) %>% 
    html_nodes('ol') %>% 
    map(~html_nodes(.x, 'li') %>% 
     html_text() %>% 
     gsub(pattern = '\\t|\\r|\\n', replacement = '') 
)

gsubは '改行' として、特殊文字を取り出すの面倒を見ることと '集計 '

出典

2017-02-15 20:40:50 GGamba

ウェブページからHTMLリストを読み取る方法R

答えて

関連する問題