スクラップ中にrを試してみてください

-1

ID番号を含むデータテーブル（npi1_list）があり、ウェブサイトからのウェブスクラップを行っている番号に基づいて、npiリストはウェブサイトのID番号と一致し、私。スクラップ中にrを試してみてください

library("rvest") 
library("data.table")  
final<- NULL 
    for(i in 8000:200000){ 
    url<-paste("http://www.npinumberlookup.org/getResultDetails.php? 
    npinum=",npi1_list[i,1],sep='') 
    webpage<-read_html(url) 
    Name<- html_nodes(webpage, 'table:nth-child(8) tr:nth-child(1) td~ td+ td , 
    table:nth-child(6) tr:nth-child(1) td~ td+ td') 
    rank_data <-html_text(Name) 
    final <- rbind(final,rank_data) 
    print(i) 
    Sys.sleep(1) 
    }

これは正常に動作しますが、時々エラーが接続タイムアウトに80ポートのエラーが表示され、再び、私は試して実装行うには、ループが終了してしまったところから私を初期化し、loop.Howのため再実行する必要がありますし、 catchオプションを使用して、200000行まで自動化できるようにしてください。

出典

2017-11-05 Dinesh Kumar V

[ダウンロード可能な完全データベース]（http://download.cms.gov/nppes/NPI_Files）を使用していないのはなぜですか？ html）および/または毎週更新されますか？サイトを叩く - ** [20秒クロール遅延]（http://www.npinumberlookup.org/robots.txt）の最小限の**明示的な状態**は、他の場所で大量に見つけることができるデータespではクールではありません。また、信頼できるソースではないため、データの整合性の問題も心配しています。 – hrbrmstr

library("rvest") 
library("data.table")  
final<- NULL 
for(i in 8000:200000){   
    repeat{ 
     successful = T 
     tryCatch({ 
      url<-paste("http://www.npinumberlookup.org/getResultDetails.php? 
      npinum=",npi1_list[i,1],sep='') 
      webpage<-read_html(url) 
      Name<- html_nodes(webpage, 'table:nth-child(8) tr:nth-child(1) td~ td+ td , 
      table:nth-child(6) tr:nth-child(1) td~ td+ td') 
      rank_data <-html_text(Name) 
      final <- rbind(final,rank_data) 
      print(i) 
     }, error = function(e){ 
      print(e) 
      print(paste0('connection error on ', i)) 
      successful <<- F 
     }) 
     Sys.sleep(1) 
     if(successful) 
      break 
    } 
}

出典

2017-11-05 06:52:13

スクラップ中にrを試してみてください

答えて

関連する問題