R Webページから不完全なテキストを解析する（HTML）

その後のテキスト解析のために複数の科学記事から平文を解析しようとしています。これまで私はパッケージRCurlとXMLに基づいてR script by Tony Breyalを使用しています。これは、http://www.sciencedirect.comによって公開されたジャーナルを除き、すべての対象とするジャーナルでうまく機能します。 SDから記事を解析しようとすると（これはSDからアクセスする必要があるすべてのテスト済みのジャーナルで一貫しています）、Rのテキストオブジェクトはドキュメント全体の最初の部分を格納します。残念ながら、私はhtmlにあまり慣れていませんが、他のすべてのケースで動作するので、問題はSD htmlコードにあるべきだと思います。一部のジャーナルはオープンアクセス可能ではありませんが、アクセス許可があり、オープンアクセスの記事でも問題が発生していることを認識しています（例を参照）。これはGitHubのからのコードは次のとおりです。R Webページから不完全なテキストを解析する（HTML）

htmlToText <- function(input, ...) { 
###---PACKAGES ---### 
require(RCurl) 
require(XML) 


###--- LOCAL FUNCTIONS ---### 
# Determine how to grab html for a single input element 
evaluate_input <- function(input) {  
# if input is a .html file 
if(file.exists(input)) { 
    char.vec <- readLines(input, warn = FALSE) 
    return(paste(char.vec, collapse = "")) 
} 

# if input is html text 
if(grepl("</html>", input, fixed = TRUE)) return(input) 

# if input is a URL, probably should use a regex here instead? 
if(!grepl(" ", input)) { 
    # downolad SSL certificate in case of https problem 
    if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm") 
    return(getURL(input, followlocation = TRUE, cainfo = "cacert.perm")) 
} 

# return NULL if none of the conditions above apply 
return(NULL) 
} 

# convert HTML to plain text 
convert_html_to_text <- function(html) { 
doc <- htmlParse(html, asText = TRUE) 
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue) 
return(text) 
} 

# format text vector into one character string 
collapse_text <- function(txt) { 
return(paste(txt, collapse = " ")) 
} 

###--- MAIN ---### 
# STEP 1: Evaluate input 
html.list <- lapply(input, evaluate_input) 

# STEP 2: Extract text from HTML 
text.list <- lapply(html.list, convert_html_to_text) 

# STEP 3: Return text 
text.vector <- sapply(text.list, collapse_text) 
return(text.vector) 
}

これは今私のコードと例の記事：

DNA：

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319" temp.text <- htmlToText(target)

未フォーマットのテキストはメソッドのセクションのどこかに停止しますメーカーのに従ってMasterPure（登録商標）酵母DNA精製キット（Epicentre、Madison、Wisconsin、USA）を用いて抽出した210命令。

どのような提案やアイデアですか？

P.S.私も試しましたhtml_textrvest同じ結果です。

出典

2016-07-13 user6583482

あなたはprbly既存のコードを使用して、ちょうどURLの末尾に?np=yを追加するが、これはもう少しコンパクトにすることができます

library(rvest) 
library(stringi) 

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319?np=y" 

pg <- read_html(target) 
pg %>% 
    html_nodes(xpath=".//div[@id='centerContent']//child::node()/text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]") %>% 
    stri_trim() %>% 
    paste0(collapse=" ") %>% 
    write(file="output.txt")

その記事のための出力のビット（合計> 80Kでした）：

Fungal Ecology Volume 22 , August 2016, Pages 61–72  175394|| Species richness 
influences wine ecosystem function through a dominant species Primrose J. Boynton a , , , 
Duncan Greig a , b a Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany 
b The Galton Laboratory, Department of Genetics, Evolution, and Environment, University 
College London, London, WC1E 6BT, UK Received 9 November 2015, Revised 27 March 2016, 
Accepted 15 April 2016, Available online 1 June 2016 Corresponding editor: Marie Louise 
Davey Abstract Increased species richness does not always cause increased ecosystem function. 
Instead, richness can influence individual species with positive or negative ecosystem effects. 
We investigated richness and function in fermenting wine, and found that richness indirectly 
affects ecosystem function by altering the ecological dominance of Saccharomyces cerevisiae . 
While S. cerevisiae generally dominates fermentations, it cannot dominate extremely species-rich 
communities, probably because antagonistic species prevent it from growing. It is also diluted 
from species-poor communities,

出典

2016-07-13 11:38:20 hrbrmstr

+1すばやく回答いただきありがとうございます。それは完全に動作します。しかし、あなたはまた** ** np = y **の意味を教えてもらえますか？ – user6583482

私はサイエンスダイレクトサイトから、クエリ文字列がテキスト全体を読み込むようにWebサーバーに指示し、ユーザーがスクロールするときに読み込むことができます（これは、読者が法的にサポートする必要があると考えるスクリーンリーダーにとってアクセス可能なアクセスを可能にする）。 – hrbrmstr

この説明にもう一度おねがいします！ – user6583482

R Webページから不完全なテキストを解析する（HTML）

答えて

関連する問題