2017-02-09 6 views
0

質問:文書用語マトリックスまたは保持したいバイグラム(用語)のリストでbigramを「いいえ」にするにはどうすればよいですか?文書用語行列にある特定のバイグラムのみを保持するR

非常に大きな文書用語マトリックスにこれを適用したいと思います。私は用語行列を行列に変換しようとしましたが、ベクトルの大きさは1000Gbを超えています。

コード:

dd <- data.frame(
id = 10:13, 
text = c("No wonderful, then, that ever", 
     "So that in many cases such a ", 
     "But there were still other and", 
     "Not even at the rationale"), stringsAsFactors = F) 

library(tm) 
library(RWeka) 

myReader <- readTabular(mapping = list(content = "text", id = "id")) 

#create v corpus 
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader)) 

#n-gram tokenizer 
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 

#create document term matrix using Tokenizer 
     dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer)) 
     inspect(dtm) 

出力:

       Docs 
      Terms   10 11 12 13 
      at the   0 0 0 1 
      but there  0 0 1 0 
      cases such  0 1 0 0 
      even at   0 0 0 1 
      in many   0 1 0 0 
      many cases  0 1 0 0 
      no wonderful 1 0 0 0 
      not even  0 0 0 1 
      other and  0 0 1 0 
      so that   0 1 0 0 
      still other  0 0 1 0 
      such a   0 1 0 0 
      that ever  1 0 0 0 
      that in   0 1 0 0 
      the rationale 0 0 0 1 
      then that  1 0 0 0 
      there were  0 0 1 0 
      were still  0 0 1 0 
      wonderful then 1 0 0 0 

答えて

0

は、DTMだったので、もっと複雑だった考えていました。

問題は解決:

d_sel <- dtm[c('no wonderful', 'there were'),] 
    inspect(d_sel) 

       Docs 
       Terms   10 11 12 13 
       no wonderful 1 0 0 0 
       there were  0 0 1 0 
関連する問題