2016-05-13 1 views
0

問題:文字ベクタの要素内の数字が大変です。文字ベクタからRIDICULOUSLY大きな数字を削除する正規表現R

troublesome_tweets 
[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3.1415926535897932384626433832795028841971693993751" 
[2] "Sick of this yet? ...2847564823378678316527120190914564856692346034861045432 6648213393607260249141273724587006606315588174881520920..."  
[3] "He only did that 939478978398894709409784680708937809648608378964809798369864 x to the Cavs in that one series"        
[4] "Am I a robot? Yes, affirmative. 0111001001101111011000100110111100100000011000100110111101101111011001110110100101100101"     
[5] "Lazy rule # 18460826292036391273639018263920273820183737473920383930; You were too lazy to read the whole number .(:"      
[6] "#thankyoushootusdown I LOVE YOU GUYS <3333333333333333333333333333333333333333333333333333333333333333333333333333333333333"     
[7] "Hvrtujikdsjktrfedwqcvbntrfedsw1123456787654325678876543234567865432456765432345678654214565tredscvbhjutwsdfvghyu. ! How I feel."    
[8] "I want to dock 555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555"        
[9] "x's to the club wit momz here 0 while gone 1000000000000000000000000000000000000000000000000000000000000000"         

現在のアプローチ:各要素の後に移動して、所与の特定の要素のうちの多数を除去するために、特定の正規表現と一緒GSUB使用。

clean_individual_tweets <- function(x){ 
    x <- gsub("[0-9][.][0-9]+", " ", x) 
    x <- gsub("[...][0-9]+", "", x) 
    x <- gsub("[0-9]+[...]", "", x) 
    x <- gsub("[0-9]+[ ][x]", "", x) 
    x <- gsub("[.][ ][0-9]+", " ", x) 
    x <- gsub("[#][ ][0-9]+", " ", x) 
    x <- gsub("[<][0-9]+", " ", x) 
    x <- gsub("[a-zA-Z][0-9]+", " ", x) 
    x <- gsub("555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555", " ", x) 
    x <- gsub("1000000000000000000000000000000000000000000000000000000000000000", " ", x)} 


cleaned_tweets <- clean_individual_tweets(troublesome_tweets 

cleaned_tweets 
[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: " 
[2] "Sick of this yet? .. .."                 
[3] "He only did that to the Cavs in that one series"           
[4] "Am I a robot? Yes, affirmative "               
[5] "Lazy rule ; You were too lazy to read the whole number .(:"        
[6] "#thankyoushootusdown I LOVE YOU GUYS "             
[7] "Hvrtujikdsjktrfedwqcvbntrfeds tredscvbhjutwsdfvghyu. ! How I feel."      
[8] "I want to dock "                   
[9] "x's to the club wit momz here 0 while gone "  

ザ・アプローチを希望:さんは10の以上の数字で構成される任意の番号を交換しましょう、大きな文字ベクトルから、これらすべての大きな数などを削除することができ、単一の正規表現を持っています。私はすべての数字を削除したくない、それは非常に簡単に始まっただろう。私は特にアーティファクト番号を削除し、アーティファクト以外のものを保持したい。

データ 'は、少なくとも10桁' の

troublesome_tweets <- c(
    "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3.1415926535897932384626433832795028841971693993751" 
, "Sick of this yet? ...2847564823378678316527120190914564856692346034861045432 6648213393607260249141273724587006606315588174881520920..."  
, "He only did that 939478978398894709409784680708937809648608378964809798369864 x to the Cavs in that one series"        
, "Am I a robot? Yes, affirmative. 0111001001101111011000100110111100100000011000100110111101101111011001110110100101100101"     
, "Lazy rule # 18460826292036391273639018263920273820183737473920383930; You were too lazy to read the whole number .(:"      
, "#thankyoushootusdown I LOVE YOU GUYS <3333333333333333333333333333333333333333333333333333333333333333333333333333333333333"     
, "Hvrtujikdsjktrfedwqcvbntrfedsw1123456787654325678876543234567865432456765432345678654214565tredscvbhjutwsdfvghyu. ! How I feel."    
, "I want to dock 555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555"        
, "x's to the club wit momz here 0 while gone 1000000000000000000000000000000000000000000000000000000000000000"         
) 
+0

cleaned_tweets < - – mansanto

+3

'GSUB上記のコードでclean_individual_tweets(troublesome_tweets)#欠落ブラケット( '\\ d {2、} '、' '、troublesome_tweets) '? \\ d {\\ d {2、} | \\ d + \\。\\ d {2、} ''は、小数点以下を取り除くために使用します。 – rawr

+0

または(?:\\ d + \\。)?\\ d {2、} ' –

答えて

0

、以下試してください。

gsub("[0-9]{10,}","",troublesome_tweets) 

[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3." 
[2] "Sick of this yet? ... ..."                 
[3] "He only did that x to the Cavs in that one series"           
[4] "Am I a robot? Yes, affirmative. "               
[5] "Lazy rule # ; You were too lazy to read the whole number .(:"        
[6] "#thankyoushootusdown I LOVE YOU GUYS <"              
[7] "Hvrtujikdsjktrfedwqcvbntrfedswtredscvbhjutwsdfvghyu. ! How I feel."       
[8] "I want to dock "                   
[9] "x's to the club wit momz here 0 while gone " 
+0

お返事ありがとうございます。 – mansanto