問題:文字ベクタの要素内の数字が大変です。文字ベクタからRIDICULOUSLY大きな数字を削除する正規表現R
troublesome_tweets
[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3.1415926535897932384626433832795028841971693993751"
[2] "Sick of this yet? ...2847564823378678316527120190914564856692346034861045432 6648213393607260249141273724587006606315588174881520920..."
[3] "He only did that 939478978398894709409784680708937809648608378964809798369864 x to the Cavs in that one series"
[4] "Am I a robot? Yes, affirmative. 0111001001101111011000100110111100100000011000100110111101101111011001110110100101100101"
[5] "Lazy rule # 18460826292036391273639018263920273820183737473920383930; You were too lazy to read the whole number .(:"
[6] "#thankyoushootusdown I LOVE YOU GUYS <3333333333333333333333333333333333333333333333333333333333333333333333333333333333333"
[7] "Hvrtujikdsjktrfedwqcvbntrfedsw1123456787654325678876543234567865432456765432345678654214565tredscvbhjutwsdfvghyu. ! How I feel."
[8] "I want to dock 555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555"
[9] "x's to the club wit momz here 0 while gone 1000000000000000000000000000000000000000000000000000000000000000"
現在のアプローチ:各要素の後に移動して、所与の特定の要素のうちの多数を除去するために、特定の正規表現と一緒GSUB使用。
clean_individual_tweets <- function(x){
x <- gsub("[0-9][.][0-9]+", " ", x)
x <- gsub("[...][0-9]+", "", x)
x <- gsub("[0-9]+[...]", "", x)
x <- gsub("[0-9]+[ ][x]", "", x)
x <- gsub("[.][ ][0-9]+", " ", x)
x <- gsub("[#][ ][0-9]+", " ", x)
x <- gsub("[<][0-9]+", " ", x)
x <- gsub("[a-zA-Z][0-9]+", " ", x)
x <- gsub("555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555", " ", x)
x <- gsub("1000000000000000000000000000000000000000000000000000000000000000", " ", x)}
cleaned_tweets <- clean_individual_tweets(troublesome_tweets
cleaned_tweets
[1] "Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: "
[2] "Sick of this yet? .. .."
[3] "He only did that to the Cavs in that one series"
[4] "Am I a robot? Yes, affirmative "
[5] "Lazy rule ; You were too lazy to read the whole number .(:"
[6] "#thankyoushootusdown I LOVE YOU GUYS "
[7] "Hvrtujikdsjktrfedwqcvbntrfeds tredscvbhjutwsdfvghyu. ! How I feel."
[8] "I want to dock "
[9] "x's to the club wit momz here 0 while gone "
ザ・アプローチを希望:さんは10の以上の数字で構成される任意の番号を交換しましょう、大きな文字ベクトルから、これらすべての大きな数などを削除することができ、単一の正規表現を持っています。私はすべての数字を削除したくない、それは非常に簡単に始まっただろう。私は特にアーティファクト番号を削除し、アーティファクト以外のものを保持したい。
データ 'は、少なくとも10桁' の
troublesome_tweets <- c(
"Happi Pi Day! Eat a pizza and some blueberry pie a la mode, and then you may calculate: 3.1415926535897932384626433832795028841971693993751"
, "Sick of this yet? ...2847564823378678316527120190914564856692346034861045432 6648213393607260249141273724587006606315588174881520920..."
, "He only did that 939478978398894709409784680708937809648608378964809798369864 x to the Cavs in that one series"
, "Am I a robot? Yes, affirmative. 0111001001101111011000100110111100100000011000100110111101101111011001110110100101100101"
, "Lazy rule # 18460826292036391273639018263920273820183737473920383930; You were too lazy to read the whole number .(:"
, "#thankyoushootusdown I LOVE YOU GUYS <3333333333333333333333333333333333333333333333333333333333333333333333333333333333333"
, "Hvrtujikdsjktrfedwqcvbntrfedsw1123456787654325678876543234567865432456765432345678654214565tredscvbhjutwsdfvghyu. ! How I feel."
, "I want to dock 555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555"
, "x's to the club wit momz here 0 while gone 1000000000000000000000000000000000000000000000000000000000000000"
)
cleaned_tweets < - – mansanto
'GSUB上記のコードでclean_individual_tweets(troublesome_tweets)#欠落ブラケット( '\\ d {2、} '、' '、troublesome_tweets) '? \\ d {\\ d {2、} | \\ d + \\。\\ d {2、} ''は、小数点以下を取り除くために使用します。 – rawr
または(?:\\ d + \\。)?\\ d {2、} ' –