2016-04-29 15 views





Welcome to RegExr v2.1 by gskinner.com, proudly **hosted** by Media Temple! A full Reference & Help is available in the Library, or watch the video Tutorial hosted by Media Temple which are so amazingly awesome that just looking at the name I get a boner instantly, and I am really serious right now, it's that exciting if you didn't get it. 


Welcome to RegExr v2.1 by gskinner.com, proudly **hosted** by Media Temple 



入力何? – sweaver2112


これはかなり難しい問題です。特にコンテキストがないためです。テキストブロックの見本を提供してください。難易度の例:アメリカのような略語。 – lmo


Rの正規表現には非常に制限された機能があるため、これも難しいです。見つけた試合の長さをチェックするだけで良いでしょう。 – 4castle




NLPライブラリを使用してテキストを文章に分割し、wordを含む特定の長さのもののみを使用することができます。私はウィキペディアのEarnest Hemingwayの略歴を引用し、1970年の単語を抽出して使用し、第2のgrepに長さ制限値のみを適用しました。あなたのケースでは

> require(tm) 
> require(openNLP) 
> text <- as.String("Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939. In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript. The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans. The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war. Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights. For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway. Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75.") 
> sentence.boundaries <- annotate(text, sentence_token_annotator) 
> sentences <- text[sentence.boundaries] 
> sentences 
[1] "Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939."                                 
[2] "In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript."                                          
[3] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]" 
[4] "The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war."                      
[5] "Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights."                             
[6] "For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway."               
[7] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                       
> with_word = grep("1940", sentences, fixed = TRUE, value = TRUE) 
> with_word 
[1] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]" 
[2] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                       
> with_word[grep("^.{30,100}$", with_word)] 
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75." 


> my_sent <- grep("(?s)(?=.{30,100}$).*1940.*$", sentences, value = TRUE, perl = TRUE) 
> my_sent 
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75." 




> with_word = grep("(?s)^(?=.{30,250}$).*\\bhosted\\b.*$", sentences, perl = TRUE, value = TRUE) 
> with_word 
[1] "proudly hosted by Media Temple!" 

Wiktorと連携して機能します。これは素晴らしいことです!投稿にいくつかのコメントは、おそらくRは自然言語のテキストを扱うための最良のツールではないことを私に納得させました。しかし、Ka-Boom!あなたは私に適切なプラグインを示しました。そして、私の問題に対する全解決策。それは私が結果に近くないことを示しました。どうもありがとうございました!私はポストでこのような漠然とした説明のためにあなたの恩赦を請う。しかし、あなたはそれを正確に理解しました。 Btw、openNLPをインストールするには5時間の回避策が必要でした.Javaをダウングレードしなければならず、他の多くのことを行う必要がありました。ありがとう、そして素晴らしい人生を送ってください:P – Denis


あなたは私はあなたが解析する英語のテキストを使用していると仮定positive lookahead


私はあなたの恩赦を懇願しますが、私は正規表現を本当によく分かりません。そして私はこの特定の例でどのように肯定的な先読みを使うことができるのかよく分かりません。私たちの場合は何が続いていますか? – Denis


正の先読みグループは、正規表現の次のものが正の先読みグループと一致していることを保証します。あなたの更新された質問を見てみましょう。 –


最初の文章がどこで終わるかはどのように分かりますか?それらは分かれているのですか、同じ行にありますか? –
