を使用することができます。
NLPライブラリを使用してテキストを文章に分割し、word
を含む特定の長さのもののみを使用することができます。私はウィキペディアのEarnest Hemingwayの略歴を引用し、1970年の単語を抽出して使用し、第2のgrep
に長さ制限値のみを適用しました。あなたのケースでは
> require(tm)
> require(openNLP)
> text <- as.String("Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939. In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript. The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans. The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war. Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights. For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway. Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75.")
> sentence.boundaries <- annotate(text, sentence_token_annotator)
> sentences <- text[sentence.boundaries]
> sentences
[1] "Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939."
[2] "In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript."
[3] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[4] "The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war."
[5] "Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights."
[6] "For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway."
[7] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
> with_word = grep("1940", sentences, fixed = TRUE, value = TRUE)
> with_word
[1] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[2] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
> with_word[grep("^.{30,100}$", with_word)]
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
、あなたが必要なものだけで文章を得るためにあなた自身の言葉と{30,250}
制限数量詞を使用しています。
> my_sent <- grep("(?s)(?=.{30,100}$).*1940.*$", sentences, value = TRUE, perl = TRUE)
> my_sent
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
が"(?s)(?=.{30,100}$).*1940.*$"
正規表現がに30を持っている文字列が必要になります:あなたが1回の操作で必要な文章をgrepすることが可能であるが、あなたは先読みして、より複雑なPCRE正規表現が必要になること
注意100(あなた自身の限界を設定する)文字を最初から最後まで、そして文字列には1940
の単語が含まれていなければなりません(単語に特殊正規表現のメタキャラクタが含まれている場合は、\\
でエスケープする必要があります)。
はちょうどあなたのデータでテスト:
> with_word = grep("(?s)^(?=.{30,250}$).*\\bhosted\\b.*$", sentences, perl = TRUE, value = TRUE)
> with_word
[1] "proudly hosted by Media Temple!"
入力何? – sweaver2112
これはかなり難しい問題です。特にコンテキストがないためです。テキストブロックの見本を提供してください。難易度の例:アメリカのような略語。 – lmo
Rの正規表現には非常に制限された機能があるため、これも難しいです。見つけた試合の長さをチェックするだけで良いでしょう。 – 4castle