4

段落から簡単な文を抽出するために使用できるアルゴリズムはありますか?複合(混合)文から簡単な文を抽出するアルゴリズムはありますか?

私の最終的な目標は、結果として得られた単純な文章で別のアルゴリズムを実行して、著者の感想を判断することです。

私はChae-Deug Parkのような情報源からこれを研究しましたが、訓練データとして単純な文章を準備することについては議論していません。事前

+0

「単純な文章」とは、具体的にはどういう意味ですか?段落と比較して単なる文章です。あなたの質問は文境界の検出に関するものです。または、主述語が1つしか含まれていない文(下位節などを含む複雑な文章とは対照的に)?何か全く違う? – jogojapan

+0

こんにちはjogojapan、はい、それは正しいです、私は段落と比較して文を意味しました... –

+0

あなたは簡単な文章を意味するものを正しく定義していないので、あなたの質問に誰かが答えにくいです。たぶん、スタンフォードパーサーのようなものを使って、各文章の解析木を取得し、 'NP VP'タイプ以外のすべての文を取り除きたいとしましょう。 '[John] [ベンチに座って]' [Mary and Jill] [サンドイッチを食べる] 'など) –

答えて

1

私はopenNLPを同じものに使用しました。

public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException, 
     InvalidFormatException { 

    InputStream is = new FileInputStream("resources/models/en-sent.bin"); 
    SentenceModel model = new SentenceModel(is); 
    SentenceDetectorME sdetector = new SentenceDetectorME(model); 

    String[] sentDetect = sdetector.sentDetect(paragraph); 
    is.close(); 
    return Arrays.asList(sentDetect); 
} 

//Failed at Hi. 
    paragraph = "Hi. How are you? This is Mike."; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Door.Noone 
    paragraph = "Close the Door.Noone is out there"; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//not able to break on noone 

    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson."; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at dr. 
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients."; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr. 

    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code."; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr. 

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]"; 
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence)); 

人間のミスがあった場合にのみ、それは失敗しました。例えば。 "博士"略語は大文字のDでなければならず、2文の間に少なくとも1つのスペースが必要です。

また、次のようにREを使用して達成することもできます。

public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){ 
    List<String> sentences = new ArrayList<String>(); 
    Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS); 
    Matcher reMatcher = re.matcher(paragraph); 
    while (reMatcher.find()) { 
     sentences.add(reMatcher.group()); 
    } 
    return sentences; 

} 

paragraph = "Hi. How are you? This is Mike."; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Door.Noone 
    paragraph = "Close the Door.Noone is out there"; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Mr., mrs. 
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson."; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at dr. 
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients."; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at U.S. 
    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code."; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]"; 
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence)); 

しかし、エラーの競争力の高されています。別の方法は、BreakIteratorを使用することです。

public static List<String> breakIntoSentencesBreakIterator(String paragraph){ 
    List<String> sentences = new ArrayList<String>(); 
    BreakIterator sentenceIterator = 
      BreakIterator.getSentenceInstance(Locale.ENGLISH); 
    BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance(); 
    sentenceInstance.setText(paragraph); 

    int end = sentenceInstance.last(); 
    for (int start = sentenceInstance.previous(); 
      start != BreakIterator.DONE; 
      end = start, start = sentenceInstance.previous()) { 
     sentences.add(paragraph.substring(start,end)); 
    } 

    return sentences; 
} 

例:

paragraph = "Hi. How are you? This is Mike."; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Door.Noone 
    paragraph = "Close the Door.Noone is out there"; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at Mr. 
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson."; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

    //Failed at dr. 
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients."; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 


    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code."; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]"; 
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence)); 

ベンチマーク

  • カスタムRE:7ミリ秒
  • のBreakIterator:143ミリ秒
  • openNlp:255ミリ秒
2

おかげでApache OpenNLPを見て、それが文検出器モジュールを持っています。マニュアルにはコマンドラインからの使用方法とAPIからの使用例があります。

関連する問題