TokenRegexを使用して目的の形式で出力を取得する

ルールベースのエンティティ抽出にTokensRegexを使用しています。それはうまく動作しますが、私は希望のフォーマットで出力を得ることができません。次のコードスニペットは、私の文章については、以下を指定された出力できます：TokenRegexを使用して目的の形式で出力を取得する

Earlier this month Trump targeted Toyota, threatening to impose a hefty fee on the world's largest automaker if it builds its Corolla cars for the U.S. market at a plant in Mexico.

for (CoreMap sentence : sentences) 
      { 

       List<MatchedExpression> matched = extractor.extractExpressions(sentence); 

       if (matched != null) { 

        matched = MatchedExpression.removeNested(matched); 
        matched = MatchedExpression.removeNullValues(matched); 
        System.out.print("FOR SENTENCE:" + sentence); 
       } 

       for(MatchedExpression phrase : matched){ 

        // Print out matched text and value 

        System.out.print("MATCHED ENTITY: " + phrase.getText()+ "\t" + "VALUE: " + phrase.getValue());

OUTPUTを

MATCHED ENTITY: Donald Trump targeted Toyota, threatening to impose a hefty fee on the world's largest automaker if it builds its Corolla cars for the U.S. market 

VALUE: LIST([PERSON])

私が使用してトークンを反復処理する場合、私が知っている：

for (CoreLabel token : cm.get(TokensAnnotation.class)) 
        {String word = token.get(TextAnnotation.class); 
          String lemma = token.get(LemmaAnnotation.class); 
          String pos = token.get(PartOfSpeechAnnotation.class); 
          String ne = token.get(NamedEntityTagAnnotation.class); 
          System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + ", NE=" + ne); 
}

私は各タグに注釈を与える出力を得ることができます。しかし、私は自分のルールを使って名前付きエンティティを検出していますが、時には、複数のトークンエンティティで1つの単語が人物としてタグ付けされる場合があります。組織と場所の名前）

だから私は期待していた出力は次のようになります。

MATCHED ENTITY: Donald Trump VALUE: PERSON 
MATCHED ENTITY: Toyota VALUE: ORGANIZATION

にはどうすれば所望の出力を得るために上記のコードを変更できますか？カスタムアノテーションを使用する必要がありますか？

出典

2017-04-20 serendipity

私は1週間ほど前に最新のビルドの瓶を作りました。 GitHubから入手できるjarファイルを使用してください。

このサンプルコードでは、ルールが実行され、適切なnerタグが適用されます。

package edu.stanford.nlp.examples; 

import edu.stanford.nlp.util.*; 
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.pipeline.*; 

import java.util.*; 


public class TokensRegexExampleTwo { 

    public static void main(String[] args) { 

    // set up properties 
    Properties props = new Properties(); 
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex"); 
    props.setProperty("tokensregex.rules", "multi-step-per-org.rules"); 
    props.setProperty("tokensregex.caseInsensitive", "true"); 

    // set up pipeline 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 

    // set up text to annotate 
    Annotation annotation = new Annotation("...text to annotate..."); 

    // annotate text 
    pipeline.annotate(annotation); 

    // print out found entities 
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) { 
     for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) { 
     System.out.println(token.word() + "\t" + token.ner()); 
     } 
    } 
    } 
}

出典

2017-04-24 03:54:14 StanfordNLPHelp

「スレッドで例外が発生しました」「main」java.lang.RuntimeException：ファイル解析中にエラーが発生しました：multi-step-per-org.rules「」原因：java.io.IOException：「multi -step-per-org.rules "をクラスパス、ファイル名、またはURLとして使用する"ビルドでこのファイルが見つかりません。助けてください。 – serendipity

これは私のrulesファイルの名前です。ルールファイルの名前で置き換える必要があります。 – StanfordNLPHelp

私は希望の形式で出力を得ることができました。

Annotation document = new Annotation(<Sentence to annotate>); 

//use the pipeline to annotate the document we created 
pipeline.annotate(document); 
List<CoreMap> sentences = document.get(SentencesAnnotation.class); 

//Note- I doesn't put environment related stuff in rule file. 
Env env = TokenSequencePattern.getNewEnv(); 
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE); 
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE); 


CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor 
     .createExtractorFromFiles(env, "test_degree.rules"); 

for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) { 
     List<MatchedExpression> matched = extractor.extractExpressions(sentence); 
     for(MatchedExpression phrase : matched){ 
     // Print out matched text and value 
     System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get()); 
     } 
    }

出力：

MATCHED ENTITY: Technical Skill VALUE: SKILL

あなたは私のrule file in this question.

を見てすることがありますが、この情報がお役に立てば幸い！

出典

2017-05-01 06:10:04

洞察に感謝します！ – serendipity

同様の問題で苦労している人に自分の質問に答えます。正しい形式で出力を得るための鍵は、ルールファイルでルールを定義する方法にあります。ここで私は出力を変更するルールに変更するものです：

旧ルール：

{ ruleType: "tokens", 
    pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)), 
    result: Annotate($1, ner, "LOCATION"), 

}

新規ルール

あなたの結果フィールドは、出力形式を定義定義する方法

{ ruleType: "tokens", 
    pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)), 
    action: Annotate($1, ner, "LOCATION"), 
    result: "LOCATION" 

}

あなたのデータ。

希望すると便利です。

出典

2017-05-03 06:39:43 serendipity

私の[質問]（http://stackoverflow.com/questions/43732780/how-to-modify-tokenregex-rule-in-stanfordnlp）を見ていただけますか？私は昨日から苦労してきましたが、これを解決できませんでした。 –

TokenRegexを使用して目的の形式で出力を取得する

答えて

関連する問題