TokensRegex：トークンがリテール化後にヌルです

私はスタンフォードNLPのTokensRegexを試していて、テキスト内の寸法（100x120など）を見つけようとしています。だから私の計画は、最初にこれらのトークンを分割して（retokenize.rules.txtにある例を使用して）入力を再調整し、新しいパターンを検索することです。元の文字列置き換え、しかし、唯一のヌル値が残っているretokenizationを行った後TokensRegex：トークンがリテール化後にヌルです

：

The top level annotation 
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]

はretokenizationは（結果における3つのトークン）正常に動作するようだが、値が失われます。トークンリストの元の値を維持するにはどうすればよいですか？

マイretokenize.rules.txtファイル（デモのように）である：

tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" } 
options.matchedExpressionsAnnotationKey = tokens; 
options.extractWithTokens = TRUE; 
options.flatten = TRUE; 
ENV.defaults["ruleType"] = "tokens" 
ENV.defaultStringPatternFlags = 2 
ENV.defaultResultAnnotationKey = tokens 

{ pattern: (/\d+(x|X)\d+/), result: Split($0[0], /x|X/, TRUE) }

メイン方法：

public static void main(String[] args) throws IOException { 
    //... 
    text = "100x120"; 
    Properties properties = new Properties(); 
    properties.setProperty("tokenize.language", "de"); 
    properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner"); 
    properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator"); 
    properties.setProperty("retokenize.rules", "retokenize.rules.txt"); 
    StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties); 
    runPipeline(pipelineWithRetokenize, text);

}

パイプライン：

public static void runPipeline(StanfordCoreNLP pipeline, String text) { 
    Annotation annotation = new Annotation(text); 
    pipeline.annotate(annotation); 
    out.println(); 
    out.println("The top level annotation"); 
    out.println(annotation.toShorterString()); 
    //... 
}

出典

2016-04-27 cferner

ありがとうございます知っている。 CoreAnnotations.ValueAnnotationに値が設定されていないため、フィールドに値を設定するためにTokenRegexを更新します。

これに関係なく、TokenRegexを使用して、計画したとおりに再表現することができます。パイプラインのほとんどはValueAnnotationに依存せず、代わりにCoreAnnotations.TextAnnotationを使用します。 CoreAnnotations.TextAnnotationを使用すると、新しいトークンのテキストを取得できます（各トークンはCoreLabelであるため、token.word（）を使用してアクセスできます）。

異なる注釈を取得する方法のコード例については、TokensRegexRetokenizeDemoを参照してください。

出典

2016-04-27 19:04:40

TokensRegex：トークンがリテール化後にヌルです

答えて

関連する問題