2017-07-25 7 views
0

私は最大エントロピークラシファイアのOpenNLP実装を使用しようとしていますが、ドキュメントはかなり不足しているようですが、このライブラリは明らかに使いやすいように設計されていますが、入力ファイルフォーマットすなわち、トレーニングセット)。MaxEnt OpenNLP実装の入力フォーマット?

これはどこにあるのか、最低限の実習例が分かりますか?

答えて

3

OpenNLPのフォーマットは非常に柔軟です。 OpenNLPでMaxEntクラシファイアを使いたい場合、いくつかのステップがあります。ここで

は、コメントとサンプルコードです:

package example; 

import java.io.File; 
import java.io.IOException; 
import java.nio.charset.Charset; 
import java.util.Arrays; 
import java.util.HashMap; 
import java.util.Map; 

import opennlp.tools.ml.maxent.GISTrainer; 
import opennlp.tools.ml.model.Event; 
import opennlp.tools.ml.model.MaxentModel; 
import opennlp.tools.tokenize.WhitespaceTokenizer; 
import opennlp.tools.util.FilterObjectStream; 
import opennlp.tools.util.MarkableFileInputStreamFactory; 
import opennlp.tools.util.ObjectStream; 
import opennlp.tools.util.PlainTextByLineStream; 
import opennlp.tools.util.TrainingParameters; 

public class ReadData { 


    public static void main(String[] args) throws Exception{ 

     // this is the data file ... 
     // the format is <LIST of FEATURES separated by spaces> <outcome> 
     // change the file to fit your needs 
     File f=new File("football.dat"); 

     // we need to create an ObjectStream of events for the trainer.. 
     // First create an InputStreamFactory -- given a file we can create an InputStream, required for resetting... 
     MarkableFileInputStreamFactory factory=new MarkableFileInputStreamFactory(f); 
     // create a PlainTextByLineInputStream -- Note: you can create your own Stream that can handle binary files or data that 
     //          --  crosses two line... 
     ObjectStream<String> stream=new PlainTextByLineStream(factory, Charset.defaultCharset()); 
     // Now you have a stream of string you need to convert it to a stream of events... 
     // I use a custom FilterObjectStream which simply takes a line, breaks it up into tokens, 
     // uses all except the last as the features [context] and the last token as the outcome class 
     ObjectStream<Event> eventStream=new FilterObjectStream<String, Event>(stream) { 
      @Override 
      public Event read() throws IOException { 
       String line=samples.read(); 
       if (line==null) return null; 

       String[] parts=WhitespaceTokenizer.INSTANCE.tokenize(line); 
       String[] context=Arrays.copyOf(parts, parts.length-1); 

       System.out.println(parts[parts.length-1]+" "+Arrays.toString(context)); 
       return new Event(parts[parts.length-1], context); 
      } 
     }; 


     TrainingParameters parameters=new TrainingParameters(); 
     // By default OpenNLP uses a cutoff of 5 (a feature has to occur 5 times before it is used) 
     // use 1 for my small dataset 
     parameters.put(GISTrainer.CUTOFF_PARAM, 1); 

     GISTrainer trainer=new GISTrainer(); 
     // the report map is supposed to mark when default values are assigned... 
     Map<String,String> reportMap=new HashMap<>(); 
     // DONT FORGET TO INITIALIZE THE TRAINER!!! 
     trainer.init(parameters, reportMap); 
     MaxentModel model=trainer.train(eventStream); 

     // Now we have a model -- you should test on a test set, but 
     // this is a toy example... so I am just resetting the eventstream. 
     eventStream.reset(); 
     Event evt=null; 
     while ((evt=eventStream.read())!=null){ 
      System.out.print(Arrays.toString(evt.getContext())+": "); 
      // Evaluate the context from the event using our model. 
      // you would want to calculate summary statistics.. 
      double[] p=model.eval(evt.getContext()); 
      System.out.print(model.getBestOutcome(p)+" "); 
      if (model.getBestOutcome(p).equals(evt.getOutcome())){ 
       System.out.println("CORRECT"); 
      }else{ 
       System.out.println("INCORRECT");     
      } 
     } 

    } 

} 

Football.dat:

home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal 
home=man_united Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous man_united 
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie 
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous tie 
home=man_united Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal 
home=man_united Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united 
home=man_united Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous man_united 
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_lost_previous man_united_won_previous arsenal 
home=arsenal Beckham=true Scholes=false Neville=true Henry=false Kanu=true Parlour=false Ferguson=tense Wengler=confident arsenal_won_previous man_united_lost_previous arsenal 
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=tense Wengler=tense arsenal_lost_previous man_united_won_previous tie 
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=false Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united 
home=arsenal Beckham=false Scholes=true Neville=true Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal 
home=arsenal Beckham=false Scholes=true Neville=true Henry=false Kanu=true Parlour=false Ferguson=confident Wengler=confident arsenal_won_previous man_united_won_previous man_united 
home=arsenal Beckham=true Scholes=true Neville=false Henry=true Kanu=true Parlour=false Ferguson=confident Wengler=tense arsenal_won_previous man_united_won_previous arsenal 

はそれが

+0

を支援ホープは、すでに問題を解決しますが、包括的な答えに感謝していました! –

+0

テキスト機能と数値機能の両方を渡す方法も知っていますか?つまり、私が通過したい場合は数値を数値として解釈するようにシステムに指示する方法です。フィーチャとしての実際の値のベクトル? –

+0

(申し訳ありませんがしばらく...)OpenNLPが数値機能を処理しているかどうかわかりません。カテゴリと数値のロジスティック回帰を使用することを検討しましたか? – HowYaDoing

関連する問題