名前付きエンティティ認識のモデルをトレーニングしていますが、人の名前を正しく識別できませんか?名前付きエンティティを開くnlpトレーニング
私のトレーニングデータは、次のようになります。
public class NamedEntityModel {
public static void train(String inputfile,String modelfile) throws IOException {
Charset charset = Charset.forName("UTF-8");
MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory(new File(inputfile));
ObjectStream<String> lineStream = new PlainTextByLineStream(factory, charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
TokenNameFinderModel model = null;
try {
model = NameFinderME.train("en", "person", sampleStream,TrainingParameters.defaultParams(),
new TokenNameFinderFactory());
} finally {
sampleStream.close();
}
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelfile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}
そして、これは、メインクラスがどのように見えるかです::モデルを訓練するためのJavaファイルがある
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . A nonexecutive director has many similar responsibilities as an executive director.However, there are no voting rights with this position.`
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group.
The former chairman of the society <START:person> Rudolph Agnew <END> will be assisting <START:person> Vinken <End> in his activities.
Mr . <START:person> Vinken <END> is the most right person in the industry.
His competitior <START:person> Steve <END> is vice chairman of Himbeldon N.V., the Ericson publishing group.
<START:person> Vinken <END> will also be assisted by <START:person> Angelina Tucci <END> who has been recognized many times For Her Good Work.
<START:person> Juilie <END> vp of Weterwood A.B., THE ZS publishing group also supported him.
Mr . <START:person> Stewart <END> is a recruiter of Metric C.D., the Drishti publishing.
He recruited <START:person> Adam <END> who will work on nlp for <START:person> Vinken <END> .
The lead conference for appointing him as a director was held by <START:person> Daniel Smith <END> at Boston.
public class NameFinder {
public static void main(String [] args) throws IOException{
String inputfile="C:/setup/apache-opennlp-1.7.2/bin/ner_training_data.txt";
String modelfile="C:/setup/apache-opennlp-1.7.2/bin/en-tr-ner-person.bin";
NamedEntityModel.train(inputfile, modelfile);
String sentence ="Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group. Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate . Peter is on leave today . "
+ "Steve is his competitor . Daniel Smith lead the ceremony. Kristen is svery happpy to know about it. Thomas will u please look into the matter as Ruby is busy";
WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;
//Tokenizing the given paragraph
String tokens[] = whitespaceTokenizer.tokenize(sentence);
for(String str:tokens)
System.out.println(str);
InputStream inputStreamNameFinder = new FileInputStream(modelfile);
TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);
NameFinderME nameFinder = new NameFinderME(model);
Span nameSpans[] = nameFinder.find(tokens);
System.out.println(Arrays.toString(Span.spansToStrings(nameSpans, tokens)));
for(Span s: nameSpans)
System.out.println(s.toString()+" "+tokens[s.getStart()]);
}
}
そして、出力は次のとおりです。
[Pierre Vinken, Vinken, Peter, Steve, Daniel Smith, Kristen, Thomas]
この訓練モデルでは、Rudolph AgnewやRubyなどの名前を認識できません。 名前をより正確に認識できるように、より正確に訓練するにはどうすればよいですか?
開始に150k文は必要ありません。 500文や1000文などの数字で始めることができます。その後、新しいテキストに注釈を付けて注釈を修正し、検証済みの文章をコーパスに追加することができます。それはあなたのプロセスをスピードアップすることができます。 – wcolen