のtrain()
とevaluate()
関数の入力には、タプルのリストのリストが必要です。各内側のリストは、各タプルが文字列のペアであるリストです。
train.txt
とtest.txt
を考える:
$ cat train.txt
This foo
is foo
a foo
sentence bar
. .
That foo
is foo
another foo
sentence bar
in foo
conll bar
format bar
. .
$ cat test.txt
What foo
is foo
this foo
sentence bar
? ?
How foo
about foo
that foo
sentence bar
? ?
タプルのリストにCoNLL形式のファイルを読み込みます。
# Using https://github.com/alvations/lazyme
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]
# Or otherwise
>>> def per_section(it, is_delimiter=lambda x: x.isspace()):
... """
... From http://stackoverflow.com/a/25226944/610569
... """
... ret = []
... for line in it:
... if is_delimiter(line):
... if ret:
... yield ret # OR ''.join(ret)
... ret = []
... else:
... ret.append(line.rstrip()) # OR ret.append(line)
... if ret:
... yield ret
...
>>>
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> tagged_test_sentences
[[('What', 'foo'), ('is', 'foo'), ('this', 'foo'), ('sentence', 'bar'), ('?', '?')], [('How', 'foo'), ('about', 'foo'), ('that', 'foo'), ('sentence', 'bar'), ('?', '?')]]
今、あなたは訓練することができます/鬼を評価:
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]
>>> from nltk.tag.perceptron import PerceptronTagger
>>> pct = PerceptronTagger(load=False)
>>> pct.train(tagged_train_sentences)
>>> pct.tag('Where do I find a foo bar sentence ?'.split())
[('Where', 'foo'), ('do', 'foo'), ('I', '.'), ('find', 'foo'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'foo'), ('sentence', 'bar'), ('?', '.')]
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> pct.evaluate(tagged_test_sentences)
0.8
ありがとうございました!これは優れた説明でした – ellen