NLTKのための私自身のデータセットを読むPerceptronTaggerを使用した音声タグ付けの部分

私はNLTKを使い慣れていて、まだPythonにはかなり新しいです。私は自分のデータセットを使ってNLTKのパーセプトロン・タガーを訓練し、テストしたいと思っています。私はデータにこれらの関数を呼び出したいNLTKのための私自身のデータセットを読むPerceptronTaggerを使用した音声タグ付けの部分

Pierre NNP 
Vinken NNP 
,  , 
61  CD 
years NNS 
old  JJ 
,  , 
will MD 
join VB 
the  DT 
board NN 
as  IN 
a  DT 
nonexecutive JJ 
director  NN 
Nov. NNP 
29  CD 
.  .

：トレーニングとテストデータは次の形式を（それは単にtxtファイルに保存されます）があり

perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False) 
perceptron_tagger.train(train_data) 
accuracy = perceptron_tagger.evaluate(test_data)

私はいくつかのことを試してみましたしかし、私はちょうどデータがどんなフォーマットであると予想されるか把握できません。どんな助けもありがとう！ありがとう

出典

2017-12-03 ellen

のtrain()とevaluate()関数の入力には、タプルのリストのリストが必要です。各内側のリストは、各タプルが文字列のペアであるリストです。

train.txtとtest.txtを考える：

$ cat train.txt 
This foo 
is foo 
a foo 
sentence bar 
. . 

That foo 
is foo 
another foo 
sentence bar 
in foo 
conll bar 
format bar 
. . 

$ cat test.txt 
What foo 
is foo 
this foo 
sentence bar 
? ? 

How foo 
about foo 
that foo 
sentence bar 
? ?

タプルのリストにCoNLL形式のファイルを読み込みます。

# Using https://github.com/alvations/lazyme 
>>> from lazyme import per_section 
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))] 

# Or otherwise 

>>> def per_section(it, is_delimiter=lambda x: x.isspace()): 
...  """ 
...  From http://stackoverflow.com/a/25226944/610569 
...  """ 
...  ret = [] 
...  for line in it: 
...   if is_delimiter(line): 
...    if ret: 
...     yield ret # OR ''.join(ret) 
...     ret = [] 
...   else: 
...    ret.append(line.rstrip()) # OR ret.append(line) 
...  if ret: 
...   yield ret 
... 
>>> 
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))] 
>>> tagged_test_sentences 
[[('What', 'foo'), ('is', 'foo'), ('this', 'foo'), ('sentence', 'bar'), ('?', '?')], [('How', 'foo'), ('about', 'foo'), ('that', 'foo'), ('sentence', 'bar'), ('?', '?')]]

今、あなたは訓練することができます/鬼を評価：

>>> from lazyme import per_section 
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))] 
>>> from nltk.tag.perceptron import PerceptronTagger 
>>> pct = PerceptronTagger(load=False) 
>>> pct.train(tagged_train_sentences) 
>>> pct.tag('Where do I find a foo bar sentence ?'.split()) 
[('Where', 'foo'), ('do', 'foo'), ('I', '.'), ('find', 'foo'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'foo'), ('sentence', 'bar'), ('?', '.')] 
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))] 
>>> pct.evaluate(tagged_test_sentences) 
0.8

出典

2017-12-04 06:42:14 alvas

ありがとうございました！これは優れた説明でした – ellen

NLTKのための私自身のデータセットを読むPerceptronTaggerを使用した音声タグ付けの部分

答えて

関連する問題