まずセットアップスタンフォードツールと正しくNLTK、例えばその後
[email protected]:~$ cd
[email protected]:~$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
[email protected]:~$ unzip stanford-parser-full-2015-12-09.zip
[email protected]:~$ ls stanford-parser-full-2015-12-09
bin ejml-0.23.jar lexparser-gui.sh LICENSE.txt README_dependencies.txt StanfordDependenciesManual.pdf
build.xml ejml-0.23-src.zip lexparser_lang.def Makefile README.txt stanford-parser-3.6.0-javadoc.jar
conf lexparser.bat lexparser-lang.sh ParserDemo2.java ShiftReduceDemo.java stanford-parser-3.6.0-models.jar
data lexparser-gui.bat lexparser-lang-train-test.sh ParserDemo.java slf4j-api.jar stanford-parser-3.6.0-sources.jar
DependencyParserDemo.java lexparser-gui.command lexparser.sh pom.xml slf4j-simple.jar stanford-parser.jar
[email protected]:~$ export STANFORDTOOLSDIR=$HOME
[email protected]:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar
(詳細はhttps://gist.github.com/alvations/e1df0ba227e542955a8aを参照し、Windowsの手順についてhttps://gist.github.com/alvations/0ed8641d7d2e1941b9f9を参照)
、リストの各項目がある場合、文字列のリストにテキストトークン化宣告するKiss and Strunk (2006)を使用します。Linuxで文。
>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is the first sentnece. This is the second. And this is the third'
>>> sent_tokenize(sentences)
['This is the first sentence.', 'This is the second.', 'And this is the third']
その後スタンフォードパーサにドキュメントストリームを養う:
>>> list(list(parsed_sent) for parsed_sent in parser.raw_parse_sents(sent_tokenze(sentences)))
[[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['first']), Tree('NN', ['sentence'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['second'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('CC', ['And']), Tree('NP', [Tree('DT', ['this'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['third'])])])])])]]
あなたはまたhttp://stackoverflow.comを参照してください、代わりに、事前に構築された 'sent_tokenize'を使用しての独自のPUNKTトークナイザを訓練することができます/ questions/21160310/training-data-format-for-nltk-punkt – alvas