2012-03-21 6 views
1

Stanford Parser(http://nlp.stanford.edu/software/lex-parser.shtml)は、以下のように文脈自由句構造木を与える。ツリー内のすべての名詞句(NP)と動詞句(NP)のようなものを抽出する最良の方法は何ですか?これらのような構造体を読み込めるPython(またはJava)ライブラリがありますか?ありがとうございました。Stanford Parserから文脈自由句構造出力から情報を抽出する

(ROOT 
    (S 
    (S 
     (NP 
     (NP (DT The) (JJS strongest) (NN rain)) 
     (VP 
      (ADVP (RB ever)) 
      (VBN recorded) 
      (PP (IN in) 
      (NP (NNP India))))) 
     (VP 
     (VP (VBD shut) 
      (PRT (RP down)) 
      (NP 
      (NP (DT the) (JJ financial) (NN hub)) 
      (PP (IN of) 
       (NP (NNP Mumbai))))) 
     (, ,) 
     (VP (VBD snapped) 
      (NP (NN communication) (NNS lines))) 
     (, ,) 
     (VP (VBD closed) 
      (NP (NNS airports))) 
     (CC and) 
     (VP (VBD forced) 
      (NP 
      (NP (NNS thousands)) 
      (PP (IN of) 
       (NP (NNS people)))) 
      (S 
      (VP (TO to) 
       (VP 
       (VP (VB sleep) 
        (PP (IN in) 
        (NP (PRP$ their) (NNS offices)))) 
       (CC or) 
       (VP (VB walk) 
        (NP (NN home)) 
        (PP (IN during) 
        (NP (DT the) (NN night)))))))))) 
    (, ,) 
    (NP (NNS officials)) 
    (VP (VBD said) 
     (NP-TMP (NN today))) 
    (. .))) 

答えて

2

でNatural Language Toolkit(NLTK)をチェックしてください。

ツールキットはPythonで書かれており、これらの種類のツリー(ほかにもたくさんのもの)を正確に読み込むためのコードを提供しています。

また、これを行うための独自の再帰関数を記述することもできます。それはかなり簡単だろう。ちょうど楽しみのため


def parse(): 
    itr = iter(filter(lambda x: x, re.split("\\s+", s.replace('(', ' (').replace(')', ') ')))) 

    def _parse(): 
    stuff = [] 
    for x in itr: 
     if x == ')': 
     return stuff 
     elif x == '(': 
     stuff.append(_parse()) 
     else: 
     stuff.append(x) 
    return stuff 

    return _parse()[0] 

def find(parsed, tag): 
    if parsed[0] == tag: 
    yield parsed 
    for x in parsed[1:]: 
    for y in find(x, tag): 
     yield y 

p = parse() 
np = find(p, 'NP') 
for x in np: 
    print x 

利回り:

['NP', ['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']], ['VP', ['ADVP', ['RB', 'ever']], ['VBN', 'recorded'], ['PP', ['IN', 'in'], ['NP', ['NNP', 'India']]]]] 
['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']] 
['NP', ['NNP', 'India']] 
['NP', ['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']], ['PP', ['IN', 'of' ['NP', ['NNP', 'Mumbai']]]] 
['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']] 
['NP', ['NNP', 'Mumbai']] 
['NP', ['NN', 'communication'], ['NNS', 'lines']] 
['NP', ['NNS', 'airports']] 
['NP', ['NP', ['NNS', 'thousands']], ['PP', ['IN', 'of'], ['NP', ['NNS', 'people']]]] 
['NP', ['NNS', 'thousands']] 
['NP', ['NNS', 'people']] 
['NP', ['PRP$', 'their'], ['NNS', 'offices']] 
['NP', ['NN', 'home']] 
['NP', ['DT', 'the'], ['NN', 'night']] 
['NP', ['NNS', 'officials']] 
ここで何をしたいの超簡単な実装です
関連する問題