pycorenlpの `nlp.annotate（）`は常に同じタイプの結果を返す方法はありますか？

pycorenlpを実行して、非ASCII文字を含むテキストをトークン化しようとしています。時々nlp.annotate()は辞書を返し、時には文字列を返します。例えばpycorenlpの `nlp.annotate（）`は常に同じタイプの結果を返す方法はありますか？

、

''' 
From https://github.com/smilli/py-corenlp/blob/master/example.py 
''' 
from pycorenlp import StanfordCoreNLP 
import pprint 
import re 

if __name__ == '__main__': 
    nlp = StanfordCoreNLP('http://localhost:9000') 
    text = u"tab with good effect, denies pain".encode('utf-8') 
    print('type(text): {0}'.format(type(text))) 

    output = nlp.annotate(text, properties={ 
     'annotators': 'tokenize,ssplit', 
     'outputFormat': 'json' 
    }) 
    #pp = pprint.PrettyPrinter(indent=4) 
    #pp.pprint(output) 
    print('type(output): {0}'.format(type(output))) 

    text = u"tab with good effect\u0013\u0013, denies pain".encode('utf-8') 
    print('\ntype(text): {0}'.format(type(text))) 
    output = nlp.annotate(text, properties={ 
     'annotators': 'tokenize,ssplit', 
     'outputFormat': 'json' 
    }) 
    print('type(output): {0}'.format(type(output)))

出力：

type(text): <type 'str'> 
type(output): <type 'dict'> 

type(text): <type 'str'> 
type(output): <type 'unicode'>

私はtype(output)が<type 'unicode'>あるとき、私はスタンフォードCoreNLPサーバーでこの警告を受けることに気づく：

WARNING: Untokenizable: ‼ (U+13, decimal: 19)

は、任意のはありますnlp.annotate()は常に同じtyを返します結果のpe？ Stanford CoreNLP serverを使用して開始された

：

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000

私はスタンフォードCoreNLP 3.6.0、0.3.0 pycorenlpおよびWindows 7 SP1 x64の究極ののpython 3.5のx64を使用します。

出典

2016-09-21 Franck Dernoncourt

クイックフィックス：

import json 
# to place right after `output = nlp.annotate(text, properties={…})` 
if type(output) is str or type(output) is unicode: 
    output = json.loads(output, strict=False)

私は理由Python json.loads fails with ValueError: Invalid control character at: line 1 column 33 (char 33)のstrict=Falseを使用。

出典

2017-01-25 23:21:52

pycorenlpの `nlp.annotate（）`は常に同じタイプの結果を返す方法はありますか？

答えて

関連する問題