nltkステマー：文字列インデックスが範囲外です

私は、nltkのPorterStemmerを使用して削除したいピクルスドテキスト文書を用意しています。私のプロジェクトに固有の理由から、私はdjangoのアプリケーションビューの内部にステミングをしたいと思います。nltkステマー：文字列インデックスが範囲外です

しかし、djangoビュー内の文書をステミングすると、'oed'という文字列のPorterStemmer().stem()から例外が発生します。IndexError: string index out of rangeが返されます。その結果、以下のことを実行している：

# xkcd_project/search/views.py 
from nltk.stem.porter import PorterStemmer 

def get_results(request): 
    s = PorterStemmer() 
    s.stem('oed') 
    return render(request, 'list.html')

が挙げエラーが発生します：

Traceback (most recent call last): 
    File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner 
    response = get_response(request) 
    File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response 
    response = self.process_exception_by_middleware(e, request) 
    File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response 
    response = wrapped_callback(request, *callback_args, **callback_kwargs) 
    File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results 
    s.stem('oed') 
    File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem 
    stem = self._step1b(stem) 
    File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b 
    lambda stem: (self._measure(stem) == 1 and 
    File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list 
    if suffix == '*d' and self._ends_double_consonant(word): 
    File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant 
    word[-1] == word[-2] and 
IndexError: string index out of range

は今、本当に奇妙である何（ジャンゴ外同じ文字列で同じステマーを実行していることは、別々のpythonファイルでまたはインタラクティブなPythonコンソール）エラーは発生しません。言い換えれば：

続い

# test.py 
from nltk.stem.porter import PorterStemmer 
s = PorterStemmer() 
print s.stem('oed')

：この問題の原因を

python test.py 
# successfully prints 'o'

？

出典

2017-01-07 jkarimi

あなたはPython 2を使用していますか？文字セットの違いかもしれませんが、ちょっと推測してみてください。 – alexis

どのバージョンのNLTKを使用していますか？インポートしたら 'nltk .__ version__'で確認することができます。おそらく、あなたはdjangoと外部pythonの2つの異なるバージョンを使用します。また、djangoで使用するpythonのバージョンをチェックして、外部スクリプトを実行できますか？ 'print'文があれば、常に' 2.7'であると思います。 –

この問題とはほとんど関係ないので、 's = PorterStemmer（）'はグローバル変数のどこかに置かなければなりません。ビューに配置するとは、このビュー機能をロードするすべてのページに対して 'PorterStemmer'オブジェクトをロードすることです。 – alvas

これは、NLTKバージョン3.2.2に固有のNLTKバグです。これは、私が責任を負うものです。ポーターステマーを書き換えたPR https://github.com/nltk/nltk/pull/1261によって導入されました。

NLTK 3.2.3で出てきたa fixを書きました。バージョン3.2.2を使用していて修正が必要な場合は、アップグレードしてください。実行して

出典

2017-01-07 20:45:19

これは正に、この回答は+20であり、オープンソースライブラリを破壊したことに対する報酬として200 Stack Overflow担当者を効果的に受け取りました。私はむしろ罪悪感を感じる。 –

有罪ではない、これはOSS =をインセンティブにする方法の1つです） – alvas

pdbを使用してnltk.stem.porterモジュールをデバッグしました。数回の反復後、_apply_rule_list()にあなたが得る：この時点で

>>> rule 
(u'at', u'ate', None) 
>>> word 
u'o'

は_ends_double_consonant()方法はword[-1] == word[-2]を行うにしようと、それは失敗します。

def _doublec(self, word): 
    """doublec(word) is TRUE <=> word ends with a double consonant""" 
    if len(word) < 2: 
     return False 
    if (word[-1] != word[-2]):  
     return False   
    return self._cons(word, len(word)-1)

は、私の知る限り見ることができるように、 len(word) < 2チェックが新しいバージョンにありません：NLTK 3.2で relative method

私は間違っていない場合は、以下の通りでした。

このような何かに_ends_double_consonant()を変更する作業をする必要があります：私はちょうど関連NLTKの問題でこの変更を提案

def _ends_double_consonant(self, word): 
     """Implements condition *d from the paper 

     Returns True if word ends with a double consonant 
     """ 
     if len(word) < 2: 
      return False 
     return (
      word[-1] == word[-2] and 
      self._is_consonant(word, len(word)-1) 
    )

。

出典

2017-01-07 19:35:54

これは私のためにうまくいった！ –

nltkステマー：文字列インデックスが範囲外です

答えて

関連する問題