CountVectorizerが 'I'を無視します

なぜSklearnのCountVectorizerが代名詞 "I"を無視していますか？CountVectorizerが 'I'を無視します

ngram_vectorizer = CountVectorizer(analyzer = "word", ngram_range = (2,2), min_df = 1) 
ngram_vectorizer.fit_transform(['HE GAVE IT TO I']) 
<1x3 sparse matrix of type '<class 'numpy.int64'>' 
ngram_vectorizer.get_feature_names() 
['gave it', 'he gave', 'it to']

出典

2015-10-21 Alex

デフォルトのトークナイザは、2文字（またはそれ以上）の単語しか考慮しません。

CountVectorizerに適切なtoken_patternを渡してこの動作を変更できます。

デフォルトのパターンは、（the signature in the docsを参照）である。

'token_pattern': u'(?u)\\b\\w\\w+\\b'

あなたは、例えば、デフォルトを変更することにより、1文字の単語をドロップしませんCountVectorizerを取得することができます：

from sklearn.feature_extraction.text import CountVectorizer 
ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), 
            token_pattern=u"(?u)\\b\\w+\\b",min_df=1) 
ngram_vectorizer.fit_transform(['HE GAVE IT TO I']) 
print(ngram_vectorizer.get_feature_names())

与えます：

['gave it', 'he gave', 'it to', 'to i']

出典

2015-10-21 16:34:55 ldirer

CountVectorizerが 'I'を無視します

答えて

関連する問題