scikit-learnにスペースのある語彙を提供するCountVectorizer

これを参照するとpostです。私たちが空間の語彙をどのようにしてCountVectorizerモデルに提供するのかと思います。 distributed systemsまたはmachine learning？次に例を示します。scikit-learnにスペースのある語彙を提供するCountVectorizer

import numpy as np 
from itertools import chain 

tags = [ 
    "python, tools", 
    "linux, tools, ubuntu", 
    "distributed systems, linux, networking, tools", 
] 

vocabulary = list(map(lambda x: x.split(', '), tags)) 
vocabulary = list(np.unique(list(chain(*vocabulary))))

我々はモデル

ここ

from sklearn.feature_extraction.text import CountVectorizer 
vec = CountVectorizer(vocabulary=vocabulary) 
print(vec.fit_transform(tags).toarray())

に、この語彙リストを提供することができ、私は言葉distributed systems（最初のカラム）のカウントを失いました。結果は次のようになります。

[[0 0 0 1 1 0] 
[0 1 0 0 1 1] 
[0 1 1 0 1 0]]

token_patternまたは別の場所に変更する必要がありますか？

出典

2016-06-17 titipata

私は本質的に、分析するボキャブラリをあらかじめ定義しておき、 '、'を分割してタグをトークン化したいと思います。与え

from sklearn.feature_extraction.text import CountVectorizer 
vec = CountVectorizer(vocabulary=vocabulary, tokenizer=lambda x: x.split(', ')) 
print(vec.fit_transform(tags).toarray())

、：：

[[0 0 0 1 1 0] 
[0 1 0 0 1 1] 
[1 1 1 0 1 0]]

出典

2016-06-17 14:26:48

本当にありがとうございました@Zichenを、これは私が探しているものです

次の方法でそれを行うためにCountVectorizerをだますことができます。 'tokenizer'を使って問題を非常に便利にします。 – titipata

scikit-learnにスペースのある語彙を提供するCountVectorizer

答えて

関連する問題