平均シフトを使用した文書クラスタリング

-2

すべての文書の各トークンに対して多数の文書を集め、tf * idfを計算し、各文書のベクトル（n次元のそれぞれはコーパスのユニークワードの数）を作成しました。 sklearn.cluster.MeanShiftを使用してベクトルからクラスターを作成する方法を見つけることができません。平均シフトを使用した文書クラスタリング

出典

2017-09-12 Mourya Vamsi

tfidfを計算した後、数値の行列（つまり、行と列のデータ表）がありますか？それは疎なのか密なのか？一般的にどんなタイプですか？あなたsklearnからTfidfVectorizer（）を使用しましたか？ – Jarad

はい、私はTfidfVectorizer（）を使用して、疎な行列で終わってしまいました。私はそれをsklearn.clister.MeanShiftへの入力として与える方法を理解していません –

TfidfVectorizerは、文書を数値の「疎行列」に変換します。 MeanShiftは、それに渡されるデータが「高密度」であることを要求します。以下では、パイプライン（credit）で変換する方法を示しますが、メモリが許可されていれば、疎な行列をtoarray()またはtodense()で密に変換できます。

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.cluster import MeanShift 
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import FunctionTransformer 

documents = ['this is document one', 
      'this is document two', 
      'document one is fun', 
      'document two is mean', 
      'document is really short', 
      'how fun is document one?', 
      'mean shift... what is that'] 

pipeline = Pipeline(
    steps=[ 
    ('tfidf', TfidfVectorizer()), 
    ('trans', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)), 
    ('clust', MeanShift()) 
    ]) 

pipeline.fit(documents) 
pipeline.named_steps['clust'].labels_ 

result = [(label,doc) for doc,label in zip(documents, pipeline.named_steps['clust'].labels_)] 

for label,doc in sorted(result): 
    print(label, doc)

プリント：

0 document two is mean 
0 this is document one 
0 this is document two 
1 document one is fun 
1 how fun is document one? 
2 mean shift... what is that 
3 document is really short

あなたは "ハイパー" を修正することができるが、これはあなたに私が思う一般的なアイデアを提供します。

出典

2017-09-13 04:22:16 Jarad

平均シフトを使用した文書クラスタリング

答えて

関連する問題