13

私は単語の袋を使ってテキストを分類しています。うまくいきますが、単語ではない機能を追加する方法が不思議です。現在の単語分類に別の機能(テキストの長さ)を追加するにはどうすればよいですか? Scikit-learn

ここは私のサンプルコードです。

import numpy as np 
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.multiclass import OneVsRestClassifier 

X_train = np.array(["new york is a hell of a town", 
        "new york was originally dutch", 
        "new york is also called the big apple", 
        "nyc is nice", 
        "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.", 
        "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.", 
        "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.", 
        "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",]) 
y_train = [[0],[0],[0],[0],[1],[1],[1],[1]] 

X_test = np.array(["it's a nice day in nyc", 
        'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.' 
        ]) 
target_names = ['Class 1', 'Class 2'] 

classifier = Pipeline([ 
    ('vectorizer', CountVectorizer(min_df=1,max_df=2)), 
    ('tfidf', TfidfTransformer()), 
    ('clf', OneVsRestClassifier(LinearSVC()))]) 
classifier.fit(X_train, y_train) 
predicted = classifier.predict(X_test) 
for item, labels in zip(X_test, predicted): 
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels)) 

ロンドンについてのテキストは、ニューヨークについてのテキストよりもはるかに長い傾向にあることは明らかです。テキストの長さをフィーチャーとして追加するにはどうすればよいですか? 別の分類方法を使用して、2つの予測を組み合わせる必要がありますか?言葉の袋と一緒にそれを行う方法はありますか? いくつかのサンプルコードは素晴らしいでしょう - 私は機械学習とscikit学習にはとても新しいです。

+0

コードが実行されません。つまり、ターゲットが1つしかない場合にOneVsRestClassifierを使用しているためです。 – joc

+4

SklearnのFeatureUnionを使用して、あなたが行っていることをほぼ正確に実行します。http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html – joc

+0

これについての答えを見てください質問http://stackoverflow.com/questions/39001956/sklearn-pipeline-transformation-on-only-certain-features/39009125#39009125 – maxymoo

答えて

3

コメントに示されているように、これは、FunctionTransformerFeaturePipelineおよびFeatureUnionの組み合わせです。

import numpy as np 
from sklearn.pipeline import Pipeline, FeatureUnion 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.multiclass import OneVsRestClassifier 
from sklearn.preprocessing import FunctionTransformer 

X_train = np.array(["new york is a hell of a town", 
        "new york was originally dutch", 
        "new york is also called the big apple", 
        "nyc is nice", 
        "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.", 
        "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.", 
        "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.", 
        "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",]) 
y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]]) 

X_test = np.array(["it's a nice day in nyc", 
        'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.' 
        ]) 
target_names = ['Class 1', 'Class 2'] 


def get_text_length(x): 
    return np.array([len(t) for t in x]).reshape(-1, 1) 

classifier = Pipeline([ 
    ('features', FeatureUnion([ 
     ('text', Pipeline([ 
      ('vectorizer', CountVectorizer(min_df=1,max_df=2)), 
      ('tfidf', TfidfTransformer()), 
     ])), 
     ('length', Pipeline([ 
      ('count', FunctionTransformer(get_text_length, validate=False)), 
     ])) 
    ])), 
    ('clf', OneVsRestClassifier(LinearSVC()))]) 

classifier.fit(X_train, y_train) 
predicted = classifier.predict(X_test) 
predicted 

これにより、テキストの長さが分類器で使用される機能に追加されます。

関連する問題