マルチクラスのテキスト分類中のxgboost sklearnのfeature_namesの不一致

python（sklearn版）でxgboostを使用してマルチクラスのテキスト分類を実行しようとしていますが、機能名に不一致があるとエラーします。奇妙なことは、時には（おそらく4回のうち1回）機能することですが、不確実性は、今のところこのソリューションに頼るのが難しくなっています。処理。マルチクラスのテキスト分類中のxgboost sklearnのfeature_namesの不一致

私は、使用しているものに似たコードのサンプルデータをいくつか提供しました。次のように私は現在持っているコードは次のとおりです。

maxymooの提案を反映し更新されたコード

import xgboost as xgb 
import numpy as np 
from sklearn.cross_validation import KFold, train_test_split 
from sklearn.metrics import accuracy_score 
from sklearn.feature_extraction.text import CountVectorizer 

rng = np.random.RandomState(31337)  

y = np.array([0, 1, 2, 1, 0, 3, 1, 2, 3, 0]) 
X = np.array(['milk honey bear bear honey tigger', 
      'tom jerry cartoon mouse cat cat WB', 
      'peppa pig mommy daddy george peppa pig pig', 
      'cartoon jerry tom silly', 
      'bear honey hundred year woods', 
      'ben holly elves fairies gaston fairy fairies castle king', 
      'tom and jerry mouse WB', 
      'peppa pig daddy pig rebecca rabit', 
      'elves ben holly little kingdom king big people', 
      'pot pot pot pot jar winnie pooh disney tigger bear']) 

xgb_model = make_pipeline(CountVectorizer(), xgb.XGBClassifier()) 

kf = KFold(y.shape[0], n_folds=2, shuffle=True, random_state=rng) 
for train_index, test_index in kf: 
    xgb_model.fit(X[train_index],y[train_index]) 
    predictions = xgb_model.predict(X[test_index]) 
    actuals = y[test_index] 
    accuracy = accuracy_score(actuals, predictions) 
    print accuracy

次のように私が取得する傾向があるエラーは次のとおりです。

Traceback (most recent call last): 
    File "main.py", line 95, in <module> 
    predictions = xgb_model.predict(X[test_index]) 
    File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/sklearn.py", line 465, in predict 
    ntree_limit=ntree_limit) 
    File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 939, in predict 
    self._validate_features(data) 
    File "//anaconda/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost/core.py", line 1179, in _validate_features 
    data.feature_names)) 
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24'] 
expected f26, f25 in input data

任意のポインタは本当にだろう感謝！

出典

2016-08-18 koend

訓練データにf338がありません。 – user1157751

私の答えはまったく同じ問題です。http://stackoverflow.com/questions/38740885/xgboost-difference-in-train-and-test-features-after-converting-to-dmatrix/38887112#38887112 – abhiieor

訓練を受けた機能でモデルにスコアを付けるだけであることを確認する必要があります。これを行う通常の方法は、Pipelineを使用して、ベクトル化装置とモデルを一緒にパッケージ化することです。そうすれば、彼らは同時に訓練を受けることになります。また、テストデータに新しい機能があった場合、ベクトル化ツールはそれを無視します（クロスチェンジの各段階でモデルを再作成する必要はなく、あなたは一度だけそれを初期化し、各折りたたみでそれを補充してください）：

from sklearn.pipeline import make_pipeline  

xgb_model = make_pipeline(CountVectoriser(), xgb.XGBClassifier()) 
for train_index, test_index in kf: 
    xgb_model.fit(X[train_index],y[train_index]) 
    predictions = xgb_model.predict(X[test_index]) 
    actuals = y[test_index] 
    accuracy = accuracy_score(actuals, predictions) 
    print accuracy

出典

2016-08-19 04:00:43 maxymoo

私はそれがうまくいくことを期待していましたが、なんらかの理由でそれはしませんでした。私は問題を示す既存の質問に実際の例を追加しました。 – koend

おそらく、CountVectoriser（）が疎な行列を返すのに対し、XGBClassifier（）は密な行列を必要とするでしょうか？しかし、それを高密度に変更することは、私のメモリフットプリントを食べているように見えます... – koend

あなたのコードは私のためにエラーを投げません...多分sklearnの最新バージョンに更新しますか？私はPython 3.4、scikit 0.17.1を使っています – maxymoo

マルチクラスのテキスト分類中のxgboost sklearnのfeature_namesの不一致

答えて

関連する問題