パイプラインのpython機能の選択：どのように機能名を決定するのですか？

パイプラインとgrid_searchを使用して最適なパラメータを選択し、これらのパラメータを使用して最適なパイプライン（ 'best_pipe'）に合わせました。しかし、feature_selection（SelectKBest）がパイプラインにあるので、SelectKBestに適用されていません。パイプラインのpython機能の選択：どのように機能名を決定するのですか？

「k」で選択した機能の機能名を知る必要があります。任意のアイデアをどのようにそれらを取得するには？あなたはbest_pipeで名前をフィーチャーセレクターにアクセスすることができます

from sklearn import (cross_validation, feature_selection, pipeline, 
        preprocessing, linear_model, grid_search) 
folds = 5 
split = cross_validation.StratifiedKFold(target, n_folds=folds, shuffle = False, random_state = 0) 

scores = [] 
for k, (train, test) in enumerate(split): 

    X_train, X_test, y_train, y_test = X.ix[train], X.ix[test], y.ix[train], y.ix[test] 

    top_feat = feature_selection.SelectKBest() 

    pipe = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()), 
           ('feat', top_feat), 
           ('clf', linear_model.LogisticRegression())]) 

    K = [40, 60, 80, 100] 
    C = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001] 
    penalty = ['l1', 'l2'] 

    param_grid = [{'feat__k': K, 
        'clf__C': C, 
        'clf__penalty': penalty}] 

    scoring = 'precision' 

    gs = grid_search.GridSearchCV(estimator=pipe, param_grid = param_grid, scoring = scoring) 
    gs.fit(X_train, y_train) 

    best_score = gs.best_score_ 
    scores.append(best_score) 

    print "Fold: {} {} {:.4f}".format(k+1, scoring, best_score) 
    print gs.best_params_

best_pipe = pipeline.Pipeline([('scale', preprocessing.StandardScaler()), 
          ('feat', feature_selection.SelectKBest(k=80)), 
          ('clf', linear_model.LogisticRegression(C=.0001, penalty='l2'))]) 

best_pipe.fit(X_train, y_train) 
best_pipe.predict(X_test)

出典

2015-10-27 figgy

事前にありがとう：

features = best_pipe.named_steps['feat']

次にあなたが名を取得するためにインデックス配列にtransform()を呼び出すことができます選択した列：

X.columns[features.transform(np.arange(len(X.columns)))]

ここでの出力は、パイプラインで選択された80個の列名になります。

出典

2015-10-27 20:54:52 jakevdp

あなたからの解決策を受け取るための真の処置、あなたは実際にpyconチュートリアルビデオでpythonを学ぶのを手伝ってくれました。しかし、「文字列を浮動小数点に変換できませんでした：score_575-600」（score_575-600は列の名前です）というエラーがどのように解決できますか？ – figgy

ああ - フィーチャーセレクターが文字列で機能しないことを忘れていました。上記の更新版をお試しください。ビデオを聞いてうれしかったよ！ – jakevdp

上記のエラーを回避する方法はまだ分かりませんが、この二段階の解決策では、少なくともk個の最高機能の列名を取得できました。 features = best_pipe.named_steps ['feat']。get_support（） x_cols = X.columns .values [features == True] x_cols – figgy

これは参考になる可能性があります。私はOPによって尋ねられたものと同様の必要性に遭遇しました。 1直接GridSearchCVからK最高の機能インデックスを取得したい場合：

finalFeatureIndices = gs.best_estimator_.named_steps["feat"].get_support(indices=True)

そしてindex manipulationを経由して、あなたのfinalFeatureListを取得することができます：

finalFeatureList = [initialFeatureList[i] for i in finalFeatureIndices]

出典

2016-04-18 21:23:06 ximiki

ジェイクの答えは完全に動作します。しかし、どの機能セレクタを使用しているかによっては、もっとクリーンな選択肢があります。これは私のために働いた：

X.columns[features.get_support()]

それは私にジェイクの答えに同じ答えを与えた。 the docsに詳細が表示されますが、get_supportは、列が使用されたかどうかの真偽値の配列を返します。また、Xは、フィーチャセレクタで使用されるトレーニングデータと同じ形状でなければならないことに注意してください。

出典

2017-04-03 16:21:24 bwest87

確かにこの回答を好む、 'features.transform（np.arange（len（X.columns）））'は 'features.get_support（）'の基本的な長さです。 – andrew

パイプラインのpython機能の選択：どのように機能名を決定するのですか？

答えて

関連する問題