GradientBoostedTreesのようなトレーニングアルゴリズムをパイプラインに追加する方法

私には実行コードがありますが、トレーニングモデルはパイプラインの一部ではありません。GradientBoostedTreesのようなトレーニングアルゴリズムをパイプラインに追加する方法

注：char_cols & num_colsは、文字列&の数値データをそれぞれ含むリストです。

以下のコードは、

は

string_indexers = [ 
    StringIndexer(inputCol=x, outputCol="int_{0}".format(x)) 
    for x in char_cols] 

assembler = VectorAssembler(
    inputCols= ["int_"+x for x in char_cols] + num_cols, 
    outputCol="features" 
) 
pipeline = Pipeline(stages=string_indexers + [assembler]) 
features_model = pipeline.fit(df) 
indexed = features_model.transform(df) 

ml_df = indexed.select(col("OutputVar").cast("int").alias("label"), col("features")).map(lambda row: LabeledPoint(row.label, row.features)) 

gbm = GradientBoostedTrees.trainRegressor(sc.parallelize(ml_df.collect()), categoricalFeaturesInfo={0:24,1:3,2:4,3:5,4:107}, numIterations=3, maxBins=120)

うまく動作しますが、GradientBoostedTreesためのパイプラインでも、トレーニングモデル（GBM）を追加しようとしますが、私には直接の方法はありませんようです。は、私はこのような何かが必要です。

pipeline = Pipeline(stages=string_indexers + [assembler] + [gbm])

とを直接実行します。pyspark.ml.regressionに

model = pipeline.fit(trainingData) 
predictions = model.transform(testData)

もののGBTRegressorを、私たちを助けることができる「labelCol」と「featuresCol」のような入力を持っていますしかし、GradientBoostedTreesでは、私は同じもののための方法を見つけることができません。 "LabeledPoint"生成ステップをパイプライン化できますか？または他の助け？

よろしく

出典

2016-05-13 Abhishek

ありません。 MLアルゴリズムは、回帰ではなくバイナリラベルのみをサポートします。 – zero323

@ zero323私は全くあなたを得ていませんでした。少し説明できますか？ – Abhishek

ただ、 "ステージ" のparamにそれを追加します。直接

from pyspark.ml.regression import GBTRegressor 

gr = GBTRegressor() 
pipeline = Pipeline(stages=string_indexers + [assembler, gr])

出典

2018-03-09 15:43:58

GradientBoostedTreesのようなトレーニングアルゴリズムをパイプラインに追加する方法

答えて

関連する問題