は、パイプライン

から語彙を抽出するために、どのように私はそれがidをだとして、上記のコードは、インデックスと語彙のリストを出力します以下の方法は、パイプライン

fl = StopWordsRemover(inputCol="words", outputCol="filtered") 
df = fl.transform(df) 
cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures") 
model = cv.fit(df) 

print(model.vocabulary)

によってCountVecotizerModelから語彙を抽出することができます。

今、私は次のように上記のコードのパイプラインを作成しました：

rm_stop_words = StopWordsRemover(inputCol="words", outputCol="filtered") 
count_freq = CountVectorizer(inputCol=rm_stop_words.getOutputCol(), outputCol="rawFeatures") 

pipeline = Pipeline(stages=[rm_stop_words, count_freq]) 
model = pipeline.fit(dfm) 
df = model.transform(dfm) 

print(model.vocabulary) # This won't work as it's not CountVectorizerModel

それがパイプラインからのモデル属性を抽出するために、どのように次のエラー

print(len(model.vocabulary)) 
AttributeError: 'PipelineModel' object has no attribute 'vocabulary'

がスローされますか？

出典

2017-10-12 Abdullah

同じように、他のどの段階属性を持つよう、stages抽出：

stages = model.stages

は1を見つける（-s）あなたが興味を持っている：

from pyspark.ml.feature import CountVectorizerModel 

vectorizers = [s for s in stages if isinstance(s, CountVectorizerModel)]

、目的のフィールドを取得します：

[v.vocabulary for v in vectorizers]

出典

2017-10-12 17:38:57 user6910411

答えて

関連する問題