pyspark.sql.dataframeに新しい列としてPySpark RDDを追加します。

私は、各行がニュース記事であるpyspark.sql.dataframeを持っています。私は各記事に含まれる言葉を表すRDDを持っています。新しい記事のデータフレームに単語のRDDを「単語」という列として追加したいと思います。私はpyspark.sql.dataframeに新しい列としてPySpark RDDを追加します。

df.withColumn('words', words_rdd)

を試してみましたが、私はエラー

AssertionError: col should be Column

は、データフレームは、この

Articles 
the cat and dog ran 
we went to the park 
today it will rain

のようになりますが、私は3Kのニュース記事があり得ます。

は、私は、このような削除ストップワードとしてテキストをきれいにする機能を適用し、私はこのようになりますRDDを持っている：

[[cat, dog, ran],[we, went, park],[today, will, rain]]

私はこのように見えるように、私のデータフレームを取得しようとしている：

Articles     Words 
the cat and dog ran  [cat, dog, ran] 
we went to the park  [we, went, park] 
today it will rain  [today, will, rain]

出典

2017-02-08 jakko

例データを共有してください。おそらく参加する必要があります。 – mtoto

どのように一致していますか？なぜ猫と犬が走った記事にマッチしているのですか？ –

なぜあなたはrddをデータフレームに戻したいのですか？むしろ、記事から直接新しい列を作成したいと思います。それを行うには複数の方法がここにありますが、私の5セントは、以下のとおりです。

from pyspark.sql import Row 
from pyspark.sql.context import SQLContext 
sqlCtx = SQLContext(sc) # sc is the sparkcontext 

x = [Row(Articles='the cat and dog ran'),Row(Articles='we went to the park'),Row(Articles='today it will rain')] 
df = sqlCtx.createDataFrame(x) 

df2 = df.map(lambda x:tuple([x.Articles,x.Articles.split(' ')])).toDF(['Articles','words']) 
df2.show()

あなたは次のような出力が得られます。

Articles     words 
the cat and dog ran  [the, cat, and, dog, ran] 
we went to the park  [we, went, to, the, park] 
today it will rain  [today, it, will, rain]

は、あなたが何かを達成するために探していたなら、私に教えてください。

出典

2017-02-09 08:32:36

これは私が欲しいものですが、3kの記事があります。これらの記事のそれぞれに機能を適用して（ちょうど分割するのではなく）、上記のようなデータフレームに入れたいと思っています。これは私の初めてのpysparkを使用しているので、私は最良のアプローチについてはわかりません。 – jakko

実際のデータのサンプルファイルを提供できますか？任意の関数をudfの助けを借りて適用することができます。 –

udf：newdf = df.withColumn（ "words"、udf_clean_text（ "articles"））を使用して動作させました。 – jakko

rdd1 = spark.sparkContext.parallelize([1, 2, 3, 5]) 
# make some transformation on rdd1: 
rdd2 = rdd.map(lambda n: True if n % 2 else False) 
# Append each row in rdd2 to those in rdd1. 
rdd1.zip(rdd2).collect()

出典

2017-08-03 07:45:32

あなたのコードが問題を解決する方法とその理由を説明してください。 –

pyspark.sql.dataframeに新しい列としてPySpark RDDを追加します。

答えて

関連する問題