Pysparkは複数の列を別のテーブルに抽出します

idという名前の列とgenreという名前の列のいずれかを持つcsvファイルがあります。Pysparkは複数の列を別のテーブルに抽出します

1,Action|Horror|Adventure 
2,Action|Adventure

行を選択し、各ジャンルごとに別のデータフレームの現在のIDとジャンルを挿入することは可能ですか？

1,Action 
1,Horror 
1,Adventure 
2,Action 
2,Adventure

出典

2017-03-27 Bato-Bair Tsyrenov

udfを使用してジャンルデータを分割し、分解機能を使用することができます。スレシュ・ソリューションに加えて

from pyspark.sql.functions import explode 
from pyspark.sql.types import ArrayType,StringType 
s = [('1','Action|Adventure'),('2','Comdey|Action')] 
rdd = sc.parallelize(s) 
df = sqlContext.createDataFrame(rdd,['id','Col']) 
df.show() 
+---+----------------+ 
| id|    Col| 
+---+----------------+ 
| 1|Action|Adventure| 
| 2| Comdey|Action| 
+---+----------------+ 

newcol = f.udf(lambda x : x.split('|'),ArrayType(StringType())) 
df1 = df.withColumn('Genre',explode(newcol('col'))).drop('col') 
df1.show() 
+---+---------+ 
| id| Genre| 
+---+---------+ 
| 1| Action| 
| 1|Adventure| 
| 2| Comdey| 
| 2| Action| 
+---+---------+

出典

2017-03-27 10:55:04 Suresh

、あなたも同じことを達成するために、あなたの文字列を分割した後flatMapを使用することができます。出力として

#Read csv from file (works in Spark 2.x and onwards 
df_csv = sqlContext.read.csv("genre.csv") 

#Split the Genre (y) on the character |, but leave the id (x) as is 
rdd_split= df_csv.rdd.map(lambda (x,y):(x,y.split('|'))) 

#Use a list comprehension to add the id column to each Genre(y) 
rdd_explode = rdd_split.flatMap(lambda (x,y):[(x,k) for k in y]) 

#Convert the resulting RDD back to a dataframe 
df_final = rdd_explode.toDF(['id','Genre'])

df_final.show()戻り、これを：

+---+---------+ 
| id| Genre| 
+---+---------+ 
| 1| Action| 
| 1| Horror| 
| 1|Adventure| 
| 2| Action| 
| 2|Adventure| 
+---+---------+

出典

2017-03-27 12:21:41 Jaco

あなたがコメントしてください可能性があり各行は正確に何をしていますか？ –

Pysparkは複数の列を別のテーブルに抽出します

答えて

関連する問題