2017-10-19 10 views
0
from pyspark.sql import Row, functions as F 
row = Row("UK_1","UK_2","Date","Cat",'Combined') 
agg = '' 
agg = 'Cat' 
tdf = (sc.parallelize 
    ([ 
     row(1,1,'12/10/2016',"A",'Water^World'), 
     row(1,2,None,'A','Sea^Born'), 
     row(2,1,'14/10/2016','B','Germ^Any'), 
     row(3,3,'!~2016/2/276','B','Fin^Land'), 
     row(None,1,'26/09/2016','A','South^Korea'), 
     row(1,1,'12/10/2016',"A",'North^America'), 
     row(1,2,None,'A','South^America'), 
     row(2,1,'14/10/2016','B','New^Zealand'), 
     row(None,None,'!~2016/2/276','B','South^Africa'), 
     row(None,1,'26/09/2016','A','Saudi^Arabia') 
     ]).toDF()) 
cols = F.split(tdf['Combined'], '^') 
tdf = tdf.withColumn('column1', cols.getItem(0)) 
tdf = tdf.withColumn('column2', cols.getItem(1)) 
tdf.show(truncate = False ) 

上記は私のサンプルコードです。Pysparkスプリットカラム

何らかの理由で、^文字で列を分割していません。

アドバイスはありますか?

答えて

1

パターンは正規表現です(splitを参照)。そして^正規表現で文字列の先頭にマッチするアンカーがあり、文字通り一致させるために、あなたはそれをエスケープする必要があります:

cols = F.split(tdf['Combined'], r'\^') 
tdf = tdf.withColumn('column1', cols.getItem(0)) 
tdf = tdf.withColumn('column2', cols.getItem(1)) 
tdf.show(truncate = False) 

+----+----+------------+---+-------------+-------+-------+ 
|UK_1|UK_2|Date  |Cat|Combined  |column1|column2| 
+----+----+------------+---+-------------+-------+-------+ 
|1 |1 |12/10/2016 |A |Water^World |Water |World | 
|1 |2 |null  |A |Sea^Born  |Sea |Born | 
|2 |1 |14/10/2016 |B |Germ^Any  |Germ |Any | 
|3 |3 |!~2016/2/276|B |Fin^Land  |Fin |Land | 
|null|1 |26/09/2016 |A |South^Korea |South |Korea | 
|1 |1 |12/10/2016 |A |North^America|North |America| 
|1 |2 |null  |A |South^America|South |America| 
|2 |1 |14/10/2016 |B |New^Zealand |New |Zealand| 
|null|null|!~2016/2/276|B |South^Africa |South |Africa | 
|null|1 |26/09/2016 |A |Saudi^Arabia |Saudi |Arabia | 
+----+----+------------+---+-------------+-------+-------+