PySparkは、私は非常に単純なCSVを持っているタイムスタンプ

を変換しません、私はpysparkを使用してそれを読んで、月を示す新しい列を追加しようとしているtest.csvPySparkは、私は非常に単純なCSVを持っているタイムスタンプ

name,timestamp,action 
A,2012-10-12 00:30:00.0000000,1 
B,2012-10-12 01:00:00.0000000,2 
C,2012-10-12 01:30:00.0000000,2 
D,2012-10-12 02:00:00.0000000,3 
E,2012-10-12 02:30:00.0000000,1

それを呼び出します。

最初にデータを読みましたが、すべてが正常です。

df = spark.read.csv('test.csv', inferSchema=True, header=True) 
df.printSchema() 
df.show()

出力：

root 
|-- name: string (nullable = true) 
|-- timestamp: timestamp (nullable = true) 
|-- action: double (nullable = true) 

+----+-------------------+------+ 
|name|   timestamp|action| 
+----+-------------------+------+ 
| A|2012-10-12 00:30:00| 1.0| 
| B|2012-10-12 01:00:00| 2.0| 
| C|2012-10-12 01:30:00| 2.0| 
| D|2012-10-12 02:00:00| 3.0| 
| E|2012-10-12 02:30:00| 1.0| 
+----+-------------------+------+

しかし、私は私の列を追加しようとすると、フォーマットオプションは何もしていないようです。

df.withColumn('month', to_date(col('timestamp'), format='MMM')).show()

は出力：

+----+-------------------+------+----------+ 
|name|   timestamp|action|  month| 
+----+-------------------+------+----------+ 
| A|2012-10-12 00:30:00| 1.0|2012-10-12| 
| B|2012-10-12 01:00:00| 2.0|2012-10-12| 
| C|2012-10-12 01:30:00| 2.0|2012-10-12| 
| D|2012-10-12 02:00:00| 3.0|2012-10-12| 
| E|2012-10-12 02:30:00| 1.0|2012-10-12| 
+----+-------------------+------+----------+

ここで何が起こっていますか？

出典

2017-12-09 Mark Dunne

何に変換しますか？月？ –

はい。 Oracleのページのドキュメントによると、MMMはそれを達成するはずですが、何のフォーマットも試みていません。 https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html –

月と呼ばれるinbuilt関数があります。https://spark.apache.org/docs/1.6.2/api/java/org /apache/spark/sql/functions.html –

to_dateformatは、解析文字列型の列に使用されます。必要なものはdate_format

from pyspark.sql.functions import date_format 

df.withColumn('month', date_format(col('timestamp'), format='MMM')).show() 

# +----+-------------------+------+-----+ 
# |name|   timestamp|action|month| 
# +----+-------------------+------+-----+ 
# | A|2012-10-12 00:30:00| 1.0| Oct| 
# | B|2012-10-12 01:00:00| 2.0| Oct| 
# | C|2012-10-12 01:30:00| 2.0| Oct| 
# | D|2012-10-12 02:00:00| 3.0| Oct| 
# | E|2012-10-12 02:30:00| 1.0| Oct| 
# +----+-------------------+------+-----+

出典

2017-12-09 16:43:35 user8371915

PySparkは、私は非常に単純なCSVを持っているタイムスタンプ

答えて

関連する問題