PysparkデータフレームにPandasタイムスタンプタイプを保存する

Pandasデータフレームの内容tをHiveテーブルPysparkに書き込みます。PysparkデータフレームにPandasタイムスタンプタイプを保存する

tはタイプpandas.tslib.Timestampの1列Request_time_localがあります

In: print t.loc[0,'Request_time_local'] 
Out: 2016-12-09 13:01:27

ハイブテーブルはタイプtimestampの列request_time_localがあります

col_name    | data_type 
request_time_local | timestamp

私はハイブへの書き込みのためPyspark dataframeにtを変換します。

t_rdd = spark.createDataFrame(t) 
t_rdd.registerTempTable("temp_result")

request_time_localの列にはテーブルにデータが入力されていませんが、他のフィールドはすべて入力されています。

spark.createDataFrame(t) 
DataFrame[request_time_local: bigint, ...]

Iがパンダに戻っPyspark dataframeを変換することによってこれを確認：Pyspark dataframeへの変換で

は、request_time_localはbigint Unixタイムスタンプです。

t_check = t_rdd.toPandas() 
In: print t_check.loc[0,'Request_time_local'] 
Out: 1481288487000000000

私は思ったんだけど：

1）私はハイブテーブルの列にtimestampにPyspark dataframeからbigintを書いていますので、投入するために失敗しrequest_time_localですか？

2）Pyspark dataframeのtimestampタイプをHiveテーブルの列タイプとの互換性のために保存する方法はありますか？

（。私はここに一つの解決策は、intにハイブ列を変更し、UNIXタイムスタンプを書くことで実現）

2016-12-10 lmart999

あなたは試すことができます：

from pyspark.sql.functions import col 

spark.createDataFrame(t) \ 
    .withColumn("parsed", (col("Request_time_local")/1000**3).cast("timestamp"))

2016-12-10 22:27:19

感謝を。これは問題を解決しました。 – lmart999

答えて