SparkのDataset APIは、Dataframeと比較して異なる結果を示します

Spark 2.1を使用していて、orc形式のハイブテーブルが1つあります。スキーマは次のとおりです。SparkのDataset APIは、Dataframeと比較して異なる結果を示します

col_name data_type 
tuid  string 
puid  string 
ts   string 
dt   string 
source  string 
peer  string 
# Partition Information 
# col_name data_type 
dt   string 
source  string 
peer  string 

# Detailed Table Information  
Database:   test 
Owner:    test 
Create Time:  Tue Nov 22 15:25:53 GMT 2016 
Last Access Time: Thu Jan 01 00:00:00 GMT 1970 
Location:   hdfs://apps/hive/warehouse/nis.db/dmp_puid_tuid 
Table Type:   MANAGED 
Table Parameters: 
    transient_lastDdlTime 1479828353 
    SORTBUCKETCOLSPREFIX TRUE 

# Storage Information 
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde 
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat 
Compressed: No 
Storage Desc Parameters:  
    serialization.format 1

パーティションカラムを使用してこのテーブルの上にフィルタを適用しているときに、正常に動作し、特定のパーティションのみを読み込みます。

val puid = spark.read.table("nis.dmp_puid_tuid") 
    .as(Encoders.bean(classOf[DmpPuidTuid])) 
    .filter("""peer = "AggregateKnowledge" and dt = "20170403"""")

と私はその上記のデータフレーム用スパーク

val puid = spark.read.table("nis.dmp_puid_tuid") 
    .as(Encoders.bean(classOf[DmpPuidTuid])) 
    .filter(tp => tp.getPeer().equals("AggregateKnowledge") && Integer.valueOf(tp.getDt()) >= 20170403)

物理計画に全データを読み込み、コードの下に使用していたときに、これは、このクエリ

== Physical Plan == 
HiveTableScan [tuid#1025, puid#1026, ts#1027, dt#1022, source#1023, peer#1024], MetastoreRelation nis, dmp_puid_tuid, [isnotnull(peer#1024), isnotnull(dt#1022), 
(peer#1024 = AggregateKnowledge), (dt#1022 = 20170403)]

のための私の物理的な計画ですが、

== Physical Plan == 
*Filter <function1>.apply 
+- HiveTableScan [tuid#1058, puid#1059, ts#1060, dt#1055, source#1056, peer#1057], MetastoreRelation nis, dmp_puid_tuid

注： - あなたはfilterにScalaの関数を渡すときDmpPuidTuidは、オプティマイザが見えるようにしようとしないので、あなたは（データセットの列は、実際に使用されて見てからスパークオプティマイザを防ぐためのJava Beanクラス

出典

2017-06-02 Kaushal

です関数のコンパイルされたコード内）。 col("peer") === "AggregateKnowledge" && col("dt").cast(IntegerType) >= 20170403などの列式を渡すと、オプティマイザは実際に必要な列を確認し、それに応じて計画を調整できます。

出典

2017-06-03 04:51:25

ありがとう@joe今後、データセットやサポートのtypesefe機能を実現するための方法はありますか？ – Kaushal

コンパイル時の型チェックを意味するなら、私が知っているのは[frameless]（https://github.com/typelevel/frameless）プロジェクトだけです。しかし、私はこのことに関する専門家ではない。 –

SparkのDataset APIは、Dataframeと比較して異なる結果を示します

答えて

関連する問題