各行の列から特定の要素を抽出するにはどうすればよいですか？

次のDataFrameはSpark 2.2.0とScala 2.11.8にあります。各行の列から特定の要素を抽出するにはどうすればよいですか？

+----------+-------------------------------+ 
|item  |  other_items   | 
+----------+-------------------------------+ 
| 111  |[[444,1.0],[333,0.5],[666,0.4]]| 
| 222  |[[444,1.0],[333,0.5]]   | 
| 333  |[]        | 
| 444  |[[111,2.0],[555,0.5],[777,0.2]]|

私は、次のデータフレームを取得したい：だから

+----------+-------------+ 
|item  | other_items | 
+----------+-------------+ 
| 111  | 444   | 
| 222  | 444   | 
| 444  | 111   |

を、基本的に、私は行ごとにother_itemsから最初itemを抽出する必要があります。また、配列[]が空の行を無視して、other_productsにする必要があります。

どうすればいいですか？

私はこのアプローチを試みましたが、期待した結果が得られませんでした。このよう

|-- item: string (nullable = true) 
|-- other_items: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- _1: string (nullable = true) 
| | |-- _2: double (nullable = true)

出典

2017-11-22 Markus

：

result = df.withColumn("other_items",$"other_items"(0))

printSchemeは、次の出力が得られる最初の（$"other_items"(0)）は配列の最初の要素を選択apply

val df = Seq(
    ("111", Seq(("111", 1.0), ("333", 0.5), ("666", 0.4))), ("333", Seq()) 
).toDF("item", "other_items") 


df.select($"item", $"other_items"(0)("_1").alias("other_items")) 
    .na.drop(Seq("other_items")).show

を、第二apply（_("_1") ）selects_1フィールド、na.dropは空の配列によってが導入されました。

+----+-----------+ 
|item|other_items| 
+----+-----------+ 
| 111|  111| 
+----+-----------+

出典

2017-11-22 16:22:32 user8991934

各行の列から特定の要素を抽出するにはどうすればよいですか？

答えて

関連する問題