[INT]

私のcsvファイルの各行は次のように構成されています[INT]

u001, 2013-11, 0, 1, 2, ... , 99

れるU001と2013から11はUIDであり、日付は、の数値0からの数値です。私は、この構造でスパークDATAFRAMEにこのcsvファイルをロードしたい：のdataVectorはアレイ[INT]あり、そしてのdataVector長さはUIDと日付のすべてに対して同じである

+-------+-------------+-----------------+ 
| uid|   date|  dataVector| 
+-------+-------------+-----------------+ 
| u001|  2013-11| [0,1,...,98,99]| 
| u002|  2013-11| [1,2,...,99,100]| 
+-------+-------------+-----------------+ 

root 
|-- uid: string (nullable = true) 
|-- date: string (nullable = true) 
|-- dataVecotr: array (nullable = true) 
| |-- element: integer (containsNull = true)

します。私はシェマ

val attributes = Array("uid", "date", "dataVector) 
val schema = StructType(
StructField(attributes(0), StringType, true) :: 
StructField(attributes(1), StringType, true) :: 
StructField(attributes(2), ArrayType(IntegerType), true) :: 
Nil)

を使用して

含めて、これを解決するには、いくつかの方法を試してみました。しかし、この方法ではうまく動作しませんでした。私の後のデータセットでデータの列が100より大きい場合、dataVectorの列全体を手動で含むスキーマを手動で作成することも不便であると思います。

直接スキーマなしCSVファイルをロードし、そして一緒にデータの列を連結するconcatenate multiple columns into single columns方法を使用しますが、スキーマ構造が

root 
    |-- uid: string (nullable = true) 
    |-- date: string (nullable = true) 
    |-- dataVector: struct (nullable = true) 
    | |-- _c3: string (containsNull = true) 
    | |-- _c4: string (containsNull = true) 
    . 
    . 
    . 
    | |-- _c101: string (containsNull = true)

と同様である

これは私が必要とするものとはまだ異なります。この構造体を必要なものに変換する方法はありませんでした。私の質問は、どのように私は必要な構造にCSVファイルを読み込むことができるということですか？任意の追加

val df = spark.read.csv(path)

とせずに

出典

2017-12-15 agonized

負荷それが選択：

import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.Column 

// Combine data into array 
val dataVector: Column = array(
    df.columns.drop(2).map(col): _* // Skip first 2 columns 
).cast("array<int>") // Cast to the required type 
val cols: Array[Column] = df.columns.take(2).map(col) :+ dataVector 

df.select(cols: _*).toDF("uid", "date", "dataVector")

出典

2017-12-15 02:32:36 user8371915

答えて

関連する問題