spark fpの拡張を実装しているときにRDDで文字列値を取得する方法は？

userIDとmatchを以下のクエリで正常に結びました。spark fpの拡張を実装しているときにRDDで文字列値を取得する方法は？

var queryToGroupCustomers = "SELECT yt.userID as player," + 
    " concat_ws(\",\", collect_set(match)) AS matchesPlayedOn" + //concat_ws() 
    " FROM recommendationengine.sportsbookbets_orc yt" + 
    " where yt.userID is not null " + leagueCondition + "'" + 
    " GROUP BY yt.userID"

ここで、アルゴリズムで使用するために列をRDDに渡したいと思います。私の実装では、汎用の行形式val transactions: RDD[Array[String]] = results.rdd.map(row => row.get(2).toString.split(","))を使用していますが、私に以下のエラーを与えています。

17/03/27 23:28:51 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 29) 
java.lang.ArrayIndexOutOfBoundsException: 2 
    at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:200)

以下に結合データセットの例を示します。

ff6e96d4-e243-4046-8e02-ce3d4b459a5d Napoli - Crotone, AC Milan - Juventus, Torino - Juventus, AS Roma - AC Milan, Empoli - Bologna, AC Milan - Internazionale, Genoa - AC Milan, Sassuolo - Chievo Verona, Sassuolo - Genoa

まだ完全なアルゴリズムは次のとおりです。

// Has all customers and their bets 
var queryToGroupCustomers = "SELECT yt.userID as player," + 
    " concat_ws(\",\", collect_set(match)) AS matchesPlayedOn" + //concat_ws() 
    " FROM recommendationengine.sportsbookbets_orc yt" + 
    " where yt.userID is not null " + leagueCondition + "'" + 
    " GROUP BY yt.userID" 

println("Executing query: \n\n" + queryToGroupCustomers) 
var results = hc.sql(queryToGroupCustomers).cache() 
val transactions: RDD[Array[String]] = results.rdd.map(row => row.get(2).toString.split(",")) 

// Set configurations for FP-Growth 
val fpg = new FPGrowth() 
    .setMinSupport(0.5) 
    .setNumPartitions(10) 

// Generate model 
val model = fpg.run(transactions); 

println("\n\n Starting FPGrowth\n\n") 

model.freqItemsets.collect().foreach { itemset => 
    println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq) 
}

私はどんな提案をいただければ幸いです...おかげ

出典

2017-03-27 EP89

あなたは2つのフィールドを持つ行を持っている、とrow.get(2)は（行のフィールドは、通常のように、0をベースとしている）その3番目のフィールドの値を取得します。もちろんそれは誤りです。 matchesPlayedOnを取得するには、row.get(1)またはrow(1)を使用してください。

出典

2017-03-28 06:43:06

はい、そうです...ありがとうございました – EP89

spark fpの拡張を実装しているときにRDDで文字列値を取得する方法は？

答えて

関連する問題