0
userID
とmatch
を以下のクエリで正常に結びました。spark fpの拡張を実装しているときにRDDで文字列値を取得する方法は?
var queryToGroupCustomers = "SELECT yt.userID as player," +
" concat_ws(\",\", collect_set(match)) AS matchesPlayedOn" + //concat_ws()
" FROM recommendationengine.sportsbookbets_orc yt" +
" where yt.userID is not null " + leagueCondition + "'" +
" GROUP BY yt.userID"
ここで、アルゴリズムで使用するために列をRDDに渡したいと思います。私の実装では、汎用の行形式val transactions: RDD[Array[String]] = results.rdd.map(row => row.get(2).toString.split(","))
を使用していますが、私に以下のエラーを与えています。
17/03/27 23:28:51 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 29)
java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:200)
以下に結合データセットの例を示します。
ff6e96d4-e243-4046-8e02-ce3d4b459a5d Napoli - Crotone, AC Milan - Juventus, Torino - Juventus, AS Roma - AC Milan, Empoli - Bologna, AC Milan - Internazionale, Genoa - AC Milan, Sassuolo - Chievo Verona, Sassuolo - Genoa
まだ完全なアルゴリズムは次のとおりです。
// Has all customers and their bets
var queryToGroupCustomers = "SELECT yt.userID as player," +
" concat_ws(\",\", collect_set(match)) AS matchesPlayedOn" + //concat_ws()
" FROM recommendationengine.sportsbookbets_orc yt" +
" where yt.userID is not null " + leagueCondition + "'" +
" GROUP BY yt.userID"
println("Executing query: \n\n" + queryToGroupCustomers)
var results = hc.sql(queryToGroupCustomers).cache()
val transactions: RDD[Array[String]] = results.rdd.map(row => row.get(2).toString.split(","))
// Set configurations for FP-Growth
val fpg = new FPGrowth()
.setMinSupport(0.5)
.setNumPartitions(10)
// Generate model
val model = fpg.run(transactions);
println("\n\n Starting FPGrowth\n\n")
model.freqItemsets.collect().foreach { itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}
私はどんな提案をいただければ幸いです...おかげ
はい、そうです...ありがとうございました – EP89