データ分布スパーク

でRDDを再分割しながら、次のスニペット（パイソン2.7にスパーク2.1を実行している）を考える：出力されるデータ分布スパーク

nums = range(0, 10) 

with SparkContext("local[2]") as sc: 
    rdd = sc.parallelize(nums) 
    print("Number of partitions: {}".format(rdd.getNumPartitions())) 
    print("Partitions structure: {}".format(rdd.glom().collect())) 

    rdd2 = rdd.repartition(5) 
    print("Number of partitions: {}".format(rdd2.getNumPartitions())) 
    print("Partitions structure: {}".format(rdd2.glom().collect()))

を：

Number of partitions: 2 
Partitions structure: [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]] 

Number of partitions: 5 
Partitions structure: [[], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [], [], []]

データを再分割した後に全てに分散されなかったのはなぜパーティション？ pysparkで

出典

2017-04-21 Khozzy

repartitionがcoalesce(numPartitions, shuffle=True)（see core code here）.IEデータはすべてネットワーク経由シャッフルされ、パーティショニングは、ラウンドロビン方式の意味で行われ、最初のレコードは、第2の処理ノードに第1、第2の処理ノードに移行するが、 local[2]つまり2つの仮想ノードしか割り当てられていないため、私の推測ではsparkはローカルマシンからコアを1つしか取得できないため、タスクが実行された特定のノードにすべての値が格納されます。

出典

2017-04-21 14:15:14 Pushkr

ご意見ありがとうございます。私はそれが当てはまるとは思わない。この方法は、DataFrames（https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4を参照）を使用すると機能しますが、純粋なRDDでは失敗します – Khozzy

答えて

関連する問題