単一のヘッダ（1列の多数のcols）を持つデータフレームを作成し、pysparkのこのデータフレームに値を更新する方法は？

は、私は以下の表のようなpysparkでデータフレームを作成したい：単一のヘッダ（1列の多数のcols）を持つデータフレームを作成し、pysparkのこのデータフレームに値を更新する方法は？

 
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count 
----------------------------------------------------------------------------------------------------- 
nation | nation  | 1  | 222  |  444  | 555    | 6677

ので、コードは私が以下試した：

てAssertionError：

schema = StructType([]) df = sqlContext.createDataFrame(sc.emptyRDD(), schema) df = df.withColumn("category",F.lit('nation')).withColumn("category_id",F.lit('nation')).withColumn("bucket",bucket) df = df.withColumn("prop_count",prop_count).withColumn("event_count",event_count).withColumn("accum_prop_count",accum_prop_count).withColumn("accum_event_count",accum_event_count) df.show()

これはエラーを与えています。 colは列である必要があります。

また、列の値を更新する必要がありますまた、更新は1行になります。

これを行う方法??

出典

2017-08-19 Viv

整数型の 'bucket'、' even_count'、 'accum_prop_count'、' accum_event_count'のようなものですか？もしそうなら、それらは列を作ることができません。そして、 'F.lit（）'を使う必要があります。 – MaFF

なぜ空のデータフレームが必要ですか？？あなたはすべての行の値を持っており、それらをデータフレームの作成に使うことができます。 – Suresh

あなたのコードの問題は、.withColumn("bucket",bucket)のような変数を使用している行にあると思います。整数値を指定して新しい列を作成しようとしています。 withColumnは、単一の整数値ではなく列を必要とします。

df = df\ 
.withColumn("category",F.lit('nation'))\ 
.withColumn("category_id",F.lit('nation'))\ 
.withColumn("bucket",F.lit(bucket))\ 
.withColumn("prop_count",F.lit(prop_count))\ 
.withColumn("event_count",F.lit(event_count))\ 
.withColumn("accum_prop_count",F.lit(accum_prop_count))\ 
.withColumn("accum_event_count",F.lit(accum_event_count))

、それはこのようなことも書くために別の、シンプルでクリーンな方法：あなたはすでに「国民」

などのために使用しているだけのよう

これを解決するには、3210を使用することができます

を

# create schema 
fields = [StructField("category", StringType(),True), 
      StructField("category_id", StringType(),True), 
      StructField("bucket", IntegerType(),True), 
      StructField("prop_count", IntegerType(),True), 
      StructField("event_count", IntegerType(),True), 
      StructField("accum_prop_count", IntegerType(),True) 
     ] 
schema = StructType(fields) 

# load data 
data = [["nation","nation",1,222,444,555]] 

df = spark.createDataFrame(data, schema) 
df.show()

出典

2017-08-21 00:55:10 Pushkr

単一のヘッダ（1列の多数のcols）を持つデータフレームを作成し、pysparkのこのデータフレームに値を更新する方法は？

答えて

関連する問題