私は、このデータフレームのサンプルまし

データフレームの行あたりのデータが不足してカウントする方法：私は、このデータフレームのサンプルまし

from pyspark.sql.types import * 

schema = StructType([ 
StructField("ClientId", IntegerType(), True), 
StructField("m_ant21", IntegerType(), True), 
StructField("m_ant22", IntegerType(), True), 
StructField("m_ant23", IntegerType(), True), 
StructField("m_ant24", IntegerType(), True)]) 

df = sqlContext.createDataFrame(
         data=[(0, None, None, None, None), 
           (1, 23, 13, 17, 99), 
           (2, 0, 0, 0, 1), 
           (3, 0, None, 1, 0), 
           (4, None, None, None, None)], 
           schema=schema)

私は、このデータフレームがあります。

+--------+-------+-------+-------+-------+ 
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24| 
+--------+-------+-------+-------+-------+ 
|  0| null| null| null| null| 
|  1|  23|  13|  17|  99| 
|  2|  0|  0|  0|  1| 
|  3|  0| null|  1|  0| 
|  4| null| null| null| null| 
+--------+-------+-------+-------+-------+

をそして、私はこの質問を解決する必要があります。 Iを新しい変数を作成して、1行あたりのデータ数がいくつあるかを数えます。例えば：

ClientIdを3でなければならないClientIdを0 4
ClientIdを1であるべきではDFがpyspark.sql.dataframe.DataFrameであることを1

注意すべきです。ここで

出典

2017-07-11 Juan David

は一つの選択肢である：あなたがここに、スキーマに対処したくない場合は

from pyspark.sql import Row 

# add the column schema to the original schema 
schema.add(StructField("count_null", IntegerType(), True)) 

# convert data frame to rdd and append an element to each row to count the number of nulls 
df.rdd.map(lambda row: row + Row(sum(x is None for x in row))).toDF(schema).show() 

+--------+-------+-------+-------+-------+----------+ 
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null| 
+--------+-------+-------+-------+-------+----------+ 
|  0| null| null| null| null|   4| 
|  1|  23|  13|  17|  99|   0| 
|  2|  0|  0|  0|  1|   0| 
|  3|  0| null|  1|  0|   1| 
|  4| null| null| null| null|   4| 
+--------+-------+-------+-------+-------+----------+

別のオプションは次のとおりです。

from pyspark.sql.functions import col, when 

df.withColumn("count_null", sum([when(col(x).isNull(),1).otherwise(0) for x in df.columns])).show() 

+--------+-------+-------+-------+-------+----------+ 
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null| 
+--------+-------+-------+-------+-------+----------+ 
|  0| null| null| null| null|   4| 
|  1|  23|  13|  17|  99|   0| 
|  2|  0|  0|  0|  1|   0| 
|  3|  0| null|  1|  0|   1| 
|  4| null| null| null| null|   4| 
+--------+-------+-------+-------+-------+----------+

出典

2017-07-11 17:18:17 Psidom

それは、おかげで動作しますが、私は持っています質問：新しい変数をスキーマコードなしでデータフレームに集計するにはどうすればよいですか？私はちょうどスキーマ= StructType（[...例として、あなたはschema.add（StructField（ "count_null"、IntegerType（）、true）を）使用を使用して、ためです。しかし、私はDF = sqlContextからデータフレームを読んで.SQL（「」「」「Base_utから*を選択する」） –

については、まず、それを新しいデータフレームを適用し、それを修正し、データフレームのスキーマを読んでどのように 'スキーマ= df.schemaのようなもの;？schema.add （...）; ... ' – Psidom

グレート!!、ありがとう！ –

私は、このデータフレームのサンプルまし

答えて

関連する問題