2017-03-10 4 views
0

をintに文字列をキャスト:私は数値にキャストしたい2つの文字列の列があり、スパークデータフレーム、結果を、持っているヌル問題

:そのようにしようと

>>> results.show() 
+--------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+--------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC...|    "43"|     "20"| 
|"BAYLOR MEDICAL C...|    "32"|     "20"| 
|"GOOD SHEPHERD ME...|    "25"|     "20"| 
|"GOOD SHEPHERD ME...|    "25"|     "20"| 
|"MASONIC HOME AND...| "Not Available"|   "Not Available"| 
|"ST HELENA HOSPITAL"|    "41"|     "20"| 
| "TOURO INFIRMARY"|    "15"|     "18"| 
|"WAHIAWA GENERAL ...|    "17"|     "10"| 
|"ANNA JAQUES HOSP...|    "27"|     "18"| 
| "CMC-BLUE RIDGE"|    "31"|     "18"| 
|"EVANSTON REGIONA...|    "15"|     "15"| 
|"OKLAHOMA SPINE H...|    "79"|     "20"| 
|"PICKENS COUNTY M...| "Not Available"|   "Not Available"| 
|"PORTNEUF MEDICAL...|    "11"|     "17"| 
|"PRESENCE SAINT J...|    "20"|     "17"| 
|"RIVERSIDE MEDICA...|    "39"|     "20"| 
|"RIVERSIDE MEDICA...|    "39"|     "20"| 
|"RIVERSIDE MEDICA...|    "39"|     "20"| 
|"SOUTH GEORGIA ME...| "3 out of 10"|     "24"| 
|"TAMPA GENERAL HO...|    "23"|     "16"| 
+--------------------+-----------------+------------------------+ 

は私にnull値のテーブルを与えます

>>> results2 = results.select(results["Hospital Name"], results["HCAHPS Base Score"].cast(pe()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).aHPS Consistency Score")) 
>>> results2.show() 
+--------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+--------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC...|    null|     null| 
|"BAYLOR MEDICAL C...|    null|     null| 
|"GOOD SHEPHERD ME...|    null|     null| 
|"GOOD SHEPHERD ME...|    null|     null| 
|"MASONIC HOME AND...|    null|     null| 
|"ST HELENA HOSPITAL"|    null|     null| 
| "TOURO INFIRMARY"|    null|     null| 
|"WAHIAWA GENERAL ...|    null|     null| 
|"ANNA JAQUES HOSP...|    null|     null| 
| "CMC-BLUE RIDGE"|    null|     null| 
|"EVANSTON REGIONA...|    null|     null| 
|"OKLAHOMA SPINE H...|    null|     null| 
|"PICKENS COUNTY M...|    null|     null| 
|"PORTNEUF MEDICAL...|    null|     null| 
|"PRESENCE SAINT J...|    null|     null| 
|"RIVERSIDE MEDICA...|    null|     null| 
|"RIVERSIDE MEDICA...|    null|     null| 
|"RIVERSIDE MEDICA...|    null|     null| 
|"SOUTH GEORGIA ME...|    null|     null| 
|"TAMPA GENERAL HO...|    null|     null| 
+--------------------+-----------------+------------------------+ 

only showing top 20 rows 

pysparkで文字列を整数に変換することはできませんか?

答えて

4

まず、二重引用符を取り除く必要がある場合、IntegerTypeに変換できるようにする必要があります。あなたはそれを達成するために以下のudfを使うことができます。私たちは、それを使用します

>>> def stripDQ(string): 
... return string.replace('"', "") 
... 
>>> from pyspark.sql.functions import udf 
>>> from pyspark.sql.types import StringType, IntegerType 
>>> udf_stripDQ = udf(stripDQ, StringType()) 

..

あなたの実際のDATAFRAME:

>>> results.show() 
+------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC"|    "43"|     "20"| 
|"BAYLOR MEDICAL C"|    "32"|     "20"| 
|"GOOD SHEPHERD ME"|    "25"|     "20"| 
|"GOOD SHEPHERD ME"|    "25"|     "20"| 
|"MASONIC HOME AND"| "Not Available"|   "Not Available"| 
+------------------+-----------------+------------------------+ 

今、我々は両方の列から二重引用符を取り除くために私たちのUDFを使用します。

>>> results2 = results1.select(results1["Hospital Name"], results1["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results1["HCAHPS Consistency Score"].cast(IntegerType()).alias("HPS Consistency Score")) 
>>> results2.show() 
+------------------+-----------------+---------------------+ 
|  Hospital Name|HCAHPS Base Score|HPS Consistency Score| 
+------------------+-----------------+---------------------+ 
|"ADIRONDACK MEDIC"|    43|     20| 
|"BAYLOR MEDICAL C"|    32|     20| 
|"GOOD SHEPHERD ME"|    25|     20| 
|"GOOD SHEPHERD ME"|    25|     20| 
|"MASONIC HOME AND"|    null|     null| 
+------------------+-----------------+---------------------+ 
:今、整数にキャスト

>>> results1 = results.withColumn("HCAHPS Base Score", udf_stripDQ(results["HCAHPS Base Score"])).withColumn("HCAHPS Consistency Score", udf_stripDQ(results["HCAHPS Consistency Score"])) 
>>> results1.show() 
+------------------+-----------------+------------------------+ 
|  Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score| 
+------------------+-----------------+------------------------+ 
|"ADIRONDACK MEDIC"|    43|      20| 
|"BAYLOR MEDICAL C"|    32|      20| 
|"GOOD SHEPHERD ME"|    25|      20| 
|"GOOD SHEPHERD ME"|    25|      20| 
|"MASONIC HOME AND"| Not Available|   Not Available| 
+------------------+-----------------+------------------------+ 

関連する問題