私はcsv
形式のデータを以下に示すように持っています。Apache pig group by functionは、予期しない出力を与えます
データは以下の形式
"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"
User.csv
下という名前のサンプル・データを持っています。このファイルには以下のデータが含まれています。
"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","[email protected]","http://www.alandrosenburgcpapc.co.uk"
"Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","[email protected]","http://www.capgeminiamerica.co.uk"
"France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","[email protected]","http://www.elliottjohnwesq.co.uk"
私はそれの出力は以下のようであるPigStorage
user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(',');
DUMP user;
を使用してロードするために同じことをしてみてください。私は都市にして、グループをやりたい
("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","[email protected]","http://www.alandrosenburgcpapc.co.uk")
("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","[email protected]","http://www.capgeminiamerica.co.uk")
("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","[email protected]","http://www.elliottjohnwesq.co.uk")
。
(Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","[email protected]","http://www.capgeminiamerica.co.uk")})
("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","[email protected]","http://www.elliottjohnwesq.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","[email protected]","http://www.alandrosenburgcpapc.co.uk")})
COMPANY_NAMEと、それはそれの一部として','
が含まれているように、アドレスは、問題を作成されています。だから私は、私は出力を得る
grp = group user by $4;
dump grp;
を書かれています。たとえば、住所には
"14, Taylor St"
、会社名には
"Elliott, John W Esq"
となります。
はので、私の$4
は"Taylor St"
の治療を受けているので、理由はアドレスデータに余分な区切り文字やCOMPANY_NAMEデータのない"St. Stephens Ward"
が正しくロードされたか、正しく区切られていないとのfuctionによってグループが正しい結果を与えていません。
は、どのように私はそれは私のためのソリューションではありません
("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","[email protected]","http://www.capgeminiamerica.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","[email protected]","http://www.alandrosenburgcpapc.co.uk")})
("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","[email protected]","http://www.elliottjohnwesq.co.uk")})
grp = group a by $5 ;
以下のように出力することで、グループを達成することができます。私はすでにそれを考えました。
データを読み込むためにCSVExcelStorageを使用してみてください。エスケープして適切にデータをロードすることを尊重する必要があります。 – LiMuBei
同じことを試してみてください –
@LiMuBei:ありがとうございます。 'CSVExcelStorage'を使って私の仕事をしました。今私はグループ化した後に正しいデータを得ることができます... –