(http://jmcauley.ucsd.edu/data/amazon/)私は尋ねた他の質問に続いてきた私のClouderaのVM解析アマゾン電子レビューApacheの豚
でApacheの豚における5-コア(1689188件のレビュー): -
レビュー例
{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
grunt> reviews = LOAD 'amazon/amazon-pro/reviews.json' USING org.apache.pig.builtin.JsonLoader('id:chararray, asin:int, reviewerName: chararray, helpful:(int), reviewText:chararray, overall:float, summary:chararray, time:int, reviewTime:chararray'); grunt> viewReview = LIMIT reviews 1; grunt> DUMP viewReview;
私はhelpful
のためのあなたのスキーマ定義に問題があると思い、次のエラー
2016-11-17 08:05:33,797 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT 2016-11-17 08:05:35,897 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2016-11-17 08:05:36,531 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2 2016-11-17 08:05:36,532 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2016-11-17 08:05:37,577 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2016-11-17 08:05:38,183 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2016-11-17 08:05:38,225 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1 2016-11-17 08:05:38,230 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job974442700781595171.jar 2016-11-17 08:05:57,665 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job974442700781595171.jar created 2016-11-17 08:05:57,754 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2016-11-17 08:05:58,090 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2016-11-17 08:05:58,347 [JobControl] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2016-11-17 08:05:58,614 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.df.interval is deprecated. Instead, use fs.df.interval 2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.max.objects is deprecated. Instead, use dfs.namenode.max.objects 2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - hadoop.native.lib is deprecated. Instead, use io.native.lib.available 2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.data.dir is deprecated. Instead, use dfs.datanode.data.dir 2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.name.dir is deprecated. Instead, use dfs.namenode.name.dir 2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - fs.checkpoint.dir is deprecated. Instead, use dfs.namenode.checkpoint.dir 2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.block.size is deprecated. Instead, use dfs.blocksize 2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.access.time.precision is deprecated. Instead, use dfs.namenode.accesstime.precision 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.replication.min is deprecated. Instead, use dfs.namenode.replication.min 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.name.edits.dir is deprecated. Instead, use dfs.namenode.edits.dir 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.replication.considerLoad is deprecated. Instead, use dfs.namenode.replication.considerLoad 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.balance.bandwidthPerSec is deprecated. Instead, use dfs.datanode.balance.bandwidthPerSec 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.safemode.threshold.pct is deprecated. Instead, use dfs.namenode.safemode.threshold-pct 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.http.address is deprecated. Instead, use dfs.namenode.http-address 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.name.dir.restore is deprecated. Instead, use dfs.namenode.name.dir.restore 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.https.client.keystore.resource is deprecated. Instead, use dfs.client.https.keystore.resource 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.backup.address is deprecated. Instead, use dfs.namenode.backup.address 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.backup.http.address is deprecated. Instead, use dfs.namenode.backup.http-address 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.permissions is deprecated. Instead, use dfs.permissions.enabled 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.safemode.extension is deprecated. Instead, use dfs.namenode.safemode.extension 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.datanode.max.xcievers is deprecated. Instead, use dfs.datanode.max.transfer.threads 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.https.need.client.auth is deprecated. Instead, use dfs.client.https.need-auth 2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.https.address is deprecated. Instead, use dfs.namenode.https-address 2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.replication.interval is deprecated. Instead, use dfs.namenode.replication.interval 2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - fs.checkpoint.edits.dir is deprecated. Instead, use dfs.namenode.checkpoint.edits.dir 2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.write.packet.size is deprecated. Instead, use dfs.client-write-packet-size 2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.permissions.supergroup is deprecated. Instead, use dfs.permissions.superusergroup 2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - topology.script.number.args is deprecated. Instead, use net.topology.script.number.args 2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode 2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration - dfs.secondary.http.address is deprecated. Instead, use dfs.namenode.secondary.http-address 2016-11-17 08:06:00,045 [JobControl] WARN org.apache.hadoop.conf.Configuration - fs.checkpoint.period is deprecated. Instead, use dfs.namenode.checkpoint.period 2016-11-17 08:06:00,045 [JobControl] WARN org.apache.hadoop.conf.Configuration - topology.node.switch.mapping.impl is deprecated. Instead, use net.topology.node.switch.mapping.impl 2016-11-17 08:06:00,045 [JobControl] WARN org.apache.hadoop.conf.Configuration - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 2016-11-17 08:06:00,217 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2016-11-17 08:06:00,270 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 11 2016-11-17 08:06:01,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201611170800_0001 2016-11-17 08:06:01,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases r,reviews 2016-11-17 08:06:01,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: reviews[1,10],r[2,4] C: R: 2016-11-17 08:06:01,755 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201611170800_0001 2016-11-17 08:09:30,985 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2016-11-17 08:09:31,500 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201611170800_0001 has failed! Stop running all dependent jobs 2016-11-17 08:09:31,538 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2016-11-17 08:09:31,596 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: org.codehaus.jackson.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors at [Source: [email protected]; line: 1, column: 43] at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1291) at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385) at org.codehaus.jackson.impl.JsonNumericParserBase._parseNumericValue(JsonNumericParserBase.java:399) at org.codehaus.jackson.impl.JsonNumericParserBase.getIntValue(JsonNumericParserBase.java:254) at org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:189) at org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) at org.apache.hadoop.map 2016-11-17 08:09:31,597 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2016-11-17 08:09:31,602 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.0.0-cdh4.7.0 0.11.0-cdh4.7.0 cloudera 2016-11-17 08:05:37 2016-11-17 08:09:31 LIMIT Failed! Failed Jobs: JobId Alias Feature Message Outputs job_201611170800_0001 r,reviews Message: Job failed! Input(s): Failed to read data from "hdfs://localhost.localdomain:8020/user/cloudera/amazon/amazon-pro/reviews.json" Output(s): Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_201611170800_0001 -> null, null 2016-11-17 08:09:31,602 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2016-11-17 08:09:31,635 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias r Details at logfile: /home/cloudera/pig_1479349681179.log
ありがとうBrian。それは動作します –
私はバッグ内のタプル値を照会しています。これに関するアイデアreviewHelpPercent = FOREACHのレビューは、参考になりました、参考になりました。(参考)$ 0 /参考$ 1 * 100 AS helpPercent、reviewTime; –