2016-09-29 5 views
1

正規表現を使用してS3からSparkを使用してデータを読み込もうとしたときに、なぜ違うのか知りたいですか?正規表現を使用するとSpark S3アクセスが拒否される

私はバケツ "テスト" でいくつかのファイルを持っている:

/test/logs/2016-07-01/a.gz 
/test/logs/2016-07-02/a.gz 
/test/logs/2016-07-03/a.gz 

これら2つの作品:

val logRDD = sqlContext.read.json("s3a://test/logs/2016-07-01/*.gz") 
or 
val logRDD = sqlContext.read.json("s3n://test/logs/2016-07-01/*.gz") 

しかし、私はやるときは、この:

val logRDD = sqlContext.read.json("s3a://test/logs/2016-07-0*/*.gz") 

私はこれを取得:

16/09/29 04:35:13 ERROR ApplicationMaster: User class threw exception: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxxx, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: xxx= 
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxx, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: xxx= 
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) 
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) 
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) 
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) 
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976) 
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956) 
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:952) 
    at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:794) 
    at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69) 
    at org.apache.hadoop.fs.Globber.glob(Globber.java:217) 
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1655) 
    at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:276) 
    at org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:283) 
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$11.apply(ResolvedDataSource.scala:173) 
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$11.apply(ResolvedDataSource.scala:169) 
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) 
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) 
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) 
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) 
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) 
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108) 
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:169) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) 
    at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244) 
    at com.test.LogParser$.main(LogParser.scala:295) 
    at com.test.LogParser.main(LogParser.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:497) 
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559) 

または私はこの使用している場合:彼らは異なっているのはなぜ

16/09/29 04:08:57 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: s3n://test/logs 
org.apache.hadoop.security.AccessControlException: Permission denied: s3n://test/logs 
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:449) 
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427) 
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411) 
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:497) 
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) 
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) 
    at org.apache.hadoop.fs.s3native.$Proxy42.retrieveMetadata(Unknown Source) 
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:530) 
    at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69) 
    at org.apache.hadoop.fs.Globber.glob(Globber.java:217) 
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1674) 
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259) 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) 
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) 
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1136) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1134) 
    at org.apache.spark.sql.execution.datasources.json.InferSchema$.infer(InferSchema.scala:65) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:114) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109) 
    at scala.Option.getOrElse(Option.scala:120) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109) 
    at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108) 
    at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636) 
    at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635) 
    at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37) 
    at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442) 
    at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:288) 
    at com.test.LogParser$.main(LogParser.scala:294) 
    at com.test.LogParser.main(LogParser.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:497) 
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559) 
Caused by: org.jets3t.service.impl.rest.HttpException 
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:423) 
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277) 
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1038) 
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2250) 
    at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2179) 
    at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120) 
    at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575) 
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174) 
    ... 52 more 

val logRDD = sqlContext.read.json("s3n://test/logs/2016-07-0*/*.gz") 

をそれから私はこれを取得しますか?

答えて

0

チェックあなたはこのバケット

0

このようなベースパスを提供するためのアクセス許可を一覧表示している場合:

spark.read.option("basePath", basePath2).json(paths.toSeq:_*) 

ベースパスは、すべてのパスでは全く変更しないパスで最長の文字列です。あなたは読んでみたい。

関連する問題