0

PySparkでいくつかの結合を行い、結果をHiveに保存しようとしています。スニペットは、小さなデータセットで動作しますが、データサイズが大きくなるが、私はエラーの下に取得する場合は、以下大きなRDDをHiveに書き込む - アンロードメモリをストレージメモリに転送できませんでした

は、ソフトウェアバージョン

  • HDFS 2.7.3
  • ハイブ1000年2月1日
  • YARN 2.7.3です
  • MapReduce2 2.7.3
  • Spark2 2.1.1
  • のPython 3.5.2
  • Hortonworks 2.6.2.0-205

コード:

hdfs_df.write.mode("append").format("orc").save("HIVE PATH") 

例外:

File "/grid/0/hadoop/yarn/local/usercache/pentaho/appcache/application_1512030580416_0001/container_e16_1512030580416_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save 
    File "/grid/0/hadoop/yarn/local/usercache/pentaho/appcache/application_1512030580416_0001/container_e16_1512030580416_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ 
    File "/grid/0/hadoop/yarn/local/usercache/pentaho/appcache/application_1512030580416_0001/container_e16_1512030580416_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco 
    File "/grid/0/hadoop/yarn/local/usercache/pentaho/appcache/application_1512030580416_0001/container_e16_1512030580416_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o439.save. 
: org.apache.spark.SparkException: Job aborted. 
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147) 
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) 
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) 
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) 
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121) 
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101) 
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) 
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) 
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) 
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) 
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) 
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) 
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
    at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:484) 
    at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:520) 
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) 
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 
    at py4j.Gateway.invoke(Gateway.java:280) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:214) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 25.0 failed 4 times, most recent failure: Lost task 21.3 in stage 25.0 (TID 6957, dgsddevhdp14.mcs.local, executor 2): java.lang.AssertionError: assertion failed: transferring unroll memory to storage memory failed 
    at scala.Predef$.assert(Predef.scala:170) 
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:382) 
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) 
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1007) 
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:947) 
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1007) 
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:711) 
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 
    at org.apache.spark.scheduler.Task.run(Task.scala:99) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 

答えて

1

あなたが保存する前に、たぶん(200)(または他の数).coalesceしようそれ。

+0

hdfs_df.coalesce(100).write.mode( "append")。format( "orc")。save( "HIVE_PATH")いいえ運:( –

+0

あなたはこれを持っています:原因:..... java.lang.AssertionError:アサーションが失敗しました:ストレージメモリへのメモリのアンロードが失敗しました。ここでこの問題を確認してください:https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/ spark/storage/memory/MemoryStore.scala。あなたが労働者のためのより多くのメモリを必要とするように見える... – user3689574

関連する問題