0
5つのジョブを実行している3ノードクラスタでDSE Sparkを使用しています。私たちはSIGTERMコマンドが/var/log/spark/worker/worker-0/worker.logに来て、私たちの仕事を止めているのを見ています。これらの時間に対応するメモリやプロセッサの制約はなく、手動でこれらの呼び出しを行った人はいません。Datastaxスパークジョブが理由なしで殺されました
私はYARNまたはMesosでヒープサイズの問題が発生するいくつかの同様の問題を経験しましたが、DSEを使用しているため、これらは関連性がないようです。
以下ジョブの2を実行していた1つのサーバーからのログ情報のサンプルです:
ERROR [SIGTERM handler] 2016-03-26 00:43:28,780 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
ERROR [SIGHUP handler] 2016-03-26 00:43:28,788 SignalLogger.scala:57 - RECEIVED SIGNAL 1: SIGHUP
INFO [Spark Shutdown Hook] 2016-03-26 00:43:28,795 Logging.scala:59 - Killing process!
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr] 2016-03-26 00:43:28,848 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_71]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[na:1.8.0_71]
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout] 2016-03-26 00:43:28,892 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) ~[na:1.8.0_71]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_71]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[na:1.8.0_71]
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
ERROR [SIGTERM handler] 2016-03-26 00:43:29,070 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,079 Logging.scala:59 - Disassociated [akka.tcp://[email protected]:44131] -> [akka.tcp://[email protected]:7077] Disassociated !
ERROR [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,080 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,081 Logging.scala:59 - Connecting to master akka.tcp://[email protected]:7077/user/Master...
WARN [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,091 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://[email protected]:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,101 Logging.scala:59 - Disassociated [akka.tcp://[email protected]:44131] -> [akka.tcp://[email protected]:7077] Disassociated !
ERROR [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,102 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,102 Logging.scala:59 - Not spawning another attempt to register with the master, since there is an attempt scheduled already.
WARN [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,323 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://[email protected]:49943] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,330 Logging.scala:59 - Executor app-20160325132151-0004/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,414 Logging.scala:59 - Killing process!
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,415 Logging.scala:59 - Executor app-20160325131848-0001/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,417 Logging.scala:59 - Killing process!
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,422 Logging.scala:59 - Unknown Executor app-20160325132151-0004/0 finished with state EXITED message Worker shutting down exitStatus 129
WARN [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,425 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://[email protected]:32874] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
WARN [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,433 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://[email protected]:56212] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,441 Logging.scala:59 - Executor app-20160325131918-0002/1 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,448 Logging.scala:59 - Unknown Executor app-20160325131918-0002/1 finished with state EXITED message Worker shutting down exitStatus 129
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,448 Logging.scala:59 - Shutdown hook called
INFO [Spark Shutdown Hook] 2016-03-26 00:43:29,449 Logging.scala:59 - Deleting directory /var/lib/spark/rdd/spark-28fa2f73-d2aa-44c0-ad4e-3ccfd07a95d2