私は深い学習のためのライブラリであるMXNetに取り組んでいます。実装されている構造は、単一マシンと分散CPUマシンの両方にあります。私はMXNet公式サイトのtutorialに続きました。 シングルマシンでの実装は何の問題もなく実行されており、結果が得られました。分散Mxnetトレーニングの許可が拒否されました
次に、複数のCPUマシンを使って分散型のトレーニングを試みました。 AWS、amazon仮想マシンでアカウントを作成し、t2.micro ubuntuの3つを起動しました。 は、私は次のコマンドラインを入力:
../../tools/launch.py -n 2 python train_mnist.py --kv-store dist_sync
上記2人の労働者と1台のサーバとの分散バージョンの訓練を実行すると仮定し、このコマンドライン。
残念ながら、私はエラーが発生しました。私は拒否されたアクセス権があることを理解が、私は、次のコマンドにより、サーバから他の二つの労働者にアクセスしようとしましたが、それは動作します:
ssh -i key.pem [email protected] number.
ここでは、エラーはあなたがあるため、エラーを見ている
Permission denied (publickey).
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
subprocess.check_call(prog, shell = True)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_WORKER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255
Permission denied (publickey).
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
subprocess.check_call(prog, shell = True)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255
[03:48:39] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:300: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]
Traceback (most recent call last):
File "train_mnist.py", line 76, in <module>
fit.fit(args, sym, get_mnist_iter)
File "/home/ubuntu/Research/code/mxnet/example/image-classification/common/fit.py", line 97, in fit
kv = mx.kvstore.create(args.kv_store)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/kvstore.py", line 403, in create
ctypes.byref(handle)))
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/base.py", line 77, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 363, in <lambda>
target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1