2016-04-01 13 views
0

私は、ADAM最適化を使用してtensorflowでネットワークの複数のgpuを実装しようとしています。Tensorflow Adam Multigpuグラジエント

私はCifar10_multigpuのコードに対処していますが、グラデーションが2番目のタワーを呼び出すと、最初の勾配を呼び出して2つのタワーの平均でエラーが生成されているように見えます。 2つの塔のコードは、この

for d in devs: 
     with tf.device(d): 
      with tf.name_scope('%s_%d' % (tf_model.TOWER_NAME, i)) as scope: 
       loss = tower_loss(scope) 
       tf.get_variable_scope().reuse_variables() 
       summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope) 
       grads = opt.compute_gradients(loss) 
       print('\n'.join('{}: {}'.format(*k) for k in enumerate(grads))) 
       tower_grads.append(grads) 
     i +=1 

であり、これは、各タワーを生成する:

0: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca11d0ae10>) 
1: (<tf.Tensor 'tower_0/gradients/tower_0/conv1/Conv2D_grad/tuple/control_dependency_1:0' shape=(1, 1, 8, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351b10>) 
2: (<tf.Tensor 'tower_0/gradients/tower_0/conv1/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c380dd0>) 
3: (<tf.Tensor 'tower_0/gradients/tower_0/conv2/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351a10>) 
4: (<tf.Tensor 'tower_0/gradients/tower_0/conv2/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6dd0>) 
5: (<tf.Tensor 'tower_0/gradients/tower_0/conv3/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 32) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6490>) 
6: (<tf.Tensor 'tower_0/gradients/tower_0/conv3/BiasAdd_grad/tuple/control_dependency_1:0' shape=(32,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351990>) 
7: (<tf.Tensor 'tower_0/gradients/tower_0/conv4/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 32, 64) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351890>) 
8: (<tf.Tensor 'tower_0/gradients/tower_0/conv4/BiasAdd_grad/tuple/control_dependency_1:0' shape=(64,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3b7790>) 
9: (<tf.Tensor 'tower_0/gradients/tower_0/conv5/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 64, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2d9110>) 
10: (<tf.Tensor 'tower_0/gradients/tower_0/conv5/BiasAdd_grad/tuple/control_dependency_1:0' shape=(128,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2849d0>) 
11: (<tf.Tensor 'tower_0/gradients/tower_0/conv6/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 128, 256) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2e6f10>) 
12: (<tf.Tensor 'tower_0/gradients/tower_0/conv6/BiasAdd_grad/tuple/control_dependency_1:0' shape=(256,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2afed0>) 
13: (<tf.Tensor 'tower_0/gradients/tower_0/fc1/MatMul_grad/tuple/control_dependency_1:0' shape=(18944, 4096) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1f9550>) 
14: (<tf.Tensor 'tower_0/gradients/tower_0/fc1/add_grad/tuple/control_dependency_1:0' shape=(4096,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c214a10>) 
15: (<tf.Tensor 'tower_0/gradients/tower_0/fc1_1/MatMul_grad/tuple/control_dependency_1:0' shape=(4096, 1024) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c23dfd0>) 
16: (<tf.Tensor 'tower_0/gradients/tower_0/fc1_1/add_grad/tuple/control_dependency_1:0' shape=(1024,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c269bd0>) 
17: (<tf.Tensor 'tower_0/gradients/tower_0/softmax_linear/MatMul_grad/tuple/control_dependency_1:0' shape=(1024, 360) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1d1a50>) 
18: (<tf.Tensor 'tower_0/gradients/tower_0/softmax_linear/softmax_linear_grad/tuple/control_dependency_1:0' shape=(360,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1def50>) 

:勾配で第1のタワーを探し

stream, target= placeholder_inputs(FLAGS.batch_size*tf_model.ANGLES/FLAGS.num_gpus) 
    logits = tf_model.inference_noisy_simulate(stream) 
    _ = tf_model.loss(logits, target) 
    losses = tf.get_collection('losses', scope) 
    total_loss = tf.add_n(losses, name='total_loss') 

は、これを生成します第2がこれを生成する。

0: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca11d0ae10>) 
1: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351b10>) 
2: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c380dd0>) 
3: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351a10>) 
4: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6dd0>) 
5: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6490>) 
6: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351990>) 
7: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351890>) 
8: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3b7790>) 
9: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2d9110>) 
10: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2849d0>) 
11: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2e6f10>) 
12: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2afed0>) 
13: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1f9550>) 
14: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c214a10>) 
15: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c23dfd0>) 
16: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c269bd0>) 
17: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1d1a50>) 
18: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1def50>) 
19: (<tf.Tensor 'tower_1/gradients/tower_1/conv1/Conv2D_grad/tuple/control_dependency_1:0' shape=(1, 1, 8, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c178c50>) 
20: (<tf.Tensor 'tower_1/gradients/tower_1/conv1/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfbb490>) 
21: (<tf.Tensor 'tower_1/gradients/tower_1/conv2/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfda950>) 
22: (<tf.Tensor 'tower_1/gradients/tower_1/conv2/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf91bd0>) 
23: (<tf.Tensor 'tower_1/gradients/tower_1/conv3/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 32) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfcb590>) 
24: (<tf.Tensor 'tower_1/gradients/tower_1/conv3/BiasAdd_grad/tuple/control_dependency_1:0' shape=(32,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf39e90>) 
25: (<tf.Tensor 'tower_1/gradients/tower_1/conv4/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 32, 64) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf499d0>) 
26: (<tf.Tensor 'tower_1/gradients/tower_1/conv4/BiasAdd_grad/tuple/control_dependency_1:0' shape=(64,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf14fd0>) 
27: (<tf.Tensor 'tower_1/gradients/tower_1/conv5/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 64, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf39150>) 
28: (<tf.Tensor 'tower_1/gradients/tower_1/conv5/BiasAdd_grad/tuple/control_dependency_1:0' shape=(128,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebd8d0>) 
29: (<tf.Tensor 'tower_1/gradients/tower_1/conv6/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 128, 256) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf23110>) 
30: (<tf.Tensor 'tower_1/gradients/tower_1/conv6/BiasAdd_grad/tuple/control_dependency_1:0' shape=(256,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf04610>) 
31: (<tf.Tensor 'tower_1/gradients/tower_1/fc1/MatMul_grad/tuple/control_dependency_1:0' shape=(18944, 4096) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebdc50>) 
32: (<tf.Tensor 'tower_1/gradients/tower_1/fc1/add_grad/tuple/control_dependency_1:0' shape=(4096,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebd310>) 
33: (<tf.Tensor 'tower_1/gradients/tower_1/fc1_1/MatMul_grad/tuple/control_dependency_1:0' shape=(4096, 1024) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be96e10>) 
34: (<tf.Tensor 'tower_1/gradients/tower_1/fc1_1/add_grad/tuple/control_dependency_1:0' shape=(1024,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be96990>) 
35: (<tf.Tensor 'tower_1/gradients/tower_1/softmax_linear/MatMul_grad/tuple/control_dependency_1:0' shape=(1024, 360) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be52c90>) 
36: (<tf.Tensor 'tower_1/gradients/tower_1/softmax_linear/softmax_linear_grad/tuple/control_dependency_1:0' shape=(360,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf56f50>) 

最初のなしから削除する方法は不思議ですが、ターゲティングインデックスはありません。これにより、より多くのタワーを作ることができます。

答えて

0

私はすでにエラーが見つかりました。私は学習率に訓練可能な変数を使用していましたが(私はlrをトラッキングすることができましたが、それは可能ではありません)、またadamの演算子で計算する変数のリストを追加しました。これが正しい方法であるかどうかはわかりませんが、動作するように見えます。

with tf.Graph().as_default(), tf.device('/cpu:0'): 
     devs = ['/job:prs/task:0/gpu:0','/job:worker/task:0/gpu:0'] # 
     global_step = tf.get_variable('global_step', [], initializer=tf.constant_initializer(0), trainable=False) 
     num_batches_per_epoch = dt_fdr.FLS_PER_ANGLE/ FLAGS.batch_size 
     #lr = tf.Variable(tf.constant(FLAGS.learning_rate, dtype=tf.float32)) 
     opt = tf.train.AdamOptimizer(FLAGS.learning_rate) 
     tower_grads = [] 
     for i in xrange(FLAGS.num_gpus): 
      with tf.device(devs[i]): 
       with tf.name_scope('%s_%d' % (tf_model.TOWER_NAME, i)) as scope: 
        loss = tower_loss(scope) 
        tf.get_variable_scope().reuse_variables() 
        summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope) 
        #"print('\n'.join('{}: {}'.format(*k) for k in enumerate(summaries))) 
        grads = opt.compute_gradients(loss, tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)) 
        #print('\n'.join('{}: {}'.format(*k) for k in enumerate(grads))) 
        tower_grads.append(grads) 
     grads = average_gradients(tower_grads) 
     #summaries.append(tf.scalar_summary('learning_rate', lr)) 
     for grad, var in grads: 
      if grad: 
       summaries.append(
        tf.histogram_summary(var.op.name + '/gradients', grad)) 
     apply_gradient_op = opt.apply_gradients(grads, global_step=global_step) 
     for var in tf.trainable_variables(): 
      summaries.append(tf.histogram_summary(var.op.name, var)) 

     train_op = apply_gradient_op 

     saver = tf.train.Saver(tf.all_variables()) 

     summary_op = tf.merge_summary(summaries) 

     init = tf.initialize_all_variables() 

     sess = tf.Session("grpc://nelson-lab:2500",config=tf.ConfigProto(
      allow_soft_placement=True, 
      log_device_placement=FLAGS.log_device_placement)) 
     sess.run(init) 

誰かがアダムを使っていくつかのダブルgpuトレーニングをやろうとしたのだろうかと思います。

よろしく

+0

感謝した後

lr = tf.Variable(FLAGS.learning_rate, trainable=False) opt = tf.train.AdamOptimizer(lr) 

。また、複数のGPUでトレーニングを並列化することも検討しています。あなたの実装で成功を見つけましたか? –

0

追加

あなたは以下のように宣言し、訓練フェーズで学習率を更新する計画を持っている場合。あなたの実装を共有するためのもの

sess.run(tf.assign(lr, new_lr)) 
関連する問題