私は問題が私のqloss()
機能では、私は、テンソルから値を引いたそれらの操作を行うと、値を返す、ということであったことを考え出しました。値はテンソルに依存しますが、テンソル自体ではカプセル化されていないため、TensorFlowはグラフのテンソルに依存しているとは判断できませんでした。
qloss()
をテンソルで直接操作してテンソルを返すように変更してこれを修正しました。新しい機能は次のとおりです。
def qloss(actions, rewards, target_Qs, pred_Qs):
"""
Q-function loss with target freezing - the difference between the observed
Q value, taking into account the recently received r (while holding future
Qs at target) and the predicted Q value the agent had for (s, a) at the time
of the update.
Params:
actions - The action for each experience in the minibatch
rewards - The reward for each experience in the minibatch
target_Qs - The target Q value from s' for each experience in the minibatch
pred_Qs - The Q values predicted by the model network
Returns:
A list with the Q-function loss for each experience clipped from [-1, 1]
and squared.
"""
ys = rewards + DISCOUNT * target_Qs
#For each list of pred_Qs in the batch, we want the pred Q for the action
#at that experience. So we create 2D list of indeces [experience#, action#]
#to filter the pred_Qs tensor.
gather_is = tf.squeeze(np.dstack([tf.range(BATCH_SIZE), actions]))
action_Qs = tf.gather_nd(pred_Qs, gather_is)
losses = ys - action_Qs
clipped_squared_losses = tf.square(tf.minimum(tf.abs(losses), 1))
return clipped_squared_losses