theanoでの効率的なカーネル実装

私はTheanoでガウスカーネルを実装しました。しかし、それをニューラルネットワークの一部としてテストした場合、時間がかかりすぎます。カーネル減算はパラレル化されていないようです。ネットワーク全体のトレーニングは、単一の処理コアを使用します。では、Theanoにカーネル操作を分割させるように正しく誘導する方法はありますか？theanoでの効率的なカーネル実装

import theano.tensor as T 
import numpy 
import theano 

batch_s=5 
dims=10 
hidd_s=3 
out_s=2 

missing_param = None #"ignore" 

rng = numpy.random.RandomState(1234) 
input = T.matrix("input") 
X = numpy.asarray(rng.uniform(low=-2.1, high=5.0, size=(batch_s, dims))) 

def layer(x): 

    W=theano.shared(
     value=numpy.asarray(
      rng.uniform(low=0.001, high=1.0, size=(dims, hidd_s)), 
       dtype=theano.config.floatX), 
     name='W', borrow=True) 

    S=theano.shared(
     value=numpy.asarray(
      rng.uniform(low=10.0, high=100.0, size=(hidd_s,)), 
       dtype=theano.config.floatX), 
     name='S', borrow=True) 

    dot_H = theano.shared(
     value=numpy.zeros((batch_s, hidd_s), 
      dtype=theano.config.floatX), 
     name='dot_H', borrow=True) 
    # This is the kernel operation. I have tested with single scan as well 
    # as with two nested scans, but operations arenot splitted as in the 
    # case of the usual dot product T.dot(). 
    for i in range(batch_s): 
     for j in range(hidd_s): 
      dot_H = T.set_subtensor(dot_H[i,j], 
        T.exp(-(W.T[j] - x[i]).norm(2) ** 2)/2 * S[j] ** 2) 
    return dot_H 

layer_out = theano.function(
          inputs=[input], 
          outputs=layer(input), 
          on_unused_input=missing_param 
          ) 
print layer_out(X)

タクあなたは大変です。

出典

2017-01-11 Nacho

ニューラルネットワークを構築している場合は、[Intel Theano]（https://github.com/intel/Theano）を試してみてください。これは、最適化された畳み込み、reluおよび他のプリミティブでCPUが非常に高速になります。 – Patric

ループを削除すると、Theanoは並列化を最適化できます。

for i in range(batch_s): 
    T.exp(-(W.T - X[i]).norm(2,axis=1) ** 2)/2 * S ** 2)

を次にあなたが外側のループにマップを使用することができます：

まず、あなたが行うことで、内側のループを回避することができます

import theano.tensor as T 
import numpy 
import theano 
import timeit 

start = timeit.default_timer() 
batch_s=5 
dims=10 
hidd_s=3 
out_s=2 

missing_param = None #"ignore" 

rng = numpy.random.RandomState(1234) 
input = T.matrix("input") 
X = numpy.asarray(rng.uniform(low=-2.1, high=5.0, size=(batch_s, dims))) 



W=theano.shared(
     value=numpy.asarray(
      rng.uniform(low=0.001, high=1.0, size=(dims, hidd_s)), 
       dtype=theano.config.floatX), 
     name='W', borrow=True) 

S=theano.shared(
     value=numpy.asarray(
      rng.uniform(low=10.0, high=100.0, size=(hidd_s,)), 
       dtype=theano.config.floatX), 
     name='S', borrow=True) 


f_func,f_updates = theano.map(lambda i : T.exp(-(W.T - i).norm(2,axis=1) ** 2)/2 * S ** 2,input,[]) 


layer_out = theano.function([input],               
          f_func, 
          updates=f_updates, 
       on_unused_input=missing_param, 
          allow_input_downcast=True) 


print layer_out(X.astype('float32')) 

stop = timeit.default_timer() 

print "running time: " + str(stop - start)

元のコードの出力は次のとおりです。

[[ 1.83701953e-25 1.78982216e-26 9.22911484e-27] 
[ 1.60078639e-17 9.21553384e-17 7.62476155e-14] 
[ 8.13404350e-17 1.88481821e-17 2.44677516e-15] 
[ 3.16093011e-29 1.49698827e-27 2.42876079e-27] 
[ 9.57103818e-09 3.46683533e-12 6.66103154e-12]] 
running time: 1.30477905273

新しいもの：

[[ 1.83701953e-25 1.78982216e-26 9.22911484e-27] 
[ 1.60078639e-17 9.21553384e-17 7.62476155e-14] 
[ 8.13404350e-17 1.88481821e-17 2.44677516e-15] 
[ 3.16093011e-29 1.49698827e-27 2.42876079e-27] 
[ 9.57103818e-09 3.46683533e-12 6.66103154e-12]] 
running time: 0.589275121689

出典

2017-01-11 09:02:54 gntoni

大変ありがとうございます。私は 'map'がはるかに高速であることを確認しました。しかし、おそらく私は間違いを繰り返しました。なぜなら、それは再び平行していないように見えるからです。私は ' - （WT-i）.norm（2、axis = 1）'を '-T.dot（W、i）'に置き換えてみましたが、うまくいきましたが、これは私が望むものではありません... GPUだけでCPUは、これが問題だと思いますか？ – Nacho

もしあなたのテンソルが十分大きければ、OpenMPを使うことができます。しかし、そうでなければそれはさらに遅くなります。参照：http://deeplearning.net/software/theano/tutorial/multi_cores.html – gntoni

ありがとうございます@ gntoni – Nacho

theanoでの効率的なカーネル実装

答えて

関連する問題