ドットプロダクトPyCUDAと固定メモリ

現在、PyCUDAを使用して固定メモリを持つドットプロダクトを開発中です。そして、大きな配列には問題があります。私が働いているドットプロダクトPyCUDAと固定メモリ

：

NVIDIAのGTXは1060
CUDA 9.1
PyCUDA 2017年1月1日

コードは次のとおりです。

#!/usr/bin/env python 

import numpy as np 
import argparse 
import math 
import pycuda.autoinit 
import pycuda.driver as drv 
from pycuda.compiler import SourceModule 

from time import time 

dot_mod = SourceModule(""" 
__global__ void full_dot(double* v1, double* v2, double* out, int N) { 
    __shared__ double cache[ 1024 ]; 
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    cache[ threadIdx.x ] = 0.f; 
    while(i < N) { 
     cache[ threadIdx.x ] += v1[ i ] * v2[ i ]; 
     i += gridDim.x * blockDim.x; 
    } 
    __syncthreads(); // required because later on the current thread is accessing 
        // data written by another thread  
    i = 1024/2; 
    while(i > 0) { 
     if(threadIdx.x < i) cache[ threadIdx.x ] += cache[ threadIdx.x + i ]; 
     __syncthreads(); 
     i /= 2; //not sure bitwise operations are actually faster 
    } 

#ifndef NO_SYNC // serialized access to shared data; 
    if(threadIdx.x == 0) atomicAdd(out, cache[ 0 ]); 
#else // no sync, what most likely happens is: 
     // 1) all threads read 0 
     // 2) all threads write concurrently 16 (local block dot product) 
    if(threadIdx.x == 0) *out += cache[ 0 ]; 
#endif     

} 
""") 


def main(args): 
    dot = dot_mod.get_function("full_dot") 
    N = args.number 
    BLOCK_SIZE = 1024 
    BLOCKS = int(math.ceil(N/BLOCK_SIZE)) 
    THREADS_PER_BLOCK = BLOCK_SIZE 

    # Time use of pinned host memory: 
    x = drv.aligned_empty((N), dtype=np.float64, order='C') 
    x = drv.register_host_memory(x, flags=drv.mem_host_register_flags.DEVICEMAP) 
    x_gpu_ptr = np.intp(x.base.get_device_pointer()) 

    # Time use of pinned host memory: 
    y = drv.aligned_empty((N), dtype=np.float64, order='C') 
    y = drv.register_host_memory(y, flags=drv.mem_host_register_flags.DEVICEMAP) 
    y_gpu_ptr = np.intp(y.base.get_device_pointer()) 

    # Time use of pinned host memory: 
    z = drv.aligned_empty((1), dtype=np.float64, order='C') 
    z = drv.register_host_memory(z, flags=drv.mem_host_register_flags.DEVICEMAP) 
    z_gpu_ptr = np.intp(z.base.get_device_pointer()) 

    z[:] = np.zeros(1) 
    x[:] = np.zeros(N) 
    y[:] = np.zeros(N) 

    x[:] = np.random.rand(N) 
    y[:] = x[:] 
    x_orig = x.copy() 
    y_orig = y.copy() 

    start = time() 
    dot(x_gpu_ptr, y_gpu_ptr, z_gpu_ptr, np.uint32(N), block=(THREADS_PER_BLOCK, 1, 1), grid=(BLOCKS,1)) 
    times = time()-start 
    print "Average kernel GPU dot product execution time with pinned memory: %3.7f" % np.mean(times) 

    start = time() 
    ydot=np.dot(x_orig,y_orig) 
    times = time()-start 
    print "Average numpy dot product execution time: %3.7f" % np.mean(times) 

    print N,ydot,z[0] 

if __name__ == "__main__": 

    parser = argparse.ArgumentParser(description=' ') 
    parser.add_argument('-n', dest='number', type=long, help="Number of samples ", required=True) 

    args = parser.parse_args() 

    main(args)

私が持っていますサンプル配列があるときにうまくいくこのコードを書いた誰にでも

➜ ./test_dot_pinned.py -n 16384 
Average kernel GPU dot product execution time with pinned memory: 0.0001669 
Average numpy dot product execution time: 0.0000119 
16384 5468.09590706 5468.09590706 
SIZE np.dot() GPU-dot-pinned 

➜ ./test_dot_pinned.py -n 1048576 
Average kernel GPU dot product execution time with pinned memory: 0.0002351 
Average numpy dot product execution time: 0.0010922 
1048576 349324.532564 258321.148593 
SIZE np.dot() GPU-dot-pinned

おかげでおよそ10mlの液体フッ化水素を圧入* 12 1024未満が、* 1024×1024のような大きな配列の値が間違っresult.-を与える大きさの、私は誰かが私を助けることができる願っています。

出典

2017-12-18 acancio

pycudaは、カーネルの起動後に同期を強制しません。通常、カーネルの起動後にデバイスのホストコピーを行うと、強制的に同期が強制されます。つまり、強制的にカーネルが完了します。

しかし、コードにこのような同期はありません。ピンメモリを使用しているので、最終的にz[0]を印刷するときにカーネルの実行時間が長くなる（結果的に作業量が大きくなる）ため、カーネルはその時点で終了していないので部分的な結果しか得られません。

この副作用は、カーネルの時間測定が正確でないこともあります。

あなたはあなたの時間の計測を終了する前に完了するために、カーネルを強制することにより、これらの両方を修正することができます。

dot(x_gpu_ptr, y_gpu_ptr, z_gpu_ptr, np.uint32(N), block=(THREADS_PER_BLOCK, 1, 1), grid=(BLOCKS,1)) 
#add the next line of code: 
drv.Context.synchronize() 
times = time()-start

出典

2017-12-18 21:26:52

うわー、たくさんの迅速な答えのためのおかげで、とても役に立ちました。宜しくお願いします。 – acancio

ドットプロダクトPyCUDAと固定メモリ

答えて

関連する問題