CUDAカーネルの起動パラメータが以前のカーネルに依存している場合、常に同期が必要ですか？

私はこのコードを自分のコードにたくさん載せています。基本的には、大きなデータセットをフィルタリングするための最初のカーネルになります。そこでは、選択されたエントリは非常に疎であり、その後は大幅に削減されたデータセットではるかに複雑な計算を実行するためのカーネルです。CUDAカーネルの起動パラメータが以前のカーネルに依存している場合、常に同期が必要ですか？

cudaStreamSynchronizeはほぼ不必要なようですが、その周りには何も表示されません。

カーネル間の同期を避ける代替パターンがありますか？
CUDA動的並列処理は、どのように役立つでしょうか？

コード例：

/* Pseudocode. Won't Compile */ 
/* Please ignore silly mistakes/syntax and inefficiant/incorrect simplifications */ 

__global__ void bar(const float * dataIn, float * dataOut, unsigned int * counter_ptr) 
{ 
    < do some computation > 
    if (bConditionalComputedAboveIsTrue) 
    { 
     const unsigned int ind = atomicInc(counter_ptr, (unsigned int)(-1)); 
     dataOut[ ind ] = resultOfAboveComputation; 
    } 
} 

int foo(float * d_datain, float* d_tempbuffer, float* d_output, cudaStream_t stream ){  
    /* Initialize a counter that will be updated by the bar kernel */ 
    unsigned int * counter_ptr; 
    cudaMalloc(&counter_ptr, sizeof(unsigned int)); //< Create a Counter 
    cudaMemsetAsync(counter_ptr, 0, sizeof(unsigned int), stream); //<Initially Set the Counter to 0 
    dim3 threadsInit(16,16,1); 
    dim3 gridInit(256, 1, 1); 
    /* Launch the Filtering Kernel. This will update the value in counter_ptr*/ 
    bar<<< gridInit, threadsInit, 0, stream >>>(d_datain, d_tempbuffer, counter_ptr); 
    /* Download the count and synchronize the stream */ 
    unsigned int count; 
    cudaMemcpyAsync(&count, counter_ptr, sizeof(unsigned int), cudaMemcpyDeviceToHost, stream); 
    cudaStreamSynchronize(stream); //< Is there any way around this synchronize? 
    /* Compute the grid parameters and launch a second kernel */ 
    dim3 bazThreads(128,1,1); 
    dim3 bazGrid(count/128 + 1, 1, 1); //< Here I use the counter modified in the prior kernel to set the grid parameters 
    baz<<< bazGrid, bazThreads, 0, stream >>>(d_tempbuffer, d_output); 
    /* cleanup */ 
    cudaFree(counter_ptr); 
}

出典

2017-06-04 Samuel Martin

代わりの第二のカーネル内のブロックの数を変えるには、固定ブロック・カウントを使用し、ブロックは、彼らが行う作業の量を適応させる可能性があります。

など。多数のブロックを起動し、作業が残っていなければ早期に終了させます。または、デバイスを埋めるだけの十分なブロックを起動し、それぞれのブロックを作業の上にループさせます。 Grid-stride loopsはこれを行う良い方法です。

動的並列処理を使用してカーネルの起動自体（したがってグリッドサイズの決定）をデバイスに移動するオプションもあります。

出典

2017-06-04 19:38:48 tera

CUDAカーネルの起動パラメータが以前のカーネルに依存している場合、常に同期が必要ですか？

答えて

関連する問題