削減結果＃1が間違っています

-2

Mark Harrisによって、よく知られているスライドの縮小＃1を実装しましたが、結果として0が得られました。入力配列にスライドに表示されているのと同じ値を入力しました。コマンドnvcc reduction1.cu -o red1を使って、cuda 7.0でコンパイルしました。間違いはどこですか？ありがとう。 talonmiesとして削減結果＃1が間違っています

#include <stdio.h> 
#include <cuda_runtime.h> 

#define THREADS_PER_BLOCK 16 

__global__ void reduce1(int *g_idata, int *g_odata) { 
    extern __shared__ int sdata[]; 
    // each thread loads one element from global to shared mem 
    unsigned int tid = threadIdx.x; 
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; 
    sdata[tid] = g_idata[i]; 
    __syncthreads(); 

    // do reduction in shared mem 
    for(unsigned int s=1; s < blockDim.x; s *= 2) 
    { 
     if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s]; 
      __syncthreads(); 
    } 

    // write result for this block to global mem 
    if (tid == 0) g_odata[blockIdx.x] = sdata[0]; 
} 

int main() 
{ 
    int inputLength=16; 
    int hostInput[16]={10,1,8,-1,0,-2,3,5,-2,-3,2,7,0,11,0,2}; 
    int hostOutput=0; 
    int *deviceInput; 
    int *deviceOutput; 

    cudaMalloc((void **)&deviceInput, inputLength * sizeof(int)); 
    cudaMalloc((void **)&deviceOutput, sizeof(int)); 

    cudaMemcpy(deviceInput, hostInput, inputLength * sizeof(int),cudaMemcpyHostToDevice); 

    reduce1<<<1,THREADS_PER_BLOCK>>>(deviceInput, deviceOutput); 

    cudaDeviceSynchronize(); 

    cudaMemcpy(&hostOutput, deviceOutput,sizeof(int), cudaMemcpyDeviceToHost); 

    printf("%d\n",hostOutput); 

    cudaFree(deviceInput); 
    cudaFree(deviceOutput); 

    return 0; 
}

出典

2017-05-22 horus

ダイナミック共有メモリの割り当てには、サイズは指定していません。私はあなたの最後の答えでこれがどのように機能するかを具体的に述べました。エラーチェックをすると気になる場合は、カーネルがメモリアクセス違反で失敗していることがわかります。 – talonmies

[適切なCUDAエラーチェック]を実装する必要があります（https://stackoverflow.com/questions/14038589/what-is-the-canonicalway-to-check-for-errors-using-the-cuda-runtime-api ）** **他の人に助けを求める前に**それはあなた自身で問題を理解するのに役立つかもしれませんし、理解していなくてもエラー出力はあなたを助けようとしている人にとって役に立ちます。 –

あなたは動的共有メモリを使用しているが、あなたはそれのために任意のメモリ空間を割り当てていない、と述べました。このメモリのサイズは、カーネル実行コンフィギュレーションの3番目の引数として指定する必要があります。

reduce1<<<1, THREADS_PER_BLOCK, 64>>>(deviceInput, deviceOutput); 
           ^^

このコードを修正する別の方法は静的共有メモリを使用することです。このようなあなたの共有メモリを宣言します。

__shared__ int sdata[16];

CUDAのための質問をする前thisをお読みください。

出典

2017-05-22 13:12:25 nglee

削減結果＃1が間違っています

答えて

関連する問題