CUDAカーネルに定数整数を渡す

次のコードに問題があります。グローバルカーネルのloop_dでは、Mの整数値は84です。共有配列tempを作成してMを配列のサイズとして使用すると、次のエラーが発生します。CUDAカーネルに定数整数を渡す

エラー：式は必須です一定の値を持っています

私はそれがなぜかわかりません。私はMをグローバル変数として宣言すると、それはうまくいくが、問題は、別のFortranプログラムでd_twoという関数を呼び出すことによってMの価値が得られるということなので、それを回避する方法がわからない。私はtemp [M]をtemp [84]に置き換えるとプログラムが完璧に動作することを知っていますが、これは実用的ではありません。異なる問題はMの値が異なるかもしれないからです。

プログラム

// Parallelized 2D Three-Point Guassian Quadrature Numerical Integration Method 
// The following program is part of two linked programs, Integral_2D_Cuda.f. 
// This is a CUDA kernel that could be called in the Integral_2D_Cuda.f Fortran code to compute 
// the integral of a given 2D-function 
#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <cuda.h> 
#include <cuda_runtime.h> 
// The following is a definition for the atomicAddd function that is called in the loop_d kernel 
// This is needed because the "regular" atomicAdd function only works for floats and integers 
__device__ double atomicAddd(double* address, double val) 
{ 
    unsigned long long int* address_as_ull = (unsigned long long int*)address; 
    unsigned long long int old = *address_as_ull, assumed; 
    do { 
     assumed = old; 
     old = atomicCAS(address_as_ull, assumed, 
      __double_as_longlong(val + __longlong_as_double(assumed))); 
    } while (assumed != old); 
    return __longlong_as_double(old); 
} 
// GPU kernel that computes the function of interest. This is good for a two dimensional problem. 
__global__ void loop_d(double *a_sx, double *b_swx, double *c_sy, double *d_swy, double *e_ans0, int N, int M) 
{ 
    // Declaring a shared array that threads of the same block have access to 
    __shared__ double temp[M]; 
    int idxX = blockIdx.x * blockDim.x + threadIdx.x; // Thread indices responsible for the swx and sx arrays 
    int idxY = threadIdx.y;  // Thread indices responsible for the swy and sy arrays 
    // Computing the multiplication of elements 
    if (idxX < N && idxY < M) 
    { 
     temp[idxY] = a_sx[idxX] * b_swx[idxX] * c_sy[idxY] * d_swy[idxY]; 
    } 
    // synchronizing all threads before summing all the mupltiplied elements int he temp array 
    __syncthreads(); 
    // Allowing the 0th thread of y to do the summation of the multiplied elements in the temp array of one block 
    if (0 == idxY) 
    { 
     double sum = 0.00; 
     for(int k = 0; k < M; k++) 
     { 
      sum = sum + temp[k]; 
     } 
     // Adding the result of this instance of calculation to the final answer, ans0 
     atomicAddd(e_ans0, sum); 
    } 
} 
extern "C" void d_two_(double *sx, double *swx, int *nptx, double *sy, double *swy, int *npty, double *ans0) 
{ 
    // Assigning GPU pointers 
    double *sx_d, *swx_d; 
    int N = *nptx; 
    double *sy_d, *swy_d; 
    int M = *npty; 
    double *ans0_d; 
    dim3 threadsPerBlock(1,M); // Creating a two dimesional block with 1 thread in the x dimesion and M threads in the y dimesion 
    dim3 numBlocks(N); // specifying the number of blocks to use of dimesion 1xM 
    // Allocating GPU Memory 
    cudaMalloc((void **)&sx_d, sizeof(double) * N); 
    cudaMalloc((void **)&swx_d, sizeof(double) * N); 
    cudaMalloc((void **)&sy_d, sizeof(double) * M); 
    cudaMalloc((void **)&swy_d, sizeof(double) * M); 
    cudaMalloc((void **)&ans0_d, sizeof(double)); 
    // Copying information fromm CPU to GPU 
    cudaMemcpy(sx_d, sx, sizeof(double) * N, cudaMemcpyHostToDevice); 
    cudaMemcpy(swx_d, swx, sizeof(double) * N, cudaMemcpyHostToDevice); 
    cudaMemcpy(sy_d, sy, sizeof(double) * M, cudaMemcpyHostToDevice); 
    cudaMemcpy(swy_d, swy, sizeof(double) * M, cudaMemcpyHostToDevice); 
    cudaMemcpy(ans0_d, ans0, sizeof(double), cudaMemcpyHostToDevice); 
    // Calling the function on the GPU 
    loop_d<<< numBlocks, threadsPerBlock >>>(sx_d, swx_d, sy_d, swy_d, ans0_d, N, M); 
    // Copying from GPU to CPU 
    cudaMemcpy(ans0, ans0_d, sizeof(double), cudaMemcpyDeviceToHost); 
    // freeing GPU memory 
    cudaFree(sx_d); 
    cudaFree(swx_d); 
    cudaFree(sy_d); 
    cudaFree(swy_d); 
    cudaFree(ans0_d); 
    return; 
}

出典

2017-06-28 Bassa

静的に割り当てられた共有メモリには、静的割り当てのサイズのコンパイル時定数が必要です。あなたのケースでは、 'M 'はコンパイル時には分かっていないので、解決策は*動的に割り当てられた共有メモリを使うことです。これの多くの例が[CUDAタグにあります]（https://stackoverflow.com/a/5531640/1695960）、[プログラミングガイド]（http://docs.nvidia.com/cuda）にあります。 /cuda-c-programming-guide/index.html#shared）。動的に割り当てられた共用メモリは 'extern'キーワードを使用します。 –

@RobertCrovella：あなたやtalonmiesは、常に最初のコメント/回答に私を打ち負かす... – einpoklum

コンパイラはコンパイル時一定になるようにMを必要とします。コンパイル時には、実際にはMが何になるのかを判断することができません（最終的にはそれを渡すことはわかりません）。

実行時に知っているだけのサイズの共有メモリを使用する場合は、のダイナミック共有メモリを使用します。

Parallel4Allブログのを参照してください。this exampleこちらのサイトをご覧ください。

出典

2017-06-28 22:26:05 einpoklum

ありがとうあなたの助け！ CUDAリンクで共有メモリを使用すると、多くの助けになりました！ – Bassa

@Bassa：どうぞよろしくお願いいたします。しかし、私は何度も与えられたのと同じアドバイスをします：ここで質問をする前に、[CUDAプログラミングガイド]（http://docs.nvidia.com/cuda）の関連セクションを参照することをお勧めします/cuda-c-programming-guide/index.html）を最初に実行してください。 – einpoklum

CUDAカーネルに定数整数を渡す

答えて

関連する問題