RAMからGPUへ、GPUからRAMへのコピー

私は自分のプロジェクトでCUDAの最適化をいくつか紹介しようとしています。しかし、私はここで何か間違っていると思う。単純な行列 - ベクトル乗算（result = matrix * vector）を実装したいと思います。しかし、結果をホストにコピーしたい場合、エラーが発生します（cudaErrorLaunchFailure）。私のカーネルにエラーがありますか（matrixVectorMultiplicationKernel）、cudaMemcpyに間違って電話しましたか？私は、この種のエラー状態についての有用な文書が見つかりませんでした。私はこれがGPUの状態を完全に破壊すると思います。なぜなら、最初に発生した後でこのエラーを再現せずにCUDAカーネルを呼び出すことができないからです。RAMからGPUへ、GPUからRAMへのコピー

編集＃1：左記のアドバイスに従って、コードを更新しました。

// code 
... 
Eigen::MatrixXf matrix(M, N); // matrix.data() usually should return a float array 
Eigen::VectorXf vector(N); // same here for vector.data() 
Eigen::VectorXf result(M); 
... // fill matrix and vector 
float* matrixOnDevice = copyMatrixToDevice(matrix.data(), matrix.rows(), matrix.cols()); 
matrixVectorMultiplication(matrixOnDevice, vector.data(), result.data(), matrix.rows(), cm.cols()); 
... // clean up 

// helper functions 
float* copyMatrixToDevice(const float* matrix, int mRows, int mCols) 
{ 
    float* matrixOnDevice; 
    const int length = mRows*mCols; 
    const int size = length * sizeof(float); 
    handleCUDAError(cudaMalloc((void**)&matrixOnDevice, size)); 
    handleCUDAError(cudaMemcpy(matrixOnDevice, matrix, size, cudaMemcpyHostToDevice)); 
    return matrixOnDevice; 
} 

void matrixVectorMultiplication(const float* matrixOnDevice, const float* vector, float* result, int mRows, int mCols) 
{ 
    const int vectorSize = mCols*sizeof(float); 
    const int resultSize = mRows*sizeof(float); 
    const int matrixLength = mRows*mCols; 
    float* deviceVector; 
    float* deviceResult; 
    handleCUDAError(cudaMalloc((void**)&deviceVector, vectorSize)); 
    handleCUDAError(cudaMalloc((void**)&deviceResult, resultSize)); 
    handleCUDAError(cudaMemset(deviceResult, 0, resultSize)); 
    handleCUDAError(cudaMemcpy(deviceVector, vector, vectorSize, cudaMemcpyHostToDevice)); 
    int threadsPerBlock = 256; 
    int blocksPerGrid = (mRows + threadsPerBlock - 1)/threadsPerBlock; 
    matrixVectorMultiplicationKernel<<<blocksPerGrid, threadsPerBlock>>>(matrixOnDevice, vector, result, mRows, mCols, matrixLength); 
    // --- no errors yet --- 
    handleCUDAError(cudaMemcpy(result, deviceResult, resultSize, cudaMemcpyDeviceToHost)); // cudaErrorLaunchFailure 
    handleCUDAError(cudaFree(deviceVector)); // cudaErrorLaunchFailure 
    handleCUDAError(cudaFree(deviceResult)); // cudaErrorLaunchFailure 
} 

__global__ void matrixVectorMultiplicationKernel(const float* matrix, const float* vector, float* result, int mRows, int mCols, int length) 
{ 
    int row = blockDim.x * blockIdx.x + threadIdx.x; 
    if(row < mRows) 
    { 
    for(int col = 0, mIdx = row*mCols; col < mCols; col++, mIdx++) 
     result[row] += matrix[mIdx] * vector[col]; 
    } 
}

出典

2012-04-16 alfa

このようなカーネルを書くのではなく、CUBLASを使用することは妥当です。 – leftaroundabout

すぐにやり遂げると思います。しかし、cublasは非常に複雑に思え、私は何か簡単なものから始めたいと思っていました。 – alfa

私の意見では、CUBLASはよりシンプルです（さらに制限があります）。 –

void copyMatrixToDevice(..., float* matrixOnDevice, ...)は、このポインタを値で取ります。つまり、デバイスのマトリックスを出力できません。あなたはmatrixVectorMultiplicationでresultと同じ問題があり

copyMatrixToDevice(matrix.data(), &matrixOnDevice, matrix.rows(), matrix.cols());

によって呼び出される、void copyMatrixToDevice(..., float** matrixOnDevice, ...)でそれを行うことができます。

C++では、長期的には、このすべてを適切なクラス抽象化レイヤにする必要があります。

出典

2012-04-16 16:42:54 leftaroundabout

OK、通常、私は自分自身で最初のエラーを見つけました（ '** matrixOnDevice'）。ありがとう！これが私がcudaMallocに（void **）を渡さなければならない理由です。 2番目のアドバイスは私には分かりません。 cudaMemcpyは 'result'のアドレスを変更しません。 float *として渡すだけでは不十分なのはなぜですか？とにかく、エラーはまだそこにあります。それは問題を完全には解決しませんでした。 – alfa

右、 'matrixVectorMultiplication'を正しく見ていませんでした。それは確かに動作しますが、特に一貫性はありません。 – leftaroundabout

OK、私は最後のエラーを発見しました。デバイス上にあるアドレスを持つカーネルを呼び出す必要があります。 'matrixVectorMultiplicationKernel <<< blocksPerGrid、threadsPerBlock >>>（matrixOnDevice、** deviceVector **、** deviceResult * *、mRows、mCols、matrixLength）; ' – alfa

RAMからGPUへ、GPUからRAMへのコピー

答えて

関連する問題