3.10。示例 · Numba 0.44 中文文檔

# 3.10。示例 > 原文： [http://numba.pydata.org/numba-doc/latest/cuda/examples.html](http://numba.pydata.org/numba-doc/latest/cuda/examples.html) ## 3.10.1。矩陣乘法這是使用 CUDA 內核的矩陣乘法的簡單實現： ```py @cuda.jit def matmul(A, B, C): """Perform square matrix multiplication of C = A * B """ i, j = cuda.grid(2) if i < C.shape[0] and j < C.shape[1]: tmp = 0. for k in range(A.shape[1]): tmp += A[i, k] * B[k, j] C[i, j] = tmp ``` 這種實現很簡單直觀但性能很差，因為相同的矩陣元素將從設備內存中多次加載，這很慢（某些設備可能有透明的數據緩存，但它們可能不夠大，不能一次保存整個輸入）。如果我們使用阻塞算法來減少對設備內存的訪問，則會更快。 CUDA 為塊中的線程提供快速[共享內存](memory.html#cuda-shared-memory)，以便在任務上協同計算。以下實現了使用共享內存的方形矩陣乘法的更快版本： ```py from numba import cuda, float32 # Controls threads per block and shared memory usage. # The computation will be done on blocks of TPBxTPB elements. TPB = 16 @cuda.jit def fast_matmul(A, B, C): # Define an array in the shared memory # The size and type of the arrays must be known at compile time sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32) sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32) x, y = cuda.grid(2) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bpg = cuda.gridDim.x # blocks per grid if x >= C.shape[0] and y >= C.shape[1]: # Quit if (x, y) is outside of valid C boundary return # Each thread computes one element in the result matrix. # The dot product is chunked into dot products of TPB-long vectors. tmp = 0. for i in range(bpg): # Preload data into shared memory sA[tx, ty] = A[x, ty + i * TPB] sB[tx, ty] = B[tx + i * TPB, y] # Wait until all threads finish preloading cuda.syncthreads() # Computes partial product on the shared memory for j in range(TPB): tmp += sA[tx, j] * sB[j, ty] # Wait until all threads finish computing cuda.syncthreads() C[x, y] = tmp ``` 由于共享內存是有限的資源，因此代碼一次從輸入數組預加載小塊。然后，它調用 [`syncthreads()`](../cuda-reference/kernel.html#numba.cuda.syncthreads "numba.cuda.syncthreads") 等待所有線程完成預加載并在共享內存上進行計算之前。它在計算后再次同步，以確保所有線程在共享內存中完成數據，然后在下一次循環迭代中覆蓋它。