<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                ThinkChat2.0新版上線,更智能更精彩,支持會話、畫圖、視頻、閱讀、搜索等,送10W Token,即刻開啟你的AI之旅 廣告
                # 5.8。示例 > 原文: [http://numba.pydata.org/numba-doc/latest/roc/examples.html](http://numba.pydata.org/numba-doc/latest/roc/examples.html) ## 5.8.1。矩陣乘法 以下是使用 HSA 內核的矩陣乘法的簡單實現: ```py @roc.jit def matmul(A, B, C): i = roc.get_global_id(0) j = roc.get_global_id(1) if i >= C.shape[0] or j >= C.shape[1]: return tmp = 0 for k in range(A.shape[1]): tmp += A[i, k] * B[k, j] C[i, j] = tmp ``` 這種實現很簡單直觀但性能很差,因為相同的矩陣元素將從設備內存中多次加載,這很慢(某些設備可能有透明的數據緩存,但它們可能不夠大,不能一次保存整個輸入)。 如果我們使用阻塞算法來減少對設備內存的訪問,則會更快。 HSA 為組中的工作項提供快速[共享內存](memory.html#roc-shared-memory),以協同計算任務。以下實現了使用共享內存的方形矩陣乘法的更快版本: ```py import numpy as np from numba import roc from numba import float32 from time import time as timer blocksize = 16 gridsize = 16 @roc.jit('(float32[:,:], float32[:,:], float32[:,:])') def matmulfast(A, B, C): x = roc.get_global_id(0) y = roc.get_global_id(1) tx = roc.get_local_id(0) ty = roc.get_local_id(1) sA = roc.shared.array(shape=(blocksize, blocksize), dtype=float32) sB = roc.shared.array(shape=(blocksize, blocksize), dtype=float32) if x >= C.shape[0] or y >= C.shape[1]: return tmp = 0 for i in range(gridsize): # preload sA[tx, ty] = A[x, ty + i * blocksize] sB[tx, ty] = B[tx + i * blocksize, y] # wait for preload to end roc.barrier(1) # compute loop for j in range(blocksize): tmp += sA[tx, j] * sB[j, ty] # wait for compute to end roc.barrier(1) C[x, y] = tmp N = gridsize * blocksize A = np.random.random((N, N)).astype(np.float32) B = np.random.random((N, N)).astype(np.float32) C = np.zeros_like(A) griddim = gridsize, gridsize blockdim = blocksize, blocksize with roc.register(A, B, C): ts = timer() matmulfast[griddim, blockdim](A, B, C) te = timer() print("1st GPU time:", te - ts) with roc.register(A, B, C): ts = timer() matmulfast[griddim, blockdim](A, B, C) te = timer() print("2nd GPU time:", te - ts) ts = timer() ans = np.dot(A, B) te = timer() print("CPU time:", te - ts) np.testing.assert_allclose(ans, C, rtol=1e-5) ``` 由于共享內存是有限的資源,因此代碼會從輸入數組一次預加載一個小塊。然后,它調用 [`barrier()`](memory.html#numba.roc.barrier "numba.roc.barrier") 等待所有線程完成預加載,然后再對共享內存進行計算。它在計算后再次同步,以確保所有線程在共享內存中完成數據,然后在下一次循環迭代中覆蓋它。
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看