5.7。 ROC Ufuncs 和廣義 Ufuncs · Numba 0.44 中文文檔

# 5.7。 ROC Ufuncs 和廣義 Ufuncs > 原文： [http://numba.pydata.org/numba-doc/latest/roc/ufunc.html](http://numba.pydata.org/numba-doc/latest/roc/ufunc.html) 此頁面描述了類似 ROC ufunc 的對象。為了支持 ROC 程序的編程模式，ROC Vectorize 和 GUVectorize 不能生成傳統的 ufunc。相反，返回類似 ufunc 的對象。此對象是一個非常模擬但與常規 NumPy ufunc 不完全兼容的對象。 ROC ufunc 增加了對傳遞設備內陣列（已在 GPU 設備上）的支持，以減少 PCI-express 總線上的流量。它還接受<cite>流</cite>關鍵字以在異步模式下啟動。 ## 5.7.1。基本 ROC UFunc 示例 ```py import math from numba import vectorize import numpy as np @vectorize(['float32(float32, float32, float32)', 'float64(float64, float64, float64)'], target='roc') def roc_discriminant(a, b, c): return math.sqrt(b ** 2 - 4 * a * c) N = 10000 dtype = np.float32 # prepare the input A = np.array(np.random.sample(N), dtype=dtype) B = np.array(np.random.sample(N) + 10, dtype=dtype) C = np.array(np.random.sample(N), dtype=dtype) D = roc_discriminant(A, B, C) print(D) # print result ``` ## 5.7.2。從 ROC UFuncs 調用設備功能所有 ROC ufunc 內核都能夠調用其他 ROC 設備函數： ```py from numba import vectorize, roc # define a device function @roc.jit('float32(float32, float32, float32)', device=True) def roc_device_fn(x, y, z): return x ** y / z # define a ufunc that calls our device function @vectorize(['float32(float32, float32, float32)'], target='roc') def roc_ufunc(x, y, z): return roc_device_fn(x, y, z) ``` ## 5.7.3。廣義 ROC ufuncs 可以使用 ROC 在 GPU 上執行廣義 ufunc，類似于 ROC ufunc 功能。這可以通過以下方式完成： ```py from numba import guvectorize @guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m,n),(n,p)->(m,p)', target='roc') def matmulcore(A, B, C): ... ``` 也可以看看 [矩陣乘法示例](examples.html#roc-matmul)。 ## 5.7.4。異步執行：一次一個塊將數據分區為塊允許計算和內存傳輸重疊。這可以提高 ufunc 的吞吐量，并使您的 ufunc 能夠處理大于 GPU 內存容量的數據。例如： ```py import math from numba import vectorize, roc import numpy as np # the ufunc kernel def discriminant(a, b, c): return math.sqrt(b ** 2 - 4 * a * c) roc_discriminant = vectorize(['float32(float32, float32, float32)'], target='roc')(discriminant) N = int(1e+8) dtype = np.float32 # prepare the input A = np.array(np.random.sample(N), dtype=dtype) B = np.array(np.random.sample(N) + 10, dtype=dtype) C = np.array(np.random.sample(N), dtype=dtype) D = np.zeros(A.shape, dtype=A.dtype) # create a ROC stream stream = roc.stream() chunksize = 1e+6 chunkcount = N // chunksize # partition numpy arrays into chunks # no copying is performed sA = np.split(A, chunkcount) sB = np.split(B, chunkcount) sC = np.split(C, chunkcount) sD = np.split(D, chunkcount) device_ptrs = [] # helper function, async requires operation on coarsegrain memory regions def async_array(arr): coarse_arr = roc.coarsegrain_array(shape=arr.shape, dtype=arr.dtype) coarse_arr[:] = arr return coarse_arr with stream.auto_synchronize(): # every operation in this context with be launched asynchronously # by using the ROC stream dchunks = [] # holds the result chunks # for each chunk for a, b, c, d in zip(sA, sB, sC, sD): # create coarse grain arrays asyncA = async_array(a) asyncB = async_array(b) asyncC = async_array(c) asyncD = async_array(d) # transfer to device dA = roc.to_device(asyncA, stream=stream) dB = roc.to_device(asyncB, stream=stream) dC = roc.to_device(asyncC, stream=stream) dD = roc.to_device(asyncD, stream=stream, copy=False) # no copying # launch kernel roc_discriminant(dA, dB, dC, out=dD, stream=stream) # retrieve result dD.copy_to_host(asyncD, stream=stream) # store device pointers to prevent them from freeing before # the kernel is scheduled device_ptrs.extend([dA, dB, dC, dD]) # store result reference dchunks.append(asyncD) # put result chunks into the output array 'D' for i, result in enumerate(dchunks): sD[i][:] = result[:] # data is ready at this point inside D print(D) ```