代碼性能優化 · JuliaLang 中文手冊

>以下幾節將描述一些提高 Julia 代碼運行速度的技巧。 ### 避免全局變量全局變量的值、類型，都可能變化。這使得編譯器很難優化使用全局變量的代碼。應盡量使用局部變量，或者把變量當做參數傳遞給函數。對性能至關重要的代碼，應放入函數中。聲明全局變量為常量可以顯著提高性能： `const DEFAULT_VAL = 0` 使用非常量的全局變量時，最好在使用時指明其類型，這樣也能幫助編譯器優化： ``` global x y = f(x::Int + 1) ``` 寫作功能比較好。它會導致更多的可重用代碼，并闡明正在做什么，以及他們的輸入和輸出是什么 ### 使用 `@time` 來衡量性能并且留心內存分配衡量計算性能最有用的工具是 @time 宏. 下面的例子展示了良好的使用方式 ``` julia> function f(n) s = 0 for i = 1:n s += i/2 end s end f (generic function with 1 method) julia> @time f(1) elapsed time: 0.008217942 seconds (93784 bytes allocated) 0.5 julia> @time f(10^6) elapsed time: 0.063418472 seconds (32002136 bytes allocated) 2.5000025e11 ``` 在第一次調用時 (`@time f(1`)), `f` 會被編譯. (如果你在這次會話中還沒有使用過 `@time`, 計時函數也會被編譯.) 這時的結果沒有那么重要. 在第二次調用時, 函數打印了執行所耗費的時間, 同時請注意, 在這次執行過程中分配了一大塊的內存. 相對于函數形式的 `tic` 和 `toc`, 這是 `@time` 宏的一大優勢. 出乎意料的大塊內存分配往往意味著程序的某個部分存在問題, 通常是關于類型穩定性. 因此, 除了關注內存分配本身的問題, 很可能 Julia 為你的函數生成的代碼存在很大的性能問題. 這時候要認真對待這些問題并遵循下面的一些個建議. 另外, 作為一個引子, 上面的問題可以優化為無內存分配 (除了向 REPL 返回結果), 計算速度提升 30 倍 ``` julia> @time f_improved(10^6) elapsed time: 0.00253829 seconds (112 bytes allocated) 2.5000025e11 ``` 你可以從下面的章節學到如何識別 f 存在的問題并解決. 在有些情況下, 你的函數可能需要為本身的操作分配內存, 這樣會使得問題變得復雜. 在這種情況下, 可以考慮使用下面的工具之一來甄別問題, 或者將函數拆分, 一部分處理內存分配, 另一部分處理算法 (參見預分配內存). ### 工具 — Julia 提供了一些工具包來鑒別性能問題所在 : * Profiling 可以用來衡量代碼的性能, 同時鑒別出瓶頸所在. 對于復雜的項目, 可以使用 ProfileView <https://github.com/timholy/ProfileView.jl> 擴展包來直觀的展示分析結果. * 出乎意料的大塊內存分配, – @time, @allocated, 或者 -profiler - 意味著你的代碼可能存在問題. 如果你看不出內存分配的問題, -那么類型系統可能存在問題. 也可以使用 --track-allocation=user 來 -啟動 Julia, 然后查看 *.mem 文件來找出內存分配是在哪里出現的. * `TypeCheck <https://github.com/astrieanna/TypeCheck.jl`_ 可以幫助找出部分類型系統相關的問題. 另一個更費力但是更全面的工具是 code_typed. 特別留意類型為 Any 的變量, 或者 Union 類型. 這些問題可以使用下面的建議解決. * * Lint 擴展包可以指出程序一些問題. ### Avoid containers with abstract type parameters When working with parameterized types, including arrays, it is best to avoid parameterizing with abstract types where possible. Consider the following: ``` a = Real[] # typeof(a) = Array{Real,1} if (f = rand()) < .8 push!(a, f) end ``` Because a is `a` an array of abstract type `Real`, it must be able to hold any `Real` value. Since Real objects can be of arbitrary size and structure, `a` must be represented as an array of pointers to individually allocated `Real` objects. Because `f` will always be a `Float64`, we should instead, use: a = Float64[] # typeof(a) = Array{Float64,1} which will create a contiguous block of 64-bit floating-point values that can be manipulated efficiently. 見下面討論參數化類型. ### 類型聲明在 Julia 中，編譯器能推斷出所有的函數參數與局部變量的類型，因此聲名變量類型不能提高性能。然而在有些具體實例中，聲明類型還是非常有用的。 ### 給復合類型做類型聲明假如有一個如下的自定義類型： ``` type Foo field end ``` 編譯器推斷不出 `foo.field` 的類型，因為當它指向另一個不同類型的值時，它的類型也會被修改。這時最好聲明具體的類型，比如 `field::Float64` 或者 `field::Array{Int64,1}` 。 ### 顯式聲明未提供類型的值的類型我們經常使用含有不同數據類型的數據結構，比如上述的 `Foo` 類型，或者元胞數組（ `Array{Any}` 類型的數組）。如果你知道其中元素的類型，最好把它告訴編譯器： ``` function foo(a::Array{Any,1}) x = a[1]::Int32 b = x+1 ... end ``` 假如我們知道 `a` 的第一個元素是 `Int32` 類型的，那就添加上這樣的類型聲明吧。如果這個元素不是這個類型，在運行時就會報錯，這有助于調試代碼。 ### 顯式聲明命名參數的值的類型命名參數可以顯式指定類型: ``` function with_keyword(x; name::Int = 1) ... end ``` 函數只處理指定類型的命名參數，因此這些聲明不會對該函數內部代碼的性能產生影響。不過，這會減少此類包含命名參數的函數的調用開銷。與直接使用參數列表的函數相比，命名參數的函數調用新增的開銷很少，基本上可算是零開銷。如果傳入函數的是命名參數的動態列表，例如``f(x; keywords...)``，速度會比較慢，性能敏感的代碼慎用。 ### 把函數拆開把一個函數拆為多個，有助于編譯器調用最匹配的代碼，甚至將它內聯。舉個應該把“復合函數”寫成多個小定義的例子： ``` function norm(A) if isa(A, Vector) return sqrt(real(dot(A,A))) elseif isa(A, Matrix) return max(svd(A)[2]) else error("norm: invalid argument") end end ``` 如下重寫會更精確、高效： ``` norm(x::Vector) = sqrt(real(dot(x,x))) norm(A::Matrix) = max(svd(A)[2]) ``` 寫“類型穩定”的函數盡量確保函數返回同樣類型的數值。考慮下面定義： `pos(x) = x < 0 ? 0 : x` 盡管看起來沒問題，但是 `0` 是個整數（ `Int` 型）， `x` 可能是任意類型。因此，函數有返回兩種類型的可能。這個是可以的，有時也很有用，但是最好如下重寫： `pos(x) = x < 0 ? zero(x) : x` Julia 中還有 `one` 函數，以及更通用的 `oftype(x,y)` 函數，它將 `y` 轉換為與 `x` 同樣的類型，并返回。這仨函數的第一個參數，可以是一個值，也可以是一個類型。避免改變變量類型在一個函數中重復地使用變量，會導致類似于“類型穩定性”的問題： ``` function foo() x = 1 for i = 1:10 x = x/bar() end return x end ``` 局部變量 x 開始為整數，循環一次后變成了浮點數（ / 運算符的結果）。這使得編譯器很難優化循環體。可以修改為如下的任何一種： * 用 x = 1.0 初始化 x * 聲明 x 的類型： x::Float64 = 1 * 使用顯式轉換: x = one(T) ### 分離核心函數很多函數都先做些初始化設置，然后開始很多次循環迭代去做核心計算。盡可能把這些核心計算放在單獨的函數中。例如，下面的函數返回一個隨機類型的數組： ``` function strange_twos(n) a = Array(randbool() ? Int64 : Float64, n) for i = 1:n a[i] = 2 end return a end ``` 應該寫成： ``` function fill_twos!(a) for i=1:length(a) a[i] = 2 end end function strange_twos(n) a = Array(randbool() ? Int64 : Float64, n) fill_twos!(a) return a end ``` Julia 的編譯器依靠參數類型來優化代碼。第一個實現中，編譯器在循環時不知道 a 的類型（因為類型是隨機的）。第二個實現中，內層循環使用 `fill_twos`! 對不同的類型 a 重新編譯，因此運行速度更快。第二種實現的代碼更好，也更便于代碼復用。標準庫中經常使用這種方法。如 [abstractarray.jl](https://github.com/JuliaLang/julia/blob/master/base/abstractarray.jl) 文件中的 `hvcat_fill` 和 `fill`! 函數。我們可以用這兩個函數來替代這兒的 `fill_twos`! 函數。形如 `strange_twos` 之類的函數經常用于處理未知類型的數據。比如，從文件載入的數據，可能包含整數、浮點數、字符串，或者其他類型。 ### Access arrays in memory order, along columns Multidimensional arrays in Julia are stored in column-major order. This means that arrays are stacked one column at a time. This can be verified using the `vec` function or the syntax `[:]` as shown below (notice that the array is ordered `[1 3 2 4]`, not `[1 2 3 4]`): ``` julia> x = [1 2; 3 4] 2x2 Array{Int64,2}: 1 2 3 4 julia> x[:] 4-element Array{Int64,1}: 1 3 2 4 ``` This convention for ordering arrays is common in many languages like Fortran, Matlab, and R (to name a few). The alternative to column-major ordering is row-major ordering, which is the convention adopted by C and Python (`numpy`) among other languages. Remembering the ordering of arrays can have significant performance effects when looping over arrays. A rule of thumb to keep in mind is that with column-major arrays, the first index changes most rapidly. Essentially this means that looping will be faster if the inner-most loop index is the first to appear in a slice expression. Consider the following contrived example. Imagine we wanted to write a function that accepts a Vector and and returns a square `Matrix` with either the rows or the columns filled with copies of the input vector. Assume that it is not important whether rows or columns are filled with these copies (perhaps the rest of the code can be easily adapted accordingly). We could conceivably do this in at least four ways (in addition to the recommended call to the built-in function repmat): ``` function copy_cols{T}(x::Vector{T}) n = size(x, 1) out = Array(eltype(x), n, n) for i=1:n out[:, i] = x end out end function copy_rows{T}(x::Vector{T}) n = size(x, 1) out = Array(eltype(x), n, n) for i=1:n out[i, :] = x end out end function copy_col_row{T}(x::Vector{T}) n = size(x, 1) out = Array(T, n, n) for col=1:n, row=1:n out[row, col] = x[row] end out end function copy_row_col{T}(x::Vector{T}) n = size(x, 1) out = Array(T, n, n) for row=1:n, col=1:n out[row, col] = x[col] end out end ``` Now we will time each of these functions using the same random `10000` by `1` input vector: ``` julia> x = randn(10000); julia> fmt(f) = println(rpad(string(f)*": ", 14, ' '), @elapsed f(x)) julia> map(fmt, {copy_cols, copy_rows, copy_col_row, copy_row_col}); copy_cols: 0.331706323 copy_rows: 1.799009911 copy_col_row: 0.415630047 copy_row_col: 1.721531501 ``` **************************************** Notice that `copy_cols` is much faster than `copy_rows`. This is expected because `copy_cols respects the column-based memory layout of the `Matrix` and fills it one column at a time. Additionally, `copy_col_row` is much faster than `copy_row_col` because it follows our rule of thumb that the first element to appear in a slice expression should be coupled with the inner-most loop. ### Pre-allocating outputs If your function returns an Array or some other complex type, it may have to allocate memory. Unfortunately, oftentimes allocation and its converse, garbage collection, are substantial bottlenecks. Sometimes you can circumvent the need to allocate memory on each function call by pre-allocating the output. As a trivial example, compare ``` function xinc(x) return [x, x+1, x+2] end function loopinc() y = 0 for i = 1:10^7 ret = xinc(i) y += ret[2] end y end ``` #### **with** ``` function xinc!{T}(ret::AbstractVector{T}, x::T) ret[1] = x ret[2] = x+1 ret[3] = x+2 nothing end function loopinc_prealloc() ret = Array(Int, 3) y = 0 for i = 1:10^7 xinc!(ret, i) y += ret[2] end y end ``` Timing results: ``` julia> @time loopinc() elapsed time: 1.955026528 seconds (1279975584 bytes allocated) 50000015000000 julia> @time loopinc_prealloc() elapsed time: 0.078639163 seconds (144 bytes allocated) 50000015000000 ``` Pre-allocation has other advantages, for example by allowing the caller to control the “output” type from an algorithm. In the example above, we could have passed a SubArray rather than an Array, had we so desired. Taken to its extreme, pre-allocation can make your code uglier, so performance measurements and some judgment may be required. ### Avoid string interpolation for I/O When writing data to a file (or other I/O device), forming extra intermediate strings is a source of overhead. Instead of: `println(file, "$a $b")` use: `println(file, a, " ", b)` The first version of the code forms a string, then writes it to the file, while the second version writes values directly to the file. Also notice that in some cases string interpolation can be harder to read. Consider: `println(file, "$(f(a))$(f(b))")` versus: `println(file, f(a), f(b))` ### 處理有關舍棄的警告被舍棄的函數，會查表并顯示一次警告，而這會影響性能。建議按照警告的提示進行對應的修改。 ### 小技巧注意些有些小事項，能使內部循環更緊致。避免不必要的數組。例如，不要使用 `sum([x,y,z])` ，而應使用 `x+y+z` 對于較小的整數冪，使用 `* `更好。如 `x*x*x` 比 `x^3` 好針對復數 `z` ，使用 `abs2(z)` 代替 `abs(z)^2` 。一般情況下，對于復數參數，盡量用 `abs2` 代替 `abs` 對于整數除法，使用 `div(x,y)` 而不是 `trunc(x/y)`, 使用 `fld(x,y)` 而不是 `floor(x/y)`, 使用 `cld(x,y)` 而不是 `ceil(x/y)`. ### Performance Annotations Sometimes you can enable better optimization by promising certain program properties. * Use `@inbounds` to eliminate array bounds checking within expressions. Be certain before doing this. If the subscripts are ever out of bounds, you may suffer crashes or silent corruption. * Write `@simd` in front of `for` loops that are amenable to vectorization. This feature is experimental and could change or disappear in future versions of Julia. Here is an example with both forms of markup: ``` function inner( x, y ) s = zero(eltype(x)) for i=1:length(x) @inbounds s += x[i]*y[i] end s end function innersimd( x, y ) s = zero(eltype(x)) @simd for i=1:length(x) @inbounds s += x[i]*y[i] end s end function timeit( n, reps ) x = rand(Float32,n) y = rand(Float32,n) s = zero(Float64) time = @elapsed for j in 1:reps s+=inner(x,y) end println("GFlop = ",2.0*n*reps/time*1E-9) time = @elapsed for j in 1:reps s+=innersimd(x,y) end println("GFlop (SIMD) = ",2.0*n*reps/time*1E-9) end ``` timeit(1000,1000) On a computer with a 2.4GHz Intel Core i5 processor, this produces: ``` GFlop = 1.9467069505224963 GFlop (SIMD) = 17.578554163920018 ``` The range for a `@simd` for loop should be a one-dimensional range. A variable used for accumulating, such as `s` in the example, is called a reduction variable. By using``@simd``, you are asserting several properties of the loop: * It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables. * Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`. * No iteration ever waits on another iteration to make forward progress. Using `@simd` merely gives the compiler license to vectorize. Whether it actually does so depends on the compiler. To actually benefit from the current implementation, your loop should have the following additional properties: * The loop must be an innermost loop. * The loop body must be straight-line code. This is why `@inbounds` is currently needed for all array accesses. * Accesses must have a stride pattern and cannot be “gathers” (random-index reads) or “scatters” (random-index writes). * The stride should be unit stride. * In some simple cases, for example with 2-3 arrays accessed in a loop, the LLVM auto-vectorization may kick in automatically, leading to no further speedup with `@simd`.