6.4 調度循環 · golan基礎

# 6.4 調度循環所有的初始化工作都已經完成了，是時候啟動運行時調度器了。我們已經知道，當所有準備工作都完成后，最后一個開始執行的引導調用就是`runtime.mstart`了。現在我們來研究一下它在干什么。 ``` TEXT runtime·rt0_go(SB),NOSPLIT,$0 (...) CALL runtime·newproc(SB) // G 初始化 POPQ AX POPQ AX // 啟動 M CALL runtime·mstart(SB) // 開始執行 RET DATA runtime·mainPC+0(SB)/8,$runtime·main(SB) GLOBL runtime·mainPC(SB),RODATA,$8 ``` ## 6.4.1 執行前的準備 ### `mstart`與`mstart1` 在啟動前，在[6.2 初始化](https://golang.design/under-the-hood/zh-cn/part2runtime/ch06sched/init)中我們已經了解到 G 的棧邊界是還沒有初始化的。因此我們必須在開始前計算棧邊界，因此在`mstart1`之前，就是一些確定執行棧邊界的工作。當`mstart1`結束后，會執行`mexit`退出 M。`mstart`也是所有新創建的 M 的起點。 ``` //go:nosplit //go:nowritebarrierrec func mstart() { _g_ := getg() // 終于開始確定執行棧的邊界了 // 通過檢查 g 執行占的邊界來確定是否為系統棧 osStack := _g_.stack.lo == 0 if osStack { // 根據系統棧初始化執行棧的邊界 // cgo 可能會離開 stack.hi // minit 可能會更新棧的邊界 size := _g_.stack.hi if size == 0 { size = 8192 * sys.StackGuardMultiplier } _g_.stack.hi = uintptr(noescape(unsafe.Pointer(&size))) _g_.stack.lo = _g_.stack.hi - size + 1024 } // 初始化棧 guard，進而可以同時調用 Go 或 C 函數。 _g_.stackguard0 = _g_.stack.lo + _StackGuard _g_.stackguard1 = _g_.stackguard0 // 啟動！ mstart1() // 退出線程 if GOOS == "windows" || GOOS == "solaris" || GOOS == "plan9" || GOOS == "darwin" || GOOS == "aix" { // 由于 windows, solaris, darwin, aix 和 plan9 總是系統分配的棧，在在 mstart 之前放進 _g_.stack 的 // 因此上面的邏輯還沒有設置 osStack。 osStack = true } // 退出線程 mexit(osStack) } ``` 再來看`mstart1`。 ``` func mstart1() { _g_ := getg() (...) // 為了在 mcall 的棧頂使用調用方來結束當前線程，做記錄 // 當進入 schedule 之后，我們再也不會回到 mstart1，所以其他調用可以復用當前幀。 save(getcallerpc(), getcallersp()) (...) minit() // 設置信號 handler；在 minit 之后，因為 minit 可以準備處理信號的的線程 if _g_.m == &m0 { mstartm0() } // 執行啟動函數 if fn := _g_.m.mstartfn; fn != nil { fn() } // 如果當前 m 并非 m0，則要求綁定 p if _g_.m != &m0 { // 綁定 p acquirep(_g_.m.nextp.ptr()) _g_.m.nextp = 0 } // 徹底準備好，開始調度，永不返回 schedule() } ``` 幾個需要注意的細節： 1. `mstart`除了在程序引導階段會被運行之外，也可能在每個 m 被創建時運行（本節稍后討論）； 2. `mstart`進入`mstart1`之后，會初始化自身用于信號處理的 g，在`mstartfn`指定時將其執行； 3. 調度循環`schedule`無法返回，因此最后一個`mexit`目前還不會被執行，因此當下所有的 Go 程序創建的線程都無法被釋放（只有一個特例，當使用`runtime.LockOSThread`鎖住的 G 退出時會使用`gogo`退出 M，在本節稍后討論）。關于運行時信號處理，以及 note 同步機制，我們分別在[6.5 信號處理機制](https://golang.design/under-the-hood/zh-cn/part2runtime/ch06sched/signal)和[6.8 同步原語](https://golang.design/under-the-hood/zh-cn/part2runtime/ch06sched/sync)詳細分析。 ### M 與 P 的綁定 M 與 P 的綁定過程只是簡單的將 P 鏈表中的 P ，保存到 M 中的 P 指針上。綁定前，P 的狀態一定是`_Pidle`，綁定后 P 的狀態一定為`_Prunning`。 ``` //go:yeswritebarrierrec func acquirep(_p_ *p) { // 此處不允許 write barrier wirep(_p_) (...) } //go:nowritebarrierrec //go:nosplit func wirep(_p_ *p) { _g_ := getg() (...) // 檢查 m 是否正常，并檢查要獲取的 p 的狀態 if _p_.m != 0 || _p_.status != _Pidle { (...) throw("wirep: invalid p state") } // 將 p 綁定到 m，p 和 m 互相引用 _g_.m.p.set(_p_) // *_g_.m.p = _p_ _p_.m.set(_g_.m) // *_p_.m = _g_.m // 修改 p 的狀態 _p_.status = _Prunning } ``` ### M 的暫止和復始無論出于什么原因，當 M 需要被暫止時，可能（因為還有其他暫止 M 的方法）會執行該調用。此調用會將 M 進行暫止，并阻塞到它被復始時。這一過程就是工作線程的暫止和復始。 ``` // 停止當前 m 的執行，直到新的 work 有效 // 在包含要求的 P 下返回 func stopm() { _g_ := getg() (...) // 將 m 放回到空閑列表中，因為我們馬上就要暫止了 lock(&sched.lock) mput(_g_.m) unlock(&sched.lock) // 暫止當前的 M，在此阻塞，直到被喚醒 notesleep(&_g_.m.park) // 清除暫止的 note noteclear(&_g_.m.park) // 此時已經被復始，說明有任務要執行 // 立即 acquire P acquirep(_g_.m.nextp.ptr()) _g_.m.nextp = 0 } ``` 它的流程也非常簡單，將 m 放回至空閑列表中，而后使用 note 注冊一個暫止通知，阻塞到它重新被復始。 ## 6.4.2 核心調度千辛萬苦，我們終于來到了核心的調度邏輯。 ``` // 調度器的一輪：找到 runnable Goroutine 并進行執行且永不返回 func schedule() { _g_ := getg() (...) // m.lockedg 會在 LockOSThread 下變為非零 if _g_.m.lockedg != 0 { stoplockedm() execute(_g_.m.lockedg.ptr(), false) // 永不返回 } (...) top: if sched.gcwaiting != 0 { // 如果需要 GC，不再進行調度 gcstopm() goto top } if _g_.m.p.ptr().runSafePointFn != 0 { runSafePointFn() } var gp *g var inheritTime bool (...) // 正在 GC，去找 GC 的 g if gp == nil && gcBlackenEnabled != 0 { gp = gcController.findRunnableGCWorker(_g_.m.p.ptr()) } if gp == nil { // 說明不在 GC // // 每調度 61 次，就檢查一次全局隊列，保證公平性 // 否則兩個 Goroutine 可以通過互相 respawn 一直占領本地的 runqueue if _g_.m.p.ptr().schedtick%61 == 0 && sched.runqsize > 0 { lock(&sched.lock) // 從全局隊列中偷 g gp = globrunqget(_g_.m.p.ptr(), 1) unlock(&sched.lock) } } if gp == nil { // 說明不在 gc // 兩種情況： // 1. 普通取 // 2. 全局隊列中偷不到的取 // 從本地隊列中取 gp, inheritTime = runqget(_g_.m.p.ptr()) (...) } if gp == nil { // 如果偷都偷不到，則休眠，在此阻塞 gp, inheritTime = findrunnable() } // 這個時候一定取到 g 了 if _g_.m.spinning { // 如果 m 是自旋狀態，則 // 1. 從自旋到非自旋 // 2. 在沒有自旋狀態的 m 的情況下，再多創建一個新的自旋狀態的 m resetspinning() } if sched.disable.user && !schedEnabled(gp) { // Scheduling of this goroutine is disabled. Put it on // the list of pending runnable goroutines for when we // re-enable user scheduling and look again. lock(&sched.lock) if schedEnabled(gp) { // Something re-enabled scheduling while we // were acquiring the lock. unlock(&sched.lock) } else { sched.disable.runnable.pushBack(gp) sched.disable.n++ unlock(&sched.lock) goto top } } if gp.lockedm != 0 { // 如果 g 需要 lock 到 m 上，則會將當前的 p // 給這個要 lock 的 g // 然后阻塞等待一個新的 p startlockedm(gp) goto top } // 開始執行 execute(gp, inheritTime) } ``` 先不管上面究竟做了什么，我們直接看最后一句的`execute`。 ``` // 在當前 M 上調度 gp。 // 如果 inheritTime 為 true，則 gp 繼承剩余的時間片。否則從一個新的時間片開始 // 永不返回。 // //go:yeswritebarrierrec func execute(gp *g, inheritTime bool) { _g_ := getg() // 將 g 正式切換為 _Grunning 狀態 casgstatus(gp, _Grunnable, _Grunning) gp.waitsince = 0 // 搶占信號 gp.preempt = false gp.stackguard0 = gp.stack.lo + _StackGuard if !inheritTime { _g_.m.p.ptr().schedtick++ } _g_.m.curg = gp gp.m = _g_.m // profiling 相關 hz := sched.profilehz if _g_.m.profilehz != hz { setThreadCPUProfiler(hz) } (...) // 終于開始執行了 gogo(&gp.sched) } ``` 當開始執行`execute`后，g 會被切換到`_Grunning`狀態。設置自身的搶占信號，將 m 和 g 進行綁定。最終調用`gogo`開始執行。在 amd64 平臺下的實現： ``` // void gogo(Gobuf*) // 從 Gobuf 恢復狀態; longjmp TEXT runtime·gogo(SB), NOSPLIT, $16-8 MOVQ buf+0(FP), BX // 運行現場 MOVQ gobuf_g(BX), DX MOVQ 0(DX), CX // 確認 g != nil get_tls(CX) MOVQ DX, g(CX) MOVQ gobuf_sp(BX), SP // 恢復 SP MOVQ gobuf_ret(BX), AX MOVQ gobuf_ctxt(BX), DX MOVQ gobuf_bp(BX), BP MOVQ $0, gobuf_sp(BX) // 清理，輔助 GC MOVQ $0, gobuf_ret(BX) MOVQ $0, gobuf_ctxt(BX) MOVQ $0, gobuf_bp(BX) MOVQ gobuf_pc(BX), BX // 獲取 g 要執行的函數的入口地址 JMP BX // 開始執行 ``` 這個`gogo`的實現真是非常巧妙。初次閱讀時，看到`JMP BX`開始執行 Goroutine 函數體后就沒了，簡直一臉疑惑，就這么沒了？后續調用怎么回到調度器呢？事實上我們已經在[6.2 初始化](https://golang.design/under-the-hood/zh-cn/part2runtime/ch06sched/init)一節中看到過相關操作了： ``` func newproc1(fn *funcval, argp *uint8, narg int32, callergp *g, callerpc uintptr) { siz := narg siz = (siz + 7) &^ 7 (...) totalSize := 4*sys.RegSize + uintptr(siz) + sys.MinFrameSize totalSize += -totalSize & (sys.SpAlign - 1) sp := newg.stack.hi - totalSize spArg := sp (...) memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched)) newg.sched.sp = sp newg.stktopsp = sp newg.sched.pc = funcPC(goexit) + sys.PCQuantum newg.sched.g = guintptr(unsafe.Pointer(newg)) gostartcallfn(&newg.sched, fn) (...) } ``` 在執行`gostartcallfn`之前，棧幀狀態為： ~~~ +--------+ | | --- --- newg.stack.hi +--------+ | | | | | | +--------+ | | | | | | siz +--------+ | | | | | | +--------+ | | | | | --- +--------+ | | | | +--------+ | totalSize = 4*sys.PtrSize + siz | | | +--------+ | | | | +--------+ | | | --- +--------+ 高地址 SP | | 假想的調用方棧幀 +--------+ --------------------------------------------- | | fn 棧幀 +--------+ | | 低地址 .... +--------+ PC | goexit | +--------+ ~~~ 當執行`gostartcallfn`后： ``` func gostartcallfn(gobuf *gobuf, fv *funcval) { var fn unsafe.Pointer if fv != nil { fn = unsafe.Pointer(fv.fn) } else { fn = unsafe.Pointer(funcPC(nilfunc)) } gostartcall(gobuf, fn, unsafe.Pointer(fv)) } func gostartcall(buf *gobuf, fn, ctxt unsafe.Pointer) { sp := buf.sp if sys.RegSize > sys.PtrSize { sp -= sys.PtrSize *(*uintptr)(unsafe.Pointer(sp)) = 0 } sp -= sys.PtrSize *(*uintptr)(unsafe.Pointer(sp)) = buf.pc buf.sp = sp buf.pc = uintptr(fn) buf.ctxt = ctxt } ``` 此時保存的堆棧的情況如下： ~~~ +--------+ | | --- --- newg.stack.hi +--------+ | | | | | | +--------+ | | | | | | siz +--------+ | | | | | | +--------+ | | | | | --- +--------+ | | | | +--------+ | totalSize = 4*sys.PtrSize + siz | | | +--------+ | | | | +--------+ | | | --- +--------+ 高地址 | goexit | 假想的調用方棧幀 +--------+ --------------------------------------------- SP | | fn 棧幀 +--------+ | | 低地址 .... +--------+ PC | fn | +--------+ ~~~ 可以看到，在執行現場`sched.sp`保存的其實是`goexit`的地址。那么也就是`JMP`跳轉到 PC 寄存器處，開始執行`fn`。當`fn`執行完畢后，會將（假想的）調用方`goexit`的地址恢復到 PC，從而達到執行`goexit`的目的： ``` // 在 Goroutine 返回 goexit + PCQuantum 時運行的最頂層函數。 TEXT runtime·goexit(SB),NOSPLIT,$0-0 BYTE $0x90 // NOP CALL runtime·goexit1(SB) // 不會返回 // traceback from goexit1 must hit code range of goexit BYTE $0x90 // NOP ``` 那么接下來就是去執行`goexit1`了： ``` // 完成當前 Goroutine 的執行 func goexit1() { (...) // 開始收尾工作 mcall(goexit0) } ``` 通過`mcall`完成`goexit0`的調用： ``` // func mcall(fn func(*g)) // 切換到 m->g0 棧, 并調用 fn(g). // Fn 必須永不返回. 它應該使用 gogo(&g->sched) 來持續運行 g TEXT runtime·mcall(SB), NOSPLIT, $0-4 MOVL fn+0(FP), DI get_tls(DX) MOVL g(DX), AX // 在 g->sched 中保存狀態 MOVL 0(SP), BX // 調用方 PC MOVL BX, (g_sched+gobuf_pc)(AX) LEAL fn+0(FP), BX // 調用方 SP MOVL BX, (g_sched+gobuf_sp)(AX) MOVL AX, (g_sched+gobuf_g)(AX) // 切換到 m->g0 及其棧，調用 fn MOVL g(DX), BX MOVL g_m(BX), BX MOVL m_g0(BX), SI CMPL SI, AX // 如果 g == m->g0 要調用 badmcall JNE 3(PC) MOVL $runtime·badmcall(SB), AX JMP AX MOVL SI, g(DX) // g = m->g0 MOVL (g_sched+gobuf_sp)(SI), SP // sp = m->g0->sched.sp PUSHL AX MOVL DI, DX MOVL 0(DI), DI CALL DI // 好了，開始調用 fn POPL AX MOVL $runtime·badmcall2(SB), AX JMP AX RET ``` 于是最終開始`goexit0`： ``` // goexit 繼續在 g0 上執行 func goexit0(gp *g) { _g_ := getg() // 切換當前的 g 為 _Gdead casgstatus(gp, _Grunning, _Gdead) if isSystemGoroutine(gp, false) { atomic.Xadd(&sched.ngsys, -1) } // 清理 gp.m = nil locked := gp.lockedm != 0 gp.lockedm = 0 _g_.m.lockedg = 0 gp.paniconfault = false gp._defer = nil // 應該已經為 true，但以防萬一 gp._panic = nil // Goexit 中 panic 則不為 nil，指向棧分配的數據 gp.writebuf = nil gp.waitreason = 0 gp.param = nil gp.labels = nil gp.timer = nil if gcBlackenEnabled != 0 && gp.gcAssistBytes > 0 { // 刷新 assist credit 到全局池。 // 如果應用在快速創建 Goroutine，這可以為 pacing 提供更好的信息。 scanCredit := int64(gcController.assistWorkPerByte * float64(gp.gcAssistBytes)) atomic.Xaddint64(&gcController.bgScanCredit, scanCredit) gp.gcAssistBytes = 0 } // 注意 gp 的棧 scan 目前開始變為 valid，因為它沒有棧了 gp.gcscanvalid = true // 解綁 m 和 g dropg() if GOARCH == "wasm" { // wasm 目前還沒有線程支持 // 將 g 扔進 gfree 鏈表中等待復用 gfput(_g_.m.p.ptr(), gp) // 再次進行調度 schedule() // 永不返回 } (...) // 將 g 扔進 gfree 鏈表中等待復用 gfput(_g_.m.p.ptr(), gp) if locked { // 該 Goroutine 可能在當前線程上鎖住，因為它可能導致了不正常的內核狀態 // 這時候 kill 該線程，而非將 m 放回到線程池。 // 此舉會返回到 mstart，從而釋放當前的 P 并退出該線程 if GOOS != "plan9" { // See golang.org/issue/22227. gogo(&_g_.m.g0.sched) } else { // 因為我們可能已重用此線程結束，在 plan9 上清除 lockedExt _g_.m.lockedExt = 0 } } // 再次進行調度 schedule() } ``` 退出的善后工作也相對簡單，無非就是復位 g 的狀態、解綁 m 和 g，將其放入 gfree 鏈表中等待其他的 go 語句創建新的 g。如果 Goroutine 將自身鎖在同一個 OS 線程中且沒有自行解綁則 m 會退出，而不會被放回到線程池中。相反，會再次調用 gogo 切換到 g0 執行現場中，這也是目前唯一的退出 m 的機會，在本節最后解釋。 ### 偷取工作全局 g 鏈式隊列中取 max 個 g ，其中第一個用于執行，max-1 個放入本地隊列。如果放不下，則只在本地隊列中放下能放的。過程比較簡單： ``` // 從全局隊列中偷取，調用時必須鎖住調度器 func globrunqget(_p_ *p, max int32) *g { // 如果全局隊列中沒有 g 直接返回 if sched.runqsize == 0 { return nil } // per-P 的部分，如果只有一個 P 的全部取 n := sched.runqsize/gomaxprocs + 1 if n > sched.runqsize { n = sched.runqsize } // 不能超過取的最大個數 if max > 0 && n > max { n = max } // 計算能不能在本地隊列中放下 n 個 if n > int32(len(_p_.runq))/2 { n = int32(len(_p_.runq)) / 2 } // 修改本地隊列的剩余空間 sched.runqsize -= n // 拿到全局隊列隊頭 g gp := sched.runq.pop() // 計數 n-- // 繼續取剩下的 n-1 個全局隊列放入本地隊列 for ; n > 0; n-- { gp1 := sched.runq.pop() runqput(_p_, gp1, false) } return gp } ``` 從本地隊列中取，首先看 next 是否有已經安排要運行的 g ，如果有，則返回下一個要運行的 g 否則，以 cas 的方式從本地隊列中取一個 g。如果是已經安排要運行的 g，則繼承剩余的可運行時間片進行運行；否則以一個新的時間片來運行。 ``` // 從本地可運行隊列中獲取 g // 如果 inheritTime 為 true，則 g 繼承剩余的時間片 // 否則開始一個新的時間片。在所有者 P 上執行 func runqget(_p_ *p) (gp *g, inheritTime bool) { // 如果有 runnext，則為下一個要運行的 g for { // 下一個 g next := _p_.runnext // 沒有，break if next == 0 { break } // 如果 cas 成功，則 g 繼承剩余時間片執行 if _p_.runnext.cas(next, 0) { return next.ptr(), true } } // 沒有 next for { // 本地隊列是空，返回 nil h := atomic.LoadAcq(&_p_.runqhead) // load-acquire, 與其他消費者同步 t := _p_.runqtail if t == h { return nil, false } // 從本地隊列中以 cas 方式拿一個 gp := _p_.runq[h%uint32(len(_p_.runq))].ptr() if atomic.CasRel(&_p_.runqhead, h, h+1) { // cas-release, 提交消費 return gp, false } } } ``` 偷取（steal）的實現是一個非常復雜的過程。這個過程來源于我們需要仔細的思考什么時候對調度器進行加鎖、什么時候對 m 進行暫止、什么時候將 m 從自旋向非自旋切換等等。 ``` // 尋找一個可運行的 Goroutine 來執行。 // 嘗試從其他的 P 偷取、從全局隊列中獲取、poll 網絡 func findrunnable() (gp *g, inheritTime bool) { _g_ := getg() // 這里的條件與 handoffp 中的條件必須一致： // 如果 findrunnable 將返回 G 運行，handoffp 必須啟動 M. top: _p_ := _g_.m.p.ptr() // 如果在 gc，則暫止當前 m，直到復始后回到 top if sched.gcwaiting != 0 { gcstopm() goto top } if _p_.runSafePointFn != 0 { runSafePointFn() } if fingwait && fingwake { if gp := wakefing(); gp != nil { ready(gp, 0, true) } } // cgo 調用被終止，繼續進入 if *cgo_yield != nil { asmcgocall(*cgo_yield, nil) } // 取本地隊列 local runq，如果已經拿到，立刻返回 if gp, inheritTime := runqget(_p_); gp != nil { return gp, inheritTime } // 全局隊列 global runq，如果已經拿到，立刻返回 if sched.runqsize != 0 { lock(&sched.lock) gp := globrunqget(_p_, 0) unlock(&sched.lock) if gp != nil { return gp, false } } // Poll 網絡，優先級比從其他 P 中偷要高。 // 在我們嘗試去其他 P 偷之前，這個 netpoll 只是一個優化。 // 如果沒有 waiter 或 netpoll 中的線程已被阻塞，則可以安全地跳過它。 // 如果有任何類型的邏輯競爭與被阻塞的線程（例如它已經從 netpoll 返回，但尚未設置 lastpoll） // 該線程無論如何都將阻塞 netpoll。 if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Load64(&sched.lastpoll) != 0 { if list := netpoll(false); !list.empty() { // 無阻塞 gp := list.pop() injectglist(gp.schedlink.ptr()) casgstatus(gp, _Gwaiting, _Grunnable) (...) return gp, false } } // 從其他 P 中偷 work procs := uint32(gomaxprocs) // 獲得 p 的數量 if atomic.Load(&sched.npidle) == procs-1 { // GOMAXPROCS = 1 或除了我們之外的所有人都已經 idle 了。 // 新的 work 可能出現在 syscall/cgocall/網絡/timer返回時 // 它們均沒有提交到本地運行隊列，因此偷取沒有任何意義。 goto stop } // 如果自旋狀態下 m 的數量 >= busy 狀態下 p 的數量，直接進入阻塞 // 該步驟是有必要的，它用于當 GOMAXPROCS>>1 時但程序的并行機制很慢時 // 昂貴的 CPU 消耗。 if !_g_.m.spinning && 2*atomic.Load(&sched.nmspinning) >= procs-atomic.Load(&sched.npidle) { goto stop } // 如果 m 是非自旋狀態，切換為自旋 if !_g_.m.spinning { _g_.m.spinning = true atomic.Xadd(&sched.nmspinning, 1) } for i := 0; i < 4; i++ { // 隨機偷 for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() { // 已經進入了 GC? 回到頂部，暫止當前的 m if sched.gcwaiting != 0 { goto top } stealRunNextG := i > 2 // 如果偷了兩輪都偷不到，便優先查找 ready 隊列 if gp := runqsteal(_p_, allp[enum.position()], stealRunNextG); gp != nil { // 總算偷到了，立即返回 return gp, false } } } stop: // 沒有任何 work 可做。 // 如果我們在 GC mark 階段，則可以安全的掃描并 blacken 對象 // 然后便有 work 可做，運行 idle-time 標記而非直接放棄當前的 P。 if gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) { _p_.gcMarkWorkerMode = gcMarkWorkerIdleMode gp := _p_.gcBgMarkWorker.ptr() casgstatus(gp, _Gwaiting, _Grunnable) (...) return gp, false } // 僅限于 wasm // 如果一個回調返回后沒有其他 Goroutine 是蘇醒的 // 則暫停執行直到回調被觸發。 if beforeIdle() { // 至少一個 Goroutine 被喚醒 goto top } // 放棄當前的 P 之前，對 allp 做一個快照 // 一旦我們不再阻塞在 safe-point 時候，可以立刻在下面進行修改 allpSnapshot := allp // 準備歸還 p，對調度器加鎖 lock(&sched.lock) // 進入了 gc，回到頂部暫止 m if sched.gcwaiting != 0 || _p_.runSafePointFn != 0 { unlock(&sched.lock) goto top } // 全局隊列中又發現了任務 if sched.runqsize != 0 { gp := globrunqget(_p_, 0) unlock(&sched.lock) // 趕緊偷掉返回 return gp, false } // 歸還當前的 p if releasep() != _p_ { throw("findrunnable: wrong p") } // 將 p 放入 idle 鏈表 pidleput(_p_) // 完成歸還，解鎖 unlock(&sched.lock) // 這里要非常小心: // 線程從自旋到非自旋狀態的轉換，可能與新 Goroutine 的提交同時發生。 // 我們必須首先丟棄 nmspinning，然后再次檢查所有的 per-P 隊列（并在期間伴隨 #StoreLoad 內存屏障） // 如果反過來，其他線程可以在我們檢查了所有的隊列、然后提交一個 Goroutine、再丟棄了 nmspinning // 進而導致無法復始一個線程來運行那個 Goroutine 了。 // 如果我們發現下面的新 work，我們需要恢復 m.spinning 作為重置的信號， // 以取消暫止新的工作線程（因為可能有多個 starving 的 Goroutine）。 // 但是，如果在發現新 work 后我們也觀察到沒有空閑 P，可以暫停當前線程 // 因為系統已滿載，因此不需要自旋線程。 wasSpinning := _g_.m.spinning if _g_.m.spinning { _g_.m.spinning = false if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 { throw("findrunnable: negative nmspinning") } } // 再次檢查所有的 runqueue for _, _p_ := range allpSnapshot { // 如果這時本地隊列不空 if !runqempty(_p_) { // 重新獲取 p lock(&sched.lock) _p_ = pidleget() unlock(&sched.lock) // 如果能獲取到 p if _p_ != nil { // 綁定 p acquirep(_p_) // 如果此前已經被切換為自旋 if wasSpinning { // 重新切換回非自旋 _g_.m.spinning = true atomic.Xadd(&sched.nmspinning, 1) } // 這時候是有 work 的，回到頂部重新 find g goto top } // 看來沒有 idle 的 p，不需要重新 find g 了 break } } // 再次檢查 idle-priority GC work // 和上面重新找 runqueue 的邏輯類似 if gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) { lock(&sched.lock) _p_ = pidleget() if _p_ != nil && _p_.gcBgMarkWorker == 0 { pidleput(_p_) _p_ = nil } unlock(&sched.lock) if _p_ != nil { acquirep(_p_) if wasSpinning { _g_.m.spinning = true atomic.Xadd(&sched.nmspinning, 1) } // Go back to idle GC check. goto stop } } // poll 網絡 // 和上面重新找 runqueue 的邏輯類似 if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Xchg64(&sched.lastpoll, 0) != 0 { (...) gp := netpoll(true) // 阻塞到新的 work 有效為止 atomic.Store64(&sched.lastpoll, uint64(nanotime())) if gp != nil { lock(&sched.lock) _p_ = pidleget() unlock(&sched.lock) if _p_ != nil { acquirep(_p_) injectglist(gp.schedlink.ptr()) casgstatus(gp, _Gwaiting, _Grunnable) (...) return gp, false } injectglist(gp) } } // 真的什么都沒找到 // 暫止當前的 m stopm() goto top } ``` 在`findrunnable`這個過程中，我們： * 首先檢查是是否正在進行 GC，如果是則暫止當前的 m 并阻塞休眠； * 嘗試從本地隊列中取 g，如果取到，則直接返回，否則繼續從全局隊列中找 g，如果找到則直接返回； * 檢查是否存在 poll 網絡的 g，如果有，則直接返回； * 如果此時仍然無法找到 g，則從其他 P 的本地隊列中偷取； * 從其他 P 本地隊列偷取的工作會執行四輪，在前兩輪中只會查找 runnable 隊列，后兩輪則會優先查找 ready 隊列，如果找到，則直接返回； * 所有的可能性都嘗試過了，在準備暫止 m 之前，還要進行額外的檢查； * 首先檢查此時是否是 GC mark 階段，如果是，則直接返回 mark 階段的 g； * 如果仍然沒有，則對當前的 p 進行快照，準備對調度器進行加鎖； * 當調度器被鎖住后，我們仍然還需再次檢查這段時間里是否有進入 GC，如果已經進入了 GC，則回到第一步，阻塞 m 并休眠； * 當調度器被鎖住后，如果我們又在全局隊列中發現了 g，則直接返回； * 當調度器被鎖住后，我們徹底找不到任務了，則歸還釋放當前的 P，將其放入 idle 鏈表中，并解鎖調度器； * 當 M/P 已經解綁后，我們需要將 m 的狀態切換出自旋狀態，并減少 nmspinning； * 此時我們仍然需要重新檢查所有的隊列； * 如果此時我們發現有一個 P 隊列不空，則立刻嘗試獲取一個 P，如果獲取到，則回到第一步，重新執行偷取工作，如果取不到，則說明系統已經滿載，無需繼續進行調度； * 同樣，我們還需要再檢查是否有 GC mark 的 g 出現，如果有，獲取 P 并回到第一步，重新執行偷取工作； * 同樣，我們還需要再檢查是否存在 poll 網絡的 g，如果有，則直接返回； * 終于，我們什么也沒找到，暫止當前的 m 并阻塞休眠。 ### M 的喚醒我們已經看到了 M 的暫止和復始的過程，那么 M 的自旋到非自旋的過程如何發生？ ``` func resetspinning() { _g_ := getg() (...) _g_.m.spinning = false nmspinning := atomic.Xadd(&sched.nmspinning, -1) (...) // M wakeup policy is deliberately somewhat conservative, so check if we // need to wakeup another P here. See "Worker thread parking/unparking" // comment at the top of the file for details. if nmspinning == 0 && atomic.Load(&sched.npidle) > 0 { wakep() } } // 嘗試將一個或多個 P 喚醒來執行 G // 當 G 可能運行時（newproc, ready）時調用該函數 func wakep() { // 對自旋線程保守一些，必要時只增加一個 // 如果失敗，則立即返回 if !atomic.Cas(&sched.nmspinning, 0, 1) { return } startm(nil, true) } // Schedules some M to run the p (creates an M if necessary). // If p==nil, tries to get an idle P, if no idle P's does nothing. // May run with m.p==nil, so write barriers are not allowed. // If spinning is set, the caller has incremented nmspinning and startm will // either decrement nmspinning or set m.spinning in the newly started M. //go:nowritebarrierrec func startm(_p_ *p, spinning bool) { lock(&sched.lock) if _p_ == nil { _p_ = pidleget() if _p_ == nil { unlock(&sched.lock) if spinning { // The caller incremented nmspinning, but there are no idle Ps, // so it's okay to just undo the increment and give up. if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 { throw("startm: negative nmspinning") } } return } } mp := mget() unlock(&sched.lock) if mp == nil { var fn func() if spinning { // The caller incremented nmspinning, so set m.spinning in the new M. fn = mspinning } newm(fn, _p_) return } (...) if spinning && !runqempty(_p_) { throw("startm: p has runnable gs") } // The caller incremented nmspinning, so set m.spinning in the new M. mp.spinning = spinning mp.nextp.set(_p_) notewakeup(&mp.park) } // 嘗試從 midel 列表中獲取一個 M // 調度器必須鎖住 // 可能在 STW 期間運行，故不允許 write barrier //go:nowritebarrierrec func mget() *m { mp := sched.midle.ptr() if mp != nil { sched.midle = mp.schedlink sched.nmidle-- } return mp } ``` ### M 的創生 M 是通過`newm`來創生的，一般情況下，能夠非常簡單的創建，某些特殊情況（線程狀態被污染），M 的創建需要一個叫做模板線程的功能加以配合，我們在[6.4 線程管理](https://golang.design/under-the-hood/zh-cn/part2runtime/ch06sched/thread)中詳細討論： ``` // 創建一個新的 m. 它會啟動并調用 fn 或調度器 // fn 必須是靜態、非堆上分配的閉包 // 它可能在 m.p==nil 時運行，因此不允許 write barrier //go:nowritebarrierrec func newm(fn func(), _p_ *p) { // 分配一個 m mp := allocm(_p_, fn) // 設置 p 用于后續綁定 mp.nextp.set(_p_) // 設置 signal mask mp.sigmask = initSigmask if gp := getg(); gp != nil && gp.m != nil && (gp.m.lockedExt != 0 || gp.m.incgo) && GOOS != "plan9" { // 我們處于一個鎖定的 M 或可能由 C 啟動的線程。這個線程的內核狀態可能 // 很奇怪（用戶可能已將其鎖定）。我們不想將其克隆到另一個線程。 // 相反，請求一個已知狀態良好的線程來創建給我們的線程。 // // 在 plan9 上禁用，見 golang.org/issue/22227 lock(&newmHandoff.lock) if newmHandoff.haveTemplateThread == 0 { throw("on a locked thread with no template thread") } mp.schedlink = newmHandoff.newm newmHandoff.newm.set(mp) if newmHandoff.waiting { newmHandoff.waiting = false // 喚醒 m, 自旋到非自旋 notewakeup(&newmHandoff.wake) } unlock(&newmHandoff.lock) return } newm1(mp) } ``` ``` // Allocate a new m unassociated with any thread. // Can use p for allocation context if needed. // fn is recorded as the new m's m.mstartfn. // // This function is allowed to have write barriers even if the caller // isn't because it borrows _p_. // //go:yeswritebarrierrec func allocm(_p_ *p, fn func()) *m { _g_ := getg() _g_.m.locks++ // disable GC because it can be called from sysmon if _g_.m.p == 0 { acquirep(_p_) // temporarily borrow p for mallocs in this function } // Release the free M list. We need to do this somewhere and // this may free up a stack we can use. if sched.freem != nil { lock(&sched.lock) var newList *m for freem := sched.freem; freem != nil; { if freem.freeWait != 0 { next := freem.freelink freem.freelink = newList newList = freem freem = next continue } stackfree(freem.g0.stack) freem = freem.freelink } sched.freem = newList unlock(&sched.lock) } mp := new(m) mp.mstartfn = fn mcommoninit(mp) // In case of cgo or Solaris or Darwin, pthread_create will make us a stack. // Windows and Plan 9 will layout sched stack on OS stack. if iscgo || GOOS == "solaris" || GOOS == "windows" || GOOS == "plan9" || GOOS == "darwin" { mp.g0 = malg(-1) } else { mp.g0 = malg(8192 * sys.StackGuardMultiplier) } mp.g0.m = mp if _p_ == _g_.m.p.ptr() { releasep() } _g_.m.locks-- if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack _g_.stackguard0 = stackPreempt } return mp } ``` ``` func newm1(mp *m) { if iscgo { var ts cgothreadstart if _cgo_thread_start == nil { throw("_cgo_thread_start missing") } ts.g.set(mp.g0) ts.tls = (*uint64)(unsafe.Pointer(&mp.tls[0])) ts.fn = unsafe.Pointer(funcPC(mstart)) if msanenabled { msanwrite(unsafe.Pointer(&ts), unsafe.Sizeof(ts)) } execLock.rlock() // Prevent process clone. asmcgocall(_cgo_thread_start, unsafe.Pointer(&ts)) execLock.runlock() return } execLock.rlock() // Prevent process clone. newosproc(mp) execLock.runlock() } ``` 當 m 被創建時，會轉去運行`mstart`： * 如果當前程序為 cgo 程序，則會通過`asmcgocall`來創建線程并調用`mstart`（在[10.4 cgo](https://golang.design/under-the-hood/zh-cn/part2runtime/ch10abi/cgo)中討論） * 否則會調用`newosproc`來創建線程，從而調用`mstart`。既然是`newosproc`，我們此刻仍在 Go 的空間中，那么實現就是操作系統特定的了， ##### 系統線程的創建 (darwin) ``` // 可能在 m.p==nil 情況下運行，因此不允許 write barrier //go:nowritebarrierrec func newosproc(mp *m) { stk := unsafe.Pointer(mp.g0.stack.hi) (...) // 初始化 attribute 對象 var attr pthreadattr var err int32 err = pthread_attr_init(&attr) if err != 0 { write(2, unsafe.Pointer(&failthreadcreate[0]), int32(len(failthreadcreate))) exit(1) } // 設置想要使用的棧大小。目前為 64KB const stackSize = 1 << 16 if pthread_attr_setstacksize(&attr, stackSize) != 0 { write(2, unsafe.Pointer(&failthreadcreate[0]), int32(len(failthreadcreate))) exit(1) } // 通知 pthread 庫不會 join 這個線程。 if pthread_attr_setdetachstate(&attr, _PTHREAD_CREATE_DETACHED) != 0 { write(2, unsafe.Pointer(&failthreadcreate[0]), int32(len(failthreadcreate))) exit(1) } // 最后創建線程，在 mstart_stub 開始，進行底層設置并調用 mstart var oset sigset sigprocmask(_SIG_SETMASK, &sigset_all, &oset) err = pthread_create(&attr, funcPC(mstart_stub), unsafe.Pointer(mp)) sigprocmask(_SIG_SETMASK, &oset, nil) if err != 0 { write(2, unsafe.Pointer(&failthreadcreate[0]), int32(len(failthreadcreate))) exit(1) } } ``` `pthread_create`在[10.1 參與運行時的系統調用](https://golang.design/under-the-hood/zh-cn/part2runtime/ch10abi/syscallrt)中討論。 ##### 系統線程的創建 (linux) 而 linux 上的情況就樂觀的多了： ``` // May run with m.p==nil, so write barriers are not allowed. //go:nowritebarrier func newosproc(mp *m) { stk := unsafe.Pointer(mp.g0.stack.hi) (...) // 在 clone 期間禁用信號，以便新線程啟動時信號被禁止。 // 他們會在 minit 中重新啟用。 var oset sigset sigprocmask(_SIG_SETMASK, &sigset_all, &oset) ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart))) sigprocmask(_SIG_SETMASK, &oset, nil) (...) } ``` `clone`是系統調用，我們在[10.1 參與運行時的系統調用](https://golang.design/under-the-hood/zh-cn/part2runtime/ch10abi/syscallrt)中討論這些系統調用在 Go 中的實現方式。 ### M/G 解綁 `dropg`聽起來很玄乎，但實際上就是指將當前 g 的 m 置空、將當前 m 的 g 置空，從而完成解綁： ``` // dropg 移除 m 與當前 Goroutine m->curg（簡稱 gp ）之間的關聯。 // 通常，調用方將 gp 的狀態設置為非 _Grunning 后立即調用 dropg 完成工作。 // 調用方也有責任在 gp 將使用 ready 時重新啟動時進行相關安排。 // 在調用 dropg 并安排 gp ready 好后，調用者可以做其他工作，但最終應該 // 調用 schedule 來重新啟動此 m 上的 Goroutine 的調度。 func dropg() { _g_ := getg() setMNoWB(&_g_.m.curg.m, nil) setGNoWB(&_g_.m.curg, nil) } // setMNoWB 當使用 muintptr 不可行時，在沒有 write barrier 下執行 *mp = new //go:nosplit //go:nowritebarrier func setMNoWB(mp **m, new *m) { (*muintptr)(unsafe.Pointer(mp)).set(new) } // setGNoWB 當使用 guintptr 不可行時，在沒有 write barrier 下執行 *gp = new //go:nosplit //go:nowritebarrier func setGNoWB(gp **g, new *g) { (*guintptr)(unsafe.Pointer(gp)).set(new) } ``` ### M 的死亡我們已經多次提到過 m 當且僅當它所運行的 Goroutine 被鎖定在該 m 且 Goroutine 退出后， m 才會退出。我們來看一看它的原因。首先，我們已經知道調度循環會一直進行下去永遠不會返回了： ``` func mstart() { (...) mstart1() // 永不返回 (...) mexit(osStack) } ``` 那`mexit`究竟什么時候會被執行？事實上，在`mstart1`中： ``` func mstart1() { (...) // 為了在 mcall 的棧頂使用調用方來結束當前線程，做記錄 // 當進入 schedule 之后，我們再也不會回到 mstart1，所以其他調用可以復用當前幀。 save(getcallerpc(), getcallersp()) (...) } ``` `save`記錄了調用方的 pc 和 sp，而對于`save`： ``` // getcallerpc 返回它調用方的調用方程序計數器 PC program conter // getcallersp 返回它調用方的調用方的棧指針 SP stack pointer // 實現由編譯器內建，在任何平臺上都沒有實現它的代碼 // // 例如: // // func f(arg1, arg2, arg3 int) { // pc := getcallerpc() // sp := getcallersp() // } // // 這兩行會尋找調用 f 的 PC 和 SP // // 調用 getcallerpc 和 getcallersp 必須被詢問的幀中完成 // // getcallersp 的結果在返回時是正確的，但是它可能會被任何隨后調用的函數無效， // 因為它可能會重新定位堆棧，以使其增長或縮小。一般規則是，getcallersp 的結果 // 應該立即使用，并且只能傳遞給 nosplit 函數。 //go:noescape func getcallerpc() uintptr //go:noescape func getcallersp() uintptr // implemented as an intrinsic on all platforms // save 更新了 getg().sched 的 pc 和 sp 的指向，并允許 gogo 能夠恢復到 pc 和 sp // // save 不允許 write barrier 因為 write barrier 會破壞 getg().sched // //go:nosplit //go:nowritebarrierrec func save(pc, sp uintptr) { _g_ := getg() // 保存當前運行現場 _g_.sched.pc = pc _g_.sched.sp = sp _g_.sched.lr = 0 _g_.sched.ret = 0 // 保存 g _g_.sched.g = guintptr(unsafe.Pointer(_g_)) // 我們必須確保 ctxt 為零，但這里不允許 write barrier。 // 所以這里只是做一個斷言 if _g_.sched.ctxt != nil { badctxt() } } ``` 由于`mstart/mstart1`是運行在 g0 上的，因此`save`將保存`mstart`的運行現場保存到`g0.sched`中。當調度循環執行到`goexit0`時，會檢查 m 與 g 之間是否被鎖住： ``` func goexit0(gp *g) { (...) gfput(_g_.m.p.ptr(), gp) if locked { if GOOS != "plan9" { gogo(&_g_.m.g0.sched) } } schedule() } ``` 如果 g 鎖在當前 m 上，則調用`gogo`恢復到`g0.sched`的執行現場，從而恢復到`mexit`調用。最后來看`mexit`： ``` // mexit 銷毀并退出當前線程 // // 請不要直接調用來退出線程，因為它必須在線程棧頂上運行。 // 相反，請使用 gogo(&_g_.m.g0.sched) 來解除棧并退出線程。 // // 當調用時，m.p != nil。因此可以使用 write barrier。 // 在退出前它會釋放當前綁定的 P。 // //go:yeswritebarrierrec func mexit(osStack bool) { g := getg() m := g.m if m == &m0 { // 主線程 // // 在 linux 中，退出主線程會導致進程變為僵尸進程。 // 在 plan 9 中，退出主線程將取消阻塞等待，即使其他線程仍在運行。 // 在 Solaris 中我們既不能 exitThread 也不能返回到 mstart 中。 // 其他系統上可能發生別的糟糕的事情。 // // 我們可以嘗試退出之前清理當前 M ，但信號處理非常復雜 handoffp(releasep()) // 讓出 P lock(&sched.lock) // 鎖住調度器 sched.nmfreed++ checkdead() unlock(&sched.lock) notesleep(&m.park) // 暫止主線程，在此阻塞 throw("locked m0 woke up") } sigblock() unminit() // 釋放 gsignal 棧 if m.gsignal != nil { stackfree(m.gsignal.stack) } // 將 m 從 allm 中移除 lock(&sched.lock) for pprev := &allm; *pprev != nil; pprev = &(*pprev).alllink { if *pprev == m { *pprev = m.alllink goto found } } // 如果沒找到則是異常狀態，說明 allm 管理出錯 throw("m not found in allm") found: if !osStack { // Delay reaping m until it's done with the stack. // // If this is using an OS stack, the OS will free it // so there's no need for reaping. atomic.Store(&m.freeWait, 1) // Put m on the free list, though it will not be reaped until // freeWait is 0. Note that the free list must not be linked // through alllink because some functions walk allm without // locking, so may be using alllink. m.freelink = sched.freem sched.freem = m } unlock(&sched.lock) // Release the P. handoffp(releasep()) // After this point we must not have write barriers. // Invoke the deadlock detector. This must happen after // handoffp because it may have started a new M to take our // P's work. lock(&sched.lock) sched.nmfreed++ checkdead() unlock(&sched.lock) if osStack { // Return from mstart and let the system thread // library free the g0 stack and terminate the thread. return } // mstart is the thread's entry point, so there's nothing to // return to. Exit the thread directly. exitThread will clear // m.freeWait when it's done with the stack and the m can be // reaped. exitThread(&m.freeWait) } ``` 可惜`exitThread`在 darwin 上還是沒有定義： ``` // 未在 darwin 上使用，但必須定義 func exitThread(wait *uint32) { } ``` 在 Linux amd64 上： ``` // func exitThread(wait *uint32) TEXT runtime·exitThread(SB),NOSPLIT,$0-8 MOVQ wait+0(FP), AX // 棧使用完畢 MOVL $0, (AX) MOVL $0, DI // exit code MOVL $SYS_exit, AX SYSCALL // 甚至連棧都沒有了 INT $3 JMP 0(PC) ``` 從實現上可以看出，只有 linux 中才可能正常的退出一個棧，而 darwin 只能保持暫止了。而如果是主線程，則會始終保持 park。 ## 小結我們已經看過了整個調度器的設計，圖 1 縱觀了整個調度循環： ![](https://golang.design/under-the-hood/assets/schedule.png)**圖 1：調度器調度循環縱覽** 那么，很自然的能夠想到這個流程中存在兩個問題： 1. `findRunnableGCWorker`在干什么？ 2. 調度循環看似合理，但如果 G 執行時間過長，難道要等到 G 執行完后再調度其他的 G？顯然不符合實際情況，那么到底會發生什么事情？