6.2 工作竊取式調度 · golan基礎

# 6.2 工作竊取式調度 ## 性能模型 * 性能方程 * memory wall * Amdahl 定律 * Gustafson 定律 * 經典例子：矩陣乘法的瓶頸, tiling 技巧 * 性能測量 TODO: 討論與 work tealing 相關的調度理論，可能的討論的主題包括： 0. original work stealing scheduling 1. NUMA-aware scheduling 2. non-cooperative preemption 3. work stealing with latency 4. thread scheduling 5. nested parallel scheduling 6. power-aware scheduling 7. distributed work stealing 8. … # 基于工作竊取的多線程計算調度 Robert D. Blumofe, Charles E. Leiserson Journal of the ACM, Vol. 46, No.5, Spectember 1999, pp. 720-748 ## 摘要本文研究了并行計算機上有效調度完全嚴格（即結構良好）的多線程計算問題。調度此類動態多指令多數據流式（MIMD-style）計算的一種流行的方法是**工作竊取**，其中需要工作的處理器從運行其他計算線程的處理器中竊取任務。本文首次證明了有依賴的多線程計算中一個有效的工作竊取式調度器。具體而言，我們的分析顯示，使用我們的工作竊取調度器在PP個處理器上執行完全嚴格計算的期望時間為T1/P+O(T∞)T1/P+O(T∞)，其中T1T1為多線程計算中最少串行執行時間，T∞T∞為無窮多個處理器的最少執行時間。此外，執行所需的空間至多為S1PS1P，其中S1S1為要求的最小連續空間。我們還指出算法的總通信時間期望至多為O(PT∞(1+nd)Smax)O(PT∞(1+nd)Smax)，其中SmaxSmax為任意線程的最大活躍記錄的大小，ndnd則為任意線程與其父線程同步的次數。此上界證明了民間智慧的正確性：工作竊取調度器與\*\*工作共享\*\*調度器相比通信效率更高。以上三個上界在常數因子內均存在且為最優。 ## 1 引言為了在 MIMD 型并行計算機上有效執行動態增長的多線程計算，調度算法必須確保足夠的線程同時處于活躍狀態以保持處理器繁忙。同時，它應確保并發活躍線程的數量保持在合理的限制范圍內，以便內存要求不會過大。此外，如果可能，調度程序還應嘗試在同一處理器上維護相關線程，以便最小化它們之間的通信。毫無疑問，同時實現所有這些目標可能很困難。目前已經出現了兩種調度范式來解決調度多線程計算的問題：**工作共享**（work sharing）和**工作竊取**（work stealing）。在工作共享中，只要處理器生成新線程，調度程序就會嘗試將其中一些線程遷移到其他處理器，以期將工作分配給未充分利用的處理器。然而，在工作竊取中，未充分利用的處理器采取主動：它們試圖從其他處理器「竊取」線程。直觀地說，線程遷移在工作竊取中的發生頻率低于工作共享，因為當所有處理器都有工作要做時，工作竊取調度程序不會遷移任何線程，但線程總是受到工作共享調度程序遷移。工作竊取的想法至少可以追溯到 \[Burton and Sleep, 1981\] 關于函數式程序并行執行的研究和 \[Halstead, 1984\] 對 Multilisp 的實現。這些作者指出了工作偷竊在空間和通信方面的啟發式好處。從那時起，許多研究人員就這一策略實現了諸多變體（參見，例如，\[Blumofe and Lisiecki, 1997\]，\[Feldmann et al. 1993\]，\[Finkel and Manber, 1987\]，\[Halbherr et al. 1994\]，\[Kuszmaul, 1994\]，\[Mohr et al. 1991\] 和 \[Vandevoorde and Roberts, 1988\]）。\[Rudolph et al. 1991\] 分析了在并行計算機上獨立工作負載均衡的隨機工作竊取策略，\[Karp and Zhang, 1993\] 分析了并行回溯搜索的隨機工作竊取策略。\[Zhang and Ortynski, 1994\] 已經對這個算法的通信要求有了很好的界限。在本文中，我們提出并分析了一種用于調度「完全嚴格」（結構良好）多線程計算的工作竊取算法。這類計算包括回溯搜索計算 \[Karp and Zhang, 1993; Zhang and Ortynski, 1994\] 和分治計算 \[Wu and Kung 1991\]，以及由于數據依賴性，線程可能會停頓的數據流計算 \[Arvind et al. 1989\]。類似于 \[Liu et al. 1993\] 的對同一數據結構通過對手串行排隊的并發訪問的原子消息傳遞（message passing）模型，我們在一個嚴格的原子訪問模型中分析我們的算法。我們的主要貢獻是用于完全嚴格的多線程計算的隨機工作竊取調度器，它在時間、空間和通信方面證明是有效的。我們證明使用我們的工作竊取調度算法在PP個處理器上執行完全嚴格計算的期望時間是T1/P+O(T∞)T1/P+O(T∞)，其中T1T1是多線程計算的最小串行執行時間，T∞T∞?? 是具有無限數量處理器的最少執行時間。此外，執行所需的空間最多為S1PS1P，其中S1S1為最小連續空間要求。這些界限優于以前的工作共享調度器 \[Blumofe and Leiserson, 1998\]，并且工作竊取調度器更加簡單且非常實用。與 \[Blumofe and Leiserson, 1998\] 研究的（一般）嚴格計算相比，這種改進的部分原因在于我們專注于完全嚴格的計算。我們還證明了執行的總通信期望最多為O(PT∞(1+nd)Smax)O(PT∞(1+nd)Smax)，其中SmaxSmax是任何線程的最大活躍記錄的大小，ndnd則為任意線程與其父線程同步的次數。這種界限存在于一個常數因子內，與 \[Wu and Kung, 1991\] 中并行分治通信的下界相同。相比之下，工作共享調度器具有幾乎最壞情況的通信行為。因此，我們的結果支持了民間智慧，即工作竊取優于工作共享。其他人已經研究并繼續研究有效管理并行計算的空間要求的問題。\[Culler and Arvind, 1988\] 以及\[Ruggiero and Sargeant, 1987\] 給出了限制數據流程序所需空間的啟發式方法。\[Burton, 1988\] 展示了如何在不導致死鎖的情況下限制某些并行計算中的空間。最近，\[Burton, 1996\] 開發并分析了一種具有可證明的良好時間和空間界限的調度算法。\[Blelloch et al. 1995; 1997\] 還開發并分析了具有可證明的良好時間和空間界限的調度算法。目前尚不清楚這些算法中的任何算法是否與偷工作一樣實用。本文的其余部分組織如下：在第 2 節中，我們回顧了 \[Blumofe and Leiserson, 1998\] 中介紹的多線程計算的圖論模型，它為分析調度程序提供了理論基礎。第 3 節給出了一個使用中央隊列的簡單調度算法。這種 Busy-leaves 算法構成了我們隨機化工作竊取算法的基礎，并在第 4 節中介紹了這一算法。在第 5 節中，我們介紹了用于分析工作竊取算法執行時間和通信成本的原子訪問模型，提出并分析一個組合的「ball and bins」博弈，我們用它來推導隨機工作竊取中出現的爭用的界限。然后，我們在第 6 節中使用此接線以及延遲序列參數 \[Ranade, 1987\] 來分析工作竊取算法的執行時間和通信成本。最后，在第 7 節中，我們簡要討論了如何本文中的理論思想應用于 Cilk 編程語言和運行時系統 \[Blumofe et al. 1996\] 和 \[Frigo et al. 1998\] ，并進行了總結性陳述。 ## 2 多線程計算模型 TODO: **定理1 （貪心調度定理）**：對于具有工作T1T1和臨界路徑長度T∞T∞??的任何多線程計算，對于任何數量為PP的處理器，任何貪婪的PP處理器執行調度XX達到T(X)≤T1/P+T∞T(X)≤T1/P+T∞。 ## 3 Busy-Leaves 屬性 TODO: **引理2**：對于棧深度為S1S1的任何多線程計算，維護 Busy-leaves 屬性的任何PP處理器執行調度XX的空間界限為S(X)≤S1PS(X)≤S1P。 **定理3**：對于任意數量的 P 處理器和任何具有工作T1T1，關鍵路徑長度T∞T∞和堆棧深度S1S1的嚴格多線程計算，Busy-Leaves 算法計算PP處理器執行調度XX，其執行時間滿足T(X)≤T1/P+T∞T(X)≤T1/P+T∞并且其空間滿足S(X)≤S1PS(X)≤S1P。 ## 4 隨機化工作竊取算法 TODO: **引理4**：在通過工作竊取算法執行任何完全嚴格的多線程計算時，考慮任何處理器pp以及pp在線程上工作的任何給定時間步長。設Γ0Γ0是pp正在處理的線程，令kk為pp的就緒隊列的線程數，令Γ0Γ0，Γ1Γ1，ΓkΓk表示從下到上排序的pp的就緒隊列線程，使得Γ??1Γ??1在最底端，而ΓkΓk在最頂端。如果我們有k\>0k\>0，則 p 的就緒隊列中的線程滿足以下性質： 1. 對于i\=1,2,…,ki\=1,2,…,k，線程ΓiΓi為Γi?1Γi?1的父線程。 2. 如果有k\>1k\>1，則對于i\=1,2,…,k?1i\=1,2,…,k?1，線程ΓiΓi沒有執行工作因為它尚未被Γi?1Γi?1創建。定理5.對于堆棧深度為S1的任何完全嚴格的多線程計算，工作竊取算法在具有P處理器的計算機上運行，最多使用S1P空間。 ## 5 原子訪問與回收博弈 TODO: ## 6 工作竊取算法的分析 TODO: **定理13**：考慮在具有 P 處理器的并行計算機上通過工作竊取算法執行任何完全嚴格的多線程計算，其中工作T1T1和臨界路徑長度為T∞T∞，則包含調度開銷的期望運行時間為T1/P+O(T∞)T1/P+O(T∞)。進一步，對于任意?\>0?\>0，以至少1??1??的概率，在PP處理器上的執行時間為T1/P+O(T∞+logP+log(1/?))3T1/P+O(T∞+log?P+log?(1/?))3。 **定理14**：考慮在具有 P 處理器的并行計算機上通過工作竊取算法執行任何具有臨界路徑長度為T∞T∞的完全嚴格的多線程計算。則傳送總字節數的期望為O(PT∞(1+nd)Smax)O(PT∞(1+nd)Smax)，其中ndnd為最大活躍幀的字節大小。進一步，對于任意?\>0?\>0，以至少1??1??的概率，遭遇的通信數為O(P(T∞+log(1/?))(1+nd)Smax)O(P(T∞+log?(1/?))(1+nd)Smax)。 ## 7 結論本文分析的方法有多實用？我們一直積極致力于構建一種基于 C 語言的名為 Cilk（發音為 “silk”）的編程語言，并用于編程多線程計算（參見，例如，\[Blumofe 1995\]，\[Blumofe et al. 1996\]，\[Frigo et al. 1998\]，\[Joerg, 1996\] 和 \[Randall, 1998\]）。Cilk 源自 PCM（Parallel Continuation Machine）系統 \[Halbherr et al. 1994\]，其本身一部分受到此處研究報告的啟發。Cilk 運行時系統采用本文中描述的工作竊取算法。由于 Cilk 采用了可證明有效的調度算法，因此 Cilk 為用戶應用程序提供了有保證的性能。具體來說，我們已經憑經驗發現使用模型T1/P+T∞T1/P+T∞可以準確預測用 Cilk 語言編寫的應用程序在具有 P 個處理器上的運行性能。 Cilk 系統目前運行在現代共享內存多處理器上，例如 Sun Enterprise，Silicon Graphics Origin，Intel Quad Pentium 和 DEC Alphaserver（早期版本的 Cilk 運行在 Thinking Machines CM-5 MPP，Intel Paragon MPP 和 IBM SP-2 上）。迄今為止，用 Cilk 編寫的應用程序包括蛋白質折疊 \[Pande et al. 1994\]，圖形渲染 \[Stark 1998\]，回溯搜索和 \*??Socrates 國際象棋程序 \[Joerg and Kuszmaul, 1994\]，它在 1995 年 ICCA 世界計算機國際象棋錦標賽中獲得二等獎，該錦標賽在桑迪亞國家實驗室的 1824-node Paragon上運行。我們最近的國際象棋程序 Cilkchess 贏得了 1996 年荷蘭公開賽計算機國際象棋錦標賽。 Cilk 的團隊編程在國際功能編程大會主辦的 ICFP'98 編程競賽中獲得了一等獎（不敗）。作為我們研究的一部分，我們在工作站網絡上為 Cilk 實現了原型運行時系統。這個運行時系統叫做 Cilk-NOW \[Blumofe 1995; Blumofe and Lisiecki 1997; Lisiecki 1996\] 支持自適應并行，其中工作站環境中的處理器可以加入用戶的計算，如果它們在其他方面處于空閑狀態，并且可以立即使用，以便在需要時由其所有者再次離開計算。 Cilk-NOW 還支持透明容錯，這意味著即使面對程序員以完全錯誤的方式編寫的代碼導致處理器崩潰，計算也可以繼續進行。\[Randall, 1998\] 描述了更新的 SMP 集群分布式實現。我們還調查了與 Cilk 相關的其他主題，包括分布式共享內存和調試工具（有關分布式共享內存的示例，請參閱 \[Blumofe et al, 1996a; 1996b\]，\[Frigo, 1998\]，\[Frigo and Luchangco, 1998\]。有關調試工具的示例，請參見 \[Cheng, 1998\]，\[Cheng et al. 1998\]，\[Feng and Leiserson, 1997\]，和 \[Stark, 1998\]）可以在[http://supertech.lcs.mit.edu/cilk](http://supertech.lcs.mit.edu/cilk)上找到論文和軟件版本的最新信息。對于共享內存多處理器的情況，我們最近推廣了兩個維度的時間界限（但不是空間或通信邊界）\[Arora et al. 1998\]。首先，我們已經證明，對于任意（不一定是完全嚴格甚至嚴格的）多線程計算，預期的執行時間是O(T1/P+T∞)O(T1/P+T∞)??。該界限基于新的結構引理和使用潛在函數的攤銷分析。其次，我們開發了工作竊取算法的非阻塞式實現，并且我們分析了它在多程序環境中的執行時間，其中計算在一組數量隨時間增長和縮小的處理器上執行，這種增長和縮小是由爭用對手控制。如果攻擊者選擇不增加或縮小處理器數量，則綁定專門用于匹配我們之前的綁定。非阻塞工作竊取調度器已在 Hood 用戶級線程庫中實現 \[Blumofe and Papadopoulos 1998; Papadopoulos 1998\]。可以在萬維網上找到論文和軟件版本的最新的信息，網址為[http://www.cs.utexas.edu/users/hood](https://www.cs.utexas.edu/users/hood)。 ## 參考文獻 * ARORA, N. S., BLUMOFE, R. D., AND PLAXTON, C. G. 1998.**Thread scheduling for multiprogrammed multiprocessors**. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98) (Puerto Vallarta, Mexico, June 28–July 2). ACM, New York, pp. 119–129. * ARVIND, NIKHIL, R. S., AND PINGALI, K. K. 1989. I-structures:**Data structures for parallel computing**. ACM Trans. Program. Lang. Syst. 11, 4 (Oct.), 598–632. * BLELLOCH, G. E., GIBBONS, P. B., AND MATIAS, Y. 1995.**Provably efficient scheduling for languages with fine-grained parallelism**. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’95) (Santa Barbara, Calif., July 17–19). ACM, New York, pp. 1–12. * BLELLOCH, G. E., GIBBONS, P. B., MATIAS, Y., AND NARLIKAR, G. J. 1997.**Space-efficient scheduling of parallelism with synchronization variables**. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’97) (Newport, R.I., June 22–25). ACM, New York, pp. 12–23. * BLUMOFE, R. D. 1995.**Executing multithreaded programs efficiently**. Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Also avail- able as MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-677. * BLUMOFE, R. D., FRIGO, M., JOERG, C. F., LEISERSON, C. E., AND RANDALL, K. H. 1996a.**An analysis of dag-consistent distributed shared-memory algorithms**. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’96) (Padua, Italy, June 24–26). ACM, New York, pp. 297–308. * BLUMOFE, R. D., FRIGO, M., JOERG, C. F., LEISERSON, C. E., AND RANDALL, K. H. 1996b.**Dag-consistent distributed shared memory**. In Proceedings of the 10th International Parallel Process- ing Symposium (IPPS) (Honolulu, Hawaii, April). IEEE Computer Society Press, Los Alamitos, Calif., pp. 132–141. * BLUMOFE, R. D., JOERG, C. F., KUSZMAUL, B. C., LEISERSON, C. E., RANDALL, K. H., AND ZHOU, Y. 1996c.**Cilk: An efficient multithreaded runtime system**. J. Paral. Dist. Comput. 37, 1 (Aug.), 55– 69. * BLUMOFE, R. D., AND LEISERSON, C. E. 1998.**Space-efficient scheduling of multithreaded computations**. SIAM J. Comput. 27, 1 (Feb.), 202–229. * BLUMOFE, R. D., AND LISIECKI, P. A. 1997.**Adaptive and reliable parallel computing on networks of workstations**. In Proceedings of the USENIX 1997 Annual Technical Conference on UNIX and Advanced Computing Systems (Anaheim, Calif., Jan.). USENIX Associates, Berkeley, Calif., pp. 133–147. * BLUMOFE, R. D., AND PAPADOPOULOS, D. 1998.**The performance of work stealing in multiprogrammed environments**. Tech. Rep. TR-98-13 (May). Dept. Computer Sciences, The University of Texas at Austin, Austin, Tex. * BRENT, R. P. 1974.**The parallel evaluation of general arithmetic expressions**. J. ACM 21, 2 (Apr.), 201–206. * BURTON, F. W. 1988.**Storage management in virtual tree machines**. IEEE Trans. Comput. 37, 3 (Mar.), 321–328. * BURTON, F. W. 1996.**Guaranteeing good memory bounds for parallel programs**. IEEE Trans. Softw. Eng. 22, 10 (Oct.), 762–773. * BURTON, F. W., AND SLEEP, M. R. 1981.**Executing functional programs on a virtual tree of processors**. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (Portsmouth, N.H., Oct.). ACM, New York, N.Y., pp. 187–194. * CHENG, G.-I. 1998.**Algorithms for data-race detection in multithreaded programs**. Master’s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technol- ogy. * CHENG, G.-I., FENG, M., LEISERSON, C. E., RANDALL, K. H., AND STARK, A. F. 1998.**Detecting data races in Cilk programs that use locks**. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA’98) (Puerto Vallarta, Mexico, June 28–July 2). ACM, New York, pp. 298–309. * CULLER, D. E., AND ARVIND. 1988.**Resource requirements of dataflow programs**. In Proceedings of the 15th Annual International Symposium on Computer Architecture (ISCA) (Honolulu, Hawaii, May). IEEE Computer Society Press, Los Alamitos, Calif., pp. 141–150. Also available as MIT Laboratory for Computer Science, Computation Structures Group Memo 280. * EAGER, D. L., ZAHORJAN, J., AND LAZOWSKA, E. D. 1989.**Speedup versus efficiency in parallel systems**. IEEE Trans. Comput. 38, 3 (Mar.), 408–423. * FELDMANN, R., MYSLIWIETZ, P., AND MONIEN, B. 1993.**Game tree search on a massively parallel system**. Adv. Comput. Chess 7, 203–219. * FENG, M., AND LEISERSON, C. E. 1997.**Efficient detection of determinacy races in Cilk programs**. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’97) (Newport, R.I., June 22–25). ACM, New York, pp. 1–11. * FINKEL, R., AND MANBER, U. 1987. DIB—**A distributed implementation of backtracking**. ACM Trans. Program. Lang. Syst. 9, 2 (Apr.), 235–256. * FRIGO, M. 1998.**The weakest reasonable memory model**. Master’s thesis, Dept. Electrical Engi- neering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. * FRIGO, M., LEISERSON, C. E., AND RANDALL, K. H. 1998.**The implementation of the Cilk-5 multithreaded language**. In Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’98) (Montreal, Canada, June 17–19). ACM, New York. * FRIGO, M., AND LUCHANGCO, V. 1998.**Computation-centric memory models**. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98) (Puerto Vallarta, Mexico, June 28–July 2). ACM, New York, pp. 240–249. * GRAHAM, R. L. 1966.**Bounds for certain multiprocessing anomalies**. Bell Syst. Tech. J. 45, 1563–1581. * GRAHAM, R. L. 1969.**Bounds on multiprocessing timing anomalies**. SIAM J. Appl. Math. 17, 2 (Mar.), 416–429. * HALBHERR, M., ZHOU, Y., AND JOERG, C. F. 1994.**MIMD-style parallel programming with continuation-passing threads**. In Proceedings of the 2nd International Workshop on Massive Parallel- ism: Hardware, Software, and Applications (Capri, Italy, Sept.). World Scientific, Singapore. (Also available as MIT Laboratory for Computer Science Computation Structures, Group Memo 355, March 1994. MIT, Cambridge, Mass. * HALSTEAD, R. H., JR. 1984.**Implementation of Multilisp: Lisp on a multiprocessor**. In Conference Record of the 1984 ACM Symposium on LISP and Functional Programming (Austin, Tex., Aug.) ACM, New York, pp. 9–17. * JOERG, C. F. 1996.**The Cilk System for Parallel Multithreaded Computing**. Ph.D. dissertation. Dept. Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. * JOERG, C., AND KUSZMAUL, B. C. 1994.**Massively parallel chess**. In Proceedings of the 3rd DIMACS Parallel Implementation Challenge (Rutgers University, New Jersey, Oct. 1994). * KARP, R. M., AND ZHANG, Y. 1993.**Randomized parallel algorithms for backtrack search and branch-and-bound computation**. J. ACM 40, 3 (July), 765–789. * KUSZMAUL, B. C. 1994.**Synchronized MIMD computing**. Ph.D. thesis, Dept. Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. Also available as MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-645. * LISIECKI, P. 1996.**Macroscheduling in the Cilk network of workstations environment**. Master’s thesis, Dept. Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. * LIU, P., AIELLO, W., AND BHATT, S. 1993.**An atomic model for message-passing**. In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’93) (Velen, Germany, June 30–July 2). ACM, New York, pp. 154–163. * MOHR, E., KRANZ, D. A., AND HALSTEAD, R. H., JR. 1991.**Lazy task creation: A technique for increasing the granularity of parallel programs**. IEEE Trans. Parall. Dist. Syst. 2, 3 (July), 264–280. PANDE, V. S., JOERG, C. F., GROSBERG, A. Y., AND TANAKA, T. 1994. Enumerations of the * Hamiltonian walks on a cubic sublattice. J. Phys. A 27. PAPADOPOULOS, D. P. 1998.**Hood: A user-level thread library for multiprogramming multiprocessors**. Master’s thesis, Dept. Computer Sciences, The University of Texas at Austin, Austin, Tex. RANADE, A. 1987. How to emulate shared memory. In Proceedings of the 28th Annual Symposium on Foundations of Computer Science (FOCS) (Los Angeles, Calif., Oct.). IEEE Computer Society * Press, Los Alamitos, Calif., pp. 185–194. RANDALL, K. H. 1998.**Cilk: Efficient multithreaded computing**. Ph.D. dissertation. Dept. Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. RUDOLPH, L., SLIVKIN-ALLALOUF, M., AND UPFAL, E. 1991. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’91) (Hilton Head, S.C., July 21–24). ACM, New York, pp. 237–245. * RUGGIERO, C. A., AND SARGEANT, J. 1987.**Control of parallelism in the Manchester dataflow machine**. In Functional Programming Languages and Computer Architecture, Number 274 in Lecture Notes in Computer Science. Springer-Verlag, New York, pp. 1–15. * STARK, A. F. 1998.**Debugging multithreaded programs that incorporate user-level locking**. Master’s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. * VANDEVOORDE, M. T., AND ROBERTS, E. S. 1988.**WorkCrews: An abstraction for controlling parallelism**. International Journal of Parallel Programming 17, 4 (Aug.), 347–366. * WU, I.-C., AND KUNG, H. T. 1991.**Communication complexity for parallel divide-and-conquer**. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science (FOCS) (San Juan, Puerto Rico, Oct. 1991). IEEE Computer Society Press, Los Alamitos, Calif., pp. 151–162. * ZHANG, Y., AND ORTYNSKI, A. 1994.**The efficiency of randomized parallel backtrack search**. In Proceedings of the 6th IEEE Symposium on Parallel and Distributed Processing (Dallas, Texas, Oct. 1994). IEEE Computer Society Press, Los Alamitos, Calif. ## 進一步閱讀的參考文獻 * [runtime: tight loops should be preemptible](https://github.com/golang/go/issues/10958) * [Non-cooperative goroutine preemption](https://github.com/golang/proposal/blob/master/design/24543-non-cooperative-preemption) * [NUMA-aware scheduler for Go](https://docs.google.com/document/u/0/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub)