MySQL · TokuDB · Cachetable 的工作線程和線程池 · 數據庫內核月報

## 介紹 TokuDB也有類似InnoDB的buffer pool叫做cachetable，存儲數據節點（包括葉節點和中間節點）和rollback段，本文中為了表達簡單，葉節點，中間節點和rollback段統稱數據節點。Cachetable是全局唯一的，它與MySQL實例存在一一對應的關系。TokuDB沒有采用常見的BTREE(BTREE+，BTREE*)表示索引，而是采用Fractal Tree，簡稱FT。FT跟BTREE+類似，維護了一個樹形的有序結構，中間節點存儲pivot（TokuDB的中間節點還包含message buffer），葉節點存儲數據。數據庫啟動的時候會去初始化cachetable。Client線程（調用棧上下文所在的線程）要訪問某個數據節點會首先在cachetable里面查找，找到就立即返回；否則會在cachetable申請一個cache項，然后從磁盤上加載數據到那個cache項。TokuDB里表示cache項的數據結構叫做pair，記錄(節點塊號/頁號，數據節點）的對應關系。在MySQL的缺省引擎InnoDB中，數據和索引是存儲在一個文件里的，而TokuDB中每個索引對應一個單獨的磁盤文件。 Cachetable是一個hash表，每個bucket里面包含多個pair，共1024*1024個bucket。屬于相同索引的pair由cachefile來管理。TokuDB有一個優化在后面會涉及到，這里先簡單提一下。當server層顯示關閉某個TokuDB表時FT層會調用`toku_cachefile_close`關閉表或者索引，并把緩存的數據節點從cachetable刪除；但這些數據節點仍然保留在cachefile中（保留在內存中）。這種cachefile會被加到的stale列表里面，它包含的數據節點會在內存里呆一段時間。近期再次訪問這個索引時，首先會在active列表里查找索引對應的cachefile。若沒有找到會嘗試在stale列表查找并把找到的cachefile的數據節點重新加到cachetable里去。近期再次訪問相同的數據集就不必從磁盤上加載了。 ## Cachetable的工作線程(worker thead) Cachetable創建了三個工作線程： 1. evictor線程：釋放部分cachetable內存空間； 2. cleaner線程：flush中間節點的message buffer到葉節點； 3. checkpointer線程：寫回dirty數據。 ## Cachetable的線程池 Cachetable創建了三個線程池： 1. client線程池：幫助cleaner線程flush中間節點的message buffer； 2. cachetable線程池： * 幫助client線程fetch/partial fetch數據節點 * 幫助evictor線程evict/partial evict數據節點 * 從cachetable刪除時，后臺刪除數據節點 3. checkpoint線程池：幫助client線程寫回處于checkpoint_pending狀態的數據節點。 ## Cachetable的幾個主要隊列 1. m_clock_head：新加載的數據節點除了加入hash方便快速定位，也會加入此隊列。可以理解成cachetable的LRU隊列； 2. m_cleaner_head：指向m_clock_head描述LRU隊列，cleaner線程從這個位置開始掃描找到memory pressure最大的中間節點發起message buffer flush操作； 3. m_checkpoint_head：指向m_clock_head描述LRU隊列，checkpointer線程在begin checkpoint階段從這個位置開始掃描，把每個數據節點加到m_pending_head隊列； 4. m_pending_head：checkpointer線程在end checkpoint階段從這個位置開始掃描，把ditry數據節點寫回到磁盤上。 ## Evictor線程隨著數據逐漸加載到cachetable，其消耗的內存空間越來越大，當達到一定程度時evictor工作線程會被喚醒嘗試釋放一些數據節點。Evitor線程定期運行(缺省1秒)。Evictor定義四個watermark來評價當前cachetable消耗內存的程度： 1. m_low_size_watermark: 達到此watermark以后，evictor線程停止釋放內存空間。通俗的說，這就是cachetable消耗內存的上限； 2. m_low_size_hysteresis：達到此watermark以后，client線程（也就是server層線程）喚醒evictor線程釋放內存。一般是m_low_size_watermark的1.1倍； 3. m_high_size_hysteresis: 達到此watermark以后，阻塞的client線程會被喚醒。一般是m_low_size_watermark的1.2倍； 4. m_high_size_watermark：達到此watermark以后，client線程會被阻塞在m_flow_control_cond條件變量上等待evictor線程釋放內存。一般是m_low_size_watermark的1.5倍。 ### Evictor線程被喚醒的時機 1. 添加新pair； 2. Get pair時，需要fetch或者partial fetch數據節點； 3. Evictor destroy時，喚醒等待的client線程； 4. 釋放若干數據節點后，Evictor判斷是否要繼續運行。鋪墊了這么多，下面一起來看一下evictor線程的主體函數`run_eviction`。`run_eviction`是一個while循環調用`eviction_needed`判斷是否要進行eviction。如下所示：m_size_current表示cachetable的當前size，m_size_evicting表示當前正在evicting的數據節點消耗的內存空間。兩者的差就是這次eviction運行前，cachetable最終能到達的size。偽碼如下： ~~~ bool eviction_needed() { return (m_size_current - m_size_evicting) > m_low_size_watermark; } void run_eviction(){ uint32_t num_pairs_examined_without_evicting = 0; while (eviction_needed()) { if (m_num_sleepers > 0 && should_sleeping_clients_wakeup()) { /* signal the waiting client threads */ } bool some_eviction_ran = evict_some_stale_pair(); if (!some_eviction_ran) { get m_pl->read_list_lock; if (!curr_in_clock) { /* nothing to evict */ break; } if (num_pairs_examined_without_evicting > m_pl->m_n_in_table) { /* everything is in use */ break; } bool eviction_run = run_eviction_on_pair(curr_in_clock); if (eviction_run) { // reset the count num_pairs_examined_without_evicting = 0; } else { num_pairs_examined_without_evicting++; } release m_pl->read_list_lock; } } } ~~~ eviction_needed 返回true時evictor嘗試釋放內存。它首先看一下當前的cachetable是否降到m_high_size_hysteresis以下，若是就喚醒等待在m_flow_control_cond條件變量上的client線程。然后，cachetable會先嘗試回收stale列表里面cachefile上的數據節點。若stale列表里面沒有可回收的數據節點，就會從m_clock_head開始嘗試回收內存。對于近期沒有被訪問過的數據節點，會調用`try_evict_pair`嘗試回收；否則會使之逐漸退化并嘗試partial evict。如果把整個m_clock_head隊列掃描一遍都沒發現可回收的數據節點，那么這次evictor線程的工作就完成了，等下次被喚醒時再次嘗試回收內存。 ## Cleaner線程 Cleaner是另一個定期運行(缺省1秒)的工作線程，從m_cleaner_head開始最多掃8個數據節點，從中找到cache pressure最大的節點（這個過程會skip掉正在被其他線程訪問的節點）。由于葉節點和rollback段的cache pressure為0，找到的節點一定是中間節點。如果這個節點設置了checkpoint_pending標記，那么需要先調用`write_locked_pair_for_checkpoint`把數據寫回再調用`cleaner_callback`把中間節點的message buffer刷到葉節點上去。數據寫回的過程，如果節點設置了`clone_callback`，寫回是由checkpoint線程池來完成的；沒有設置`clone_callback`的情況，寫回是由cleaner線程完成的。中間節點flush message buffer是一個很復雜的過程，涉及到message apply和merge等操作，打算另寫一篇文章介紹。偽碼如下： ~~~ run_cleaner(){ uint32_t num_iterations = get_iterations(); // by default, iteration == 1 for (uint32_t i = 0; i < num_iterations; ++i) { get pl->read_list_lock; PAIR best_pair = NULL; int n_seen = 0; long best_score = 0; const PAIR first_pair = m_cleaner_head; if (first_pair == NULL) { /* nothing to clean */ break; } /* pick up best_pair */ do { get m_cleaner_head pair lock; skip m_cleaner_head if which was being referenced by others n_seen++; long score = 0; bool need_unlock = false; score = m_cleaner_head cache pressure if (best_score < score) { best_score = score; if (best_pair) { need_unlock = true; } best_pair = m_cleaner_head; } else { need_unlock = true; } if (need_unlock) { release m_cleaner_head pair lock; } m_cleaner_head = m_cleaner_head->clock_next; } while (m_cleaner_head != first_pair && n_seen < 8); release m_pl->read_list_lock; if (best_pair) { get best_pair->value_rwlock; if (best_pair->checkpoint_pending) { write_locked_pair_for_checkpoint(ct, best_pair, true); } bool cleaner_callback_called = false; if (best_pair cache pressure > 0) { r = best_pair->cleaner_callback(best_pair->value_data, best_pair->key, best_pair->fullhash, best_pair->write_extraargs); cleaner_callback_called = true; } if (!cleaner_callback_called) { release best_pair->value_rwlock; } } } } ~~~ ## Checkpointer線程 Cachetable的臟數據是由checkpointer線程定期(缺省60秒)刷到磁盤上。 Checkpointer線程執行過程分為兩個階段： begin checkpoint階段 1. 為每個active的cache file打for_checkpoint標記； 2. 寫日志； 3. 為每個數據節點打checkpoint_pending標記，并加到m_pending_head隊列； 4. clone checkpoint_header: FT的metadata在內存中的數據結構是FT_HEADER，這個header有兩個版本: * h表示當前版本 * checkpoint_header表示當前正在進行checkpoint的版本，是h在checkpoint開始時刻的副本 5. clone BTT（block translation table）: TokuDB采用BTT記錄邏輯頁號（blocknum）到文件offset的映射關系。每次刷新數據節點時申請一個未使用的offset，把臟頁刷到新的offset位置上，不覆蓋老的數據。 BTT表也采用類似的機制被映射到FT文件不同的offset上。BTT的起始地址記錄在FT_HEADER中。checkpoint完成時FT_HEADER會被更新，使新數據生效。用戶可以使用checkpoint機制生成backup加速重建數據庫的過程。BTT表有三個版本 * 當前版本(_current) * 正在checkpoint的版本(_inprogress) * 上次checkpoint的版本(_checkpointed) end checkpoint階段 1. 把m_pending_head隊列里的數據節點挨個寫回到磁盤。寫的時候首先檢查是否設置`clone_callback`方法，如有調用`clone_callback`生成clone節點，在`clone_callback`里可能會對葉節點做rebalance操作，clone完成后調用`cachetable_only_write_locked_data`把cloned pair寫回。沒有設置clone_callback的情況會直接調用`cachetable_write_locked_pair`把節點寫回。偽碼如下： ~~~ void write_pair_for_checkpoint_thread (evictor* ev, PAIR p) { get p->value_rwlock.write_lock; if (p->dirty && p->checkpoint_pending) { if (p->clone_callback) { get p->disk_nb_mutex; clone_pair(ev, p); } else { cachetable_write_locked_pair(ev, p, true /* for_checkpoint */); } } p->checkpoint_pending = false; put p->value_rwlock.write_lock; if (p->clone_callback) { cachetable_only_write_locked_data(ev, p, true /* for_checkpoint */, &attr, true /* is_clone */); } } ~~~ 2. 調用`checkpoint_userdata`： * 寫回BTT的_inprogress版本 * 寫回FT_HEADER的checkpoint_header版本，后面會把checkpoint_header釋放掉 3. 調用`end_checkpoint_userdata`： * 釋放BTT _checkpointed版本占用的地址空間 * 把_inprogress版本切換成_checkpointed