MySQL · TokuDB · checkpoint過程 · 數據庫內核月報

TokuDB的buffer pool（在TokuDB中被稱作cachetable）維護幾個后臺工作線程定期處理一些任務。其中有一個工作線程叫做checkpointer線程，每60秒啟動一次把cachetable中所有臟頁寫回到磁盤上。 TokuDB只支持這一種checkpoint方式，用MySQL術語來說就是sharp checkpoint。每次checkpoint過程寫的臟頁數目可能會比較多，而且在寫回的過程中需要一直持有節點的讀寫鎖，因此，checkpoint時索引的訪問性能會受到一定程度的影響。為了降低checkpoint對性能影響，TokuDB對每個臟頁clone一份用于寫回，在clone的過程中是持有節點的讀寫鎖的，clone結束會放掉讀寫鎖。 TokuDB checkpoint過程分為如下五個步驟： * 獲取全局的checkpoint鎖 * Begin checkpoint * End checkpoint * 清理redo日志 * 釋放全局的checkpoint鎖下面我們一起看一下begin checkpoint和end checkpoint的詳細過程。 ## Begin checkpoint 在checkpoint開始時刻要做一些準備工作，諸如： 1. pin FT 給CACHEFILE對應的FT加pinned_by_checkpoint標記，保證CACHEFILE不會從內存里移除。CACHEFILE記錄了索引包含的數據節點列表和描述索引對應文件的相關信息。 2. 對每個CACHEFILE加for_checkpoint標記標識此CACHEFILE數據屬于當前的checkpoint。 3. 記redo日志記錄checkpoint開始時刻的lsn。寫redo日志：begin checkpoint日志項，checkpoint打開索引文件日志項，live txn日志項 4. 對每個PAIR（數據頁）加checkpoint_pending標記遍歷cachetable里面每個數據頁，如果數據頁對應的索引文件（CACHEFILE）屬于checkpoint，對數據頁加checkpoint_pending標記，并加入到全局m_pending_head雙向鏈表里面。 5. 更新checkpoint header信息 clone一份FT header，記做ft->checkpoint_header,記錄checkpoint開始時刻BTT（Block Translation Table）的位置。 TokuDB每次checkpoint都會把數據寫到一個新的地方，索引邏輯頁號（或者塊號）到索引文件offset的映射關系記錄在BTT里面。 ft->checkpoint_header的類型為FT_CHECKPOINT_INPROGRESS，lsn為checkpoint開始時刻的lsn。 6. 克隆BTT，BTT里面有個translation表，記錄邏輯頁號到索引文件offset的映射關系。這個表有三個版本： * _current（當前的，類型為TRANSLATION_CURRENT） * _inprogress（checkpoint開始時刻的，類型為TRANSLATION_INPROGRESS） * _checkpointed（上次checkpont的，類型為TRANSLATION_CHECKPOINTED）? 就是把TRANSLATION_CURRENT復制一份，并把類型設置為TRANSLATION_INPROGRESS。注： 1, 2階段在m_cf_list->read_lock保護下進行 4, 5階段在此過程在pair list的鎖和m_cf_list->read_lock保護下進行。。注意是拿了pair list上所有的鎖，m_list_lock讀鎖，m_pending_lock_expensive寫鎖，m_pending_lock_cheap寫鎖，保證不能向pair list添加/刪除數據頁；不能把pair list的數據頁evict出內存；同時也阻止在get_and_pin的過程中client線程池幫助寫回屬于checkpoint的臟頁。這三個鎖都是保護pair list的，按照不同的功能拆分成三個鎖。 ~~~ void checkpointer::begin_checkpoint() { // 1\. Initialize the accountability counters. m_checkpoint_num_txns = 0; // 2\. Make list of cachefiles to be included in the checkpoint. m_cf_list->read_lock(); m_cf_list->m_active_fileid.iterate<void *, iterate_note_pin::fn>(nullptr); m_checkpoint_num_files = m_cf_list->m_active_fileid.size(); m_cf_list->read_unlock(); // 3\. Create log entries for this checkpoint. if (m_logger) { this->log_begin_checkpoint(); } bjm_reset(m_checkpoint_clones_bjm); m_list->write_pending_exp_lock(); m_list->read_list_lock(); m_cf_list->read_lock(); // needed for update_cachefiles m_list->write_pending_cheap_lock(); // 4\. Turn on all the relevant checkpoint pending bits. this->turn_on_pending_bits(); // 5\. Clone BTT and FT header this->update_cachefiles(); m_list->write_pending_cheap_unlock(); m_cf_list->read_unlock(); m_list->read_list_unlock(); m_list->write_pending_exp_unlock(); } ~~~ ## End checkpoint 在end checkpoint的階段 * 把所有的CACHEFIlE記錄在checkpoint_cfs數組里面，為后面的步驟做準備。 * 然后調用checkpoint_pending_pairs函數把m_pending_head雙向鏈表的數據頁寫回到磁盤上。 Checkpoint_pending_pairs遍歷m_pending_head鏈表，對每個數據頁判斷是否真的需要寫回。因為一次checkpoint的時間比較長，有的數據頁可能是被client線程池幫忙寫回了，這里就不需要再做一次寫回操作。如果需要寫回，就調用clone_callback克隆一份。在clone的過程中是持有數據頁的讀寫鎖和disk_nb_mutex（mutex語義，表示有I/O在進行），克隆結束后，釋放讀寫鎖，只持有disk_nb_mutex鎖，由checkpointer線程把數據頁寫回（cloned副本）。寫回結束后，釋放disk_nb_mutex。如果數據頁沒有設置clone_callback（缺省是都會設置的），由checkpointer線程把數據頁（注意，是數據頁本身）寫回，寫回過程中是持有讀寫鎖和disk_nb_mutex的。寫回結束后清除checkpoint_pending標記和dirty標記。函數checkpoint_pending_pairs把所有的數據頁寫回到磁盤上，后面要做的就是metadata的修改。 * 對checkpoint_cfs數組的每個CACHEFILE調用checkpoint_userdata回調函數（實際上是ft_checkpoint函數）把BTT（Block Translation Table）和ft->checkpoint_header序列化到磁盤上。 BTT的rootnum 在FT索引文件里有兩個位置可以保存ft->header：偏移0和偏移4096。 TokuDB采用round robin的方式，把奇數次（1,3,5…）checkpoint的header存儲在偏移為0的地方; 把偶數次（2,4,6,…）checkpoint的header存儲在偏移為4096的位置上。然后更新ft->h->checkpoint_lsn為checkpoint開始時刻的lsn。 * 寫redo日志：end checkpoint日志項。 * 通知logger子系統logger->last_completed_checkpoint_lsn為checkpoint開始時刻的lsn。 * 對checkpoint_cfs數組保存的每個CACHEFILE調用end_checkpoint_userdata回調函數（實際上是ft_end_checkpoint）把_checkpointed記錄的上次checkpoint寫回的數據頁所占用空間釋放掉。并且把這次checkpoint的BTT保存在_checkpointed，然后清空_inprogress，表示checkpoint結束，當前沒有正在進行的checkpoint。在ft_end_checkpoint里面還做了一個事情就是把ft->checkpoint_header釋放并置為空，到這里checkpoint的工作就完成了。 * unpin FT ~~~ void checkpointer::end_checkpoint(void (*testcallback_f)(void*), void* testextra) { toku::scoped_malloc checkpoint_cfs_buf(m_checkpoint_num_files * sizeof(CACHEFILE)); CACHEFILE *checkpoint_cfs = reinterpret_cast<CACHEFILE *>(checkpoint_cfs_buf.get()); this->fill_checkpoint_cfs(checkpoint_cfs); this->checkpoint_pending_pairs(); this->checkpoint_userdata(checkpoint_cfs); // For testing purposes only. Dictionary has been fsync-ed to disk but log has not yet been written. if (testcallback_f) { testcallback_f(testextra); } this->log_end_checkpoint(); this->end_checkpoint_userdata(checkpoint_cfs); // Delete list of cachefiles in the checkpoint, this->remove_cachefiles(checkpoint_cfs); } ~~~ ## Checkpoint的redo日志下面我們一起看一下checkpoint過程記錄的redo日志： * Begin_checkpoint：表示begin checkpoint的日志項 * Fassociate：表示打開的索引的日志項 * End_checkpoint：表示end checkpoint的日志項 ~~~ ./tdb_logprint < data/log000000000002.tokulog27 begin_checkpoint 'x': lsn=88 timestamp=1455623796540257 last_xid=153 crc=470dd9ea len=37 fassociate 'f': lsn=89 filenum=0 treeflags=0 iname={len=15 data="tokudb.rollback"} unlink_on_close=0 crc=8606e9b1 len=49 fassociate 'f': lsn=90 filenum=1 treeflags=4 iname={len=18 data="tokudb.environment"} unlink_on_close=0 crc=92dc4c1c len=52 fassociate 'f': lsn=91 filenum=3 treeflags=4 iname={len=16 data="tokudb.directory"} unlink_on_close=0 crc=86323b7e len=50 end_checkpoint 'X': lsn=92 lsn_begin_checkpoint=88 timestamp=1455623796541659 num_fassoc ~~~