本文分為三節,分別介紹clog的fsync頻率,原子操作,與異步提交一致性。
## PostgreSQL pg_clog fsync 頻率分析
分析一下pg_clog是在什么時候需要調用fsync的?
首先引用wiki里的一段[pg_clog](https://wiki.postgresql.org/wiki/Hint_Bits)的介紹
> Some details here are in src/backend/access/transam/README:
> 1\. “pg_clog records the commit status for each transaction that has been assigned an XID.”
> 2\. “Transactions and subtransactions are assigned permanent XIDs only when/if they first do something that requires one — typically, insert/update/delete a tuple, though there are a few other places that need an XID assigned.”
>
> pg_clog is updated only at sub or main transaction end. When the transactionid is assigned the page of the clog that contains that transactionid is checked to see if it already exists and if not, it is initialised.
> pg_clog is allocated in pages of 8kB apiece(和BLOCKSZ一致,所以不一定是8K,見后面的分析).?
> Each transaction needs 2 bits, so on an 8 kB page there is space for 4 transactions/byte * 8k bytes = 32k transactions.
> On allocation, pages are zeroed, which is the bit pattern for “transaction in progress”.?
> So when a transaction starts, it only needs to ensure that the pg_clog page that contains its status is allocated, but it need not write anything to it.?
> In 8.3 and later, this happens not when the transaction starts, but when the Xid is assigned (i.e. when the transaction first calls a read-write command).?
> In previous versions it happens when the first snapshot is taken, normally on the first command of any type with very few exceptions.
>
> This means that one transaction in every 32K writing transactions?does?have to do extra work when it assigns itself an XID, namely create and zero out the next page of pg_clog.?
> And that doesn’t just slow down the transaction in question, but the next few guys that would like an XID but arrive on the scene while the zeroing-out is still in progress.?
> This probably contributes to reported behavior that the transaction execution time is subject to unpredictable spikes.
每隔32K個事務,要擴展一個CLOG PAGE,每次擴展需要填充0,同時需要調用PG_FSYNC,這個相比FSYNC XLOG應該是比較輕量級的。但是也可能出現不可預知的響應延遲,因為如果堵塞在擴展CLOG PAGE,所有等待clog PAGE的會話都會受到影響。
這里指當CLOG buffer沒有空的SLOT時,會從所有的CLOG buffer SLOT選擇一個臟頁,將其刷出,這個時候才會產生pg_fsync。
CLOG pages don’t make their way out to disk until the internal CLOG buffers are filled, at which point the least recently used buffer there is evicted to permanent storage.
下面從代碼中分析一下pg_clog是如何調用pg_fsync刷臟頁的。
每次申請新的事務ID時,都需要調用ExtendCLOG,如果通過事務ID計算得到的CLOG PAGE頁不存在,則需要擴展;但是并不是每次擴展都需要調用pg_fsync,因為checkpoint會將clog buffer刷到磁盤,除非在申請新的CLOG PAGE時所有的clog buffer都沒有刷出臟頁,才需要主動選擇一個page并調用pg_fsync刷出對應的pg_clog/file。
src/backend/access/transam/varsup.c
~~~
/*
* Allocate the next XID for a new transaction or subtransaction.
*
* The new XID is also stored into MyPgXact before returning.
*
* Note: when this is called, we are actually already inside a valid
* transaction, since XIDs are now not allocated until the transaction
* does something. So it is safe to do a database lookup if we want to
* issue a warning about XID wrap.
*/
TransactionId
GetNewTransactionId(bool isSubXact)
{
......
/*
* If we are allocating the first XID of a new page of the commit log,
* zero out that commit-log page before returning. We must do this while
* holding XidGenLock, else another xact could acquire and commit a later
* XID before we zero the page. Fortunately, a page of the commit log
* holds 32K or more transactions, so we don't have to do this very often.
*
* Extend pg_subtrans too.
*/
ExtendCLOG(xid);
ExtendSUBTRANS(xid);
......
~~~
`ExtendCLOG(xid)`擴展clog page,調用`TransactionIdToPgIndex`計算XID和CLOG_XACTS_PER_PAGE的余數,如果不為0,則不需要擴展。
src/backend/access/transam/clog.c
~~~
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
/*
* Make sure that CLOG has room for a newly-allocated XID.
*
* NB: this is called while holding XidGenLock. We want it to be very fast
* most of the time; even when it's not so fast, no actual I/O need happen
* unless we're forced to write out a dirty clog or xlog page to make room
* in shared memory.
*/
void
ExtendCLOG(TransactionId newestXact)
{
int pageno;
/*
* No work except at first XID of a page. But beware: just after
* wraparound, the first XID of page zero is FirstNormalTransactionId.
*/
if (TransactionIdToPgIndex(newestXact) != 0 && // 余數不為0,說明不需要擴展。
!TransactionIdEquals(newestXact, FirstNormalTransactionId))
return;
pageno = TransactionIdToPage(newestXact);
LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
/* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true);
LWLockRelease(CLogControlLock);
}
~~~
`ZeroCLOGPage(pageno, true)`,調用`SimpleLruZeroPage`,擴展并初始化CLOG PAGE,寫XLOG日志。
~~~
/*
* Initialize (or reinitialize) a page of CLOG to zeroes.
* If writeXlog is TRUE, also emit an XLOG record saying we did this.
*
* The page is not actually written, just set up in shared memory.
* The slot number of the new page is returned.
*
* Control lock must be held at entry, and will be held at exit.
*/
static int
ZeroCLOGPage(int pageno, bool writeXlog)
{
int slotno;
slotno = SimpleLruZeroPage(ClogCtl, pageno);
if (writeXlog)
WriteZeroPageXlogRec(pageno);
return slotno;
}
~~~
`SimpleLruZeroPage(ClogCtl, pageno)`,調用`SlruSelectLRUPage(ctl, pageno)`,從clog shared buffer中選擇SLOT。
src/backend/access/transam/slru.c
~~~
/*
* Initialize (or reinitialize) a page to zeroes.
*
* The page is not actually written, just set up in shared memory.
* The slot number of the new page is returned.
*
* Control lock must be held at entry, and will be held at exit.
*/
int
SimpleLruZeroPage(SlruCtl ctl, int pageno)
{
SlruShared shared = ctl->shared;
int slotno;
/* Find a suitable buffer slot for the page */
slotno = SlruSelectLRUPage(ctl, pageno);
Assert(shared->page_status[slotno] == SLRU_PAGE_EMPTY ||
(shared->page_status[slotno] == SLRU_PAGE_VALID &&
!shared->page_dirty[slotno]) ||
shared->page_number[slotno] == pageno);
/* Mark the slot as containing this page */
shared->page_number[slotno] = pageno;
shared->page_status[slotno] = SLRU_PAGE_VALID;
shared->page_dirty[slotno] = true;
SlruRecentlyUsed(shared, slotno);
/* Set the buffer to zeroes */
MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
/* Set the LSNs for this new page to zero */
SimpleLruZeroLSNs(ctl, slotno);
/* Assume this page is now the latest active page */
shared->latest_page_number = pageno;
return slotno;
}
~~~
`SlruSelectLRUPage(SlruCtl ctl, int pageno)`,從clog buffer選擇一個空的SLOT,如果沒有空的SLOT,則需要調用`SlruInternalWritePage(ctl, bestvalidslot, NULL)`,寫shared buffer page。
~~~
/*
* Select the slot to re-use when we need a free slot.
*
* The target page number is passed because we need to consider the
* possibility that some other process reads in the target page while
* we are doing I/O to free a slot. Hence, check or recheck to see if
* any slot already holds the target page, and return that slot if so.
* Thus, the returned slot is *either* a slot already holding the pageno
* (could be any state except EMPTY), *or* a freeable slot (state EMPTY
* or CLEAN).
*
* Control lock must be held at entry, and will be held at exit.
*/
static int
SlruSelectLRUPage(SlruCtl ctl, int pageno)
{
......
/* See if page already has a buffer assigned */ 先查看clog buffer中是否有空SLOT,有則返回,不需要調pg_fsync
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
if (shared->page_number[slotno] == pageno &&
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
return slotno;
}
......
/* 如果沒有找到空SLOT,則需要從clog buffer中選擇一個使用最少的PAGE,注意他不會選擇最近臨近的PAGE,優先選擇IO不繁忙的PAGE
* If we find any EMPTY slot, just select that one. Else choose a
* victim page to replace. We normally take the least recently used
* valid page, but we will never take the slot containing
* latest_page_number, even if it appears least recently used. We
* will select a slot that is already I/O busy only if there is no
* other choice: a read-busy slot will not be least recently used once
* the read finishes, and waiting for an I/O on a write-busy slot is
* inferior to just picking some other slot. Testing shows the slot
* we pick instead will often be clean, allowing us to begin a read at
* once.
*
* Normally the page_lru_count values will all be different and so
* there will be a well-defined LRU page. But since we allow
* concurrent execution of SlruRecentlyUsed() within
* SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
* acquire the same lru_count values. In that case we break ties by
* choosing the furthest-back page.
*
* Notice that this next line forcibly advances cur_lru_count to a
* value that is certainly beyond any value that will be in the
* page_lru_count array after the loop finishes. This ensures that
* the next execution of SlruRecentlyUsed will mark the page newly
* used, even if it's for a page that has the current counter value.
* That gets us back on the path to having good data when there are
* multiple pages with the same lru_count.
*/
cur_count = (shared->cur_lru_count)++;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
int this_delta;
int this_page_number;
if (shared->page_status[slotno] == SLRU_PAGE_EMPTY) // 如果在此期間出現了空SLOT,返回這個slotno
return slotno;
this_delta = cur_count - shared->page_lru_count[slotno];
if (this_delta < 0)
{
/*
* Clean up in case shared updates have caused cur_count
* increments to get "lost". We back off the page counts,
* rather than trying to increase cur_count, to avoid any
* question of infinite loops or failure in the presence of
* wrapped-around counts.
*/
shared->page_lru_count[slotno] = cur_count;
this_delta = 0;
}
this_page_number = shared->page_number[slotno];
if (this_page_number == shared->latest_page_number)
continue;
if (shared->page_status[slotno] == SLRU_PAGE_VALID) // IO不繁忙的臟頁
{
if (this_delta > best_valid_delta ||
(this_delta == best_valid_delta &&
ctl->PagePrecedes(this_page_number,
best_valid_page_number)))
{
bestvalidslot = slotno;
best_valid_delta = this_delta;
best_valid_page_number = this_page_number;
}
}
else
{
if (this_delta > best_invalid_delta ||
(this_delta == best_invalid_delta &&
ctl->PagePrecedes(this_page_number,
best_invalid_page_number)))
{
bestinvalidslot = slotno; // 當所有頁面IO都繁忙時,無奈只能從IO繁忙中選擇一個.
best_invalid_delta = this_delta;
best_invalid_page_number = this_page_number;
}
}
}
/* 如果選擇到的PAGE
* If all pages (except possibly the latest one) are I/O busy, we'll
* have to wait for an I/O to complete and then retry. In that
* unhappy case, we choose to wait for the I/O on the least recently
* used slot, on the assumption that it was likely initiated first of
* all the I/Os in progress and may therefore finish first.
*/
if (best_valid_delta < 0) // 說明沒有找到SLRU_PAGE_VALID的PAGE,所有PAGE都處于IO繁忙的狀態。
{
SimpleLruWaitIO(ctl, bestinvalidslot);
continue;
}
/*
* If the selected page is clean, we're set.
*/
if (!shared->page_dirty[bestvalidslot]) // 如果這個頁面已經不是臟頁(例如被CHECKPOINT刷出了),那么直接返回
return bestvalidslot;
......
僅僅當以上所有的步驟,都沒有找到一個EMPTY SLOT時,才需要主動刷臟頁(在SlruInternalWritePage調用pg_fsync)。
/*
* Write the page. 注意第三個參數為NULL,即fdata
*/
SlruInternalWritePage(ctl, bestvalidslot, NULL);
......
~~~
`SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)`,調用`SlruPhysicalWritePage`,執行write。
~~~
/*
* Write a page from a shared buffer, if necessary.
* Does nothing if the specified slot is not dirty.
*
* NOTE: only one write attempt is made here. Hence, it is possible that
* the page is still dirty at exit (if someone else re-dirtied it during
* the write). However, we *do* attempt a fresh write even if the page
* is already being written; this is for checkpoints.
*
* Control lock must be held at entry, and will be held at exit.
*/
static void
SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
{
......
/* Do the write */
ok = SlruPhysicalWritePage(ctl, pageno, slotno, fdata);
......
~~~
SLRU PAGE狀態
~~~
/*
* Page status codes. Note that these do not include the "dirty" bit.
* page_dirty can be TRUE only in the VALID or WRITE_IN_PROGRESS states;
* in the latter case it implies that the page has been re-dirtied since
* the write started.
*/
typedef enum
{
SLRU_PAGE_EMPTY, /* buffer is not in use */
SLRU_PAGE_READ_IN_PROGRESS, /* page is being read in */
SLRU_PAGE_VALID, /* page is valid and not being written */
SLRU_PAGE_WRITE_IN_PROGRESS /* page is being written out */
} SlruPageStatus;
~~~
`SlruPhysicalWritePage(ctl, pageno, slotno, fdata)`,這里涉及pg_clog相關的`SlruCtlData`結構,do_fsync=true。
~~~
/*
* Physical write of a page from a buffer slot
*
* On failure, we cannot just ereport(ERROR) since caller has put state in
* shared memory that must be undone. So, we return FALSE and save enough
* info in static variables to let SlruReportIOError make the report.
*
* For now, assume it's not worth keeping a file pointer open across
* independent read/write operations. We do batch operations during
* SimpleLruFlush, though.
*
* fdata is NULL for a standalone write, pointer to open-file info during
* SimpleLruFlush.
*/
static bool
SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno,
SlruFlush fdata);
......
int fd = -1;
......
// 如果文件不存在,自動創建
if (fd < 0)
{
/*
* If the file doesn't already exist, we should create it. It is
* possible for this to need to happen when writing a page that's not
* first in its segment; we assume the OS can cope with that. (Note:
* it might seem that it'd be okay to create files only when
* SimpleLruZeroPage is called for the first page of a segment.
* However, if after a crash and restart the REDO logic elects to
* replay the log from a checkpoint before the latest one, then it's
* possible that we will get commands to set transaction status of
* transactions that have already been truncated from the commit log.
* Easiest way to deal with that is to accept references to
* nonexistent files here and in SlruPhysicalReadPage.)
*
* Note: it is possible for more than one backend to be executing this
* code simultaneously for different pages of the same file. Hence,
* don't use O_EXCL or O_TRUNC or anything like that.
*/
SlruFileName(ctl, path, segno);
fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY,
S_IRUSR | S_IWUSR);
......
/*
* If not part of Flush, need to fsync now. We assume this happens
* infrequently enough that it's not a performance issue.
*/
if (!fdata) // 因為傳入的fdata=NULL,并且ctl->do_fsync=true,所以以下pg_fsync被調用。
{
if (ctl->do_fsync && pg_fsync(fd)) // 對于pg_clog和multixact,do_fsync=true。
{
slru_errcause = SLRU_FSYNC_FAILED;
slru_errno = errno;
CloseTransientFile(fd);
return false;
}
if (CloseTransientFile(fd))
{
slru_errcause = SLRU_CLOSE_FAILED;
slru_errno = errno;
return false;
}
}
~~~
`ctl->do_fsync`?&&?`pg_fsync(fd)`涉及的代碼:
src/include/access/slru.h
~~~
/*
* SlruCtlData is an unshared structure that points to the active information
* in shared memory.
*/
typedef struct SlruCtlData
{
SlruShared shared;
/*
* This flag tells whether to fsync writes (true for pg_clog and multixact
* stuff, false for pg_subtrans and pg_notify).
*/
bool do_fsync;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
* need to use comparison of TransactionIds here in order to do the right
* thing with wraparound XID arithmetic.
*/
bool (*PagePrecedes) (int, int);
/*
* Dir is set during SimpleLruInit and does not change thereafter. Since
* it's always the same, it doesn't need to be in shared memory.
*/
char Dir[64];
} SlruCtlData;
typedef SlruCtlData *SlruCtl;
~~~
src/backend/access/transam/slru.c
~~~
......
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
LWLock *ctllock, const char *subdir)
......
ctl->do_fsync = true; /* default behavior */ // 初始化LRU時,do_fsync默認是true的。
......
~~~
以下是clog初始化LRU的調用,可以看到它沒有修改do_fsync,所以是TURE。
src/backend/access/transam/clog.c
~~~
/*
* Number of shared CLOG buffers.
*
* Testing during the PostgreSQL 9.2 development cycle revealed that on a
* large multi-processor system, it was possible to have more CLOG page
* requests in flight at one time than the number of CLOG buffers which existed
* at that time, which was hardcoded to 8\. Further testing revealed that
* performance dropped off with more than 32 CLOG buffers, possibly because
* the linear buffer search algorithm doesn't scale well.
*
* Unconditionally increasing the number of CLOG buffers to 32 did not seem
* like a good idea, because it would increase the minimum amount of shared
* memory required to start, which could be a problem for people running very
* small configurations. The following formula seems to represent a reasonable
* compromise: people with very low values for shared_buffers will get fewer
* CLOG buffers as well, and everyone else will get 32.
*
* It is likely that some further work will be needed here in future releases;
* for example, on a 64-core server, the maximum number of CLOG requests that
* can be simultaneously in flight will be even larger. But that will
* apparently require more than just changing the formula, so for now we take
* the easy way out.
*/
Size
CLOGShmemBuffers(void)
{
return Min(32, Max(4, NBuffers / 512));
}
void
CLOGShmemInit(void)
{
ClogCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(ClogCtl, "CLOG Ctl", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
CLogControlLock, "pg_clog");
}
~~~
以下是subtrans初始化LRU的調用,看到它修改了do_fsync=false。所以subtrans擴展PAGE時不需要調用pg_fsync。
src/backend/access/transam/subtrans.c
~~~
void
SUBTRANSShmemInit(void)
{
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "SUBTRANS Ctl", NUM_SUBTRANS_BUFFERS, 0,
SubtransControlLock, "pg_subtrans");
/* Override default assumption that writes should be fsync'd */
SubTransCtl->do_fsync = false;
}
~~~
multixact.c也沒有修改do_fsync,所以也是需要fsync的。
MultiXactShmemInit(void)@src/backend/access/transam/multixact.c
pg_fsync代碼:
src/backend/storage/file/fd.c
~~~
/*
* pg_fsync --- do fsync with or without writethrough
*/
int
pg_fsync(int fd)
{
/* #if is to skip the sync_method test if there's no need for it */
#if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)
if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
return pg_fsync_writethrough(fd);
else
#endif
return pg_fsync_no_writethrough(fd);
}
/*
* pg_fsync_no_writethrough --- same as fsync except does nothing if
* enableFsync is off
*/
int
pg_fsync_no_writethrough(int fd)
{
if (enableFsync)
return fsync(fd);
else
return 0;
}
/*
* pg_fsync_writethrough
*/
int
pg_fsync_writethrough(int fd)
{
if (enableFsync)
{
#ifdef WIN32
return _commit(fd);
#elif defined(F_FULLFSYNC)
return (fcntl(fd, F_FULLFSYNC, 0) == -1) ? -1 : 0;
#else
errno = ENOSYS;
return -1;
#endif
}
else
return 0;
}
~~~
從上面的代碼分析,擴展clog page時,如果在CLOG BUFFER中沒有EMPTY SLOT,則需要backend process主動刷CLOG PAGE,所以會有調用pg_fsync的動作。
clog page和數據庫BLOCKSZ (database block size)一樣大,默認是8K(如果編譯數據庫軟件時沒有修改的話,默認是8KB),最大可以設置為32KB。每個事務在pg_clog中需要2個比特位來存儲事務信息(xmin commit/abort,xmax commit/abort)。所以8K的clog page可以存儲32K個事務信息,換句話說,每32K個事務,需要擴展一次clog page。
下面的代碼是clog的一些常用宏。
src/backend/access/transam/clog.c
~~~
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
* everywhere else in Postgres.
*
* Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
* CLOG page numbering also wraps around at 0xFFFFFFFF/CLOG_XACTS_PER_PAGE,
* and CLOG segment numbering at
* 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT. We need take no
* explicit notice of that fact in this module, except when comparing segment
* and page numbers in TruncateCLOG (see CLOGPagePrecedes).
*/
/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT 2
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)
#define TransactionIdToPage(xid) ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
#define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)
#define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)
~~~
查看數據庫的block size:
~~~
postgres@digoal-> pg_controldata |grep block
Database block size: 8192
WAL block size: 8192
~~~
我們可以使用stap來跟蹤是否調用pg_fsync,如果你要觀察backend process主動刷clog 臟頁,可以把checkpoint間隔開大,同時把clog shared buffer pages。
你就會觀察到backend process主動刷clog 臟頁。
~~~
Size
CLOGShmemBuffers(void)
{
return Min(32, Max(4, NBuffers / 512));
}
~~~
跟蹤
~~~
src/backend/access/transam/slru.c
SlruPhysicalWritePage
......
SlruFileName(ctl, path, segno);
fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY,
S_IRUSR | S_IWUSR);
......
src/backend/storage/file/fd.c
OpenTransientFile
pg_fsync(fd)
~~~
stap腳本
~~~
[root@digoal ~]# cat trc.stp
global f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
~~~
開啟一個pgbench執行txid_current()函數申請新的事務號。
~~~
postgres@digoal-> cat 7.sql
select txid_current();
~~~
測試,約每秒產生32K左右的請求。
~~~
postgres@digoal-> pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
progress: 240.0 s, 31164.4 tps, lat 0.031 ms stddev 0.183
progress: 241.0 s, 33243.3 tps, lat 0.029 ms stddev 0.127
progress: 242.0 s, 32567.3 tps, lat 0.030 ms stddev 0.179
progress: 243.0 s, 33656.6 tps, lat 0.029 ms stddev 0.038
progress: 244.0 s, 33948.1 tps, lat 0.029 ms stddev 0.021
progress: 245.0 s, 32996.8 tps, lat 0.030 ms stddev 0.046
progress: 246.0 s, 34156.7 tps, lat 0.029 ms stddev 0.015
progress: 247.0 s, 33259.5 tps, lat 0.029 ms stddev 0.074
progress: 248.0 s, 32979.6 tps, lat 0.030 ms stddev 0.043
progress: 249.0 s, 32892.6 tps, lat 0.030 ms stddev 0.039
progress: 250.0 s, 33090.7 tps, lat 0.029 ms stddev 0.020
progress: 251.0 s, 33238.3 tps, lat 0.029 ms stddev 0.017
progress: 252.0 s, 32341.3 tps, lat 0.030 ms stddev 0.045
progress: 253.0 s, 31999.0 tps, lat 0.030 ms stddev 0.167
progress: 254.0 s, 33332.6 tps, lat 0.029 ms stddev 0.056
progress: 255.0 s, 30394.6 tps, lat 0.032 ms stddev 0.027
progress: 256.0 s, 31862.7 tps, lat 0.031 ms stddev 0.023
progress: 257.0 s, 31574.0 tps, lat 0.031 ms stddev 0.112
~~~
跟蹤backend process
~~~
postgres@digoal-> ps -ewf|grep postgres
postgres 2921 1883 29 09:37 pts/1 00:00:05 pgbench -M prepared -n -r -P 1 -f ./7.sql -c 1 -j 1 -T 100000
postgres 2924 1841 66 09:37 ? 00:00:13 postgres: postgres postgres [local] SELECT
~~~
從日志中抽取pg_clog相關的跟蹤結果。
~~~
[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 2924 >./stap.log 2>&1
0 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12350 slotno=10 fdata=ERROR
31 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
53 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
102 postgres(2924): -> time:1441503927731, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1096 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1113 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
1105302 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12351 slotno=11 fdata=ERROR
1105329 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0181" fileFlags=66 fileMode=384
1105348 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
1105405 postgres(2924): -> time:1441503928836, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
1106440 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
1106452 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
2087891 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").call, par:ctl={.shared=0x7f74a9fe39c0, .do_fsync='\001', .PagePrecedes=0x4b1960, .Dir="pg_clog"} pageno=12352 slotno=12 fdata=ERROR
2087917 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").call, par:fileName="pg_clog/0182" fileFlags=66 fileMode=384
2087958 postgres(2924): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("OpenTransientFile@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:1710").return, par:14
2088013 postgres(2924): -> time:1441503929819, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").call, par:fd=14
2089250 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("pg_fsync@/opt/soft_bak/postgresql-9.4.4/src/backend/storage/file/fd.c:315").return, par:0
2089265 postgres(2924): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SlruPhysicalWritePage@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/slru.c:699").return, par:'\001'
~~~
計算估計,每隔1秒左右會產生一次fsync。
~~~
postgres=# select 1441503928836-1441503927731;
?column?
----------
1105
(1 row)
postgres=# select 1441503929819-1441503928836;
?column?
----------
983
(1 row)
~~~
前面pgbench的輸出看到每秒產生約32000個事務,剛好等于一個clog頁的事務數(本例數據塊大小為8KB)。
每個事務需要2個比特位,每個字節存儲4個事務信息,8192*4=32768。
如果你需要觀察backend process不刷clog buffer臟頁的情況。可以把checkpoint 間隔改小,或者手動執行checkpoint,同時還需要把clog buffer pages改大,例如:
~~~
Size
CLOGShmemBuffers(void)
{
return Min(1024, Max(4, NBuffers / 2));
}
~~~
使用同樣的stap腳本,你就觀察不到backend process主動刷clog dirty page了。
通過以上分析,如果你發現backend process頻繁的clog,可以采取一些優化手段。
1. 因為每次擴展pg_clog文件后,文件大小都會發生變化,此時如果backend process調用pg_fdatasync也會寫文件系統metadata journal(以EXT4為例,假設mount參數data不等于writeback),這個操作是整個文件系統串行的,容易產生堵塞;
所以backend process挑選clog page時,不選擇最近的page number可以起到一定的效果,(最好是不選擇最近的clog file中的pages);
另一種方法是先調用`sync_file_range`, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER,它不需要寫metadata。將文件寫入后再調用pg_fsync。減少等待data fsync的時間;
2. pg_clog文件預分配,目前pg_clog單個文件的大小是由CLOGShmemBuffers決定的,為BLOCKSZ的32倍。可以嘗試預分配這個文件,而不是每次都擴展,改變它的大小;
3. 延遲backend process 的 fsync請求到checkpoint處理。
[參考]
https://wiki.postgresql.org/wiki/Hint_Bits
http://blog.163.com/digoal@126/blog/static/1638770402015840480734/
src/backend/access/transam/varsup.c
src/backend/access/transam/clog.c
src/backend/access/transam/slru.c
src/include/access/slru.h
src/backend/access/transam/subtrans.c
src/backend/storage/file/fd.c
## pg_clog的原子操作與pg_subtrans(子事務)
如果沒有子事務,其實很容易保證pg_clog的原子操作,但是,如果加入了子事務并為子事務分配了XID,并且某些子事務XID和父事務的XID不在同一個CLOG PAGE時,保證事務一致性就涉及CLOG的原子寫了。
PostgreSQL是通過2PC來實現CLOG的原子寫的:
1. 首先將主事務以外的CLOG PAGE中的子事務設置為sub-committed狀態;
2. 然后將主事務所在的CLOG PAGE中的子事務設置為sub-committed,同時設置主事務為committed狀態,將同頁的子事務設置為committed狀態;
3. 將其他CLOG PAGE中的子事務設置為committed狀態;
src/backend/access/transam/clog.c
~~~
/*
* TransactionIdSetTreeStatus
*
* Record the final state of transaction entries in the commit log for
* a transaction and its subtransaction tree. Take care to ensure this is
* efficient, and as atomic as possible.
*
* xid is a single xid to set status for. This will typically be
* the top level transactionid for a top level commit or abort. It can
* also be a subtransaction when we record transaction aborts.
*
* subxids is an array of xids of length nsubxids, representing subtransactions
* in the tree of xid. In various cases nsubxids may be zero.
*
* lsn must be the WAL location of the commit record when recording an async
* commit. For a synchronous commit it can be InvalidXLogRecPtr, since the
* caller guarantees the commit record is already flushed in that case. It
* should be InvalidXLogRecPtr for abort cases, too.
*
* In the commit case, atomicity is limited by whether all the subxids are in
* the same CLOG page as xid. If they all are, then the lock will be grabbed
* only once, and the status will be set to committed directly. Otherwise
* we must
* 1\. set sub-committed all subxids that are not on the same page as the
* main xid
* 2\. atomically set committed the main xid and the subxids on the same page
* 3\. go over the first bunch again and set them committed
* Note that as far as concurrent checkers are concerned, main transaction
* commit as a whole is still atomic.
*
* Example:
* TransactionId t commits and has subxids t1, t2, t3, t4
* t is on page p1, t1 is also on p1, t2 and t3 are on p2, t4 is on p3
* 1\. update pages2-3:
* page2: set t2,t3 as sub-committed
* page3: set t4 as sub-committed
* 2\. update page1:
* set t1 as sub-committed,
* then set t as committed,
then set t1 as committed
* 3\. update pages2-3:
* page2: set t2,t3 as committed
* page3: set t4 as committed
*
* NB: this is a low-level routine and is NOT the preferred entry point
* for most uses; functions in transam.c are the intended callers.
*
* XXX Think about issuing FADVISE_WILLNEED on pages that we will need,
* but aren't yet in cache, as well as hinting pages not to fall out of
* cache yet.
*/
~~~
實際調用的入口代碼在transam.c,subtrans.c中是一些低級接口。
那么什么是subtrans?
當我們使用savepoint時,會產生子事務。子事務和父事務一樣,可能消耗XID。一旦為子事務分配了XID,那么就涉及CLOG的原子操作了,因為要保證父事務和所有的子事務的CLOG一致性。
當不消耗XID時,需要通過SubTransactionId來區分子事務。
~~~
src/backend/acp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466858 ptr=0
~~~
重新開一個會話,你會發現,子事務也消耗了XID。因為重新分配的XID已經從607466859開始了。
~~~
postgres@digoal-> psql
psql (9.4.4)
Type "help" for help.
postgres=# select txid_current();
txid_current
--------------
607466859
(1 row)
~~~
[參考]
src/backend/access/transam/clog.c
src/backend/access/transam/subtrans.c
src/backend/access/transam/transam.c
src/backend/access/transam/README
src/include/c.hr
## CLOG一致性和異步提交
異步提交是指不需要等待事務對應的wal buffer fsync到磁盤,即返回,而且寫CLOG時也不需要等待XLOG落盤。
而pg_clog和pg_xlog是兩部分存儲的,那么我們想一想,如果一個已提交事務的pg_clog已經落盤,而XLOG沒有落盤,剛好此時數據庫CRASH了。數據庫恢復時,由于該事務對應的XLOG缺失,數據無法恢復到最終狀態,但是PG_CLOG卻顯示該事務已提交,這就出問題了。
所以對于異步事務,CLOG在write前,務必等待該事務對應的XLOG已經FLUSH到磁盤。
PostgreSQL如何記錄事務和它產生的XLOG的LSN的關系呢?
其實不是一一對應的關系,而是記錄了多事務對一個LSN的關系。
src/backend/access/transam/clog.c
LSN組,每32個事務,記錄它們對應的最大LSN。
也就是32個事務,只記錄最大的LSN。節約空間?
~~~
/* We store the latest async LSN for each group of transactions */
#define CLOG_XACTS_PER_LSN_GROUP 32 /* keep this a power of 2 */
每個CLOG頁需要分成多少個LSN組。
#define CLOG_LSNS_PER_PAGE (CLOG_XACTS_PER_PAGE / CLOG_XACTS_PER_LSN_GROUP)
#define GetLSNIndex(slotno, xid) ((slotno) * CLOG_LSNS_PER_PAGE + \
((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)
~~~
LSN被存儲在這個數據結構中
src/include/access/slru.h
~~~
/*
* Shared-memory state
*/
typedef struct SlruSharedData
{
......
/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
* for pg_clog, false for multixact, pg_subtrans, pg_notify). group_lsn[]
* has lsn_groups_per_page entries per buffer slot, each containing the
* highest LSN known for a contiguous group of SLRU entries on that slot's
* page. 僅僅pg_clog需要記錄group_lsn
*/
XLogRecPtr *group_lsn; // 一個數組,存儲32個事務組成的組中最大的LSN號。
int lsn_groups_per_page;
......
~~~
src/backend/access/transam/clog.c
~~~
* lsn must be the WAL location of the commit record when recording an async
* commit. For a synchronous commit it can be InvalidXLogRecPtr, since the
* caller guarantees the commit record is already flushed in that case. It
* should be InvalidXLogRecPtr for abort cases, too.
void
TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
TransactionId *subxids, XidStatus status, XLogRecPtr lsn)
{
......
~~~
更新事務狀態時,同時更新對應LSN組的LSN為最大LSN值。(CLOG BUFFER中的操作)
~~~
/*
* Sets the commit status of a single transaction.
*
* Must be called with CLogControlLock held
*/
static void
TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
{
......
/*
* Update the group LSN if the transaction completion LSN is higher.
*
* Note: lsn will be invalid when supplied during InRecovery processing,
* so we don't need to do anything special to avoid LSN updates during
* recovery. After recovery completes the next clog change will set the
* LSN correctly.
*/
if (!XLogR int lsnindex = GetLSNIndex(slotno, xid);
if (ClogCtl->shared->group_lsn[lsnindex] < lsn) // 更新組LSN
ClogCtl->shared->group_lsn[lsnindex] = lsn;
}
......
~~~
將事務標記為commit狀態,對于異步事務,多一個LSN參數,用于修改事務組的最大LSN。
~~~
/*
* TransactionIdCommitTree
* Marks the given transaction and children as committed
*
* "xid" is a toplevel transaction commit, and the xids array contains its
* committed subtransactions.
*
* This commit operation is not guaranteed to be atomic, but if not, subxids
* are correctly marked subcommit first.
*/
void
TransactionIdCommitTree(TransactionId xid, int nxids, TransactionId *xids)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
TRANSACTION_STATUS_COMMITTED,
InvalidXLogRecPtr);
}
/*
* TransactionIdAsyncCommitTree
* Same as above, but for async commits. The commit record LSN is needed.
*/
void
TransactionIdAsyncCommitTree(TransactionId xid, int nxids, TransactionId *xids,
XLogRecPtr lsn)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
TRANSACTION_STATUS_COMMITTED, lsn);
}
/*
* TransactionIdAbortTree
* Marks the given transaction and children as aborted.
*
* "xid" is a toplevel transaction commit, and the xids array contains its
* committed subtransactions.
*
* We don't need to worry about the non-atomic behavior, since any onlookers
* will consider all the xacts as not-yet-committed anyway.
*/
void
TransactionIdAbortTree(TransactionId xid, int nxids, TransactionId *xids)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
TRANSACTION_STATUS_ABORTED, InvalidXLogRecPtr);
}
~~~
從XID號,獲取它對應的LSN,需要注意的是,這個XID如果是一個FROZEN XID,則返回一個(XLogRecPtr) invalid lsn。
src/backend/access/transam/transam.c
~~~
/*
* TransactionIdGetCommitLSN
*
* This function returns an LSN that is late enough to be able
* to guarantee that if we flush up to the LSN returned then we
* will have flushed the transaction's commit record to disk.
*
* The result is not necessarily the exact LSN of the transaction's
* commit record! For example, for long-past transactions (those whose
* clog pages already migrated to disk), we'll return InvalidXLogRecPtr.
* Also, because we group transactions on the same clog page to conserve
* storage, we might return the LSN of a later transaction that falls into
* the same group.
*/
XLogRecPtr
TransactionIdGetCommitLSN(TransactionId xid)
{
XLogRecPtr result;
/*
* Currently, all uses of this function are for xids that were just
* reported to be committed by TransactionLogFetch, so we expect that
* checking TransactionLogFetch's cache will usually succeed and avoid an
* extra trip to shared memory.
*/
if (TransactionIdEquals(xid, cachedFetchXid))
return cachedCommitLSN;
/* Special XIDs are always known committed */
if (!TransactionIdIsNormal(xid))
return InvalidXLogRecPtr;
/*
* Get the transaction status.
*/
(void) TransactionIdGetStatus(xid, &result);
return result;
}
/*
* Interrogate the state of a transaction in the commit log.
*
* Aside from the actual commit status, this function returns (into *lsn)
* an LSN that is late enough to be able to guarantee that if we flush up to
* that LSN then we will have flushed the transaction's commit record to disk.
* The result is not necessarily the exact LSN of the transaction's commit
* record! For example, for long-past transactions (those whose clog pages // long-past事務,指非標準事務號。例), we'll return InvalidXLogRecPtr. Also, because
* we group transactions on the same clog page to conserve storage, we might
* return the LSN of a later transaction that falls into the same group.
*
* NB: this is a low-level routine and is NOT the preferred entry point
* for most uses; TransactionLogFetch() in transam.c is the intended caller.
*/
XidStatus
TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
{
int pageno = TransactionIdToPage(xid);
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
int slotno;
int lsnindex;
char *byteptr;
XidStatus status;
/* lock is acquired by SimpleLruReadPage_ReadOnly */
slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid);
byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
lsnindex = GetLSNIndex(slotno, xid);
*lsn = ClogCtl->shared->group_lsn[lsnindex];
LWLockRelease(CLogControlLock);
return status;
}
~~~
前面所涉及的都是CLOG BUFFER中的操作,如果要將buffer寫到磁盤,則真正需要涉及到一致性的問題,即在將CLOG write到磁盤前,必須先確保對應的事務產生的XLOG已經flush到磁盤。那么這里就需要用到前面每個LSN組中記錄的max LSN了。
代碼如下:
src/backend/access/transam/slru.c
~~~
/*
* Physical write of a page from a buffer slot
*
* On failure, we cannot just ereport(ERROR) since caller has put state in
* shared memory that must be undone. So, we return FALSE and save enough
* info in static variables to let SlruReportIOError make the report.
*
* For now, assume it's not worth keeping a file pointer open across
* independent read/write operations. We do batch operations during
* SimpleLruFlush, though.
*
* fdata is NULL for a standalone write, pointer to open-file info during
* SimpleLruFlush.
*/
static bool
SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
{
SlruShared shared = ctl->shared;
int segno = pageno / SLRU_PAGES_PER_SEGMENT;
int rpageno = pageno % SLRU_PAGES_PER_SEGMENT;
int offset = rpageno * BLCKSZ;
char path[MAXPGPATH];
int fd = -1;
/*
* Honor the write-WAL-before-data rule, if appropriate, so that we do not
* write out data before associated WAL records. This is the same action
* performed during FlushBuffer() in the main buffer manager.
*/
if (shared->group_lsn != NULL)
{
/*
* We must determine the largest async-commit LSN for the page. This
* is a bit tedious, but since this entire function is a slow path
* anyway, it seems better to do this here than to maintain a per-page
* LSN variable (which'd need an extra comparison in the
* transaction-commit path).
*/
XLogRecPtr max_lsn;
int lsnindex,
lsnoff;
lsnindex = slotno * shared->lsn_groups_per_page;
max_lsn = shared->group_lsn[lsnindex++];
for (lsnoff = 1; lsnoff < shared->lsn_groups_per_page; lsnoff++)
{
XLogRecPtr this_lsn = shared->group_lsn[lsnindex++];
if (max_lsn < this_lsn)
max_lsn = this_lsn;
}
if (!XLogRecPtrIsInvalid(max_lsn)) // 判斷max_lsn是不是一個有效的LSN,如果是有效的LSN,說明需要先調用xlogflush將wal buffer中小于該LSN以及以前的buffer寫入磁盤。
則。
{
/*
* As noted above, elog(ERROR) is not acceptable here, so if
* XLogFlush were to fail, we must PANIC. This isn't much of a
* restriction because XLogFlush is just about all critical
* section anyway, but let's make sure.
*/
START_CRIT_SECTION();
XLogFlush(max_lsn);
END_CRIT_SECTION();
}
}
......
~~~
小結
對于異步事務,如何保證write-WAL-before-data規則?
pg_clog將32個事務分為一組,存儲這些事務的最大LSN。存儲在SlruSharedData結構中。
在將clog buffer write到磁盤前,需要確保該clog page對應事務的xlog LSN已經flush到磁盤。
[參考]
src/backend/access/transam/clog.c
src/include/access/slru.h
src/backend/access/transam/transam.c
src/backend/access/transam/slru.c
- 數據庫內核月報目錄
- 數據庫內核月報 - 2016/09
- MySQL · 社區貢獻 · AliSQL那些事兒
- PetaData · 架構體系 · PetaData第二代低成本存儲體系
- MySQL · 社區動態 · MariaDB 10.2 前瞻
- MySQL · 特性分析 · 執行計劃緩存設計與實現
- PgSQL · 最佳實踐 · pg_rman源碼淺析與使用
- MySQL · 捉蟲狀態 · bug分析兩例
- PgSQL · 源碼分析 · PG優化器淺析
- MongoDB · 特性分析· Sharding原理與應用
- PgSQL · 源碼分析 · PG中的無鎖算法和原子操作應用一則
- SQLServer · 最佳實踐 · TEMPDB的設計
- 數據庫內核月報 - 2016/08
- MySQL · 特性分析 ·MySQL 5.7新特性系列四
- PgSQL · PostgreSQL 邏輯流復制技術的秘密
- MySQL · 特性分析 · MyRocks簡介
- GPDB · 特性分析· Greenplum 備份架構
- SQLServer · 最佳實踐 · RDS for SQLServer 2012權限限制提升與改善
- TokuDB · 引擎特性 · REPLACE 語句優化
- MySQL · 專家投稿 · InnoDB物理行中null值的存儲的推斷與驗證
- PgSQL · 實戰經驗 · 旋轉門壓縮算法在PostgreSQL中的實現
- MySQL · 源碼分析 · Query Cache并發處理
- PgSQL · 源碼分析· pg_dump分析
- 數據庫內核月報 - 2016/07
- MySQL · 特性分析 ·MySQL 5.7新特性系列三
- MySQL · 特性分析 · 5.7 代價模型淺析
- PgSQL · 實戰經驗 · 分組TOP性能提升44倍
- MySQL · 源碼分析 · 網絡通信模塊淺析
- MongoDB · 特性分析 · 索引原理
- SQLServer · 特性分析 · XML與JSON應用比較
- MySQL · 最佳實戰 · 審計日志實用案例分析
- MySQL · 性能優化 · 條件下推到物化表
- MySQL · 源碼分析 · Query Cache內部剖析
- MySQL · 捉蟲動態 · 備庫1206錯誤問題說明
- 數據庫內核月報 - 2016/06
- MySQL · 特性分析 · innodb 鎖分裂繼承與遷移
- MySQL · 特性分析 ·MySQL 5.7新特性系列二
- PgSQL · 實戰經驗 · 如何預測Freeze IO風暴
- GPDB · 特性分析· Filespace和Tablespace
- MariaDB · 新特性 · 窗口函數
- MySQL · TokuDB · checkpoint過程
- MySQL · 特性分析 · 內部臨時表
- MySQL · 最佳實踐 · 空間優化
- SQLServer · 最佳實踐 · 數據庫實現大容量插入的幾種方式
- 數據庫內核月報 - 2016/05
- MySQL · 引擎特性 · 基于InnoDB的物理復制實現
- MySQL · 特性分析 · MySQL 5.7新特性系列一
- PostgreSQL · 特性分析 · 邏輯結構和權限體系
- MySQL · 特性分析 · innodb buffer pool相關特性
- PG&GP · 特性分析 · 外部數據導入接口實現分析
- SQLServer · 最佳實踐 · 透明數據加密在SQLServer的應用
- MySQL · TokuDB · 日志子系統和崩潰恢復過程
- MongoDB · 特性分析 · Sharded cluster架構原理
- PostgreSQL · 特性分析 · 統計信息計算方法
- MySQL · 捉蟲動態 · left-join多表導致crash
- 數據庫內核月報 - 2016/04
- MySQL · 參數故事 · innodb_additional_mem_pool_size
- GPDB · 特性分析 · Segment事務一致性與異常處理
- GPDB · 特性分析 · Segment 修復指南
- MySQL · 捉蟲動態 · 并行復制外鍵約束問題二
- PgSQL · 性能優化 · 如何瀟灑的處理每天上百TB的數據增量
- Memcached · 最佳實踐 · 熱點 Key 問題解決方案
- MongoDB · 最佳實踐 · 短連接Auth性能優化
- MySQL · 最佳實踐 · RDS 只讀實例延遲分析
- MySQL · TokuDB · TokuDB索引結構--Fractal Tree
- MySQL · TokuDB · Savepoint漫談
- 數據庫內核月報 - 2016/03
- MySQL · TokuDB · 事務子系統和 MVCC 實現
- MongoDB · 特性分析 · MMAPv1 存儲引擎原理
- PgSQL · 源碼分析 · 優化器邏輯推理
- SQLServer · BUG分析 · Agent 鏈接泄露分析
- Redis · 特性分析 · AOF Rewrite 分析
- MySQL · BUG分析 · Rename table 死鎖分析
- MySQL · 物理備份 · Percona XtraBackup 備份原理
- GPDB · 特性分析· GreenPlum FTS 機制
- MySQL · 答疑解惑 · 備庫Seconds_Behind_Master計算
- MySQL · 答疑解惑 · MySQL 鎖問題最佳實踐
- 數據庫內核月報 - 2016/02
- MySQL · 引擎特性 · InnoDB 文件系統之文件物理結構
- MySQL · 引擎特性 · InnoDB 文件系統之IO系統和內存管理
- MySQL · 特性分析 · InnoDB transaction history
- PgSQL · 會議見聞 · PgConf.Russia 2016 大會總結
- PgSQL · 答疑解惑 · PostgreSQL 9.6 并行查詢實現分析
- MySQL · TokuDB · TokuDB之黑科技工具
- PgSQL · 性能優化 · PostgreSQL TPC-C極限優化玩法
- MariaDB · 版本特性 · MariaDB 的 GTID 介紹
- MySQL · 特性分析 · 線程池
- MySQL · 答疑解惑 · mysqldump tips 兩則
- 數據庫內核月報 - 2016/01
- MySQL · 引擎特性 · InnoDB 事務鎖系統簡介
- GPDB · 特性分析· GreenPlum Primary/Mirror 同步機制
- MySQL · 專家投稿 · MySQL5.7 的 JSON 實現
- MySQL · 特性分析 · 優化器 MRR & BKA
- MySQL · 答疑解惑 · 物理備份死鎖分析
- MySQL · TokuDB · Cachetable 的工作線程和線程池
- MySQL · 特性分析 · drop table的優化
- MySQL · 答疑解惑 · GTID不一致分析
- PgSQL · 特性分析 · Plan Hint
- MariaDB · 社區動態 · MariaDB on Power8 (下)
- 數據庫內核月報 - 2015/12
- MySQL · 引擎特性 · InnoDB 事務子系統介紹
- PgSQL · 特性介紹 · 全文搜索介紹
- MongoDB · 捉蟲動態 · Kill Hang問題排查記錄
- MySQL · 參數優化 ·RDS MySQL參數調優最佳實踐
- PgSQL · 特性分析 · 備庫激活過程分析
- MySQL · TokuDB · 讓Hot Backup更完美
- PgSQL · 答疑解惑 · 表膨脹
- MySQL · 特性分析 · Index Condition Pushdown (ICP)
- MariaDB · 社區動態 · MariaDB on Power8
- MySQL · 特性分析 · 企業版特性一覽
- 數據庫內核月報 - 2015/11
- MySQL · 社區見聞 · OOW 2015 總結 MySQL 篇
- MySQL · 特性分析 · Statement Digest
- PgSQL · 答疑解惑 · PostgreSQL 用戶組權限管理
- MySQL · 特性分析 · MDL 實現分析
- PgSQL · 特性分析 · full page write 機制
- MySQL · 捉蟲動態 · MySQL 外鍵異常分析
- MySQL · 答疑解惑 · MySQL 優化器 range 的代價計算
- MySQL · 捉蟲動態 · ORDER/GROUP BY 導致 mysqld crash
- MySQL · TokuDB · TokuDB 中的行鎖
- MySQL · 捉蟲動態 · order by limit 造成優化器選擇索引錯誤
- 數據庫內核月報 - 2015/10
- MySQL · 引擎特性 · InnoDB 全文索引簡介
- MySQL · 特性分析 · 跟蹤Metadata lock
- MySQL · 答疑解惑 · 索引過濾性太差引起CPU飆高分析
- PgSQL · 特性分析 · PG主備流復制機制
- MySQL · 捉蟲動態 · start slave crash 診斷分析
- MySQL · 捉蟲動態 · 刪除索引導致表無法打開
- PgSQL · 特性分析 · PostgreSQL Aurora方案與DEMO
- TokuDB · 捉蟲動態 · CREATE DATABASE 導致crash問題
- PgSQL · 特性分析 · pg_receivexlog工具解析
- MySQL · 特性分析 · MySQL權限存儲與管理
- 數據庫內核月報 - 2015/09
- MySQL · 引擎特性 · InnoDB Adaptive hash index介紹
- PgSQL · 特性分析 · clog異步提交一致性、原子操作與fsync
- MySQL · 捉蟲動態 · BUG 幾例
- PgSQL · 答疑解惑 · 詭異的函數返回值
- MySQL · 捉蟲動態 · 建表過程中crash造成重建表失敗
- PgSQL · 特性分析 · 談談checkpoint的調度
- MySQL · 特性分析 · 5.6 并行復制恢復實現
- MySQL · 備庫優化 · relay fetch 備庫優化
- MySQL · 特性分析 · 5.6并行復制事件分發機制
- MySQL · TokuDB · 文件目錄談
- 數據庫內核月報 - 2015/08
- MySQL · 社區動態 · InnoDB Page Compression
- PgSQL · 答疑解惑 · RDS中的PostgreSQL備庫延遲原因分析
- MySQL · 社區動態 · MySQL5.6.26 Release Note解讀
- PgSQL · 捉蟲動態 · 執行大SQL語句提示無效的內存申請大小
- MySQL · 社區動態 · MariaDB InnoDB表空間碎片整理
- PgSQL · 答疑解惑 · 歸檔進程cp命令的core文件追查
- MySQL · 答疑解惑 · open file limits
- MySQL · TokuDB · 瘋狂的 filenum++
- MySQL · 功能分析 · 5.6 并行復制實現分析
- MySQL · 功能分析 · MySQL表定義緩存
- 數據庫內核月報 - 2015/07
- MySQL · 引擎特性 · Innodb change buffer介紹
- MySQL · TokuDB · TokuDB Checkpoint機制
- PgSQL · 特性分析 · 時間線解析
- PgSQL · 功能分析 · PostGIS 在 O2O應用中的優勢
- MySQL · 引擎特性 · InnoDB index lock前世今生
- MySQL · 社區動態 · MySQL內存分配支持NUMA
- MySQL · 答疑解惑 · 外鍵刪除bug分析
- MySQL · 引擎特性 · MySQL logical read-ahead
- MySQL · 功能介紹 · binlog拉取速度的控制
- MySQL · 答疑解惑 · 浮點型的顯示問題
- 數據庫內核月報 - 2015/06
- MySQL · 引擎特性 · InnoDB 崩潰恢復過程
- MySQL · 捉蟲動態 · 唯一鍵約束失效
- MySQL · 捉蟲動態 · ALTER IGNORE TABLE導致主備不一致
- MySQL · 答疑解惑 · MySQL Sort 分頁
- MySQL · 答疑解惑 · binlog event 中的 error code
- PgSQL · 功能分析 · Listen/Notify 功能
- MySQL · 捉蟲動態 · 任性的 normal shutdown
- PgSQL · 追根究底 · WAL日志空間的意外增長
- MySQL · 社區動態 · MariaDB Role 體系
- MySQL · TokuDB · TokuDB數據文件大小計算
- 數據庫內核月報 - 2015/05
- MySQL · 引擎特性 · InnoDB redo log漫游
- MySQL · 專家投稿 · MySQL數據庫SYS CPU高的可能性分析
- MySQL · 捉蟲動態 · 5.6 與 5.5 InnoDB 不兼容導致 crash
- MySQL · 答疑解惑 · InnoDB 預讀 VS Oracle 多塊讀
- PgSQL · 社區動態 · 9.5 新功能BRIN索引
- MySQL · 捉蟲動態 · MySQL DDL BUG
- MySQL · 答疑解惑 · set names 都做了什么
- MySQL · 捉蟲動態 · 臨時表操作導致主備不一致
- TokuDB · 引擎特性 · zstd壓縮算法
- MySQL · 答疑解惑 · binlog 位點刷新策略
- 數據庫內核月報 - 2015/04
- MySQL · 引擎特性 · InnoDB undo log 漫游
- TokuDB · 產品新聞 · RDS TokuDB小手冊
- PgSQL · 社區動態 · 說一說PgSQL 9.4.1中的那些安全補丁
- MySQL · 捉蟲動態 · 連接斷開導致XA事務丟失
- MySQL · 捉蟲動態 · GTID下slave_net_timeout值太小問題
- MySQL · 捉蟲動態 · Relay log 中 GTID group 完整性檢測
- MySQL · 答疑釋惑 · UPDATE交換列單表和多表的區別
- MySQL · 捉蟲動態 · 刪被引用索引導致crash
- MySQL · 答疑釋惑 · GTID下auto_position=0時數據不一致
- 數據庫內核月報 - 2015/03
- MySQL · 答疑釋惑· 并發Replace into導致的死鎖分析
- MySQL · 性能優化· 5.7.6 InnoDB page flush 優化
- MySQL · 捉蟲動態· pid file丟失問題分析
- MySQL · 答疑釋惑· using filesort VS using temporary
- MySQL · 優化限制· MySQL index_condition_pushdown
- MySQL · 捉蟲動態·DROP DATABASE外鍵約束的GTID BUG
- MySQL · 答疑釋惑· lower_case_table_names 使用問題
- PgSQL · 特性分析· Logical Decoding探索
- PgSQL · 特性分析· jsonb類型解析
- TokuDB ·引擎機制· TokuDB線程池
- 數據庫內核月報 - 2015/02
- MySQL · 性能優化· InnoDB buffer pool flush策略漫談
- MySQL · 社區動態· 5.6.23 InnoDB相關Bugfix
- PgSQL · 特性分析· Replication Slot
- PgSQL · 特性分析· pg_prewarm
- MySQL · 答疑釋惑· InnoDB丟失自增值
- MySQL · 答疑釋惑· 5.5 和 5.6 時間類型兼容問題
- MySQL · 捉蟲動態· 變量修改導致binlog錯誤
- MariaDB · 特性分析· 表/表空間加密
- MariaDB · 特性分析· Per-query variables
- TokuDB · 特性分析· 日志詳解
- 數據庫內核月報 - 2015/01
- MySQL · 性能優化· Group Commit優化
- MySQL · 新增特性· DDL fast fail
- MySQL · 性能優化· 啟用GTID場景的性能問題及優化
- MySQL · 捉蟲動態· InnoDB自增列重復值問題
- MySQL · 優化改進· 復制性能改進過程
- MySQL · 談古論今· key分區算法演變分析
- MySQL · 捉蟲動態· mysql client crash一例
- MySQL · 捉蟲動態· 設置 gtid_purged 破壞AUTO_POSITION復制協議
- MySQL · 捉蟲動態· replicate filter 和 GTID 一起使用的問題
- TokuDB·特性分析· Optimize Table
- 數據庫內核月報 - 2014/12
- MySQL· 性能優化·5.7 Innodb事務系統
- MySQL· 踩過的坑·5.6 GTID 和存儲引擎那會事
- MySQL· 性能優化·thread pool 原理分析
- MySQL· 性能優化·并行復制外建約束問題
- MySQL· 答疑釋惑·binlog event有序性
- MySQL· 答疑釋惑·server_id為0的Rotate
- MySQL· 性能優化·Bulk Load for CREATE INDEX
- MySQL· 捉蟲動態·Opened tables block read only
- MySQL· 優化改進· GTID啟動優化
- TokuDB· Binary Log Group Commit with TokuDB
- 數據庫內核月報 - 2014/11
- MySQL· 捉蟲動態·OPTIMIZE 不存在的表
- MySQL· 捉蟲動態·SIGHUP 導致 binlog 寫錯
- MySQL· 5.7改進·Recovery改進
- MySQL· 5.7特性·高可用支持
- MySQL· 5.7優化·Metadata Lock子系統的優化
- MySQL· 5.7特性·在線Truncate undo log 表空間
- MySQL· 性能優化·hash_scan 算法的實現解析
- TokuDB· 版本優化· 7.5.0
- TokuDB· 引擎特性· FAST UPDATES
- MariaDB· 性能優化·filesort with small LIMIT optimization
- 數據庫內核月報 - 2014/10
- MySQL· 5.7重構·Optimizer Cost Model
- MySQL· 系統限制·text字段數
- MySQL· 捉蟲動態·binlog重放失敗
- MySQL· 捉蟲動態·從庫OOM
- MySQL· 捉蟲動態·崩潰恢復失敗
- MySQL· 功能改進·InnoDB Warmup特性
- MySQL· 文件結構·告別frm文件
- MariaDB· 新鮮特性·ANALYZE statement 語法
- TokuDB· 主備復制·Read Free Replication
- TokuDB· 引擎特性·壓縮
- 數據庫內核月報 - 2014/09
- MySQL· 捉蟲動態·GTID 和 DELAYED
- MySQL· 限制改進·GTID和升級
- MySQL· 捉蟲動態·GTID 和 binlog_checksum
- MySQL· 引擎差異·create_time in status
- MySQL· 參數故事·thread_concurrency
- MySQL· 捉蟲動態·auto_increment
- MariaDB· 性能優化·Extended Keys
- MariaDB·主備復制·CREATE OR REPLACE
- TokuDB· 參數故事·數據安全和性能
- TokuDB· HA方案·TokuDB熱備
- 數據庫內核月報 - 2014/08
- MySQL· 參數故事·timed_mutexes
- MySQL· 參數故事·innodb_flush_log_at_trx_commit
- MySQL· 捉蟲動態·Count(Distinct) ERROR
- MySQL· 捉蟲動態·mysqldump BUFFER OVERFLOW
- MySQL· 捉蟲動態·long semaphore waits
- MariaDB·分支特性·支持大于16K的InnoDB Page Size
- MariaDB·分支特性·FusionIO特性支持
- TokuDB· 性能優化·Bulk Fetch
- TokuDB· 數據結構·Fractal-Trees與LSM-Trees對比
- TokuDB·社區八卦·TokuDB團隊