PostgreSQL · 特性分析 · 統計信息計算方法 · 數據庫內核月報

一條SQL在PG中的執行過程是： ~~~ ----> SQL輸入 ----> 解析SQL，獲取解析后的語法樹 ----> 分析、重寫語法樹，獲取查詢樹 ----> 根據重寫、分析后的查詢樹計算各路徑代價，從而選擇一條成本最優的執行樹 ----> 根據執行樹進行執行 ----> 獲取結果并返回 ~~~ PostgreSQL的SQL優化、執行方式為代價模型。而這里的各路徑的代價計算，則是依賴于系統表中的統計信息。那么這些統計信息如何得來的？就是這里要討論的問題。 PostgreSQL是catalog-driven型的數據庫，引擎運行過程中所有所需的數據、信息都存放在系統表中，統計信息不例外。這些統計信息，則是通過SQL命令vacuum和analyze分別寫入pg_class和pg_statistic中的。 [參考官方文檔ANALYZE](http://www.postgresql.org/docs/9.1/static/sql-analyze.html) ## pg_class && pg_statistic pg_class的表結構如下： ~~~ => \d pg_class Table "pg_catalog.pg_class" Column | Type | Modifiers ----------------+-----------+----------- relname | name | not null ... relpages | integer | not null reltuples | real | not null ... relkind | "char" | not null relnatts | smallint | not null ... Indexes: "pg_class_oid_index" UNIQUE, btree (oid) "pg_class_relname_nsp_index" UNIQUE, btree (relname, relnamespace) "pg_class_tblspc_relfilenode_index" btree (reltablespace, relfilenode) ~~~ 這里比較關注的是relpages和reltuples兩個字段，分別表示這張表占了多少磁盤頁和行數。其中行數是估計值。而這兩個字段的值是通過vacuum、analyze（或create index)來更新的。 [參考官方文檔pg_class](http://www.postgresql.org/docs/9.4/static/catalog-pg-class.html) pg_statistic的表結構如下： ~~~ => \d pg_statistic Table "pg_catalog.pg_statistic" Column | Type | Modifiers -------------+----------+----------- starelid | oid | not null staattnum | smallint | not null stainherit | boolean | not null stanullfrac | real | not null stawidth | integer | not null stadistinct | real | not null stakind1 | smallint | not null stakind2 | smallint | not null stakind3 | smallint | not null stakind4 | smallint | not null stakind5 | smallint | not null staop1 | oid | not null staop2 | oid | not null staop3 | oid | not null staop4 | oid | not null staop5 | oid | not null stanumbers1 | real[] | stanumbers2 | real[] | stanumbers3 | real[] | stanumbers4 | real[] | stanumbers5 | real[] | stavalues1 | anyarray | stavalues2 | anyarray | stavalues3 | anyarray | stavalues4 | anyarray | stavalues5 | anyarray | Indexes: "pg_statistic_relid_att_inh_index" UNIQUE, btree (starelid, staattnum, stainherit) ~~~ 這里的stanullfrac、stadistinct、stakindN、staopN、stanumbersN、stavaluesN等是我們所關注的值。其中： * stakindN 用于表示后面number、values所表示的數據用途，被用于生成pg_stats。如1則表示是MCV的值；2表示直方圖（histogram）的值；3表示相關性（correlation）的值等。kind的取值范圍：1~99，內核占用；100~199，PostGIS占用；200~299，ESRI ST_Geometry幾何系統占用；300~9999，公共占用。 * staopN 用于表示該統計值支持的操作，如’=’或’<’等。 * stanumbersN 如果是MCV類型（即kind=1），那么這里即是下面對應的stavaluesN出現的概率值，即MCF。 * stavaluesN anyarray類型的數據，內核特殊類型，不可更改。是統計信息的值部分，與kind對應。如kind=2的時候，則這里的值表示直方圖。這些值的更新都是通過analyze完成，N的取值是[1, 5]，由PG內核決定的。將來有可能更多。 [參考官方文檔pg_statistic](http://www.postgresql.org/docs/9.4/static/catalog-pg-statistic.html) ## 執行方式 vacuum和analyze的執行可以通過兩種方式來觸發，一種是DB用戶執行，如定時腳本或人工執行；一種是autovacuum。兩個操作持有相同類型的鎖ShareUpdateExclusiveLock，與DDL互斥。 autovacuum是PostgreSQL提供的一個deamon進程，會在一定時間內或者DML多到一定程度時觸發vacuum或analyze。這里的一定時間和一定程度可以通過autovacuum的一系列配置實現，如autovacuum_naptime、autovacuum_max_workers 、autovacuum_vacuum_threshold等；且vacuum和analyze的觸發算法和依賴參數并不盡相同。注：請參考?[autovacuum_vacuum_threshold](http://www.postgresql.org/docs/9.1/static/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-THRESHOLD)?和?[The Autovacuum Daemon](http://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#AUTOVACUUM) ## analyze vacuum本身除了負責更新relpages和reltuples等之外，最主要的是： 1. 回收被更新和刪除占用的空間 2. 回收事務id，凍結老的事務id，以防止這部分老數據丟失而analyze則主要是收集統計信息，并存儲到pg_statistic表中。其主要的步驟如下： * 以共享排他鎖（ShareUpdateExclusiveLock）打開表這個鎖會與DDL之上所有的操作互斥，詳細的互斥關系如下，其值越大鎖粒度越大： ~~~ /* * These are the valid values of type LOCKMODE for all the standard lock * methods (both DEFAULT and USER). */ /* NoLock is not a lock mode, but a flag value meaning "don't get a lock" */ #define NoLock 0 #define AccessShareLock 1 /* SELECT */ #define RowShareLock 2 /* SELECT FOR UPDATE/FOR SHARE */ #define RowExclusiveLock 3 /* INSERT, UPDATE, DELETE */ #define ShareUpdateExclusiveLock 4 /* VACUUM (non-FULL),ANALYZE, CREATE * INDEX CONCURRENTLY */ #define ShareLock 5 /* CREATE INDEX (WITHOUT CONCURRENTLY) */ #define ShareRowExclusiveLock 6 /* like EXCLUSIVE MODE, but allows ROW * SHARE */ #define ExclusiveLock 7 /* blocks ROW SHARE/SELECT...FOR * UPDATE */ #define AccessExclusiveLock 8 /* ALTER TABLE, DROP TABLE, VACUUM * FULL, and unqualified LOCK TABLE */ ~~~ * 選擇采樣函數如果是普通表或者物化視圖，則采樣函數采用acquire_sample_rows；如果是外表，那么外表所用的插件需要FDW的實現，如postgres_fdw的postgresAnalyzeForeignTable。 * 檢查表的每個字段在真正開始分析之前，先檢查每個字段，并返回VacAttrStats結構體。后面所有的分析都將在此檢查之上進行。 VacAttrStats結構體如下： ~~~ typedef struct VacAttrStats { /* * These fields are set up by the main ANALYZE code before invoking the * type-specific typanalyze function. * * Note: do not assume that the data being analyzed has the same datatype * shown in attr, ie do not trust attr->atttypid, attlen, etc. This is * because some index opclasses store a different type than the underlying * column/expression. Instead use attrtypid, attrtypmod, and attrtype for * information about the datatype being fed to the typanalyze function. */ Form_pg_attribute attr; /* copy of pg_attribute row for column */ Oid attrtypid; /* type of data being analyzed */ int32 attrtypmod; /* typmod of data being analyzed */ Form_pg_type attrtype; /* copy of pg_type row for attrtypid */ MemoryContext anl_context; /* where to save long-lived data */ /* * These fields must be filled in by the typanalyze routine, unless it * returns FALSE. */ AnalyzeAttrComputeStatsFunc compute_stats; /* function pointer */ int minrows; /* Minimum # of rows wanted for stats */ void *extra_data; /* for extra type-specific data */ /* * These fields are to be filled in by the compute_stats routine. (They * are initialized to zero when the struct is created.) */ bool stats_valid; float4 stanullfrac; /* fraction of entries that are NULL */ int32 stawidth; /* average width of column values */ float4 stadistinct; /* # distinct values */ int16 stakind[STATISTIC_NUM_SLOTS]; Oid staop[STATISTIC_NUM_SLOTS]; int numnumbers[STATISTIC_NUM_SLOTS]; float4 *stanumbers[STATISTIC_NUM_SLOTS]; int numvalues[STATISTIC_NUM_SLOTS]; Datum *stavalues[STATISTIC_NUM_SLOTS]; /* * These fields describe the stavalues[n] element types. They will be * initialized to match attrtypid, but a custom typanalyze function might * want to store an array of something other than the analyzed column's * elements. It should then overwrite these fields. */ Oid statypid[STATISTIC_NUM_SLOTS]; int16 statyplen[STATISTIC_NUM_SLOTS]; bool statypbyval[STATISTIC_NUM_SLOTS]; char statypalign[STATISTIC_NUM_SLOTS]; /* * These fields are private to the main ANALYZE code and should not be * looked at by type-specific functions. */ int tupattnum; /* attribute number within tuples */ HeapTuple *rows; /* access info for std fetch function */ TupleDesc tupDesc; Datum *exprvals; /* access info for index fetch function */ bool *exprnulls; int rowstride; } VacAttrStats; ~~~ 具體的針對字段檢查的步驟如下： * 確定這個字段是否可以分析，如果不可以，則返回NULL。一般有兩種情況致使這個字段不進行分析：字段已被刪除（已刪除的字段還存在于系統表中，只是作了標記）；用戶指定了字段。 * 獲取數據類型，并決定針對該類型的采樣數據量和統計函數不同的類型，其分析函數也不同，比如array_typanalyze。如果該類型沒有對應的分析函數，則采用標準的分析函數std_typanalyze。以標準分析函數為例，其確定了兩個地方：采樣后用于統計的函數（compute_scalar_stats或compute_minimal_stats，和采樣的記錄數（現在默認是300 * 100）。 * 索引索引在PG中，是以與表類似的方式存在的。當analyze沒有指定字段，或者是繼承表的時候，也會對索引進行統計信息的計算。以AccessShareLock打開該表上所有的鎖，同樣的檢查索引的每個字段是否需要統計、如何統計等。 * 采樣選擇表所有字段所需采樣數據量的最大值作為最終采樣的數據量。當前PG采取的[兩階段采樣的算法](http://www.postgresql.org/message-id/1uon60lr3jjndh4o8i9cagd62tead9b0t6@email.aon.at)： * 先獲取所需數據量的文件塊 * 遍歷這些塊，根據Vitter算法，選擇出所需數據量的記錄時以頁為單位，盡量讀取該頁中所有的完整記錄，以減少IO；按照物理存儲的位置排序，后續會用于計算相關性（correlation）。這里的采樣并不會處理事務中的記錄，如正在插入或刪除的記錄。但如果刪除或插入操作是在當前analyze所在的事務執行的，那么插入的是被記為live_tuples并且加入統計的；刪除的會被記為dead_tuples而不加入統計。由此會可能產生兩個問題： * 當有另外一個連接正好也在進行統計的時候，自然會產生不同的統計值，且后來者會直接覆蓋前者。當統計期間有較多的事務在執行，且很快結束，那么結果與實際情況可能有點差別。 * 當有超長的事務出現，當事務結束時，統計信息與實際情況可能有較大的差距。以上兩種情況，重復執行analyze即可。但有可能因統計信息不準確導致的執行計劃異常而造成短時間的性能波動，需要注意！這里也說明了長事務的部分危害。 * 統計、計算在獲取到相應樣本數據后，針對每個字段分別進行分析。首先會依據當前字段的值，對記錄進行排序。因在取出樣本數據的時候，按照tuple在磁盤中的位置順序取出的，因此對值進行排序后即可計算得出相關性。另外，在排序后，也更容易計算統計值的頻率，從而得出MCV和MCF。這里采用的快速排序！之后，會根據每個值進行分析： * 如果是NULL，則計數 NULL概率的計算公式是：stanullfrac = null_number / sample_row_number。 * 如果是變長字段，如text等，則需要計算平均寬度 * 計算出現最多的值，和相應頻率 * 基數的計算該部分計算稍微復雜一些，分為以下三種情況： 1. 當采樣中的值沒有重復的時候，則認為所有的值唯一，stadistinct = -1。 2. 當采樣中的每個值都出現重復的時候，則認為基數有限，則stadistinct = distinct_value_number 3. 當采樣中的值中，存在有唯一值并且存在不唯一值的時候，則依據以下的公式（by Haas and Stokes in IBM Research Report RJ 10025）： ~~~ n * d / (n - f1 + f1 * n/N) ~~~ 其中，N是指所有的記錄數，即pg_class.reltuples；n是指sample_row_number，即采樣的記錄數；f1則是只出現一次的值的數據；d則是采樣中所有的值的數量。 * MCV / MCF 并不是所有采樣的值都會被列入MCV/MCF。首先是如果可以，則將所有采樣的記錄放到MCV中，如表所有的記錄都已經取作采樣的時候；其次，則是選取那些出現頻率超過平均值的值，事實上是超過平均值的25%；那些出現頻率大于直方圖的個數的倒數的時候等。 * 直方圖計算直方圖，會首先排除掉MCV中的值。意思是直方圖中的數據不包含MCV/MCF的部分，兩者的值是補充關系而且不會重合，但不一定互補（兩種加起來未必是全部數據）。這個也與成本的計算方式有關系，[請參考row-estimation-examples](http://www.postgresql.org/docs/9.1/static/row-estimation-examples.html)?。其計算公式相對比較簡單，如下： values[(i * (nvals - 1)) / (num_hist - 1)] i指直方圖中的第幾列；nvals指當前還有多少個值；num_hist則指直方圖中還有多少列。計算完成后，kind的值會被置為2。到此，采樣的統計基本結束。完成采樣的計算后，通過內部函數更新相關的統計信息到pg_statistic，更新relpages和totale rows到pg_class中。即完成了一次統計信息的收集。