PgSQL · 源碼分析 · PG優化器淺析 · 數據庫內核月報

在使用PostgreSQL數據庫過程中，對SQL調優最常用的手段是使用explain查看執行計劃，很多時候我們只關注了執行計劃的結果而未深入了解執行計劃是如何生成的。優化器作為數據庫核心功能之一，也是數據庫的“大腦”，理解優化器將有助于我們更好地優化SQL，下面將會為大家解開PostgreSQL優化器神秘的面紗。 ## SQL執行過程 ![screenshot](http://ata2-img.cn-hangzhou.img-pub.aliyun-inc.com/aa032da191ddcfd61b94e0dd6d39e59f) 在PG數據庫中，對于DDL語句無需進行優化，到utility模塊處理，對于DML語句需要到優化器中處理，一個用戶連接從接收SQL到執行的流程如下: ![screenshot](http://ata2-img.cn-hangzhou.img-pub.aliyun-inc.com/1e81ee7e9f6da232c293c0f02686c799) ## 查詢重寫主要目的是為了消除view、rule等，如下示例，視圖v_t_1_2在執行計劃里面已經被t1、t2替換。 ~~~ create view v_t_1_2 as SELECT t1.a1, t1.b1, t2.a2, t2.b2 FROM t1, t2; postgres=> explain select * from v_t_1_2, t1 where v_t_1_2.a1 = 10 and t1.b1 = 20; QUERY PLAN ------------------------------------------------------------------------------------- Nested Loop (cost=0.55..41.59 rows=1000 width=24) -> Nested Loop (cost=0.55..16.60 rows=1 width=16) -> Index Scan using t1_a1_key on t1 t1_1 (cost=0.28..8.29 rows=1 width=8) Index Cond: (a1 = 10) -> Index Scan using b1_1 on t1 (cost=0.28..8.29 rows=1 width=8) Index Cond: (b1 = 20) -> Seq Scan on t2 (cost=0.00..15.00 rows=1000 width=8) (7 rows) ~~~ ## 提升子鏈目標是將IN和exists子句遞歸提升。 select * from t1 where t1.a1 in (select t2.a2 from t2 where t2.b2 = 10); 假設t2.a2為unique 轉化為: select t1.a1,t1,a2 from t1 join t2 where t1.a1=t2.a2 and t2.b2 = 10; in子鏈接執行計劃如下： ~~~ postgres=> explain select * from t1 where t1.a1 in (select t2.a2 from t2 where t2.b2 = 10); QUERY PLAN -------------------------------------------------------------------------- Nested Loop (cost=0.28..25.80 rows=1 width=8) -> Seq Scan on t2 (cost=0.00..17.50 rows=1 width=4) Filter: (b2 = 10) -> Index Scan using t1_a1_key on t1 (cost=0.28..8.29 rows=1 width=8) Index Cond: (a1 = t2.a2) ~~~ explain select * from t1 where exists (select t2.a2 from t2 where t2.a2 = t1.a1) ; 假設t2.a2為unique 轉化為: select t1.a1, t1.b1 from t1, t2 where t1.a1=t2.a1; exists子鏈接執行計劃如下： ~~~ postgres=> explain select * from t1 where exists (select t2.a2 from t2 where t2.a2 = t1.a1) ; QUERY PLAN ----------------------------------------------------------------- Hash Join (cost=26.42..54.69 rows=952 width=8) Hash Cond: (t2.a2 = t1.a1) -> Seq Scan on t2 (cost=0.00..15.00 rows=1000 width=4) -> Hash (cost=14.52..14.52 rows=952 width=8) -> Seq Scan on t1 (cost=0.00..14.52 rows=952 width=8) (5 rows) ~~~ ## 提升子查詢子查詢和子鏈接區別：子查詢不在表達式中子句，子鏈接在in/exists表達式中的子句。 select * from t1, (select * from t2) as c where t1.a1 = c.a2; 轉化為： select * from t1, t2 where t1.a1 = t2.a2; ~~~ postgres=> explain select * from t1, (select * from t2) as c where t1.a1 = c.a2; QUERY PLAN ----------------------------------------------------------------- Hash Join (cost=26.42..54.69 rows=952 width=16) Hash Cond: (t2.a2 = t1.a1) -> Seq Scan on t2 (cost=0.00..15.00 rows=1000 width=8) -> Hash (cost=14.52..14.52 rows=952 width=8) -> Seq Scan on t1 (cost=0.00..14.52 rows=952 width=8) (5 rows) ~~~ 并不是所有的子查詢都能提升，含有集合操作、聚合操作、sort/limit/with/group、易失函數、from為空等是不支持提升的。如下： ~~~ postgres=> explain select t1.a1 from t1, (select a2 from t2 limit 1) as c where c.a2 = 10; QUERY PLAN ------------------------------------------------------------------------ Nested Loop (cost=0.00..24.07 rows=952 width=4) -> Subquery Scan on c (cost=0.00..0.03 rows=1 width=0) Filter: (c.a2 = 10) -> Limit (cost=0.00..0.01 rows=1 width=4) -> Seq Scan on t2 (cost=0.00..15.00 rows=1000 width=4) -> Seq Scan on t1 (cost=0.00..14.52 rows=952 width=4) (6 rows) ~~~ ## 化簡條件包含邏輯推理、表達式計算等 ![screenshot](http://ata2-img.cn-hangzhou.img-pub.aliyun-inc.com/08b0ff53f11c1feeed103b4cc3d20751) ## 外連接消除(left/right/full join) 以left join為例，left join(左連接) 返回包括左表中的所有記錄和右表中連接字段相等的記錄，如果右表沒有匹配的記錄，那么右表將會以NULL值代替，例如： ~~~ A表 B表 ID1 ID2 1 1 2 select * from A left join B on A.id1 = B.id2; 結果如下： ID1 ID2 1 1 2 NULL ~~~ 存在外連接left join ~~~ postgres=> explain select * from t1 left join t2 on true; QUERY PLAN ------------------------------------------------------------------- Nested Loop Left Join (cost=0.00..11932.02 rows=952000 width=16) -> Seq Scan on t1 (cost=0.00..14.52 rows=952 width=8) -> Materialize (cost=0.00..20.00 rows=1000 width=8) -> Seq Scan on t2 (cost=0.00..15.00 rows=1000 width=8) (4 rows) ~~~ 消除外連接需要where和join條件保證右表不會有NULL值的行產生。 ~~~ postgres=> explain select * from t1 left join t2 on t1.b1 = t2.b2 where t2.b2 is not NULL; QUERY PLAN --------------------------------------------------------------------- Nested Loop (cost=0.28..23.30 rows=1 width=16) -> Seq Scan on t2 (cost=0.00..15.00 rows=1 width=8) Filter: (b2 IS NOT NULL) -> Index Scan using b1_1 on t1 (cost=0.28..8.29 rows=1 width=8) Index Cond: (b1 = t2.b2) (5 rows) ~~~ ## 條件下推條件下推的目的為了連接前，元組數組盡量少，如下示例，條件已經下推到每個表上面了。 ~~~ postgres=> explain select * from t1,t2 where t1.a1 < 10 and t2.a2 > 900; QUERY PLAN --------------------------------------------------------------------------------- Nested Loop (cost=0.55..31.20 rows=1000 width=16) -> Index Scan using t2_a2_key on t2 (cost=0.28..10.03 rows=100 width=8) Index Cond: (a2 > 900) -> Materialize (cost=0.28..8.70 rows=10 width=8) -> Index Scan using t1_a1_key on t1 (cost=0.28..8.65 rows=10 width=8) Index Cond: (a1 < 10) ~~~ ## 語義優化當表中字段存在約束鍵時，PostgreSQL將會對其進行語義優化，因為查詢條件有可能已經隱含滿足或者不滿足，例如： ~~~ create table tt1(id int not null); postgres=> explain select * from tt1 where id is null; QUERY PLAN -------------------------------------------------------- Seq Scan on tt1 (cost=0.00..15407.02 rows=1 width=15) Filter: (id IS NULL) set constraint_exclusion = on; postgres=> explain select * from tt1 where id is null; QUERY PLAN ------------------------------------------ Result (cost=0.00..0.01 rows=1 width=0) One-Time Filter: false ~~~ 表tt1的id字段已經隱含了不為NULL，所以id=null這種條件可以直接返回false，PostgreSQL數據庫默認并沒有開啟約束優化，需要設置constraint_exclusion這個參數。 ## MIN/MAX優化 min/max函數在應用的使用中是非常廣泛的，數據庫有必要對其進行特殊優化，比如索引中已經將數據排好序了，最大最小值可以直接獲取到，所以PostgreSQL對min/max函數做了一步轉化。 select min(a1) from t1 轉化為 select a1 from t1 order by a1 limit 1; 如果a1沒有索引，那么將會是順序掃描，不進行轉化。 ~~~ postgres=> explain select min(a1) from t1; QUERY PLAN ------------------------------------------------------------------------------------------ Result (cost=0.32..0.33 rows=1 width=0) InitPlan 1 (returns $0) -> Limit (cost=0.28..0.32 rows=1 width=4) -> Index Only Scan using t1_a1_key on t1 (cost=0.28..45.09 rows=952 width=4) Index Cond: (a1 IS NOT NULL) ~~~ ## group by優化如果不對group by優化，那么將會需要對結果進行Sort或者Hash，但是如果表中數據已經是排序好的，那么將可以對其進行優化。 ~~~ create index tt1_id_key on tt1 using btree ( id); postgres=> explain select id from tt1 group by id; QUERY PLAN ------------------------------------------------------------------------------------------- Group (cost=0.42..33891.21 rows=1000102 width=4) Group Key: id -> Index Only Scan using tt1_id_key on tt1 (cost=0.42..31390.96 rows=1000102 width=4) postgres=> explain select name from tt1 group by name; QUERY PLAN -------------------------------------------------------------------------- Group (cost=132169.76..137170.27 rows=1000102 width=11) Group Key: name -> Sort (cost=132169.76..134670.02 rows=1000102 width=11) Sort Key: name -> Seq Scan on tt1 (cost=0.00..15407.02 rows=1000102 width=11) ~~~ ## order by優化 1\. 利用索引消除order by ~~~ postgres=> explain select * from t1 order by a1; QUERY PLAN ----------------------------------------------------------------------- Index Scan using t1_a1_key on t1 (cost=0.28..42.71 rows=952 width=8) (1 row) ~~~ 2\. order by下推，利用merge join實現更快的連接 ~~~ postgres=> explain select * from t1,t2 where t1.b1=t2.b2 order by b1; QUERY PLAN ------------------------------------------------------------------ Merge Join (cost=126.45..136.22 rows=1 width=16) Merge Cond: (t1.b1 = t2.b2) -> Sort (cost=61.62..64.00 rows=952 width=8) Sort Key: t1.b1 -> Seq Scan on t1 (cost=0.00..14.52 rows=952 width=8) -> Sort (cost=64.83..67.33 rows=1000 width=8) Sort Key: t2.b2 -> Seq Scan on t2 (cost=0.00..15.00 rows=1000 width=8) (8 rows) ~~~ ## distinct優化類似于group by優化，distinct將會從Sort和Hash中選擇最優的，如果字段中有索引，Sort代價可能會更低。 ~~~ postgres=> explain select distinct(a1) from t1; QUERY PLAN ----------------------------------------------------------- HashAggregate (cost=16.90..26.42 rows=952 width=4) Group Key: a1 -> Seq Scan on t1 (cost=0.00..14.52 rows=952 width=4) (3 rows) postgres=> explain select distinct(name) from tt1; QUERY PLAN -------------------------------------------------------------------------- Unique (cost=132169.76..137170.27 rows=1000102 width=11) -> Sort (cost=132169.76..134670.02 rows=1000102 width=11) Sort Key: name -> Seq Scan on tt1 (cost=0.00..15407.02 rows=1000102 width=11) ~~~ ## 集合操作優化集合操作union被轉換成Append方式。 ~~~ postgres=> explain select a1 from t1 where a1 < 10 union select a2 from t2; QUERY PLAN -------------------------------------------------------------------------------------- HashAggregate (cost=36.28..46.38 rows=1010 width=4) Group Key: t1.a1 -> Append (cost=0.28..33.75 rows=1010 width=4) -> Index Only Scan using t1_a1_key on t1 (cost=0.28..8.65 rows=10 width=4) Index Cond: (a1 < 10) -> Seq Scan on t2 (cost=0.00..15.00 rows=1000 width=4) postgres=> explain select a1 from t1 where a1 < 10 union all select a2 from t2; QUERY PLAN -------------------------------------------------------------------------------- Append (cost=0.28..23.75 rows=1010 width=4) -> Index Only Scan using t1_a1_key on t1 (cost=0.28..8.65 rows=10 width=4) Index Cond: (a1 < 10) -> Seq Scan on t2 (cost=0.00..15.00 rows=1000 width=4) ~~~ ## 總結以上介紹了幾種常見的PostgreSQL優化器對SQL優化的方法，這些方法更著重于SQL邏輯優化，也就是盡量對SQL進行等價或者推倒變換，以達到更有效率的執行計劃。PostgreSQL優化器原理遠不止這些，比如表的掃描方式選擇、多表組合方式、多表組合順序等，這些內容將會在后續的月報中繼續呈現。