分析函數 · k8s原理與實踐

[TOC] # 分析函數 ## 簡介 **分析函數**（某些數據庫下也叫做**窗口函數**）與聚合函數類似，計算總是基于一組行的集合，不同的是，聚合函數一組只能返回一行，而分析函數每組可以返回多行，組內每一行都是基于窗口的邏輯計算的結果。分析函數可以顯著優化需要 self-join 的查詢。 ### 分析函數語法 “窗口”也稱為 FRAME，OceanBase 數據庫同時支持 ROWS 與 RANGE 兩種 FRAME 語義，前者是基于物理行偏移的窗口，后者則是基于邏輯值偏移的窗口。分析函數語法如下： ~~~ analytic_function: analytic_function([ arguments ]) OVER (analytic_clause) analytic_clause: [ query_partition_clause ] [ order_by_clause [ windowing_clause ] ] query_partition_clause: PARTITION BY { expr[, expr ]... | ( expr[, expr ]... ) } order_by_clause: ORDER [ SIBLINGS ] BY{ expr | position | c_alias } [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] [, { expr | position | c_alias } [ ASC | DESC ][ NULLS FIRST | NULLS LAST ]]... windowing_clause: { ROWS | RANGE } { BETWEEN { UNBOUNDED PRECEDING | CURRENT ROW | value_expr { PRECEDING | FOLLOWING } } AND{ UNBOUNDED FOLLOWING | CURRENT ROW | value_expr { PRECEDING | FOLLOWING } } | { UNBOUNDED PRECEDING | CURRENT ROW| value_expr PRECEDING}} ~~~ ## SUM/MIN/MAX/COUNT/AVG **聲明** SUM 的語法為：`SUM([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` MIN 的語法為：`MIN([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` MAX 的語法為：`MAX([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` COUNT 的語法為：`COUNT({ * | [ DISTINCT | ALL ] expr }) [ OVER (analytic_clause) ]` AVG 的語法為：`AVG([ DISTINCT | ALL ] expr) [ OVER(analytic_clause) ]` **說明** 以上分析函數都有對應的聚合函數，其中，`SUM`返回`expr`的和，`MIN`/`MAX`返回`expr`的最小值/最大值，`COUNT`返回窗口中查詢的行數，`AVG`返回`expr`的平均值。對于`COUNT`函數，如果指定了`expr`，即返回`expr`不為 NULL 的統計個數，如果指定`COUNT(*)`返回所有行的統計數目。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.17 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.03 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient>select last_name, sum(salary) over(partition by job_id) totol_s, min(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s, count(*) over(partition by job_id) count_s from exployees; +-----------+---------+-------+-------+---------+ | last_name | totol_s | min_s | max_s | count_s | +-----------+---------+-------+-------+---------+ | jim | 2000 | 2000 | 2000 | 1 | | mike | 36000 | 11000 | 13000 | 3 | | lily | 36000 | 11000 | 13000 | 3 | | tom | 36000 | 11000 | 13000 | 3 | +-----------+---------+-------+-------+---------+ 4 rows in set (0.01 sec) ~~~ ## NTH\_VALUE/FIRST\_VALUE/LAST\_VALUE **聲明** NTH\_VALUE 的語法為：`NTH_VALUE (measure_expr, n) [ FROM { FIRST | LAST } ] [ { RESPECT | IGNORE } NULLS ] OVER (analytic_clause)` FIRST\_VALUE 的語法為：`FIRST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)` LAST\_VALUE 的語法為：`LAST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)` **說明** NTH\_VALUE 函數表示第幾個值，方向由`[ FROM { FIRST | LAST } ]`確定，默認為`FROM FIRST`，含有是否忽略 NULL 值的標志。其窗口為統一的`analytic_clause`。這里`n`應該是正數，如果`n`是 NULL，函數將返回錯誤；如果`n`大于窗口內所有的行數，此函數將返回 NULL。 FIRST\_VALUE 和 LAST\_VALUE 表示從第一個開始計數或者是從最后一個開始計數。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> select last_name, first_value(salary) over(partition by job_id) totol_s, last_value(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s from exployees; +-----------+---------+-------+-------+ | last_name | totol_s | min_s | max_s | +-----------+---------+-------+-------+ | jim | 2000 | 2000 | 2000 | | mike | 12000 | 11000 | 13000 | | lily | 12000 | 11000 | 13000 | | tom | 12000 | 11000 | 13000 | +-----------+---------+-------+-------+ 4 rows in set (0.01 sec) ~~~ ## LEAD/LAG **聲明** LEAD 的語法為：`LEAD { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)` LAG 的語法為：`LAG { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)` **說明** LEAD 和 LAG 含義為可以在一次查詢中取出當前行的同一個字段的前面或后面第 N 行的數據，這種操作可以使用相同表的自連接來實現，但 LEAD/LAG 窗口函數有更高的效率。其中，`value_expr`是要做比對的字段，`offset`是`value_expr`的偏移量，`default`參數的默認值為 NULL，即如果在 LEAD/LAG 沒有顯示的設置`default`值的情況下，返回值為 NULL。例如：對 LAG 來說，當前行為 4，`offset`值為 6，這時候所要找的數據就是第 -2 行，不存在此行即返回`default`的值。 `[ { RESPECT | IGNORE } NULLS ]`的語法為是否考慮 NULL 值，默認為`RESPECT`，考慮 NULL 值。注意 LEAD/LAG 兩個函數后必須有`order_by_clause`，數據應該在一個列上排序之后才能有前多少行后多少行的概念。`query_partition_clause`是可選的，如果沒有`query_partition_clause`，就是全局的數據。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, lead(salary) over(order by salary) lead, lag(salary) over(order by salary) lag from exployees; +-----------+-------+-------+ | last_name | lead | lag | +-----------+-------+-------+ | jim | 11000 | NULL | | tom | 12000 | 2000 | | mike | 13000 | 11000 | | lily | NULL | 12000 | +-----------+-------+-------+ 4 rows in set (0.01 sec) ~~~ ## STDDEV/VARIANCE/STDDEV\_SAMP/STDDEV\_POP **聲明** VARIANCE 的語法為：`VARIANCE([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` STDDEV 的語法為：`STDDEV([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]` STDDEV\_SAMP 的語法為：`STDDEV_SAMP(expr) [ OVER (analytic_clause) ]` STDDEV\_POP 的語法為：`STDDEV_POP(expr) [ OVER (analytic_clause) ]` **說明** VARIANCE 返回的是`expr`的方差，`expr`可能是數值類型或者可以轉換成數值類型的類型，方差的類型和輸入的值的類型相同。 STDDEV 返回的是`expr`的標準差，參數類型方面和 VARIANCE 的相同。 STDDEV\_SAMP 返回的是樣本標準差。 STDDEV\_POP 返回的是總體標準差。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, stddev(salary) over(order by salary) std, variance(salary) over(order by salary) var, stddev_pop(salary) over() std_pop, stddev_samp(salary) over() from exployees; +-----------+-------------------+--------------------+-------------------+----------------------------+ | last_name | std | var | std_pop | stddev_samp(salary) over() | +-----------+-------------------+--------------------+-------------------+----------------------------+ | jim | 0 | 0 | 4387.482193696061 | 5066.228051190222 | | tom | 4500 | 20250000 | 4387.482193696061 | 5066.228051190222 | | mike | 4496.912521077347 | 20222222.222222224 | 4387.482193696061 | 5066.228051190222 | | lily | 4387.482193696061 | 19250000 | 4387.482193696061 | 5066.228051190222 | +-----------+-------------------+--------------------+-------------------+----------------------------+ 4 rows in set (0.00 sec) ~~~ ## NTILE **聲明** `NTILE(expr) OVER ([ query_partition_clause ] order_by_clause)` **說明** NTILE 函數將分區中已經排序的行劃分為大小盡可能相同的指定數量的分組，并返回給每行組號。`expr`如果是 NULL，則返回 NULL。例子 ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, ntile(10) over(partition by job_id order by salary) ntl from exployees; +-----------+------+ | last_name | ntl | +-----------+------+ | jim | 1 | | tom | 1 | | mike | 2 | | lily | 3 | +-----------+------+ 4 rows in set (0.01 sec) ~~~ ## ROW\_NUMBER 聲明 `ROW_NUMBER( ) OVER ([ query_partition_clause ] order_by_clause)` 說明 ROW\_NUMBER 函數按照`order_by_clause`子句中指定的行的順序，為每一行分配一個編號。例子 ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.08 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.01 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, row_number() over(partition by job_id order by salary) ntl from exployees; +-----------+------+ | last_name | ntl | +-----------+------+ | jim | 1 | | tom | 1 | | mike | 2 | | lily | 3 | +-----------+------+ 4 rows in set (0.00 sec) ~~~ ## RANK/DENSE\_RANK/PERCENT\_RANK **聲****明** RANK 的語法為：`RANK( ) OVER ([ query_partition_clause ] order_by_clause)` DENSE\_RANK 的語法為：`DENSE_RANK( ) OVER([ query_partition_clause ] order_by_clause)` PERCENT\_RANK 的語法為：`PERCENT_RANK( ) OVER ([ query_partition_clause ] order_by_clause)` **說明** RANK 計算每一行數據在某列上的排序，該列由`order_by_clause`中的列決定。例如，按照 salary 排序可以看出員工的收入排名。 DENSE\_RANK 的語義基本和 RANK 函數相同，但是 RANK 的排序中間會有‘跳過’，但是 DENSE\_RANK 中不會有。 PERCENT\_RANK 的語義基本和 RANK 函數相同，但是 PERCENT\_RANK 排序的結果是百分比，計算的是給定行的百分比。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.10 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, rank() over(partition by job_id order by salary) rank, dense_rank() over(partition by job_id order by salary) dense_rank, percent_rank() over(partition by job_id order by salary) percent_rank from exployees; +-----------+------+------------+----------------------------------+ | last_name | rank | dense_rank | percent_rank | +-----------+------+------------+----------------------------------+ | jim | 1 | 1 | 0.000000000000000000000000000000 | | tom | 1 | 1 | 0.000000000000000000000000000000 | | mike | 2 | 2 | 0.500000000000000000000000000000 | | lily | 3 | 3 | 1.000000000000000000000000000000 | +-----------+------+------------+----------------------------------+ 4 rows in set (0.01 sec) ~~~ ## CUME\_DIST **聲明** `CUME_DIST() OVER ([ query_partition_clause ] order_by_clause)` **說明** 該函數計算一個值的分布，返回值為大于 0 小于等于 1 的值。作為一個分析函數，CUME\_DIST 在升序情況下計算比當前行的特定列小的數據的占比。例如如下例子中，按 job\_id 分組并在薪水排序的情況下，每行數據在窗口內的排序列上的占比。 **例子** ~~~ obclient> create table exployees(last_name char(10), salary decimal, job_id char(32)); Query OK, 0 rows affected (0.10 sec) obclient> insert into exployees values('jim', 2000, 'cleaner'); Query OK, 1 row affected (0.11 sec) obclient> insert into exployees values('mike', 12000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('lily', 13000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> insert into exployees values('tom', 11000, 'engineering'); Query OK, 1 row affected (0.00 sec) obclient> select last_name, cume_dist() over(partition by job_id order by salary) cume_dist from exployees; +-----------+----------------------------------+ | last_name | cume_dist | +-----------+----------------------------------+ | jim | 1.000000000000000000000000000000 | | tom | 0.333333333333333333333333333333 | | mike | 0.666666666666666666666666666667 | | lily | 1.000000000000000000000000000000 | +-----------+----------------------------------+ 4 rows in set (0.01 sec) ~~~