Explain the window function of SQL operation in detail

Absrtact: window function is an extension of aggregation function and a higher-level SQL language operation. It is mainly used to analyze, summarize and sort data in AP scenarios.

This article is shared from Huawei cloud community< GaussDB(DWS) SQL window function for advanced SQL operations >Author: two cups of coffee.

The window function is an extension of the aggregation function and a higher-level SQL language operation. It is mainly used to analyze, summarize and sort data in AP scenarios. This article will start with the basic concept of window function, introduce its common SQL syntax and main application scenarios, and talk about some factors that need to be considered when using window function in GaussDB(DWS).

I Role of window function

Window function, also known as analytics function, is usually used when grouping and summarizing data. Therefore, it is similar to aggregation function. However, unlike the aggregation function, the aggregation function can only output grouping summary results by grouping data, while the original data cannot be displayed in the results. The window function can display the original data and the aggregate analysis results at the same time. Through the following example, you can experience the difference.

Given the table score(id, classid, score), each row represents the student id, class id and exam score. The data is shown in the following figure:

If we want to get the total score, average score and number of students of each class, we can aggregate the classid. The query statement is:

select classid, sum(score), avg(score), count(*) from score group by classid order by classid;

The results are as follows:

Through this result, we know the basic information of classes 1 and 2, but we lose the student information at this time, and we don't know how each student ranks in the class. If you want to query this information, you can, of course, Join the aggregated results with the original table, but it is obviously more cumbersome. Through the statements of window functions, we can easily query the information we need.

select classid, id, score,
sum(score) over(partition by classid),
avg(score) over(partition by classid),
count(*) over(partition by classid),
rank() over(partition by classid order by score desc)
from score
order by classid;

The results are as follows:

Through the above information, we can easily make further queries, for example: query the student id of each class that exceeds the average score, the student id of the top 5, etc.

It can be seen that the window function is actually a combined operation for multiple operations such as query and aggregation. However, compared with multiple operations, using the window function to complete the function is also more concise. At the same time, the window function also provides more functions and more aggregation methods to support diversified functions, and supports the sorting function in grouping. By comparing with the aggregation results, we can easily extract the record information that meets certain statistical requirements.

II Introduction to SQL syntax of window functions

The function of window function is so powerful that the supported syntax must be very complex. Don't worry. Through systematic analysis, you can quickly master the essence of window function syntax!

The main syntax of a single window function expression is:

SUM(SCORE) OVER (PARTITION BY CLASSID ORDER BY SCORE ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)

The expression is mainly composed of the following parts (the figure below is an illustration):

  • Window function expression (purple part): Specifies the aggregation function for calculation of this window function. It can be SUM(), COUNT(), AVG(), MIN(), MAX() and other aggregation functions. It can be found in GaussDB(DWS) through the following statements:
SELECT proname FROM pg_proc WHERE proisagg = TRUE;

It can also be other proprietary window functions, which can be found in the GaussDB(DWS) system table through the following statements:

SELECT proname FROM pg_proc WHERE proiswindow = TRUE;

GaussDB(DWS) currently supports 15 proprietary window functions:

The following describes in detail the common window function row_ Number(), rank(), deny_ The difference between rank(). First, compare the results of the following query.

SELECT id, classid, score,
ROW_NUMBER() OVER(ORDER BY score DESC),
RANK() OVER(ORDER BY score DESC),
DENSE_RANK() OVER(ORDER BY score DESC)
FROM score;

It can be seen that the three functions are used to sort rows, and the number of parameters is 0. The row can be seen from ①_ The difference between number() and RANK(): the former, as the name suggests, numbers the rows from 1. No matter whether the data is repeated or not, the result is not repeated; The latter gives the same sorting results for the same data. From ②, we can see RANK() and dense_ The difference between RANK() and row() is that the former has a number and row after the duplicate value_ Number() is the same. Although the same data numbers are the same, they still occupy multiple number positions; The latter only occupies one number for repeated values, and the repetition is followed by numbering.

In practical application, the first two functions are widely used, ROW_NUMBER() is mainly used for line numbering, paging display and other applications; RANK() is mainly used to display the results after sorting.

  • Window function partition column (red part): indicates which column is used for grouping calculation, similar to the GROUP BY clause in the aggregation statement. This part may not be available, which is similar to the aggregate statement, indicating that all statements are divided into the same group for processing.
  • Window function sorting sequence (green part): indicates the sorting sequence before window function calculation after the array is divided into the same group. Multiple columns can be specified. The syntax is similar to ORDER BY. This clause can be omitted when the evaluation result of the aggregate function is independent of the order.
  • Window function move window option (blue part): this option is also called Window Frame Option, which can be omitted by default. It means that aggregate calculation is performed for all rows in each group (when there is no row sequence) or for the starting row in each group to the current row (when there is row sequence). However, when this option is specified, only tuples within the specified window are aggregated.

For the results of all rows in the group, when we need to specify a window, we need to specify the start row and end row, and the aggregation function will calculate the results of all rows in the window. Therefore, the main syntax of the move window option is:

RANGE|ROWS [BETWEEN] <rows_loc> [AND <rows_loc>]
or
RANGE|ROWS <rows_loc>

The first syntax specifies both the start line and the end line. The second syntax specifies only the start line. The end line is the current line by default.

<rows_ Loc> is used to specify a line, and supports the following five usages:

  • UNBOUNDED PRECEDING

Represents the first row of the group

  • UNBOUNDED FOLLOWING

Represents the last row of the group

  • CURRENT ROW

Represents the current row.

  • <expression> PRECEDING

Indicates the number of rows <expression> from the current row, where <expression> cannot contain variables. The RANGE option is disabled.

  • <expression> FOLLOWING

Indicates the number of rows <expression> counted from the current row, where <expression> cannot contain variables. The RANGE option is disabled.

For example:

RANGE BETWEEN unbounded forecasting AND UNBOUNDED FOLLOWING

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

ROWS BETWEEN 10 predicting AND 5 FOLLOWING the first 10 rows to the last 5 rows of the current row of the group (cannot exceed the start row AND end row)

Example: the table on the left of the figure below contains column X. the value of the calculated expression value sum (x) over (rows 2 predicting and 2 following) is shown in the table on the right. The number on the middle arrow represents the line number of the start and end windows. For example, the window in line 1 is [1-2, 1+2]=[1, 3] (no more than the starting line); The window in line 4 is [4-2, 4+2]=[2, 6].

After understanding the syntax of single window function expression, let's take a look at the usage specification in SQL statements.

1. A single query can contain one or more window function expressions.

2. Window functions can only appear in output columns. If you need to use the value of a window function for conditional filtering, you need to nest the window function in a subquery and use the alias of the window function expression in the outer layer for conditional filtering. For example:

select classid, id, score
from
(select *, avg(score) over(partition by classid) as avg_score from score)
where score >= avg_score;

3. The query block where the window function is located supports the use of GROUP BY expression for grouping and de duplication, but the PARTITION BY clause in the window function must be a subset of the GROUP BY expression to ensure that the window function performs window operations on the results of GROUP BY column de duplication. At the same time, the expression in the ORDER BY clause also needs to be a subset of the GROUP BY expression or an aggregation function for aggregation operations. For example:

select classid,
rank() over(partition by classid order by sum(score)) as avg_score
from score
group by classid, id;

Several statements in the standard benchmark tpc-ds use window functions. Take TPC-DS Q51 as an example:

WITH web_v1 as (
select
  ws_item_sk item_sk, d_date,
  sum(sum(ws_sales_price))
      over (partition by ws_item_sk order by d_date rows between unbounded preceding and current row) cume_sales
from web_sales
    ,date_dim
where ws_sold_date_sk=d_date_sk
  and d_month_seq between 1212 and 1212+11
  and ws_item_sk is not NULL
group by ws_item_sk, d_date),
store_v1 as (
select
  ss_item_sk item_sk, d_date,
  sum(sum(ss_sales_price))
      over (partition by ss_item_sk order by d_date rows between unbounded preceding and current row) cume_sales
from store_sales
    ,date_dim
where ss_sold_date_sk=d_date_sk
  and d_month_seq between 1212 and 1212+11
  and ss_item_sk is not NULL
group by ss_item_sk, d_date)
 select  *
from (select item_sk
     ,d_date
     ,web_sales
     ,store_sales
     ,max(web_sales)
         over (partition by item_sk order by d_date rows between unbounded preceding and current row) web_cumulative
     ,max(store_sales)
         over (partition by item_sk order by d_date rows between unbounded preceding and current row) store_cumulative
     from (select case when web.item_sk is not null then web.item_sk else store.item_sk end item_sk
                 ,case when web.d_date is not null then web.d_date else store.d_date end d_date
                 ,web.cume_sales web_sales
                 ,store.cume_sales store_sales
           from web_v1 web full outer join store_v1 store on (web.item_sk = store.item_sk
                                                          and web.d_date = store.d_date)
          )x )y
where web_cumulative > store_cumulative
order by item_sk
        ,d_date
limit 100;

Four window functions are used in the above query, and CTE Web_ The definition of V1 is as follows:

select
  ws_item_sk item_sk, d_date,
  sum(sum(ws_sales_price))
      over (partition by ws_item_sk order by d_date rows between unbounded preceding and current row) cume_sales
from web_sales
    ,date_dim
where ws_sold_date_sk=d_date_sk
  and d_month_seq between 1212 and 1212+11
  and ws_item_sk is not NULL
group by ws_item_sk, d_date;

This statement block calculates the cumulative sales of goods sold online within a certain period of time.

The statement block is on the web_sales and date_ After the dim table joins, use WS_ Item_ SK, D_ GROUP BY is performed on the date column to calculate sum(ws_sales_price), then the window function is performed on the result of the aggregation function, and sum(ws_sales_price) is performed on the window function. The window function sum(sum(ws_sales_price)) over (partition by ws_item_sk order by d_date rows between unbounded preceding and current row) means that_ Item_ SK is a group, with D_ Date is the order to calculate the cumulative sales from the beginning to the current day.

III Window clause

Through this statement, careful readers can find that window functions are basically in the same format partition by item_sk order by d_date rows between unbounded preceding and current row. Therefore, the SQL statement supports the use of the window clause, which is similar to the definition of a public grouping and sorting window by the WITH clause. In this way, only the corresponding window name can be referenced. For example, the subject sentence of Q51 can be written as:

select  *
from (select item_sk
     ,d_date
     ,web_sales
     ,store_sales
     ,max(web_sales)
         over (s rows between unbounded preceding and current row) web_cumulative
     ,max(store_sales)
         over (s rows between unbounded preceding and current row) store_cumulative
     from (select case when web.item_sk is not null then web.item_sk else store.item_sk end item_sk
                 ,case when web.d_date is not null then web.d_date else store.d_date end d_date
                 ,web.cume_sales web_sales
                 ,store.cume_sales store_sales
           from web_v1 web full outer join store_v1 store on (web.item_sk = store.item_sk
                                                          and web.d_date = store.d_date)
          )x window s as (partition by item_sk order by d_date))y
where web_cumulative > store_cumulative
order by item_sk
        ,d_date
limit 100;

Using the window clause in GaussDB(DWS) requires the following attention:

  1. Window clauses can only be reused in the same query block statements. Different window clauses need to be defined across query blocks.
  2. Window clauses can only contain window clauses after OVER. Window clauses can include: PARTITION BY clause, ORDER BY clause, and move window option. Window does not support move window option.
  3. When a window function reuses a window clause, it can add an ORDER BY clause and a move window clause on the basis of the window clause, similar to inheritance.

For example, the following example:

select classid, id, score,
sum(score) over(s),
avg(score) over(s),
count(*) over(s),
sum(score) over(s rows between 1 preceding and current row)
from score
window s as (partition by classid order by score, id desc)
order by classid;

The query result is:

IV Notes on using window functions in GaussDB(DWS)

  • Different from the mainstream databases such as Teradata, if the expression after the PARTITION BY of the window function in GaussDB(DWS) is a number, it represents the constant value rather than the sequence number of the output column. This is also different from the syntax of GROUP BY. For example:
select *, rank() over(partition by 1 order by id) as avg_score from score;

All data is grouped here.

  • Since the window function generally needs to sort the data and then calculate the window, the Sort + WindowAgg operator is used in GaussDB(DWS) to realize its function. If there are multiple different window clauses, each window clause corresponds to a pair of Sort + WindowAgg operators. The example statement and plan are as follows:
explain
select classid, id, score,
sum(score) over(s),
avg(score) over(s),
count(*) over(s),
sum(score) over(s rows between 1 preceding and current row)
from score
window s as (partition by classid order by score, id desc)
order by classid;

This query contains two types of window clauses, (partition by classid order by score, id desc) and (partition by classid order by score, id desc rows between 1 preceding and current row). Therefore, two pairs of Sort + WindowAgg operators are required. See operators 2-4 (why is there only one SORT in the plan? Please think about it). Since the partition is made according to the classid, it is redistributed according to it first. See operator 5.

  • For the distributed database GaussDB(DWS), the data needs to be executed in each DN for better performance. For window functions, tuples of the same PARTITION BY value will be executed on the same DN. Therefore, the distributed effect will not be fully realized in scenarios where the PARTITION BY clause is missing or the value of PARTITION BY is small.

(1) When there is no PARTITION BY clause, Gauss(DWS) needs to perform global sorting and window calculation. Therefore, it needs to sort all DNS first, and then summarize the data to one DN for consolidated sorting and window calculation. Finally, the calculated DN will become the bottleneck of the whole system. This scenario needs to be avoided as much as possible. (the semantics of MSUM and other functions supported by Teradata in the early stage are global sorting and local summation. This is the case. Since the window function can achieve the same function, MSUM is only provided for forward compatibility.) The example statement and plan of this scenario are shown in the following figure:

SELECT * from (select id, classid, score, ROW_NUMBER() OVER(ORDER BY score DESC) rn from score) where rn <= 5;

The 4th operator Broadcast broadcasts the sorted data of each DN (the 5th operator) to a dn where row is performed_ Number() is calculated, filtered, and output results.

(2) When the PARTITION BY clause exists, the data will be redistributed according to the column of PARTITION BY, and the data will be distributed to each DN for their own grouping calculation. At this time, if the value of PARTITION BY is small, the data will be redistributed to a few DNS, resulting in data skew. Therefore, if you use the proprietary window function of the sort class, and only the top rows can be sorted, you can add corresponding filter conditions. GaussDB(DWS) will perform double-layer WindowAgg calculation according to the filter conditions. Take the Q67 of TPC-DS as an example:

select  *
from (select i_category
            ,i_class
            ,i_brand
            ,i_product_name
            ,d_year
            ,d_qoy
            ,d_moy
            ,s_store_id
            ,sumsales
            ,rank() over (partition by i_category order by sumsales desc) rk
      from (select i_category
                  ,i_class
                  ,i_brand
                  ,i_product_name
                  ,d_year
                  ,d_qoy
                  ,d_moy
                  ,s_store_id
                  ,sum(coalesce(ss_sales_price*ss_quantity,0)) sumsales
            from store_sales
                ,date_dim
                ,store
                ,item
       where  ss_sold_date_sk=d_date_sk
          and ss_item_sk=i_item_sk
          and ss_store_sk = s_store_sk
          and d_month_seq between 1212 and 1212+11
       group by  rollup(i_category, i_class, i_brand, i_product_name, d_year, d_qoy, d_moy,s_store_id))dw1) dw2
where rk <= 100
order by i_category
        ,i_class
        ,i_brand
        ,i_product_name
        ,d_year
        ,d_qoy
        ,d_moy
        ,s_store_id
        ,sumsales
        ,rk
limit 100;

The query contains the following window functions: rank() over (partition by i_category order by sumsales desc), which uses I_ Category to get the data of the top 100 in each group. In the TPC-DS dataset, I_ Category refers to the type of goods and the quantity is small. Therefore, we adopt a two-tier WindowAgg strategy. First, WindowAgg is performed locally and the top 100 data are filtered out. After the amount of data is reduced, global redistribution and global WindowAgg operations are performed. The plan is shown in the following figure:

Operator 10-11 calculates the sorting and WindowAgg for each DN, and the Result node 9 performs conditional filtering for each DN to obtain the data of the top 100 of each DN, and then the operator of layer 8 calculates the sorting and WindowAgg according to I_ The category column is redistributed so that each DN can obtain I_ Perform global calculation for data with the same category value. Operator 6-7 calculates the global WindowAgg results, and operator 5 filters the results and outputs them.

V summary

As an advanced function of SQL, window function is flexible, complex and difficult to master. However, mastering it provides a broader world for us to write SQL and realize various functions with certain technical content. At the same time, GaussDB(DWS) supports relatively complete window functions and their implementation, which can help readers realize their ideal applications!

To learn more about GuassDB(DWS), you are welcome to search "GaussDB DWS" on wechat and follow the wechat official account to share the latest and complete PB level digital warehouse black technology with you. You can also get a lot of learning materials in the background~

The Huawei partners and Developers Conference 2022 is hot, and the heavy content should not be missed!

[wonderful activities]

Forge ahead and become an all-round developer → 12 live technical broadcasts, 8 high-energy output of technical classics, as well as multiple rounds of mysterious tasks such as code chamber and knowledge contest. Break the barrier and open the ultimate prize! Click to embark on the road of universal developer promotion!

[technical topics]

The future has come, 2022 technology exploration → Huawei's cutting-edge technologies, heavy open source projects and innovative application practices in various fields, standing at the entrance of the intelligent world, exploring how the future can be reflected into reality, Full of dry goods Click to learn

 

Click "follow" to learn about Huawei cloud new technologies at the first time~

Posted by mentorbassment on Tue, 31 May 2022 14:20:13 +0530