Window Functions (OVER Clause)—Help Make a Difference
Itzik talks about window functions, explains how powerful they are, and why it’s so important to encourage Microsoft to enhance support for those in SQL Server 11.
September 17, 2010
If I had to name one concept in standard SQL that I thought was the most important one, and that is worth Microsoft’s investment for future versions of SQL Server, I’d say window functions, hands down, without a doubt. Window functions are a subset of what the standard calls set functions, meaning, functions that are applied to a set of rows. The term window is used to describe the set of rows that the function operates on, and the language provides a clause called OVER where you provide the window specification. So what’s the big deal, and what makes window functions more important than other features that are missing in SQL Server? There are so many reasons… But first I’ll give a bit more background about window functions, and then I’ll get to the reasons and demonstrate use cases…
First, to clarify, SQL Server 2005 already introduced some support for window functions—the ranking calculations: ROW_NUMBER, RANK, DENSE_RANK and NTILE, and partial support for window aggregate functions with only the partitioning part implemented. SQL Server 2005 was a great release for developers with so many cool and practical T-SQL features. The number of solutions that I simplified and optimized just with the ROW_NUMBER function and CTEs is amazing. Still, there are many standard features related to window functions that SQL Server didn’t yet implement (as of SQL Server 2008 R2) and that can help address quite a wide variety of business problems with simpler and more efficient solutions.
These days the next major release of Microsoft SQL Server—version 11—is being developed. These are pivotal days for candidate features where decisions are made whether they will or will not make it to the final release. And even though I think that more complete support for window functions is so important to developers and to the success of SQL Server, I’m not sure at all that we will see those in the product. This is time for us as part of the SQL Server community to express our strong opinion. Hopefully Microsoft will realize how important it is for us to have those features in the product, as well as to show that the SQL Server community’s opinion matters.
In this article I will explain some of the key features that are missing in SQL Server and why it’s important to add support for such features. If you share my opinion, and haven’t done so already, you can cast your vote in the following feature request items:
Ordering for aggregates (used to allow subsequent framing options):
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254387
Framing (ROWS and RANGE window sub-clauses):
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254392
DISTINCT clause for aggregate functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254393
LAG and LEAD offset functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254388
PERCENT_RANK and CUME_DIST Distribution Functions
https://connect.microsoft.com/SQLServer/feedback/details/600484
FIRST_VALUE, LAST_VALUE offset functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254395
Reuse of window definition using the WINDOW clause:
https://connect.microsoft.com/SQLServer/feedback/details/600499
QUALIFY filtering clause:
https://connect.microsoft.com/SQLServer/feedback/details/532474
Like with any thing in life that you’re not aware of, you don’t know how it can help you if you don’t know that it exists. My feeling is that many developers are not really aware of the capabilities of the standard window functions and therefore Microsoft doesn’t see a lot of demand for it. Education and raising the topic to people’s awareness is therefore key to the realization of the benefits, and as a consequence, encourage people to ask Microsoft for more support.
The unfortunate part is that all of SQL Server’s leading competitors; including Oracle, DB2 and Teradata for some time now already have a far more complete support for window functions. So even though my focus and passion is for SQL Server, I sometimes find myself in the awkward situation of demoing standard SQL window functions on Oracle when teaching or presenting.
So what’s missing…
The most important missing features are probably ordering and framing options for window aggregate functions. Other key features that are still missing are distribution and offset functions, and reusability of window definitions. More details shortly.
Why are window functions so powerful?
SQL is often referred to as a set-based language. The reason is that the language is based on the relational model, which in turn is based, in part, on mathematical set theory. When writing SQL queries you’re supposed to deal with a table (or relation, which is a set) as a whole, as opposed to the table’s individual rows. Also, since sets have no order, you’re not supposed to make any assumptions in regards to the physical ordering of the data.
The reality is that for many developers set-based thinking is far from being intuitive, and it can take a few good years to truly think in SQL terms. This is why often developers tend to use cursors—because using those feel like an extension to what they already know. Cursors allow you to deal with one row at a time, and also rely on specified order of the data.
Window functions have an ingenious design. They do operate on sets, or windows, while allowing you to indicate ordering as part of the calculation where relevant. Not to confuse with cursors, window functions allow defining ordering for the calculation without making any expectations in regards to ordering of the input data given to the query or the output coming out of the query. In other words, no relational concepts are violated. Ordering is only part of the specification of the calculation. Similarly, other common elements in querying problems, like partitioning, framing of applicable rows, are all intuitive parts of the window specification. So in a sense, I see window functions as bridging the big gap that exists between cursor/iterative and set-based thinking.
Now, that’s a lot of words before showing even one example. So let’s look at a few more concrete examples of some of the missing features…
Sample Data
Most of the examples I’ll show are against a database called InsideTSQL2008. You can find the script creating it here: http://www.InsideTSQL.com/books/source_code/InsideTSQL2008.zip. In addition, the following view will be used in some of the examples:
SET NOCOUNT ON;
USE InsideTSQL2008;
GO
IF OBJECT_ID('Sales.EmpOrders', 'V') IS NOT NULL
DROP VIEW Sales.EmpOrders;
GO
CREATE VIEW Sales.EmpOrders
WITH SCHEMABINDING
AS
SELECT
O.empid,
DATEADD(month, DATEDIFF(month, 0, O.orderdate), 0) AS ordermonth,
SUM(OD.qty) AS qty,
CAST(SUM(OD.qty * OD.unitprice * (1 - discount))
AS NUMERIC(12, 2)) AS val,
COUNT(*) AS numorders
FROM Sales.Orders AS O
JOIN Sales.OrderDetails AS OD
ON OD.orderid = O.orderid
GROUP BY empid, DATEADD(month, DATEDIFF(month, 0, O.orderdate), 0);
GO
Ordering and Framing for Window Aggregate Functions
As mentioned, currently window aggregate functions support only a partitioning element. What’s missing are ordering and framing options. The standard supports an ORDER BY clause to define ordering in the window and ROWS and RANGE clauses that frame the window based on the defined ordering. A classic example that would benefit from ordering and framing is running totals. Consider the following Accounts table definition:
CREATE TABLE dbo.Accounts
(
actid INT NOT NULL, -- partitioning column
tranid INT NOT NULL, -- ordering column
val MONEY NOT NULL -- measure
CONSTRAINT PK_Accounts PRIMARY KEY(actid, tranid)
);
The table represents deposit (positive value) and withdrawal (negative value) transactions in bank accounts. You need to calculate at each point what the account balance was. Like with many querying problems there’s a partitioning element (actid), ordering element (tranid), and a measure that the calculation applies to (val). Window aggregate functions in standard SQL support all three elements. Here’s how you would express the query calculating the balance at each point for each account:
SELECT actid, tranid, val,
SUM(val) OVER(PARTITION BY actid
ORDER BY tranid
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS balance
FROM dbo.Accounts;
You can achieve such calculations today in SQL Server using a subquery or a join:
-- Set-Based Solution Using Subqueries
SELECT actid, tranid, val,
(SELECT SUM(S2.val)
FROM dbo.Accounts AS S2
WHERE S2.actid = S1.actid
AND S2.tranid <= S1.tranid) AS balance
FROM dbo.Accounts AS S1;
-- Set-Based Solution Using Joins
SELECT S1.actid, S1.tranid, S1.val,
SUM(S2.val) AS balance
FROM dbo.Accounts AS S1
JOIN dbo.Accounts AS S2
ON S2.actid = S1.actid
AND S2.tranid <= S1.tranid
GROUP BY S1.actid, S1.tranid, S1.val;
But besides the fact that these solutions are not as straightforward and intuitive as the one using a window function, there’s a big problem with the way SQL Server currently optimizes the subquery and join solutions. Assuming you defined a covering index on the partitioning column, followed by the ordering column, and including the aggregated measure, for each row SQL Server will scan all rows with the same partitioning value and an ordering value that is less than or equal to the current. Given p partitions with r rows in average, and fairly even distribution of rows in partitions, the total number of rows processed in such a plan is pr + p(r + r^2)/2. This means that in respect to the partition size, the algorithmic complexity, or scaling, of the solution s quadratic (N^2). That’s bad. The window function form lends itself to good optimization, especially with the fast track case like the above (rows between unbounded preceding and current row). It should be straightforward to the optimizer to optimize this query with one ordered scan of the index, translating to simply pr rows being scanned.
Another example for running totals is querying a table called EmpOrders with a row for each employee and month, and calculating the cumulative performance for each employee and month; in other words, the total value for the employee from the beginning of his/her activity until the current month. Here’s how you would express it with a window aggregate:
SELECT empid, ordermonth, qty,
SUM(qty) OVER(PARTITION BY empid
ORDER BY ordermonth
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS run_qty
FROM Sales.EmpOrders;
empid ordermonth qty run_qty
----------- ----------------------- ----------- -----------
1 2006-07-01 00:00:00.000 121 121
1 2006-08-01 00:00:00.000 247 368
1 2006-09-01 00:00:00.000 255 623
1 2006-10-01 00:00:00.000 143 766
1 2006-11-01 00:00:00.000 318 1084
...
2 2006-07-01 00:00:00.000 50 50
2 2006-08-01 00:00:00.000 94 144
2 2006-09-01 00:00:00.000 137 281
2 2006-10-01 00:00:00.000 248 529
2 2006-11-01 00:00:00.000 237 766
...
There are many business examples where ordering and framing options can be useful besides calculating account balances. Those include inventory, running totals for reporting, moving averages, and so on. Here’s an example for a query calculating the average of the last three recorded periods:
SELECT empid, ordermonth,
AVG(qty) OVER(PARTITION BY empid
ORDER BY ordermonth
ROWS BETWEEN 2 PRECEDING
AND CURRENT ROW) AS avg_last_three
FROM Sales.EmpOrders;
There are also various temporal querying problems where running totals serve part of the solution.
For simplicity I showed examples where framing is based on the ROWS clause where you indicate an offset in terms of number of rows. The standard also supports a RANGE clause that allows indicating an offset in terms of values, such as time intervals, as in the following example returning the average of the last three months:
SELECT empid, ordermonth, qty,
SUM(qty) OVER(PARTITION BY empid
ORDER BY ordermonth
RANGE INTERVAL '2' MONTH PRECEDING) AS sum_3m_qty
FROM Sales.EmpOrders
ORDER BY empid, ordermonth;
Offset Functions
The SQL standard defines several offset functions that would make developers’ life so much easier compared to the tools available today for similar needs. Among the missing offset functions are LAG and LEAD, returning a value from a row in a given offset from the current row based on specified ordering. For example, the following query will return, for each current order, also the order date of the previous and next orders:
SELECT custid, orderdate, orderid,
LAG(orderdate) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS prvod,
LEAD(orderdate) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS nxtod
FROM Sales.Orders;
custid orderdate orderid prvod nxtod
------- ----------- -------- ----------- -----------
1 2007-08-25 10643 NULL 2007-10-03
1 2007-10-03 10692 2007-08-25 2007-10-13
1 2007-10-13 10702 2007-10-03 2008-01-15
1 2008-01-15 10835 2007-10-13 2008-03-16
1 2008-03-16 10952 2008-01-15 2008-04-09
1 2008-04-09 11011 2008-03-16 NULL
2 2006-09-18 10308 NULL 2007-08-08
2 2007-08-08 10625 2006-09-18 2007-11-28
2 2007-11-28 10759 2007-08-08 2008-03-04
2 2008-03-04 10926 2007-11-28 NULL
...
Notice how elegant and intuitive this form is. The default offset is one row, but you can also be explicit if you need an offset that is other than one row, e.g., three rows:
SELECT custid, orderdate, orderid,
LAG(orderdate, 3) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS prv3od
FROM Sales.Orders;
There are lots of business examples for the usefulness of these functions, like recency calculations, trend analysis, and others. Here’s an example for a query addressing recency calculations, returning the difference in terms of days between the current and previous orders:
SELECT custid, orderdate, orderid,
DATEDIFF(day,
LAG(orderdate) OVER(PARTITION BY custid
ORDER BY orderdate, orderid),
orderdate) AS diff
FROM Sales.Orders;
Other missing offset functions are FIRST_VALUE, LAST_VALUE, returning the value from the first or last rows in the partition based on specified ordering. Here’s an example returning the value of the first and last orders per customer with each order:
-- FIRST_VALUE, LAST_VALUE
SELECT custid, orderdate, orderid, val,
FIRST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS val_firstorder,
LAST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, ordered
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS val_lastorder
FROM Sales.OrderValues;
custid orderdate orderid val val_firstorder val_lastorder
------- ----------- -------- ------- --------------- --------------
1 2007-08-25 10643 814.50 814.50 933.50
1 2007-10-03 10692 878.00 814.50 933.50
1 2007-10-13 10702 330.00 814.50 933.50
1 2008-01-15 10835 845.80 814.50 933.50
1 2008-03-16 10952 471.20 814.50 933.50
1 2008-04-09 11011 933.50 814.50 933.50
2 2006-09-18 10308 88.80 88.80 514.40
...
And here’s an example calculating the difference between the current order value and the first and last:
SELECT custid, orderdate, orderid, val,
val - FIRST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS diff_first,
val - LAST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, ordered
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS diff_last
FROM Sales.OrderValues;
Distribution Functions
Standard SQL supports window distribution functions that performing statistical calculations. Specifically it supports the PERCENT_RANK and CUM_DIST functions, calculating a percentile rank and cumulative distribution. These functions give you a relative rank of a row in respect to other rows in the window partition, expressed as ratio/percent. The specific formulas used by the two variants are:
PERCENT_RANK: (RK-1)/(NR-1), where RK = rank, NR = number of rows in partition
CUME_DIST: NP/NR, where NP = number of rows preceding or peer with current row (same as next rank - 1)
Here’s an example using these functions:
SELECT custid, COUNT(*) AS numorders,
PERCENT_RANK() OVER(ORDER BY COUNT(*)) AS percentrank,
CUME_DIST() OVER(ORDER BY COUNT(*)) AS cumedist
FROM Sales.Orders
GROUP BY custid;
custid numorders percentrank cumedist
------- ---------- ------------ ---------
13 1 0.0000 0.0112
33 2 0.0114 0.0337
43 2 0.0114 0.0337
42 3 0.0341 0.1124
53 3 0.0341 0.1124
...
37 19 0.9545 0.9663
24 19 0.9545 0.9663
63 28 0.9773 0.9775
20 30 0.9886 0.9888
71 31 1.0000 1.0000
Reuse of Window Definition using WINDOW Clause
Suppose you need to write several window functions that rely on the same window definition (or part of it). You will end up with a lot of repetition of code. Standard SQL has a clause called WINDOW that allows naming a window definition or part of it, making it reusable. For example, instead of:
SELECT empid, ordermonth, qty,
SUM(qty) OVER ( PARTITION BY empid
ORDER BY ordermonth
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW ) AS run_sum_qty,
AVG(qty) OVER ( PARTITION BY empid
ORDER BY ordermonth
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW ) AS run_avg_qty,
FROM Sales.EmpOrders;
You would use:
SELECT empid, ordermonth, qty,
SUM(qty) OVER W1 AS run_sum_qty,
AVG(qty) OVER W1 AS run_avg_qty,
FROM Sales.EmpOrders
WINDOW W1 AS ( PARTITION BY empid
ORDER BY ordermonth
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW );
As you can see, with the WINDOW clause the code is shorter, more readable, and easier to maintain.
Conclusion
I showed just part of the standard support for window functions that SQL Server is still missing. There’s more, like window frame exclusion. There are also other set functions not implemented, like ordered set functions, and so on. But here I wanted to make a point in hope that Microsoft will realize how important it is to add such support in SQL Server 11. If you feel so as well, help make a difference by voting for the items, write about the topic, talk about it, increasing people’s awareness. Hopefully this request will find open ears. As a reminder, here are the open items for some of the requests for enhancements:
Ordering for aggregates (used to allow subsequent framing options):
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254387
Framing (ROWS and RANGE window sub-clauses):
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254392
DISTINCT clause for aggregate functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254393
LAG and LEAD offset functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254388
PERCENT_RANK and CUME_DIST Distribution Functions
https://connect.microsoft.com/SQLServer/feedback/details/600484
FIRST_VALUE, LAST_VALUE offset functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254395
Reuse of window definition using the WINDOW clause:
https://connect.microsoft.com/SQLServer/feedback/details/600499
QUALIFY filtering clause:
https://connect.microsoft.com/SQLServer/feedback/details/532474
Cheers,
Itzik
About the Author
You May Also Like