Window Functions (OVER Clause)—Help Make a Difference

Itzik talks about window functions, explains how powerful they are, and why it’s so important to encourage Microsoft to enhance support for those in SQL Server 11.

ITPro Today

September 17, 2010

24 Min Read
ITPro Today logo in a gray background | ITPro Today

If I had to name one concept in standard SQL that I thought was the most important one, and that is worth Microsoft’s investment for future versions of SQL Server, I’d say window functions, hands down, without a doubt. Window functions are a subset of what the standard calls set functions, meaning, functions that are applied to a set of rows. The term window is used to describe the set of rows that the function operates on, and the language provides a clause called OVER where you provide the window specification. So what’s the big deal, and what makes window functions more important than other features that are missing in SQL Server? There are so many reasons… But first I’ll give a bit more background about window functions, and then I’ll get to the reasons and demonstrate use cases…

First, to clarify, SQL Server 2005 already introduced some support for window functions—the ranking calculations: ROW_NUMBER, RANK, DENSE_RANK and NTILE, and partial support for window aggregate functions with only the partitioning part implemented. SQL Server 2005 was a great release for developers with so many cool and practical T-SQL features. The number of solutions that I simplified and optimized just with the ROW_NUMBER function and CTEs is amazing. Still, there are many standard features related to window functions that SQL Server didn’t yet implement (as of SQL Server 2008 R2) and that can help address quite a wide variety of business problems with simpler and more efficient solutions.

These days the next major release of Microsoft SQL Server—version 11—is being developed. These are pivotal days for candidate features where decisions are made whether they will or will not make it to the final release. And even though I think that more complete support for window functions is so important to developers and to the success of SQL Server, I’m not sure at all that we will see those in the product. This is time for us as part of the SQL Server community to express our strong opinion. Hopefully Microsoft will realize how important it is for us to have those features in the product, as well as to show that the SQL Server community’s opinion matters.

In this article I will explain some of the key features that are missing in SQL Server and why it’s important to add support for such features. If you share my opinion, and haven’t done so already, you can cast your vote in the following feature request items:

Ordering for aggregates (used to allow subsequent framing options):
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254387

Framing (ROWS and RANGE window sub-clauses):
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254392

DISTINCT clause for aggregate functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254393

LAG and LEAD offset functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254388

PERCENT_RANK and CUME_DIST Distribution Functions
https://connect.microsoft.com/SQLServer/feedback/details/600484

FIRST_VALUE, LAST_VALUE offset functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254395

Reuse of window definition using the WINDOW clause:
https://connect.microsoft.com/SQLServer/feedback/details/600499

QUALIFY filtering clause:
https://connect.microsoft.com/SQLServer/feedback/details/532474

Like with any thing in life that you’re not aware of, you don’t know how it can help you if you don’t know that it exists. My feeling is that many developers are not really aware of the capabilities of the standard window functions and therefore Microsoft doesn’t see a lot of demand for it. Education and raising the topic to people’s awareness is therefore key to the realization of the benefits, and as a consequence, encourage people to ask Microsoft for more support.

The unfortunate part is that all of SQL Server’s leading competitors; including Oracle, DB2 and Teradata for some time now already have a far more complete support for window functions. So even though my focus and passion is for SQL Server, I sometimes find myself in the awkward situation of demoing standard SQL window functions on Oracle when teaching or presenting.

So what’s missing…

The most important missing features are probably ordering and framing options for window aggregate functions. Other key features that are still missing are distribution and offset functions, and reusability of window definitions. More details shortly.

Why are window functions so powerful?

SQL is often referred to as a set-based language. The reason is that the language is based on the relational model, which in turn is based, in part, on mathematical set theory. When writing SQL queries you’re supposed to deal with a table (or relation, which is a set) as a whole, as opposed to the table’s individual rows. Also, since sets have no order, you’re not supposed to make any assumptions in regards to the physical ordering of the data.

The reality is that for many developers set-based thinking is far from being intuitive, and it can take a few good years to truly think in SQL terms. This is why often developers tend to use cursors—because using those feel like an extension to what they already know. Cursors allow you to deal with one row at a time, and also rely on specified order of the data.

Window functions have an ingenious design. They do operate on sets, or windows, while allowing you to indicate ordering as part of the calculation where relevant. Not to confuse with cursors, window functions allow defining ordering for the calculation without making any expectations in regards to ordering of the input data given to the query or the output coming out of the query. In other words, no relational concepts are violated. Ordering is only part of the specification of the calculation. Similarly, other common elements in querying problems, like partitioning, framing of applicable rows, are all intuitive parts of the window specification. So in a sense, I see window functions as bridging the big gap that exists between cursor/iterative and set-based thinking.

Now, that’s a lot of words before showing even one example. So let’s look at a few more concrete examples of some of the missing features…

Sample Data

Most of the examples I’ll show are against a database called InsideTSQL2008. You can find the script creating it here: http://www.InsideTSQL.com/books/source_code/InsideTSQL2008.zip. In addition, the following view will be used in some of the examples:

SET NOCOUNT ON;
USE InsideTSQL2008;
GO

IF OBJECT_ID('Sales.EmpOrders', 'V') IS NOT NULL
  DROP VIEW Sales.EmpOrders;
GO

CREATE VIEW Sales.EmpOrders 
  WITH SCHEMABINDING 
AS 

SELECT 
  O.empid,
  DATEADD(month, DATEDIFF(month, 0, O.orderdate), 0) AS ordermonth, 
  SUM(OD.qty) AS qty,
  CAST(SUM(OD.qty * OD.unitprice * (1 - discount)) 
       AS NUMERIC(12, 2)) AS val,
  COUNT(*) AS numorders 
FROM Sales.Orders AS O 
  JOIN Sales.OrderDetails AS OD 
    ON OD.orderid = O.orderid 
GROUP BY empid, DATEADD(month, DATEDIFF(month, 0, O.orderdate), 0);
GO

Ordering and Framing for Window Aggregate Functions

As mentioned, currently window aggregate functions support only a partitioning element. What’s missing are ordering and framing options. The standard supports an ORDER BY clause to define ordering in the window and ROWS and RANGE clauses that frame the window based on the defined ordering. A classic example that would benefit from ordering and framing is running totals. Consider the following Accounts table definition:

CREATE TABLE dbo.Accounts
(
  actid  INT   NOT NULL,                -- partitioning column
  tranid INT   NOT NULL,                -- ordering column
  val    MONEY NOT NULL                 -- measure
  CONSTRAINT PK_Accounts PRIMARY KEY(actid, tranid)
);

The table represents deposit (positive value) and withdrawal (negative value) transactions in bank accounts. You need to calculate at each point what the account balance was. Like with many querying problems there’s a partitioning element (actid), ordering element (tranid), and a measure that the calculation applies to (val). Window aggregate functions in standard SQL support all three elements. Here’s how you would express the query calculating the balance at each point for each account:

SELECT actid, tranid, val,
  SUM(val) OVER(PARTITION BY actid
                ORDER BY tranid
                ROWS BETWEEN UNBOUNDED PRECEDING
                         AND CURRENT ROW) AS balance
FROM dbo.Accounts;

You can achieve such calculations today in SQL Server using a subquery or a join:

-- Set-Based Solution Using Subqueries
SELECT actid, tranid, val,
  (SELECT SUM(S2.val)
   FROM dbo.Accounts AS S2
   WHERE S2.actid = S1.actid
     AND S2.tranid <= S1.tranid) AS balance
FROM dbo.Accounts AS S1;

-- Set-Based Solution Using Joins
SELECT S1.actid, S1.tranid, S1.val,
  SUM(S2.val) AS balance
FROM dbo.Accounts AS S1
  JOIN dbo.Accounts AS S2
    ON S2.actid = S1.actid
   AND S2.tranid <= S1.tranid
GROUP BY S1.actid, S1.tranid, S1.val;

But besides the fact that these solutions are not as straightforward and intuitive as the one using a window function, there’s a big problem with the way SQL Server currently optimizes the subquery and join solutions. Assuming you defined a covering index on the partitioning column, followed by the ordering column, and including the aggregated measure, for each row SQL Server will scan all rows with the same partitioning value and an ordering value that is less than or equal to the current. Given p partitions with r rows in average, and fairly even distribution of rows in partitions, the total number of rows processed in such a plan is pr + p(r + r^2)/2. This means that in respect to the partition size, the algorithmic complexity, or scaling, of the solution s quadratic (N^2). That’s bad. The window function form lends itself to good optimization, especially with the fast track case like the above (rows between unbounded preceding and current row). It should be straightforward to the optimizer to optimize this query with one ordered scan of the index, translating to simply pr rows being scanned.

Another example for running totals is querying a table called EmpOrders with a row for each employee and month, and calculating the cumulative performance for each employee and month; in other words, the total value for the employee from the beginning of his/her activity until the current month. Here’s how you would express it with a window aggregate:

SELECT empid, ordermonth, qty,
  SUM(qty) OVER(PARTITION BY empid
                ORDER BY ordermonth
                ROWS BETWEEN UNBOUNDED PRECEDING
                         AND CURRENT ROW) AS run_qty
FROM Sales.EmpOrders;

empid       ordermonth              qty         run_qty
----------- ----------------------- ----------- -----------
1           2006-07-01 00:00:00.000 121         121
1           2006-08-01 00:00:00.000 247         368
1           2006-09-01 00:00:00.000 255         623
1           2006-10-01 00:00:00.000 143         766
1           2006-11-01 00:00:00.000 318         1084
...
2           2006-07-01 00:00:00.000 50          50
2           2006-08-01 00:00:00.000 94          144
2           2006-09-01 00:00:00.000 137         281
2           2006-10-01 00:00:00.000 248         529
2           2006-11-01 00:00:00.000 237         766
...

There are many business examples where ordering and framing options can be useful besides calculating account balances. Those include inventory, running totals for reporting, moving averages, and so on. Here’s an example for a query calculating the average of the last three recorded periods:

SELECT empid, ordermonth,
  AVG(qty) OVER(PARTITION BY empid
                ORDER BY ordermonth
                ROWS BETWEEN 2 PRECEDING
                         AND CURRENT ROW) AS avg_last_three
FROM Sales.EmpOrders;

There are also various temporal querying problems where running totals serve part of the solution.

For simplicity I showed examples where framing is based on the ROWS clause where you indicate an offset in terms of number of rows. The standard also supports a RANGE clause that allows indicating an offset in terms of values, such as time intervals, as in the following example returning the average of the last three months:

SELECT empid, ordermonth, qty,
  SUM(qty) OVER(PARTITION BY empid
                ORDER BY ordermonth
                RANGE INTERVAL '2' MONTH PRECEDING) AS sum_3m_qty
FROM Sales.EmpOrders
ORDER BY empid, ordermonth;

Offset Functions

The SQL standard defines several offset functions that would make developers’ life so much easier compared to the tools available today for similar needs. Among the missing offset functions are LAG and LEAD, returning a value from a row in a given offset from the current row based on specified ordering. For example, the following query will return, for each current order, also the order date of the previous and next orders:

SELECT custid, orderdate, orderid,
  LAG(orderdate)  OVER(PARTITION BY custid
                       ORDER BY orderdate, orderid) AS prvod,
  LEAD(orderdate) OVER(PARTITION BY custid
                       ORDER BY orderdate, orderid) AS nxtod
FROM Sales.Orders;

custid  orderdate   orderid  prvod       nxtod
------- ----------- -------- ----------- -----------
1       2007-08-25  10643    NULL        2007-10-03
1       2007-10-03  10692    2007-08-25  2007-10-13
1       2007-10-13  10702    2007-10-03  2008-01-15
1       2008-01-15  10835    2007-10-13  2008-03-16
1       2008-03-16  10952    2008-01-15  2008-04-09
1       2008-04-09  11011    2008-03-16  NULL
2       2006-09-18  10308    NULL        2007-08-08
2       2007-08-08  10625    2006-09-18  2007-11-28
2       2007-11-28  10759    2007-08-08  2008-03-04
2       2008-03-04  10926    2007-11-28  NULL
...

Notice how elegant and intuitive this form is. The default offset is one row, but you can also be explicit if you need an offset that is other than one row, e.g., three rows:

SELECT custid, orderdate, orderid,
  LAG(orderdate, 3) OVER(PARTITION BY custid
                         ORDER BY orderdate, orderid) AS prv3od
FROM Sales.Orders;

There are lots of business examples for the usefulness of these functions, like recency calculations, trend analysis, and others. Here’s an example for a query addressing recency calculations, returning the difference in terms of days between the current and previous orders:

SELECT custid, orderdate, orderid,
  DATEDIFF(day,
    LAG(orderdate) OVER(PARTITION BY custid
                        ORDER BY orderdate, orderid),
    orderdate) AS diff
FROM Sales.Orders;

Other missing offset functions are FIRST_VALUE, LAST_VALUE, returning the value from the first or last rows in the partition based on specified ordering. Here’s an example returning the value of the first and last orders per customer with each order:

-- FIRST_VALUE, LAST_VALUE
SELECT custid, orderdate, orderid, val,
  FIRST_VALUE(val) OVER(PARTITION BY custid
                        ORDER BY orderdate, orderid) AS val_firstorder,
  LAST_VALUE(val)  OVER(PARTITION BY custid
                        ORDER BY orderdate, ordered
                        ROWS BETWEEN UNBOUNDED PRECEDING
                                 AND UNBOUNDED FOLLOWING) AS val_lastorder
FROM Sales.OrderValues;

custid  orderdate   orderid  val     val_firstorder  val_lastorder
------- ----------- -------- ------- --------------- --------------
1       2007-08-25  10643    814.50  814.50          933.50
1       2007-10-03  10692    878.00  814.50          933.50
1       2007-10-13  10702    330.00  814.50          933.50
1       2008-01-15  10835    845.80  814.50          933.50
1       2008-03-16  10952    471.20  814.50          933.50
1       2008-04-09  11011    933.50  814.50          933.50
2       2006-09-18  10308    88.80   88.80           514.40
...

And here’s an example calculating the difference between the current order value and the first and last:

SELECT custid, orderdate, orderid, val,
  val - FIRST_VALUE(val) OVER(PARTITION BY custid
                              ORDER BY orderdate, orderid) AS diff_first,
  val - LAST_VALUE(val)  OVER(PARTITION BY custid
                              ORDER BY orderdate, ordered
                              ROWS BETWEEN UNBOUNDED PRECEDING
                                       AND UNBOUNDED FOLLOWING) AS diff_last
FROM Sales.OrderValues;

Distribution Functions

Standard SQL supports window distribution functions that performing statistical calculations. Specifically it supports the PERCENT_RANK and CUM_DIST functions, calculating a percentile rank and cumulative distribution. These functions give you a relative rank of a row in respect to other rows in the window partition, expressed as ratio/percent. The specific formulas used by the two variants are:

PERCENT_RANK: (RK-1)/(NR-1), where RK = rank, NR = number of rows in partition

CUME_DIST: NP/NR, where NP = number of rows preceding or peer with current row (same as next rank - 1)

Here’s an example using these functions:

SELECT custid, COUNT(*) AS numorders,
  PERCENT_RANK() OVER(ORDER BY COUNT(*)) AS percentrank,
  CUME_DIST() OVER(ORDER BY COUNT(*)) AS cumedist
FROM Sales.Orders
GROUP BY custid;

custid  numorders  percentrank  cumedist
------- ---------- ------------ ---------
13      1          0.0000       0.0112
33      2          0.0114       0.0337
43      2          0.0114       0.0337
42      3          0.0341       0.1124
53      3          0.0341       0.1124
...
37      19         0.9545       0.9663
24      19         0.9545       0.9663
63      28         0.9773       0.9775
20      30         0.9886       0.9888
71      31         1.0000       1.0000

Reuse of Window Definition using WINDOW Clause

Suppose you need to write several window functions that rely on the same window definition (or part of it). You will end up with a lot of repetition of code. Standard SQL has a clause called WINDOW that allows naming a window definition or part of it, making it reusable. For example, instead of:

SELECT empid, ordermonth, qty,
  SUM(qty) OVER ( PARTITION BY empid
                   ORDER BY ordermonth
                   ROWS BETWEEN UNBOUNDED PRECEDING
                           AND CURRENT ROW ) AS run_sum_qty,
  AVG(qty) OVER ( PARTITION BY empid
                   ORDER BY ordermonth
                   ROWS BETWEEN UNBOUNDED PRECEDING
                           AND CURRENT ROW ) AS run_avg_qty,
FROM Sales.EmpOrders;

You would use:

SELECT empid, ordermonth, qty,
  SUM(qty) OVER W1 AS run_sum_qty,
  AVG(qty) OVER W1 AS run_avg_qty,
FROM Sales.EmpOrders
WINDOW W1 AS ( PARTITION BY empid
               ORDER BY ordermonth
               ROWS BETWEEN UNBOUNDED PRECEDING
                          AND CURRENT ROW );

As you can see, with the WINDOW clause the code is shorter, more readable, and easier to maintain.

Conclusion

I showed just part of the standard support for window functions that SQL Server is still missing. There’s more, like window frame exclusion. There are also other set functions not implemented, like ordered set functions, and so on. But here I wanted to make a point in hope that Microsoft will realize how important it is to add such support in SQL Server 11. If you feel so as well, help make a difference by voting for the items, write about the topic, talk about it, increasing people’s awareness. Hopefully this request will find open ears. As a reminder, here are the open items for some of the requests for enhancements:

Ordering for aggregates (used to allow subsequent framing options):
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254387

Framing (ROWS and RANGE window sub-clauses):
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254392

DISTINCT clause for aggregate functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254393

LAG and LEAD offset functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254388

PERCENT_RANK and CUME_DIST Distribution Functions
https://connect.microsoft.com/SQLServer/feedback/details/600484

FIRST_VALUE, LAST_VALUE offset functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254395

Reuse of window definition using the WINDOW clause:
https://connect.microsoft.com/SQLServer/feedback/details/600499

QUALIFY filtering clause:
https://connect.microsoft.com/SQLServer/feedback/details/532474

Cheers,

Itzik

 

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like