IS NOT DISTINCT FROM
Itzik encourages you to vote for a proposal for a T-SQL enhancement that has to do with NULL treatment.
July 16, 2007
Treatment of NULLs adds a fair amount of complexity to SQL. Here I’d like to point out a specific issue that has to do with NULL treatment and point out a standard SQL language element that addresses this issue and was not yet implemented in SQL Server. At the end of this article I’ll provide a URL where you can vote for adding it to SQL Server.
When you compare columns/expressions, say in the ON or WHERE clauses of a query, predicates follow what’s called the 3-valued-logic. For example:
SELECT ...FROM dbo.T1 JOIN dbo.T2 ON T1.col1 = T2.col1;
The predicate can evaluate to TRUE, FALSE or UNKNOWN. The predicate evaluates toUNKNOWN whenever NULLs are involved. For example (T1.col1 = 5, T2.col1 = NULL),(T1.col1 = NULL, T1.col2 = 10) and even (T1.col1 = NULL, T2.col1 = NULL). SomeT-SQL language elements will handle UNKNOWN differently than others. Forexample, ON, WHERE, HAVING ,IF and CASE “accept TRUE”, and will treat UNKNOWNsomewhat similar to FALSE, but not exactly like FALSE.
For example, in the aboveexample, when T1.col1 = T2.col1 evaluates to UNKNOWN, like with FALSE, this willbe considered a nonmatch. However, the reason that I said that the treatmenthere is not exactly like with FALSE is that if T1.col1 = T2.col1 evaluates toUNKNOWN, NOT T1.col1 = T2.col1 also evaluates to UNKNOWN. Other languageelements like a CHECK constraint “reject FALSE”, and treat UNKNOWN more similarto TRUE, but again, not exactly like TRUE. For example, an expression such asCHECK(col1 = col2) will reject a row where col1 = 5 and col2 = 10, but willaccept col1 = 5 and col2 = NULL. But as I mentioned earlier, if col1 = col2evaluates to UNKNOWN, NOT col1 = col2 still evaluates to UNKNOWN. So in thiscase, when col1 = 5 and col2 = 10 the row will be accepted this time (becauseNOT FALSE is TRUE), but col1 = 5 and col2 = NULL will also be accepted.
T-SQL already provides an answer for cases where you want to explicitly checkwhen the expression is or is not a NULL via the IS NULL and IS NOT NULLoperators. However, there are cases where you want to comparecolumns/expressions from both sides, and use 2-valued-logic; that is, you wantthe predicate to evaluate to either TRUE or FALSE.
For example, suppose that in certain circumstances you want the predicateT1.col1 = T2.col1 to evaluate to TRUE when both sides have known values and areequal, or when both sides are NULL—in other words, you want to treat NULLs justlike you do known values. For now, you have to add some logic such as:
(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))
This makes the expression awkward, and also you would probably cross yourfingers in hope that this will not hurt optimization. Note that this expressionwill “accept” only the cases you want to treat as a match and “reject” allothers, but it still has 3-valued-logic, e.g., T1.col1 = 10, T2.col1 = NULLevaluates to UNKNOWN. This means
that:
NOT(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))
Will evaluate to UNKNOWN in those cases as well and not to TRUE. So if you wantall non matches of the positive predicate to be accepted, you won’t use thepositive predicate and apply NOT on top, rather rewrite your
logic, e.g.,
T1.col1 T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL) OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)
In short, there are cases where you want predicates to use 2-valued-logic, andfor now, the solutions are not pretty, and we can only hope that they areoptimized well.
SQL Server 2008 introduces new language elements that add scenarioswhere NULLsare involved and you would want predicates to use 2-valued-logic, e.g., groupingsets related features and the MERGE statement.
I will cover those in detail inmy October and November columns. For now, even though you may not be familiarwith grouping sets and the MERGE statement, focus on the ON clause of thefollowing MERGE statement
handling incremental updates of aggregates:
WITH LastDay AS( SELECT GROUPING_ID( custid, empid, YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id, custid, empid, YEAR(orderdate) AS orderyear, MONTH(orderdate) AS ordermonth, DAY(orderdate) AS orderday, SUM(qty) AS qty FROM dbo.Orders WHERE orderdate = '20080419' GROUP BY CUBE(custid, empid), ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)))MERGE INTO dbo.MyGroupingSets AS TGTUSING LastDay AS SRC ON (TGT.grp_id = SRC.grp_id) AND (TGT.orderyear = SRC.orderyear OR (TGT.orderyear IS NULL AND SRC.orderyear IS NULL)) AND (TGT.ordermonth = SRC.ordermonth OR (TGT.ordermonth IS NULL AND SRC.ordermonth IS NULL)) AND (TGT.orderday = SRC.orderday OR (TGT.orderday IS NULL AND SRC.orderday IS NULL)) AND (TGT.custid = SRC.custid OR (TGT.custid IS NULL AND SRC.custid IS NULL)) AND (TGT.empid = SRC.empid OR (TGT.empid IS NULL AND SRC.empid IS NULL))WHEN MATCHED THEN UPDATE SET TGT.qty += SRC.qtyWHEN NOT MATCHED THEN INSERT (grp_id, orderyear, ordermonth, orderday, custid, empid) VALUES (SRC.grp_id, SRC.orderyear, SRC.ordermonth, SRC.orderday, SRC.custid, SRC.empid);
Notice how awkward it is.
I had a discussion on the subject with fellow MVPs in a private forum. Steve
Kass did some research and discovered that ANSI SQL has an answer to
thisproblem in the form of operators:
IS [NOT] DISTINCT FROM
It does require getting used to, but once you do, it makes perfect sense.
DISTINCT treats NULLs like known values. That is, one NULL is not
distinct fromanother NULL, but a NULL is distinct from known values. 5 is
distinct from 10,but 5 is not distinct from 5.
Now let’s go back to the example I mentioned earlier:
T1.col1 = T2.col1
To have this predicate use 2-valued-logic, and have NULLs treated like
knownvalues, instead of using:
(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))
You would use:
T1.col1 IS NOT DISTINCT FROM T2.col1
This is ultimately simpler than the awkward forms used nowadays and has muchbetter potential for good optimization in terms of utilizing indexesefficiently. As for the inverse; take the following predicate:
T1.col1 T2.col1
In order to treat NULLs like known values, instead of using:
T1.col1 T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL) OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)
You would use:
T1.col1 IS DISTINCT FROM T2.col1
Besides being simpler and lending itself to better optimization, the IS [NOT]DISTINCT FROM operator give you pure 2-valued-logic, namely, apply NOT on topand you will get the exact inverse.
I find this standard language element very important, and this will become moreevident in SQL Server 2008. Unfortunately, it was not implemented in T-SQL.Steve Kass who researched and discovered it submitted a
suggestion for thisenhancement in connect.microsoft.com. I encourage you to vote for thissuggestionhere;the more votes it gets, the better are the chances you will see it in a futureversion of SQL Server.
Cheers,
--
BG
About the Author
You May Also Like