IS NOT DISTINCT FROM

Itzik encourages you to vote for a proposal for a T-SQL enhancement that has to do with NULL treatment.

July 16, 2007

5 Min Read

Treatment of NULLs adds a fair amount of complexity to SQL. Here I’d like to point out a specific issue that has to do with NULL treatment and point out a standard SQL language element that addresses this issue and was not yet implemented in SQL Server. At the end of this article I’ll provide a URL where you can vote for adding it to SQL Server.

When you compare columns/expressions, say in the ON or WHERE clauses of a query, predicates follow what’s called the 3-valued-logic. For example:

SELECT ...FROM dbo.T1 JOIN dbo.T2  ON T1.col1 = T2.col1;

The predicate can evaluate to TRUE, FALSE or UNKNOWN. The predicate evaluates toUNKNOWN whenever NULLs are involved. For example (T1.col1 = 5, T2.col1 = NULL),(T1.col1 = NULL, T1.col2 = 10) and even (T1.col1 = NULL, T2.col1 = NULL). SomeT-SQL language elements will handle UNKNOWN differently than others. Forexample, ON, WHERE, HAVING ,IF and CASE “accept TRUE”, and will treat UNKNOWNsomewhat similar to FALSE, but not exactly like FALSE.

For example, in the aboveexample, when T1.col1 = T2.col1 evaluates to UNKNOWN, like with FALSE, this willbe considered a nonmatch. However, the reason that I said that the treatmenthere is not exactly like with FALSE is that if T1.col1 = T2.col1 evaluates toUNKNOWN, NOT T1.col1 = T2.col1 also evaluates to UNKNOWN. Other languageelements like a CHECK constraint “reject FALSE”, and treat UNKNOWN more similarto TRUE, but again, not exactly like TRUE. For example, an expression such asCHECK(col1 = col2) will reject a row where col1 = 5 and col2 = 10, but willaccept col1 = 5 and col2 = NULL. But as I mentioned earlier, if col1 = col2evaluates to UNKNOWN, NOT col1 = col2 still evaluates to UNKNOWN. So in thiscase, when col1 = 5 and col2 = 10 the row will be accepted this time (becauseNOT FALSE is TRUE), but col1 = 5 and col2 = NULL will also be accepted.

T-SQL already provides an answer for cases where you want to explicitly checkwhen the expression is or is not a NULL via the IS NULL and IS NOT NULLoperators. However, there are cases where you want to comparecolumns/expressions from both sides, and use 2-valued-logic; that is, you wantthe predicate to evaluate to either TRUE or FALSE.

For example, suppose that in certain circumstances you want the predicateT1.col1 = T2.col1 to evaluate to TRUE when both sides have known values and areequal, or when both sides are NULL—in other words, you want to treat NULLs justlike you do known values. For now, you have to add some logic such as:

(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))

This makes the expression awkward, and also you would probably cross yourfingers in hope that this will not hurt optimization. Note that this expressionwill “accept” only the cases you want to treat as a match and “reject” allothers, but it still has 3-valued-logic, e.g., T1.col1 = 10, T2.col1 = NULLevaluates to UNKNOWN. This means
that:

NOT(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))

Will evaluate to UNKNOWN in those cases as well and not to TRUE. So if you wantall non matches of the positive predicate to be accepted, you won’t use thepositive predicate and apply NOT on top, rather rewrite your
logic, e.g.,

T1.col1  T2.col1  OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL)  OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)

In short, there are cases where you want predicates to use 2-valued-logic, andfor now, the solutions are not pretty, and we can only hope that they areoptimized well.

SQL Server 2008 introduces new language elements that add scenarioswhere NULLsare involved and you would want predicates to use 2-valued-logic, e.g., groupingsets related features and the MERGE statement.

I will cover those in detail inmy October and November columns. For now, even though you may not be familiarwith grouping sets and the MERGE statement, focus on the ON clause of thefollowing MERGE statement
handling incremental updates of aggregates:

WITH LastDay AS(  SELECT     GROUPING_ID(      custid, empid,      YEAR(orderdate), MONTH(orderdate), DAY(orderdate) ) AS grp_id,    custid, empid,    YEAR(orderdate) AS orderyear,    MONTH(orderdate) AS ordermonth,    DAY(orderdate) AS orderday,    SUM(qty) AS qty  FROM dbo.Orders  WHERE orderdate = '20080419'  GROUP BY    CUBE(custid, empid),    ROLLUP(YEAR(orderdate), MONTH(orderdate), DAY(orderdate)))MERGE INTO dbo.MyGroupingSets AS TGTUSING LastDay AS SRC  ON     (TGT.grp_id = SRC.grp_id)     AND (TGT.orderyear = SRC.orderyear          OR (TGT.orderyear IS NULL AND SRC.orderyear IS NULL))     AND (TGT.ordermonth = SRC.ordermonth          OR (TGT.ordermonth IS NULL AND SRC.ordermonth IS NULL))     AND (TGT.orderday = SRC.orderday          OR (TGT.orderday IS NULL AND SRC.orderday IS NULL))     AND (TGT.custid = SRC.custid          OR (TGT.custid IS NULL AND SRC.custid IS NULL))     AND (TGT.empid = SRC.empid          OR (TGT.empid IS NULL AND SRC.empid IS NULL))WHEN MATCHED THEN  UPDATE SET    TGT.qty += SRC.qtyWHEN NOT MATCHED THEN  INSERT (grp_id, orderyear, ordermonth, orderday, custid, empid)  VALUES (SRC.grp_id, SRC.orderyear, SRC.ordermonth, SRC.orderday, SRC.custid, SRC.empid);

Notice how awkward it is.
I had a discussion on the subject with fellow MVPs in a private forum. Steve
Kass did some research and discovered that ANSI SQL has an answer to
thisproblem in the form of operators:

IS [NOT] DISTINCT FROM

It does require getting used to, but once you do, it makes perfect sense.
DISTINCT treats NULLs like known values. That is, one NULL is not
distinct fromanother NULL, but a NULL is distinct from known values. 5 is
distinct from 10,but 5 is not distinct from 5.
Now let’s go back to the example I mentioned earlier:

T1.col1 = T2.col1

To have this predicate use 2-valued-logic, and have NULLs treated like
knownvalues, instead of using:

(T1.col1 = T2.col1 OR (T1.col1 IS NULL AND T2.col1 IS NULL))

You would use:

T1.col1 IS NOT DISTINCT FROM T2.col1

This is ultimately simpler than the awkward forms used nowadays and has muchbetter potential for good optimization in terms of utilizing indexesefficiently. As for the inverse; take the following predicate:

T1.col1  T2.col1

In order to treat NULLs like known values, instead of using:

T1.col1  T2.col1  OR (T1.col1 IS NULL AND T2.col1 IS NOT NULL)  OR (T1.col1 IS NOT NULL AND T2.col1 IS NULL)

You would use:

T1.col1 IS DISTINCT FROM T2.col1

Besides being simpler and lending itself to better optimization, the IS [NOT]DISTINCT FROM operator give you pure 2-valued-logic, namely, apply NOT on topand you will get the exact inverse.

I find this standard language element very important, and this will become moreevident in SQL Server 2008. Unfortunately, it was not implemented in T-SQL.Steve Kass who researched and discovered it submitted a
suggestion for thisenhancement in connect.microsoft.com. I encourage you to vote for thissuggestionhere;the more votes it gets, the better are the chances you will see it in a futureversion of SQL Server.

Cheers,
--
BG

About the Author

Itzik Ben-Gan

See more from Itzik Ben-Gan

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

IS NOT DISTINCT FROM

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

<span class="ArticleBase-LargeTitle">IS NOT DISTINCT FROM</span>IS NOT DISTINCT FROM

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

IS NOT DISTINCT FROM