When to Generate SQL Server Statistics Automatically

In the first article in this series on SQL Server distribution statistics, I covered at a high level what statistics are, why they are important, and the options at the database level for creating and updating statistics automatically. Now it’s time to see what those automatic operations leave in their wake as well as looking at when it makes sense to manually add statistics objects to your tables & indexes in Microsoft SQL Server.

Why Statistics Matter: A Refresher

In part one of this series, I explained that statistics provide your data values' metadata. That metadata is used to create and compile the best execution plans for the queries being submitted by end users against your databases. If you have performance problems with your SQL Server instances, many of them can be linked to the query optimizer making bad decisions when building execution plans. And those plans are inappropriate because they're based on a lack of understanding data values.

Statistics provide insight into what values exist in a domain of data, how discrete (aka different, unique) the values are from one another, and what their storage size and counts are. This all is useful for the query optimizer (QO going forward) to determine and craft the proper seeks, scans, lookups, joins of various kinds, etc., along with the amount of memory to allocate for the processing of results.

Statistics can be automatically generated and updated depending upon certain conditions set at the database level. They are also generated automatically when indexes are created and rebuilt (but not reorganized). More information on those settings and the thresholds governing when statistics are rebuilt to collect current distribution metadata in that first article.

We'll be looking at two things in this article: the desire to look at the objects that are created automatically, and the timing of such things so that we can identify gaps to create statistics objects manually.

Identifying Statistics Objects in the Database

While there are many ways to query SQL Server for details on the statistics objects that exist to get an idea of their composition and what they can tell us of our data distribution, we’re going to focus on the simple identification process in this article. In order to do that I only need to introduce you to a couple of simple SQL Server queries. We will be looking at both the sys.stats system catalog view and the sys.dm_db_stats_properties Dynamic Management Function.

The following queries can be used to return base “does it exist” information as well as other valuable details about your statistics objects in a database:

First the sys.stats approach:

USE Tech_Outbound; GO SELECT O.[name] AS [the_object_name] , S.object_id , S.name AS [the_stats_name] , S.stats_id , S.auto_created , S.user_created , S.no_recompute , S.has_filter , S.filter_definition , S.is_temporary , S.is_incremental

FROM sys.stats AS S INNER JOIN sys.objects AS O ON S.[object_id] = O.[object_id] WHERE O.is_ms_shipped = 0 ORDER BY O.[name];

I do join out to another system catalog view, sys.objects, for the purpose of resolving the name of the object the statistics are tied to (via object_id) as well as to filter the results to only look at objects that are user-created (using the is_ms_shipped column in sys.objects.)

At this point I’ve yet to create any objects in the Tech_Outbound database I’m using for this series. So no results are returned. However I want to build some foundational tables to looking at statistics as we go along so I’m going to first create a numbers table using code from my good friend Itzik Ben-Gan who will be joining us to deliver presentations on t-sql in 2018 on the Alaskan Tech Outbound event slated for August 4-11, 2018. Ben-Gan is a frequent author here and this code is borrowed from his tool kit.

--CREATE A NUMBERS TABLE

CREATE TABLE Numbers (n INT NOT NULL PRIMARY KEY); GO INSERT INTO dbo.Numbers(n) SELECT rn FROM ( SELECT ROW_NUMBER() OVER(ORDER BY CURRENT_TIMESTAMP) AS rn FROM sys.trace_event_bindings AS b1 , sys.trace_event_bindings AS b2 ) AS rd WHERE rn <= 500000;

SELECT * FROM dbo.Numbers;

If we run the discovery query again from earlier in this article we would see the following results.

This is specifically due to the creation of the primary key. How I can tell is the naming convention of the statistics name matches the automatic naming of the primary key. Plus the creation of a primary key creates a clustered index by default and the creation of an index creates the underlying statistics for that index in the process since every page needs to be scanned for the index creation you’ll also end up with a full scan used for the statistics creation. This is the most complete view of your data distribution you can get. Usually you’ll get sampled statistics but since there is already a full scan overhead incurred for an index rebuild it’s a wise time to get a full scan build of your statistics (or update of statistics via full scan if the conditions are appropriate.)

Using a query against sys.dm_db_stats_properties we can gain a few additional insights beyond those offered through sys.stats. For a query against this Dynamic Management Function (aka DMF) we need to supply two parameter values: object_id and stats_id. We know these values through the results returned from sys.stats earlier. Using those values and again joining to sys.objects for getting clarifying information we have the following query below:

SELECT O.[name] AS [the_object_name]
       , S.object_id
       , S.stats_id
       , S.last_updated
       , S.rows
       , S.rows_sampled
       , S.steps
       , S.unfiltered_rows
       , S.modification_counter
FROM sys.dm_db_stats_properties(565577053,1) AS S
       INNER JOIN sys.objects AS O
              ON S.[object_id] = O.[object_id]
WHERE O.is_ms_shipped = 0
       AND O.[name] != 'foo'
ORDER BY O.[name];

The sys.dm_db_stats_properties DMF provides insights into the time the statistics were last updated as well as information about the underlying object’s domain (in this case 50,000 rows) as well as the amount of rows included in the sampling used to create or update the statistics, the number of steps in the statistics historgram (explained in a future article in this series), the amount of unfiltered rows – which come into play when dealing with filtered statistics (not yet covered in this series), and finally the modification_counter for the statistics object.

The modification counter is what keeps track to the number of incremental changes made to the underlying column for the object the statistics are describing. In this case that would be the primary key column of dbo.Numbers. If looking at a multi-columnar statistic this value is only for the leading column as is the metric for “steps” since only the leading column in multi columnar statistics has a histogram associated with it.

The results here back up what I stated about a full scan being employed with an index is created or rebuilt: this is a conclusion of looking at the rows and rows_sampled columns from sys.dm_db_stats_properties. Considering they equal you can conclude 100% of the rows were sampled to build the statistics for the primary key in that table.

Statistics Creation Insights

I want to show a couple more examples of what happens with statistics generation when other commands associated with creation of tables and querying values in tables occur. One thing I highlighted in the first article is the behavior of turning on the AUTO_CREATE_STATISTICS option at the database level. What this enables is the ability for SQL Server to generate statistics objects when needed and they don’t exist. We’ll see this behavior as we continue on.

I want to use the dbo.Numbers table to create another table we’ll be using later in this series to look at statistics in greater depth. I’m going to first create this table (called T0) with the following code:

Now if I re-run the query against sys.stats (the first discovery query above) I get the following results which show that no new statistics objects have been generated simply by the act of creating a table:

CREATE TABLE T0 (c1 INT NOT NULL, c2 NCHAR(200) NOT NULL DEFAULT '#');

At this point I now am going to use the dbo.Numbers table to load 100,000 rows into the table all with the value of 1000 into the c1 column and a single record with the value of 2000. We’ll be using this table later as we look at what cardinality – the uniqueness of data values – has to do with execution plans and statistics.

--INSERT 100000 rows. All rows contain the value of 1000 for column c1

INSERT INTO T0(c1)
SELECT 1000 FROM dbo.Numbers
WHERE n <= 10000;

--INSERT 1 row with value of 2000

INSERT INTO T0(c1) VALUES (2000);

At this point we still have not created any new statistics objects. However, I now want to create a non-clustered index on column c1:

--CREATE NON-CLUSTERED INDEX ON COLUMN c1

CREATE NONCLUSTERED INDEX ix_T0_1 ON T0(c1);

Now, when running the sys.stats discovery query we see an additional stats object in the database:

This is to be expected. It’s similar behavior to what happened with the dbo.Numbers table. What about the other aspect of automatic statistics creation however? That arises when statistics are needed to get an understanding of how joins and search predicates are handled when creating an execution plan. Right now, the only understanding of distribution statistics exist for just the dbo.Numbers.n column (it’s primary key column) and the dbo.T0.c1 column that we manually created a clustered index on above. I’m going to create one more table specifically to show the resulting behavior:

CREATE TABLE T1 (c1 INT NOT NULL, c2 NCHAR(200) NOT NULL DEFAULT 'SQL Cruise')

This statement will not create any associated statistics objects as we’ve seen presented before. There is no index of any kind on the T1 table when I add a row:

INSERT INTO T1(c1) VALUES (2000)

But now what happens if I execute the following query?

SELECT T1.c1 FROM T0 INNER JOIN T1 ON T1.c1 = T0.c1;

We will get a single result returned (2000) but what I want to highlight instead is what happens with statistics. We had no stats objects for T1 until the query was executed (or should I say just before the query was executed because the QO determined it didn’t have any understanding of the data that it needed to perform the join so it created statistics as part of building the execution plan. This is what the sys.stats discovery query returns now:

By setting AUTO_CREATE_STATISTICS = ON you’ll see statistics created for joins where there are no existing leading column statistics created.

Likewise we can see the same behavior on filtering through a WHERE clause predicate. The following two queries will not create any statistics:

SELECT T1.c2
FROM dbo.T1
WHERE c1 = 2000;
SELECT T0.c2
FROM dbo.T0
WHERE c2 IS NULL;

This is because the QO does not need any additional insight into data distribution for either query. This is true for the first query because we already have statistics through the automatic creation behavior from the join we just executed that resulted in the _WA_Sys_... queries to be generated (by the way this is the default naming convention for automatically generated statistics.) The reason the second query doesn’t create statistics is because the QO can determine the results based upon the definition of the table not allowing for NULL values.

What happens however when the QO needs more information as to the distribution of values for the predicate? We get automatic generation of statistics as shown below.

SELECT T0.c2
FROM dbo.T0
WHERE c2 = '1';

This results in creation of the highlighted statistics shown in row 4 below:

Conclusion of Part 2

As we continue through our tour of distribution statistics in Microsoft SQL Server you should now have a good understanding of what statistics are and why they’re important as well as how to query for existence and gain some basic insights about the stats objects in your database. You should also now be able to understand when stats are automatically created – and when they’re not.

In the next article in the series I’ll be visiting the statistics histogram and more options for detailed metadata about your statistics through sys.stats, sys.dm_db_stats_properties, and through the as-yet-to-be-discussed call to DBCC SHOW STATISTICS.

Comments

Plain text