Compute a Trimmed Mean

Learn two methods: the NTILE function and the PERCENT_RANK or CUME_DIST function

December 4, 2014

4 Min Read

My good friend Andy Kelly sent me a puzzle concerning the computation of an average aggregate, but with a twist. The puzzle involves data representing job durations. To try the puzzle, use the code in Listing 1 to create the Jobs table and fill it with some sample data.

SET NOCOUNT ON;USE tempdb;IF OBJECT_ID(N'dbo.Jobs', N'U') IS NOT NULL DROP TABLE dbo.Jobs;GOCREATE TABLE dbo.Jobs(  jobid INT NOT NULL,  starttime DATETIME2(0) NOT NULL,  endtime DATETIME2(0) NOT NULL,  duration AS DATEDIFF(s, starttime, endtime),  CONSTRAINT PK_Jobs PRIMARY KEY(jobid, starttime),  CONSTRAINT CHK_start_end CHECK (endtime >= starttime));GOINSERT INTO dbo.Jobs(jobid, starttime, endtime) VALUES  (1, SYSDATETIME(), DATEADD(s, 60, SYSDATETIME())),  (1, DATEADD(s, 100, SYSDATETIME()), DATEADD(s, 165, SYSDATETIME())),  (1, DATEADD(s, 200, SYSDATETIME()), DATEADD(s, 254, SYSDATETIME())),  (1, DATEADD(s, 300, SYSDATETIME()), DATEADD(s, 362, SYSDATETIME())),  (1, DATEADD(s, 400, SYSDATETIME()), DATEADD(s, 400, SYSDATETIME())), -- failure  (1, DATEADD(s, 500, SYSDATETIME()), DATEADD(s, 570, SYSDATETIME())),  (1, DATEADD(s, 600, SYSDATETIME()), DATEADD(s, 663, SYSDATETIME())),  (1, DATEADD(s, 700, SYSDATETIME()), DATEADD(s, 750, SYSDATETIME())),  (1, DATEADD(s, 800, SYSDATETIME()), DATEADD(s, 2000, SYSDATETIME())), -- stuck  (1, DATEADD(s, 2100, SYSDATETIME()), DATEADD(s, 2158, SYSDATETIME()));

Query the Jobs table as follows to see its contents:

SELECT jobid, durationFROM dbo.Jobs;

Figure 1 shows the output you get.

jobid       duration----------- -----------1       601       651       541       621       01       701       631       501       12001       58

The sample data represents 10 executions of the same job. In reality, you'll have multiple jobs with many more rows per job. The puzzle involves computing the average duration per job. The problem is that outliers exist that need to be excluded. For example, consider a case in which a problem causes a job to run for an abnormally short or long duration. You want to exclude the outliers from consideration.

Different statistical methods are available for excluding outliers. The method you should use depends on your exact needs. Dejan Sarka goes into a lot of detail regarding such computations in his chapter "T-SQL for BI Practitioners" in the book T-SQL Querying (Microsoft Press, 2015).

I'd like to describe one of the classic methods used in such cases, called trimmed mean. The idea in this method is to exclude a certain percent of the lowest and highest values. For example, suppose you want to exclude 10 percent of the lowest and 10 percent of the highest values from consideration. For our sample data, this would mean excluding the first and last samples out of the 10 samples we have, as Figure 2 shows.

Figure 2: Trimmed Mean

Figure 3 shows the desired result for our sample data.

jobid  avgduration------ ------------1      60.250000

One way to achieve the task is to arrange the samples in 10 tiles, using the NTILE(10) function. The first 10 percent of the rows will be assigned with tile number 1 and the last 10 percent with tile number 10. Then you filter only the rows with the tile numbers 2 through 9, excluding 1 and 10. After filtering, you group the remaining rows by the jobid column and return the average duration. Here's the complete solution query:

WITH C AS(  SELECT jobid, duration,    NTILE(10) OVER(PARTITION BY jobid ORDER BY duration) AS tile  FROM dbo.Jobs)SELECT jobid,  AVG(1. * duration) AS avgdurationFROM CWHERE tile BETWEEN 2 AND 9GROUP BY jobid;

Another option is to use the PERCENT_RANK window function. This function computes the relative rank of a row and expresses it as a percent. More specifically, if rk is the rank of the row computed with the RANK function based on job duration order and cnt is the count of rows (job executions), PERCENT_RANK is computed as (rk - 1) / (cnt - 1).

For our trimmed mean calculation, you start by computing the percentile rank of the row (call the column pctrnk). Then, similar to the previous solution, you filter only the rows in the desired percentile range (e.g., between 0.1 and 0.9). Finally, you group the remaining rows by the jobid column and compute the average duration. Here's the complete solution query:

WITH C AS(  SELECT jobid, duration,    PERCENT_RANK() OVER(PARTITION BY jobid ORDER BY duration) AS pctrnk  FROM dbo.Jobs)SELECT jobid,  AVG(1. * duration) AS avgdurationFROM CWHERE pctrnk BETWEEN 0.1 AND 0.9GROUP BY jobid;

Depending on the exact semantics you're looking for, instead of using PERCENTILE_RANK you can use the alternative function CUME_DIST. This function computes the percent with the formula np / cnt, where np is the number of rows that precede or are peers of the current row.

About the Author

Itzik Ben-Gan

See more from Itzik Ben-Gan

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

Compute a Trimmed Mean

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

<span class="ArticleBase-LargeTitle">Compute a Trimmed Mean</span>Compute a Trimmed Mean

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Compute a Trimmed Mean