T-SQL 101, Lesson 4
The GROUP BY clause is a powerful feature of T-SQL. Depending on the level of detail you need in your reports, you can use the GROUP BY clause in SELECT statements to group data by values in one or more columns. You can even use the HAVING clause to further refine your reports.
May 27, 2008
You can begin to tap into the true power of T-SQL by using the GROUP BY clause in SELECT statements. Grouping data lets you produce reports that answer complex questions instead of reports that answer only basic questions. For example, with the GROUP BY clause, you can produce a report that answers the question "What is the average size of the bonus paid to each employee over the past 10 years?" instead of "What is the average size of the bonus paid out to employees?"
Depending on the level of detail you need in your reports, you can use the GROUP BY clause to group data by values in one or more columns. You can even use the HAVING clause to further refine your reports. Before I show you how to do so, though, you might want to create and populate a couple of tables so you can follow along.
The Prerequisites
To help demonstrate grouping, I created two tables: Bonus and MovieReview. The Bonus table contains the bonus payments given to eight employees in the past 10 years. This table contains three columns: EmployeeID, Amount, and PaymentDate. The MovieReview table contains the ratings that the five employees have given to movies they’ve watched in their spare time. This table contains four columns: EmployeeID, Genre, MovieName, and Stars. The Stars field specifies the movie’s rating, where 1 star is the worst rating and 5 is the best rating.
You can create and populate these tables by following these steps:
Download the CodeToCreateBonusTable.sql, CodeToPopulateBonusTable.sql, CodeTo CreateMovieReviewTable.sql, and CodeTo PopulateMovieReviewTable.sql files. Click the "Download the Code" hotlink at the top of the article.
Create the Bonus table. Open SQL Server 2005’s SQL Server Management Studio (SSMS) or SQL Server 2000’s Query Analyzer and copy the code in CodeToCreateBonusTable.sql into the query window. In the first line of code, change MyDB to the name of your database. Execute the code.
Populate the Bonus table by running CodeToPopulateBonusTable.sql in SSMS or Query Analyzer.
Create the MovieReview table. Copy the code in CodeToCreateMovieTable.sql into SSMS’s or Query Analyzer’s query window. In the first line of code, change MyDB to the name of your database. Execute the code.
Populate the MovieReview table by running CodeToPopulateMovieReviewTable.sql in SSMS or Query Analyzer.
Grouping Data Using One Column
I remember spending hours writing COBOL programs to produce reports. They took even longer to debug. Today, it takes me only a few minutes to crank out the equivalent T-SQL code to produce similar reports, thanks in part to the GROUP BY clause.
When you use the GROUP BY clause in a SELECT statement, two things happen:
GROUP BY uses the grouping criterion you specify to group the data being returned by the SELECT query. Typically, the grouping criterion is a column, in which case GROUP BY groups the data into the possible values in that column. For example, the EmployeeID column in the Bonus table has eight possible values (1 through 8), so GROUP BY would group the data being returned by the SELECT query into those eight groups. Similarly, the Stars column in the MovieReview table has five possible values (1 through 5), so GROUP BY would group the data into those five groups. GROUP BY returns only one row for each possible group. For example, GROUP BY would return eight rows when you group by the EmployeeID column and five rows when you group by the Stars column, assuming there is data in each group.
After GROUP BY is done grouping the data, the aggregate function specified in the SELECT query is performed on each group rather than on the entire result set. So, for example, if the SELECT query specifies to use the AVG function to get the average of the values in the Amount column in the Bonus table and you grouped the data by the EmployeeID column, individual averages will be calculated for each of the eight groups instead of one overall average.
These concepts can be confusing for people new to T-SQL, so let’s take a look at a few examples. Note that in trying to keep these examples as straightforward as possible, I’m minimizing the use of the T-SQL concepts that I haven’t yet covered. People who are more familiar with T-SQL might question the column definitions I have chosen or the lack of differential integrity in the tables. Those concepts, while important, aren’t relevant to what I’m covering here. I’ll explore those topics in future lessons. Until then, please bear with me.
Suppose you need to query the Bonus table to determine the total bonus amount paid to each employee over the past 10 years. In Lesson 3, I showed you how to use the SUM function in a SELECT statement to obtain the total of all the values in a specified column. If you were to use this function with the Amount column in the Bonus table, you’d get a single dollar figure representing the total amount that the company paid out in bonuses for the past 10 years. To get the total bonus amount paid to each employee over the past 10 years, you can use a GROUP BY clause in which the specified column is EmployeeID:
SELECT EmployeeID,SUM(Amount) AS 'Total Bonus'FROM BonusGROUP BY EmployeeID
As you can see in the results in Figure 1, employee 6 has the lowest total bonus payout. If you cross reference this result with the data in the Employee table created in Lesson 3, you’ll see that employee 6 is Napoleon Lawrence, one of several employees most recently hired in 2006. You’ll also see that Napoleon has the lowest salary of all the employees. Thus, in all likelihood, his total bonus payout is less than the other employees because he hasn’t been with the company that long and his salary is low. (Typically, employees with lower salaries receive lower bonuses.)
Now let’s go a bit further and determine the average bonus per employee along with the number of bonuses paid to each employee. You can do this by using the AVG and COUNT functions and grouping their results by EmployeeID:
SELECT EmployeeID,AVG(Amount) AS 'Average Bonus'COUNT(Amount) AS 'Bonuses Paid'FROM BonusGROUP BY EmployeeID
As the results in Figure 2 show, Napoleon (employee 6) has indeed earned only two bonuses and his average payout is $1,000, which is significantly lower than all the other employees.
Now let’s compare each employee’s average bonus with the overall average paid out by the company. You first need to determine the overall average of the bonuses paid out by the company:
SELECT AVG(Amount)AS 'Corporate Average'FROM Bonus
The result is $7787.50. Next, you need to plug this corporate average into a calculation that determines how each employee’s average bonus compares with the corporate average in terms of percentage:
SELECT EmployeeID,AVG(Amount) AS 'Average Bonus',7787.50 AS 'Corporate Average',(AVG(Amount) - (7787.50)) / (7787.50) * 100AS '% Higher/Lower than Corp Avg'FROM BonusGROUP BY EmployeeID
Figure 3 shows the results. Note that if you wanted to calculate the corporate average on the fly, you could use several subqueries in the SELECT statement. Or even better, you could calculate the corporate average and store the result in a local variable. You could then reference this variable in the SELECT statement instead of hard-coding the corporate average. However, this is an advanced topic that I’ll cover in a later lesson.
Grouping Data by More Than One Column
When selecting data for a query, you can group by more than one column. When you group by more than one column, you should include the ORDER BY clause to specify how you want the returned data sorted. If you don’t use this clause, the data isn’t returned any particular order.
For the next examples, let’s switch to the MovieReview table to see what some employees have been up to in their (ahem) spare time. Suppose you want to find out each employee’s favorite types of movies (i.e., genres) and the average rating he or she has given them. To determine the favorite genres, you need to count the number of times each type of movie was reviewed, which you can do with the COUNT function. You also need to use the AVG function to determine the average rating for each genre. To show the number of movie reviews per employee per genre, you need to group this data by two columns: EmployeeID and Genre. To make it easy to see each employee’s favorite types of movies, you can sort the data first by the returned values in EmployeeID column (column 1) then by the returned values in the Review column (column 3). Here’s what that query looks like:
SELECT EmployeeID, Genre,COUNT(*) AS 'Reviews',AVG(Stars) AS 'Average Rating'FROM MovieReviewGROUP BY EmployeeID, GenreORDER BY 1,3
As you can see from the results in Figure 4, employee 1 has watched many different types of movies and seemed to enjoy the fiction and horror movies the most because they had the highest average rating. You can also see that employees 4 and 5 haven’t reviewed as many movies as the other employees.
Refining Results with the HAVING Clause
The HAVING clause is used to eliminate rows from the result set after the data has been aggregated and grouped. Any column defined in the SELECT list can be referenced in the HAVING clause. You can also reference aggregate functions.
For example, the results in Figure 4 show that not all the employees have reviewed the same number of movies. So, let’s produce a report that lists the average rating of every movie that has been rated by four or more employees.
As output, you want to see each movie’s name, the average number of stars it received, and the number of times it was reviewed. To get this output, you need to:
Use the GROUP BY clause to group the data by the MovieName column.
Use the AVG function to determine the average number of stars for each movie. However, instead of displaying the average number of stars as an integer (boring), you can use the REPLICATE function to display an asterisk (*) for every star the movie received. The REPLICATE function repeats a string value the specified number of times. To make sure this new Stars column is only 10 characters long, you need to use the LEFT function, which returns with the specified number of characters from the left part of the specified string.
Use the COUNT function to determine how many reviews each movie received.
Use the HAVING clause to display only those movies that were rated by at least four employees.
Use the ORDER BY clause to sort the movies by their average rating.
The query would look like
SELECT MovieName,LEFT(REPLICATE('* ',AVG(Stars)),10)AS 'Stars',COUNT(*) AS 'Reviews'FROM MovieReviewGROUP BY MovieNameHAVING COUNT(*) >= 4ORDER BY Stars
Figure 5 shows the results.
Now let’s get a list of the movies with the worst ratings—that is, those movies with an average rating of 3 or fewer stars. You can use the same query, except you need to use a HAVING clause that will display only those movies that received an average of three or fewer stars:
SELECT MovieName,LEFT(REPLICATE('* ',AVG(Stars)),10)AS 'Stars'FROM MovieReviewGROUP BY MovieNameHAVING AVG(Stars) <= 3ORDER BY Stars
As Figure 6 shows, there are some real stinkers.
Finally, let’s see what movies the employees enjoyed the most by changing the HAVING clause so that it displays only those movies that had an average rating of four stars or higher:
SELECT MovieName,LEFT(REPLICATE('* ',AVG(Stars)),10) AS 'Stars'FROM MovieReviewGROUP BY MovieNameHAVING AVG(Stars) >= 4ORDER BY Stars DESC
Notice that the ORDER BY clause sorts the data by the number of stars, but this time it displays them in descending order. Figure 7 shows the result.
A Powerful Feature
As you witnessed in this lesson, the GROUP BY clause is a powerful feature of T-SQL. Grouping data might have been a foreign concept to you prior to this lesson, but now you should be well on your way to writing complex useful reports in record time. Be sure to rub it into any COBOL programmers you might know.
About the Author
You May Also Like