Flattening Hierarchies

Convert hierarchies that are represented as adjacency lists

Itzik Ben-Gan

October 25, 2010

11 Min Read
Flattening Hierarchies

From time to time I get a request to convert a hierarchy represented as an adjacency list (parent-child IDs) to a flattened, or pivoted, representation. The latter representation means that you want the result set to have a row for each node in the hierarchy, with a column for each of the ancestors of that node, holding the ancestor ID.

As an example, the code in Listing 1 creates a table called Employees and populates it with sample data. The Employees table holds an adjacency list representation of an employee hierarchy. The task that’s the focus of this article is to create a flattened representation of the hierarchy, as Table 1 shows.

I cover two main strategies to address this task—one based on joins and another based on recursive queries. As usual, I urge you to first try to come up with your own solutions before looking at mine.

Table 1: Flattened Hierarchy

Empid

lvl

level1

level2

level3

level4

level5

1

1

1

NULL

NULL

NULL

NULL

2

2

1

2

NULL

NULL

NULL

3

2

1

3

NULL

NULL

NULL

7

3

1

3

7

NULL

NULL

9

4

1

3

7

9

NULL

11

4

1

3

7

11

NULL

12

5

1

3

7

9

12

13

5

1

3

7

9

13

14

5

1

3

7

9

14

4

3

1

2

4

NULL

NULL

5

3

1

2

5

NULL

NULL

6

3

1

2

6

NULL

NULL

8

4

1

2

5

8

NULL

10

4

1

2

5

10

NULL

Solution Using a Static Query Based on Joins

The strategy based on joins works reasonably well when the degree of the tree (the maximum depth, or number of levels) is fairly small and known ahead. For example, suppose you know that the employee hierarchy won’t exceed five levels of management in the foreseeable future. You can thus address the task at hand with a static query that uses four joins. Listing 2 contains a possible solution based on the joins strategy.

Most of the logic in the query defining the CTE called C is fairly straightforward—it’s a query with four self-outer-joins, where each instance of the Employees table represents a different ancestor of the initial node. The instance representing the initial node is A, the parent of the initial node B, and so on. The one thing that might be less straightforward is the calculation of the column lvl. This column represents the level of the current node, with the integer 1 representing the root node (the CEO in our case), 2 for children of the root (direct subordinates of the CEO), and so on. To calculate the current node’s level, you need to count how many nodes in the ancestors path leading to the current node aren’t NULL. For example, for the employee Jiru (employee ID 5), you’ll find three non-NULL ancestors in its path—A.empid = 5, B.empid = 2 (Jiru’s manager), and C.empid = 1 (Jiru’s manager’s manager, the CEO). The IDs of the fourth and fifth ancestors of Jiru (D.empid and E.empid) are NULLs.

To count how many non-NULL ancestor IDs exist in each node’s path, I used a new feature in SQL Server 2008—table value constructor—like so:

( SELECT COUNT(empid)  FROM (VALUES(A.empid),(B.empid),(C.empid),(D.empid),(E.empid)) AS D(empid) ) AS lvl

This feature lets you use the VALUES clause to construct a derived table made out of rows constructed from your own expressions. My derived table (named D) consists of five rows, each with one column (named empid) holding one of the ancestor IDs of the current node’s ancestors. The query against the derived table applies a COUNT aggregate to count the non-NULL empid values. For us, this count represents the current node’s level in the tree.

If you’re working with a version of SQL Server prior to 2008, you can use the following expression as an alternative:

(   CASE WHEN A.empid IS NOT NULL THEN 1 ELSE 0 END  + CASE WHEN B.empid IS NOT NULL THEN 1 ELSE 0 END  + CASE WHEN C.empid IS NOT NULL THEN 1 ELSE 0 END  + CASE WHEN D.empid IS NOT NULL THEN 1 ELSE 0 END  + CASE WHEN E.empid IS NOT NULL THEN 1 ELSE 0 END ) AS lvl

As for the outer query in Listing 2, it has a series of CASE expressions responsible for returning the ancestor ID in each of the levels of the path leading to the current node. Recall that each instance in the join represents a certain distance from the starting node—A represents the starting node, B represents the parent, C the grandparent, and so on. But the desired result is supposed to return in each of the columns the ancestor with a certain distance from the root—not the target node. So, for example, the target column level1 is supposed to hold the ID of the current node’s ancestor in the first level of the path. For Didi, who’s in level 5, the ancestor in level 1 in the path appears in the attribute E.empid. But for Gabriel, who’s in level 4, the ancestor in level 1 appears in the attribute D.empid. So the CASE expressions have the logic and the math to figure out for each level in the path which of the attributes needs to be returned based on the current node’s level. This technique works reasonably well when the degree of the tree is fairly small, known, and not planned to increase in the foreseeable future.

Static Solution Using a Recursive Query

Another approach to flattening hierarchies uses recursive queries. I’ll first demonstrate a solution using a static recursive query, assuming the degree of the tree won’t exceed a certain number—for example, five. Then I’ll show how you can enhance the solution for cases in which the degree of the tree can keep changing with an unknown maximum.

The first step in the solution is to write a recursive CTE with the anchor query returning the row for the root node and the recursive query returning the children of the nodes from the previous round. This way the code will query the nodes one whole level at a time, starting with the root. With fairly simple logic you can calculate each node’s level (1 for the root; parent’s level plus 1 for a child). You can also construct a path consisting of the IDs of all ancestors leading to the current node. The path of the root consists of just the root’s ID, converted to a fixed-length 10-character string with leading spaces. The path for a child simply consists of the parent’s path plus the child’s ID, again, converted to a fixed-length character string. Listing 3 contains the code that implements this step.

Table 2 shows the output of the step. As an example, Didi, who has four levels of ancestors, got the lvl (meaning level) value 5, and pth (meaning path) values 1, 3, 7, 9, and 14.

The second step in the solution is to have an outer query against the CTE Tree extracting the individual ancestor IDs from the concatenated paths. The implementation of this step, as part of the complete solution, is shown in Listing 4.

A SUBSTRING function is used to extract the string representation of each of the IDs. Then the expression + 0 is used to force implicit conversion of the character ID to the numeric ID. Finally, NULLIF(, 0) is used to convert an ID of 0 to a NULL.

Table 2: Output of Code in Listing 3

Empid

lvl

pth

1

1

1

2

2

1

3

2

1

7

3

1

9

4

1

11

4

1

12

5

1

13

5

1

14

5

1

4

3

1

5

3

1

6

3

1

8

4

1

10

4

1

 

Dynamic Solution Using a Recursive Query

The solution in Listing 4 is based on the assumption that the degree of the tree is 5. In case a known maximum doesn’t exist and the degree of the tree can keep changing, you can still rely on similar logic—you’ll just need to dynamically construct the query string, then execute it.

The dynamic solution uses a helper function called GetNums that accepts an integer as input and returns a sequence of integers of the requested size from 1 and on. Use the code in Listing 5 to create the GetNums function.

Listing 6 contains the complete dynamic solution for the task. The code in Callout A is used to calculate the current degree of the tree. As you can see, the code uses a simple recursive query very similar to the one I described earlier to calculate the level of each node. The outer query then calculates the maximum level in the tree and stores the value in the variable @degree.

The rest of the code is responsible for constructing the solution query code, storing it in a variable called @sql, then executing it. The first part in the construction of the solution query code is straightforward—it’s the definition of the CTE Tree, which is the same as in Listing 4. The part that’s new is the code in Callout B, which is responsible for constructing the series of expressions in the outer query’s SELECT list—the ones returning the ancestor IDs in the different levels. This is the dynamic part that’s based on the current degree of the tree. Observe that this code queries the GetNums function to get a row for each level. The code constructs the right expression for each level number. The code uses the FOR XML PATH option to concatenate the values from the result rows into one string. Finally, the code following Callout B finalizes the solution query string and executes it.

Learn more: CTEs with Multiple Recursive Members

Strategies Summarized

In this article I demonstrated two main strategies to flatten hierarchies. I discussed a strategy based on joins, which is mainly suitable for trees with a small and nonincreasing number of levels. I also discussed a more dynamic strategy that uses a recursive query to handle trees with a changing number of levels and an unknown maximum. I demonstrated this technique first with a static query, then I showed how to apply similar logic to construct the query string dynamically after calculating the current degree of the tree. These techniques for flattening hierarchies are handy for converting hierarchies that are represented as adjacency lists to a flattened form for purposes such as presentation.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like