LINQ Deep Dive: Part I
Tips for Delving in to LINQ
October 30, 2009
CoverStory
LANGUAGES:C#
ASP.NETVERSIONS: 3.5
LINQ Deep Dive: Part I
Tips for Delving in to LINQ
By Zeeshan Hirani
LINQ (language-integrated query) enables developers to work at a higher abstraction layer by making querying a first-class citizen in C#. LINQ provides the same query model for different domains, such as LINQ to Objects, LINQ to XML, and LINQ to SQL.
Although the querying concept is the same at the syntax level, we are dealing with different domains that have totally different behaviors, which LINQ cannot abstract. For instance, when you write a LINQ query that is applied on SQL Server, you rarely consider optimizing the query. SQL Server has a built-in optimizer to execute the query to yield maximum performance in the most efficient manner. However, LINQ to Objects does not have that richness; therefore, it is left to the developer to write queries that can be executed by the compiler in the most efficient manner. I ll share in this two-part series some of my experiences and pain that I ve encountered working with LINQ on a daily basis. I ll also demonstrate some common gotchas to avoid, as well as hidden features in LINQ that can help you write concise and expedient LINQ queries.
Tip One: Apply Filters Before a Join to Improve Query Performance
A LINQ to SQL query is translated into SQL by the LINQprovider and sent to SQL Server for execution. As a developer, you don t haveto worry about the order in which the join and filter clause needs to beapplied to return data faster. SQL Server has a built-in query optimizer thatlooks at query execution plans and indexes to determine the fastest possibleway to read data, such as doing a table scan or using an index or hash join toretrieve data. LINQ to Objects queries are not endowed with any such optimizers.Therefore, it is essential to understand query behaviors to try to limit theresult by applying filters before join clauses, causing the projected output tobe joined against a smaller subset. Joining against fewer rows offers betterperformance. Figure 1 shows an example that illustrates performance benefitsgained by putting a where clause before joins.
Figure 1: Applying a where clause before a join makes queries run faster.
Figure 1 contains two integer collections, nums1 and nums2,which are generated using the Range extension method on the Enumerable class. TheRange method creates a sequence of numbers specified within a range. The LINQquery in Figure 1 uses a join clause to find common integers in nums1 and nums2.Because I m only looking for numbers that fall below the 500,000 range, I alsoapplied a where filter to limit my search. To calculate query execution time, Irecorded the start and end time of each LINQ query. Both queries in Figure 1accomplish the same task, the difference being the first query applies thewhere clause after joining the two collections, and the second query limits therows to be joined with the nums2 collection by applying the where clause beforethe join statement. Figure 2 shows the time taken by both queries.
Figure 2: Results from executing the query in Figure 1 show an increase of performance when you apply a where clausebefore the join statement.
Results from Figure 2 confirm there is a magnitude of speed difference when you apply a where clause before the join statement. Applying the where clause before took only 6.8 seconds, compared to the original query, which took 16.5 seconds. Remember, when executing queries on in-memory objects, understanding your data is essential so you can limit the search results before joining it to another table. Most of these capabilities are provided for you by SQL Server when you write LINQ to SQL queries, but in LINQ to Objects, developers are responsible for writing queries that can perform optimally.
Tip Two: Filter LINQ Queries Using Object Reference Comparison
You are not required to explicitly specify the primary keycolumn in the where clause of the query if the LINQ query is filtered based on aprimary key. Comparing object references will ensure that the query you send tothe database gets filtered based on the primary key column. This sometimesremoves noise from the code and makes it easier to read. Figure 3 demonstratesusing object comparison to apply a where clause to retrieve only products in theBeverages category.
Figure 3: Applying a where filter to return products in the Beverages category.
As Figure 3 shows, I returned products that belong to the Beveragescategory. Instead of filtering products based on CategoryID, I filtered theresults by comparing the category on the product object with the category Iretrieved from the database. The object reference comparison gets converted to aprimary key comparison on SQL Server. This can be confirmed by the SQL capturedisplayed below the LINQ query in Figure 3. Notice that the Products table isfiltered based on the CategoryID column.
Tip Three: Apply Aggregates to Empty Collections in LINQ to SQL Queries
LINQ to SQL has support for aggregate operators, like Sum,Count, and Average. For collections implementing IQueryable, aggregateoperations are performed on the database. Certain aggregate operators like Sumand Max would cause LINQ to SQL queries to throw an exception. For instance, todetermine the total sales an employee has made, you would group the Orderstable by employee and sum the order total for each order to get theirTotalSales. But what happens when there are employees in the database who havenot created an order? Those employees will not have any records present in the Orderstable, and SQL Server will return null for their sum (instead of 0).
The implementation on SQL Server for calculating the sumon empty rows differs from how LINQ to Objects applies Sum on emptycollections. For instance, if an employee has not created an Order, getting itsOrders would return an empty collection of Orders. When you apply Sum on anempty collection, you would end up with 0 totals. However, for the samescenario, SQL Server would return a null value for TotalSales. So be aware thatif the aggregation is performed on SQL Server, the type that is going to holdthe sum must be a nullabe type. If the query is performed in memory, you can beassured that results will not be null. Let s look at an example of comparingboth LINQ to SQL and LINQ to Objects implementations for getting total revenuefor each employee (see Figure 4).
Figure 4: An aggregate operation applied on an empty collection on the database requires casting to nullabletypes.
As Figure 4 shows, I calculated the total revenue generated by each employee. In the first section of the code, I preloaded all the Orders and OrderDetails for all employees by making use of DataLoadOptions. By using DataLoadOptions, I m telling the LINQ to SQL query engine that while you are retrieving employees, also fetch all the orders for that employee, and, for each Order, also retrieve its OrderDetails. By preloading all the orders from the database, I can force all the aggregate operations to be performed in memory. The query in Figure 4 displays the employee name, and for each employee I am flattening the list of OrderDetails. Because each Order has many OrderDetails, I make use of the SelectMany operator to flatten the hierarchy of OrderDetails into one collection that contains all the OrderDetails for all the orders created by a particular Employee. Once I ve obtained all the OrderDetails, I sum the OrderDetails total for each order by multiplying the quantity and price available on the OrderDetails entity.
Figure 5 shows the output of printing the anonymous typeon the screen. Notice that we have an employee named zeeshan that has no orderscreated; hence, the sales total shows a value of 0. For the same query to workin LINQ to SQL, I must cast results returned from the Sum operation to a nullabledecimal type. SQL Server returns null for TotalSales when there is no ordercreated by an employee. If you do not make your TotalSales nullable, LINQ toSQL will throw an exception complaining null values cannot be assigned to typesthat are not defined as nullable.
Figure 5: Results from a Sum operation applied on SQL Server using LINQ to SQL and in memory using LINQ to Objects.
Tip Four: Delay Loading a Property in LINQ to SQL
LINQ to SQL by default supports lazy loading of entityrefs and EntitySets. This lazy-loading behavior allows the developer to loadonly the data needed for processing. If you don t require additionalcollections, there is no need to incur the performance degradation and memoryconsumption for loading additional collections from the database.
But what happens if there are certain columns in yourtable entity that are expensive to load and you rarely need them in your userinterface. It would be beneficial in these scenarios to take advantage of lazyloading on certain columns of the table. One way to achieve this is to write aLINQ query that returns an anonymous type that contains only the data you needto display on the Presentation layer. However, an anonymous type cannot bepassed across tiers in a strongly typed manner. Another option is to partlypopulate your LINQ entity with only columns you need to display. In thisoption, you must at least populate your LINQ entity with a column marked as theprimary key, because object tracking uses primary key columns to trackentities. Both options have weaknesses, because if you don t load the column atthe time you run the query, those columns never get loaded and you don t geta second chance to load the property unless you call Refresh to refresh theentity from the database.
The LINQ to SQL designer provides a Delay Loaded propertyon every column in the table. When you set the Delay Loaded property to True ona column, the LINQ to SQL engine will exclude that column from the query whenit makes a request to the database. Only when you access that column will LINQmake a request to SQL Server to get only that column value. It is useful inscenarios when you have defined columns as XML data type, varchar(max), orvarbinary. Figure 6 shows setting the Delay Loaded property on the LINQ to SQLdesigner.
Figure 6: Code generated by the LINQ to SQL designer when you set Delay Loaded to True.
Figure 6 shows a preview where I marked the Picture columnwith Delay Loaded set to True. On changing the setting in the Propertieswindow, the LINQ to SQL designer will create a private variable of typeSystem.Data.Linq, where T is the type of property (Binary in this case).System.Data.Linq.Link has the smartness to enable lazy loading on the property.After setting the Picture column to be lazy loaded, querying the Categoriestable will exclude the Picture column. Figure 7 confirms the behavior byprinting the SQL sent to the database. Notice that our query does not includeany reference to the Picture column.
Figure 7: The SQL query sent by LINQ to SQL excluded the Picture column.
Setting lazy loading in the designer is a good option, butit applies lazy-loading behavior for the property for all the queries in yourapplication. What happens when you have a genuine need to have the propertiesnot be lazy loaded on certain queries. In those cases you can use the familiarLoadWith option to load properties. LoadWith not only works with foreign keytables, but you also can use that to load lazy-loaded properties. Figure 8demonstrates loading the Picture property when the Category entity is retrieved.Therefore, this time, my SQL-generated code includes the Picture column,despite the fact that the Picture property on the Category entity has deferred-loadingenabled.
Figure 8: The Picture column is loaded when we fetch the Category entity from the database.
Tip Five: Use Table-valued Functions with Eager Loading Turned On
User-defined functions provide composability for LINQ toSQL queries compared to stored procedures. When you call a stored procedurefrom your datacontext, the collection returned is IEnumerable. Iffurther transformations are applied on the query, such as a where or order byclause, they get executed in memory. These kinds of queries are deceiving,because they compile fine and do not throw exceptions at run time. Because partof the query gets executed on SQL Server and any transformations applied aredone in memory, you could end up bringing lots of rows on the Web server, whichcould increase your network traffic and increase memory consumption. However,table-valued functions return IQueryable, and are marked with the IsComposableattribute set to True. This allows us to further compose our queries by joiningit to another table and adding where, order by, or any other LINQ operators.Then the entire query gets sent to SQL Server for execution.
Table-valued functions work well, but they tend to giveincorrect results if you turn on eager loading for child collections. Forinstance, if I have a table-valued function that returns categories and I meager-loading products for the category as well, then when I print thecategories it will return the correct number of categories, but they all wouldbe the same. Figure 9 shows a table-valued function named BvgAndSeaFood thatreturns two categories (Beverages and Produce). To use the function, I draggedthe function on the categories table to the LINQ to SQL designer to indicatethat my function returns a collection of Category objects. In the query, I madeuse of DataLoadOptions to eagerly-load products for the categories returnedfrom my function named BvgAndSeaFood. Once I ve configured my DataLoadOptions,I assigned the object to the datacontext. On printing the categories to theoutput screen, both times the Beverages category is printed. If you don t useDataLoadOptions to preload products, the results printed on the console will beBeverages and Produce, as dictated by my table-valued functions. This leads meto believe there is something wrong with the implementation of table-valuedfunctions when you turn on eager loading.
Figure 9: This query returns the Beverages category both times, even though our function dictates returning Beverages and Produce.
As mentioned earlier, one way to get around the problem isto join the Category table and select a category from the Category table.Figure 10 shows the working version where I get correct categories with noduplicates printed on the console window.
Figure 10: Joining categories returned from the function to the Category table removes duplicates of the Beveragescategory.
Tip Six: Put Joins in the Correct Order in a LINQ to Objects Query
If you are doing lots of joins in memory using LINQ to Objects,you must to be aware of the order in which joins are applied. For instance, ifyou have two sequences in memory and you have to join those sequences, alwaysuse the smaller sequence on the right side of your query and the largersequence on the left side. It makes a considerable difference in performance.Figure 11 shows the time taken to join a large collection to a smallcollection. To create my first array, num1, I used the Range operator. The Rangeoperator takes in the start position from where to start generating numbers;the second parameter indicates how many numbers to generate. Looking at Figure11, you can see that num1 is a large array. I then created my second arrayusing the same Range operator, but gave it a small count, which resulted in asmaller array.
Figure 11: A join is optimized because the smaller array is on the right side and the larger array is on the left side.
I then joined both the sequences with the smaller sequencenums2 on the right side and the larger sequence nums1 on the left side. Thetime taken to execute the query was 0.18 seconds. If I were to rewrite thequery slightly differently by putting the smaller array on the left and the largerarray on the right, it would dramatically increase the query execution time to2.6 seconds (see Figure 12). Not only does the query time increase, but thememory consumption goes up because the LINQ query engine must load the entirecollection on the right side and if that happens to be a larger collection,it increases the query time and occupies more memory.
Figure 12: The larger array on the right side of the join causes the query time to increase.
Tip Seven: Compose a LINQ Query Inside a Loop
LINQ queries can be composed by dynamically building a lambdastatement and appending it to the existing query. For instance, you can loopthrough an array and for each iteration build a lambda statement and add it to anexisting query. If the lambda statement makes use of a looping variable,results would be inconsistent if the looping variable is not capturedcorrectly. These side effects return misleading results and introduce bugs thatare very difficult to find.
The problem can be explained by a simple demonstration. InFigure 13, I have an array of cities containing London and Paris. I also have acollection of customers where some customers are from London and some are fromParis. I am looping through my array of cities and building a LINQ query.Inside the loop, I am creating a lambda statement that matches the city in thearray and picking the first customer. When I run the query I expect to get twocustomers, one from London and one from Paris. However, the output on theconsole window only shows a customer with CustID:3 from Paris. The result isincorrect because the variable city remains in scope and changes with everyiteration of the loop. The last iteration of the loop returns the value ofParis, so all predicates using the city parameter in their lambda expressionget affected and use the new value of Paris.
Figure 13: The side effects if the looping variable is not captured correctly when building LINQ queries dynamically.
To solve the problem we must declare another variableinside the loop that captures the value of the looping variable. When thelooping variable changes as you iterate the loop, our local variable is notaffected. The local variable goes out of scope at the end of each iteration ofthe loop and, hence, is not affected by the changing value of the loopingvariable. Consequently, with each iteration of the loop we create a lambda statementthat captures a different local variable.
To avoid the side effect of the city variable beingchanged, I can declare another variable, _city, inside the loop, which capturesthe value of the city at that moment and passes that to the lambda statements(see Figure 14). Once that iteration ends, the local variable _city goes out ofscope and does not cause any side effects. The output printed on the consoleconfirms we are getting two customers, one from London and one from Paris.
Figure 14: Declare a local variable to avoid the side effects of looping variables and, thus, return correct customers from London and Paris.
Conclusion
LINQ queries make the code more readable because the codespecifies what needs to be done and leaves to the compiler the underlyingdetails of how it needs to be done. Although LINQ abstracts differentprogramming domains, such as LINQ to Objects, LINQ to XML, and LINQ to SQL, itcannot hide the complexity and different behavior of each domain. It isessential that we understand how each domain interprets the LINQ query and ensurethat a particular query works optimally for the domain we are targeting.
In this installment we explored different query behaviorsspecific to each domain, such as how where clauses are executed in memory ascompared to being applied on the database, and what performance tradeoffs we mustconsider. We also learned how aggregate behavior varies in each domain where LINQto Objects would return 0 for empty collections and LINQ to SQL would returnnull. It is important that we understand these differences and ensure that ourquery returns correct results for the specific domain we are targeting.
To improve query performance we learned how to delay-loada particular column in the database that is not used often. We also discussedsome of the constraints of using table-valued functions when eager-loading isturned on, and how to overcome those inconsistencies. We also talked about howto put joins in the proper order to ensure a query runs faster and does notoccupy lots of memory. Finally, we learned that when you are composing queriesinside a loop, you must take precaution to capture the looping variablecorrectly; otherwise, your LINQ expressions will be filtering data based on thelast value of the looping variable, resulting in incorrect results.
We ll continue our exploration in Part II by introducingusers to caching features offered by the object tracking service, and in whichscenarios caching is used and how it ensures object reference consistency.We ll also discuss how LINQ to SQL automatically maps dynamic SQL to entities,and how a non-default constructor can prevent composing queries from otherqueries. We ll also learn how to understand and identify if the query is beingsent to a database for execution, or if it is being applied in memory.
Source code accompanying this article is available for download.
Zeeshan Hirani is a senior developer at http://CheaperThanDirt.com. He specializes mainly in ASP.NET, AJAX, and leveragingOR-mapping solutions like LINQ and Entity Framework in business applications. He has written several articles for http://CodeProject.com and is an active member in the user-group community in Dallas/Fort Worth. He is a frequent blogger. One of his greatest achievements was writing a 500-pagelearning guide that walks you through different concepts and modeling scenarios supported by Entity Framework. The learning guide can be downloaded at http://weblogs.asp.net/zeeshanhirani/archive/2008/12/18/my-christmas-present-to-the-entity-framework-community.aspx.You can find his blog at http://weblogs.asp.net/zeeshanhirani.Contact him with any questions at [email protected].
About the Author
You May Also Like