Presenting you all, one of the most interesting topic based on SQL Performance tunning: SET Based & Procedural Based approach in SQL. I would like to share with you all as I found this aritcles very very useful.
What is "Procedural approach"?
Well, simply speaking, Procedural approach is actually the "programmatic approach" that we are used to working with in our daily programming life. In this approach, we tell the system "what to do" along with "how to do" it. We query the database to obtain a result set and we write the data operational and manipulation logic using loops, conditions, and processing statements to produce the final result. The runtime does whatever we want it to do, however we want it to do.
In TSQL, any User Defined Function (UDF) or Cursor that executes on a result set row by row is a procedural approach. For example, if you are querying your database to obtain a result set and using a Cursor to navigate the result set to do further processing row by row, you are using a procedural approach. Also, if you are using a User Defined Function in your SQL for processing each row in the result set to calculate a scalar output, you are using a procedural approach. Looking for example code? Continuing reading..
What is "Set based approach"?
Set based approach is actually an approach which lets you specify "what to do", but does not let you specify "how to do". That is, you just specify your requirement for a processed result that has to be obtained from a "set of data" (be it a simple table/view, or joins of tables/views), filtered by optional condition(s). Sometimes, the specification of the "sets" you like to retrieve data from may be obtained by using complex joins/subqueries/conditional case statements, but in the end, there is a set of data from which the resultant set has to be obtained. You never have to specify "how" the data retrieval operation has to be implemented internally. You never have to specify how to implement the "joining operation" internally either. Also, you don't have to specify how to apply a filter condition against the rows. The database engine determines the best possible algorithms or processing logic to do these.
For example, following is SQL that is written using "Set based approach":
FROM Product INNER JOIN Category
ON Product.CategoryID = Category.ID
WHERE Price > 100
In the above SQL, "
Product INNER JOIN Caterogy ON Product.CategoryID = Category.CategoryID" is the "set" of data from where ProductName and CategoryName has to be displayed for those products which have a Price > 100. You just specify your requirements and conditions, and the SQL engine does the rest to produce the result.
Why is "Set based approach" better than "Procedural approach"?
Frankly speaking, the internal execution engines of databases are designed and optimized for taking "set based instructions" as input (SQL) and executing these instructions in the best possible way (that varies based on a lot of criteria) to produce an output. That's why "Set based approaches" are almost always the better option.
Let us try to understand this with an example.
We often need to do joins between two or more sets of data (say, two tables) to obtain a desired result set. While joining, we only specify the participating table/view names and join conditions. Now, it's the SQL engine which has to implement the actual "joining operation" based upon the join conditions that we provide.
In the SQL Server execution engine, there are three types of joining techniques. These are:
- Merge join: Most optimized joining algorithm. Takes place when both tables are joined on indexed columns which are sorted. In a merge join operation, SQL Server scans once through the sorted inputs and merges the data together.
- Nested loop join: Moderate cost joining algorithm. Takes place when one of the tables in the join clause contains a small number of records (inner table) compared to the number of records in the other participating table (outer table). Nested loops scan once through each inner input row and search for the corresponding row in the outer input.
- Hash join: Most expensive joining algorithm. Takes place when large, unsorted, non-indexed columns are used in the join condition. The SQL Server engine prepares a hash table for the smaller relation by applying a hash function to the join attribute of each row. Then it scans the larger relation and finds the relevant rows by looking at the hash table.
Now, whenever we specify any join predicate in any SQL in SQL Server, based upon the participating columns, data volume, indexing structure, and the set of values in the participating columns, SQL Server determines and uses the best possible joining algorithm and implements the logic to perform the actual joining operation in memory. Note that we don't have to specify the type of joining algorithm in the SQL. SQL Server does that, and does its best to provide the result as fast as possible.
That's where the database engine is in love with "Set based" approaches.
Another important fact is, whatever SQL written in "Set based approach" is issued in the database, the query optimizer generates an execution plan first, and then the execution engine executes the plan to retrieve data from the physical storage and processes the output in an efficient manner. That is, there is a single execution plan tree for each single SQL statement, be it simple or complex. Executing that single execution plan tree is generally a faster operation.
But, when we specify our own way of processing a result set (that is obtained by executing a SQL) using another SQL that works on a row-by-row manner in the resultset (perhaps using a UDF), the database engine has to execute an execution plan for each and every row, even after obtaining the result set by executing an execution plan. Imagine a row-by-row operation that is executed for a result set containing 1 million rows. In this case, the initial data retrieval operation would require an execution plan to be executed, and later, 1 million times another execution plan has to be executed for processing each row. That's what happens when a User Defined Function (UDF) is executed for each row in a result set. An additional overhead of using UDFs is the amount of stack I/O that takes place for invoking the UDF.
On the other hand, if you use a Cursor to process a result set row-by-row, while executing, the Cursor locks the rows in the corresponding table, and unlocks the rows when processing is done. This involves lots of resource usage on the server, and in the case of large result sets, severely slows down performance.
This experiment shows that using a UDF for row-by-row operations involving up to 1000 rows may provide a performance that is within an acceptable range. But, as the number of rows are increased, using UDF would result in a dramatically slow performance. The same goes for Cursors.
That is why "Set based SQL" always outperforms "Procedural SQL", specially if the result set to process becomes large.
OK, understood. Now tell me how to get rid of "Procedural approaches"?
It's not hard. Just follow the following simple tricks and re-factor some of your SQL.
Use inline sub queries to replace User Defined Functions
Let's assume, for a self-referential table Employee(ID, Name,MgrID), there is a query written in "Procedural approach" using a User Defined Function. The query outputs employee names and corresponding manager names.
Here is the query:
Collapse | Copy Code
SELECT Name AS [Employee Name],dbo.fnGetManagerName(MgrID) as [Manager Name] FROM Employee
dbo.fnGetManagerName(MgrID) is a UDF that returns the manager's name (which is nothing but another employee in the same Employee table) as follows:
Collapse | Copy Code
ALTER FUNCTION [dbo].[fnGetManagerName](@ID int) RETURNS VARCHAR(50) AS
DECLARE @ManagerName varchar(50)
SELECT @ManagerName = Name FROM Employee WHERE ID = @ID
The above "Procedural SQL" could be re-written using a sub query in "Set based approach" as follows:
Collapse | Copy Code