posted 5/1/2012 by pkumar3 - Views: [2242]
SQL Query Performance Tips:
Difference between Clustered and Non-Clustered Index Data Structures
When you first create a new table, there is no index created by default. In technical terms, a table without an index is called a “heap”. As you would expect, the data we will insert into the table will be returned in the same order. A non-clustered index is a special type of index in which the logical order of the index does not match the physical stored order of the rows on disk. The leaf node of a non-clustered index does not consist of the data pages. Instead, the leaf nodes contain index rows.
A clustered index is a special type of index that reorders the way records in the table are physically stored. Therefore table can have only one clustered index. The leaf nodes of a clustered index contain the data pages.
Clustered index is good when you know in which order you will be returning the records in most cases. You can create clustered index and after that you don’t need to use ORDER BY statement. This will be much faster. If the order is not important for you and will not create clustered index by yourself, then primary key will be clustered index by default. There is nothing bad not to have the clustered index; it can speed up inserting rows.
1. Use Clustered Indexes
Having the clustered index on the primary key is sometimes not the most efficient place for the clustered index to be. A clustered index is the most performant type of index. The whole table is sorted according to the clustered index. If the table is involved in lots of joins based on the primary key, it is probably the right place for it to be, but if you are continually filtering or grouping on other columns in a table, then you should possibly consider changing the primary key index to Non-Clustered, and putting the clustered index on those filtered or grouped columns.
The following statement removes and existing clustered index on the primary key and replaces it with a non-clustered index:
ALTER TABLE MySchema.SalesOrderHeader
DROP CONSTRAINT PK_SalesOrderHeader
GO
ADD CONSTRAINT PK_SalesOrderHeader
PRIMARY KEY NONCLUSTERED(SalesOrderID);
Then the following statement adds a new clustered index to a table.
CREATE CLUSTERED INDEX MyClusteredIndex
ON MySchema.SalesOrderHeader (OrderID)
Indexed Views have been around for a while. A view is like a named query, and these days you can add indexes to them. If used correctly, they can cause a massive improvement in execution times, often better than a clustered index with covering columns on the original table. Also, in SQL Server Developer Edition and Enterprise Edition, a view index will also be automatically used if it is the best index even if you don’t actually specify the view in your query!
CREATE VIEW MySchema.SalesByCustomer
WITH SCHEMABINDING
AS
SELECT soh.SalesTerritoryID, soh.CustomerID,
SUM(sod.Quantity * sod.UnitPrice)
FROM MySchema.SalesOrderHeader soh
INNER JOIN MySchema.SalesOrderDetail sod
ON (soh.SalesOrderID = sod.SalesOrderID)
GROUP BY soh.SalesOrderTerritory, soh.CustomerID
Note the use of the schema binding attribute. This prevents you from changing underlying tables while this view exists, and is necessary if you want to add an index. Some people avoid indexed views for this reason, as the maintenance becomes more complicated as further dependencies to the view are created. The following statement adds an index:
CREATE UNIQUE CLUSTERED INDEX IdxSalesOrderView
ON MySchema.SalesByCustomer(
SalesTerritoryID, CustomerID
)
Covering indexes are a feature that was newly added to SQL 2005. Basically, you can create an index optimised for the query itself based on joins, filters and grouping, and then add additional columns that can be retrieved directly from the index for use in select statements, as follows:
CREATE NONCLUSTERED INDEX TestIndex
ON MySchema.SalesOrderDetail(OrderId)
INCLUDE (Quantity, UnitPrice)
The above statement causes a non-clustered index to be created on the SalesOrderDetail table. If queries are executed on the OrderId column, the index will be used, and if the only other columns being retrieved are Quantity and UnitPrice, then the query optimiser doesn’t need to retrieve any extra columns from the underlying table. It can just use the index. Because the query optimiser doesn’t need to query the original table, performance is improved.
One thing you need to consider when determining where to put your clustered index is how big the key for that index will be. The problem here is that the key to the clustered index is also used as the key for every non-clustered index in the table. So if you have a large clustered index on a table with a decent number of rows, the size could blow out significantly. In the case where there is no clustered index on a table, this could be just as bad, because it will use the row pointer, which is 8 bytes per row.
A bit of a no-brainer. Cursors are less performing because every FETCH statement executed is equivalent to another SELECT statement execution that returns a single row. The optimizer can’t optimize a CURSOR statement, instead optimizing the queries within each execution of the cursor loop, which is undesirable. Given that most CURSOR statements can be re-written using set logic, they should generally be avoided.
Another no-brainer, so I won’t say much. If you want to improve query performance, give the optimizer less work to do. If you can cut down the number of rows the query has deal with, then performance will improve. I have no problem with people creating audit triggers to move historical data into other tables for this reason. Alternatively, if you don’t need your data after a certain period of time, back up your database and remove the data.
These days, you don’t actually have to move old data out of a table to improve query performance. You can partition your table into a number of data segments based on a partition function. The query optimiser can use the partition function to look at rows only on the most appropriate filegroup. To create partitions, you need a partition function and a partition scheme.
CREATE PARTITION FUNCTION myRangePartitionFunction(int)
AS RANGE RIGHT FOR VALUES (1,100,1000)
Once the partition function is created, you can then apply the function to a partition scheme for a table.
CREATE PARTITION SCHEME myRangePartitionScheme
AS PARTITION myRangePartitionFunction
TO (filegrp1, filegrp2, filegrp3, filegrp4)
Then it’s just a matter of creating the table to use the partition scheme on the column you decided to partition on:
CREATE TABLE mySchema.myPartitionTable(
col1 int,
col2 nvarchar(100)
ON myRangePartitionScheme(col1)
The apply statement was created for the situation where you put multiple inline nested queries in the one statement. For example, take the following statement:
SELECT soh.SalesOrderID,
Quantity=(SELECT TOP 1 (Quantity)
FROM Sales.SalesOrderDetails
WHERE SalesOrderID = soh.SalesOrderID),
UnitPrice=(SELECT TOP 1 (UnitPrice)
WHERE SalesOrderID = soh.SalesOrderID)
FROM Sales.SalesOrderHeader soh
This performs an extra query, retrieving data from another table using the same criterion. This can now be replaced with the following:
SELECT soh.SalesOrderID, soh.OrderDate, a.*
CROSS APPLY (
SELECT TOP (1) sod.UnitPrice, sod.Quantity
FROM Sales.SalesOrderDetail sod
WHERE sod.SalesOrderId = soh.SalesOrderId
ORDER BY sod.Quantity DESC
) as a
Computed columns are derived from other columns in a table. By creating and indexing a computed column, you can turn what would otherwise be a scan into a seek. For example, if you needed to calculate SalesPrice and you had a Quantity and UnitPrice column, multiplying them in the SQL inline would cause a table scan as it multiplied the two columns together for every single row. Create a computed column called SalesPrice, then index it, and the query optimiser will no longer need to retrieve the UnitPrice and Quantity data and do a calculation – it’s already done.
• If you have a query that uses ORs and it is not making the best use of indexes, consider rewriting it as a UNION and then testing performance. Only through testing can you be sure that one version of your query will be faster than another.
• Queries that include either the DISTINCT or the GROUP BY clauses can be optimized by including appropriate indexes. Any of the following indexing strategies can be used:
• Ideally a clustered index should be based on a single column (not multiple columns) that are as narrow as possible. This not only reduces the clustered index's physical size, it also reduces the physical size of non-clustered indexes and boosts SQL Server's overall performance.
• When you create a clustered index, try to create it as a unique clustered index, not a non-unique clustered index.
• SET NOCOUNT ON at the beginning of each stored procedure you write. This statement should be included in every stored procedure, trigger, etc. that you write.
• If you are creating a stored procedure to run in a database other than the Master database, don't use the prefix sp_ in its name. This special prefix is reserved for system stored procedures. Although using this prefix will not prevent a user defined stored procedure from working, what it can do is to slow down its execution ever so slightly. Any stored procedures prefixed with 'sp_' are first searched for in the Master database rather than the one it is created in. This will cause a delay in the stored procedure being executed.
• If you use input parameters in your stored procedures, you should validate all of them at the beginning of your stored procedure. This way, if there is a validation problem and the client application needs to be notified of the problem, it happens before any stored procedure processing takes place, preventing wasted effort and boosting performance.
• For best performance, all objects that are called within the same stored procedure should be owned by the same object owner or schema, preferably dbo, and should also be referred to in the format of object_owner.object_name or schema_owner.object_name.
• If you think a stored procedure will return only a single value and not a record set, consider returning the single value as an output parameter.
• Use stored procedures instead of views. They offer better performance.
• Don't be afraid to make broad-minded use of in-line and block comments in your Transact-SQL code. They will not affect the performance of your application and they will enhance your productivity when you have to come back to the code and try to modify it.
• If possible, avoid using SQL Server cursors. They generally use a lot of SQL Server resources and reduce the performance and scalability of your applications.
• If you have the choice of using a join or a sub-query to perform the same task within a query, generally the join is faster. This is not always the case, however, and you may want to test the query using both methods to determine which is faster for your particular application.
• Instead of using temporary tables, consider using a derived table instead. A derived table is the result of using a SELECT statement in the FROM clause of an existing SELECT statement. By using derived tables instead of temporary tables, you can reduce I/O and often boost your application's performance.
• For better performance, if you need a temporary table in your Transact-SQL code, consider using a table variable instead of creating a conventional temporary table.
• Don't repeatedly reuse the same function to calculate the same result over and over within your Transact-SQL code.
• If you need to store large strings of data and they are less than 8000 characters, use a VARCHAR data type instead of a TEXT data type. TEXT data types have extra overhead that drag down performance.
• Don't use the NVARCHAR or NCHAR data types unless you need to store 16-bit character (Unicode) data. They take up twice as much space as VARCHAR or CHAR data types, increasing server I/O and wasting unnecessary space in your buffer cache
• Don't use the DATETIME data type as a primary key. From a performance perspective, it is more efficient to use a data type that uses less space. For example, the DATETIME data type uses 8 bytes of space, while the INT data type only takes up 4 bytes. The less space used, the smaller the table and index, and the less I/O overhead that is required to access the primary key.
• When you have a choice of using the IN or the EXISTS clause in your Transact-SQL, you will generally want to use the EXISTS clause, as it is usually more efficient and performs faster.
• When you have a choice of using the IN or the BETWEEN clauses in your Transact-SQL, you will generally want to use the BETWEEN clause, as it is much more efficient. E.g.
SELECT task_id, task_name
FROM tasks
WHERE task_id in (1000, 1001, 1002, 1003, 1004)
...is much less efficient than this:
WHERE task_id BETWEEN 1000 and 1004
Are you using UNION instead of UNION ALL?A UNION statement effectively does a SELECT DISTINCT on the results set. If you know that all the records returned are unique from your union, use UNION ALL instead, it is much quicker. E.g.
DECLARE @Table1 TABLE (Col INT)
INSERT INTO @Table1
SELECT 1
SELECT 2
SELECT 3
SELECT 4
SELECT 5
DECLARE @Table2 TABLE (Col INT)
INSERT INTO @Table2
SELECT 6
SELECT 7
SELECT 8
/* Result of Union operation */
SELECT Col 'Union'
FROM @Table1
UNION
SELECT Col
FROM @Table2
SELECT Col 'UnionALL'
UNION ALL
Is SET NO COUNT ON being used? By default, every time a stored procedure is executed, a message is sent from the server to the client indicating the number of rows that were affected by the stored procedure. You can reduce network traffic between the server and the client if you don't need this feature by adding SET NO COUNT ON at the beginning of your stored procedure.