2012/11/05

Can a Table Really Have a Clustered Index?

The terms 'Clustered Index' and 'Non-Clustered Index' are not very intuitively worded for the concepts they represent. Besides being non-intuitive, they also imply a degree of similarity with each other. Perhaps this leads to a certain level of misunderstanding of these terms by not only those who are new to databases, but also by experienced data professionals who sometimes stumble while explaining them. Here is my attempt at providing a simple explanation of what they mean. I will also use this explanation to argue why a table cannot really "have" a clustered index.

Say you have a table that looks like this:

Employee_ID Employee_Name
----------------------------------------------------------
1 DeeDee
5 Jennie
6   Jeb
2   David
8  Jose
3   Chris
7   Tom
4   Ed

Let's assume there is no clustered index on this table. This makes it a heap, meaning the rows are not physically written to disk in any particular order. When you search for a record in such a table via a select statement, the database must look at each record in the table. This is called a table scan. Not a very good way of doing things, because it is bad for performance. Imagine looking through an unordered phone book for a specific name. You would have to read every name in the book to find your result (you would not stop at the first instance of that name because you could not be sure that you have found all instances of that name). Not very efficient.

Adding a clustered index physically re-orders the records on disk. Let's say we add a clustered index on the Employee_ID field (there already are many articles on which fields to pick as clustered indexes, so I won't revisit that here). This would cause the rows to be ordered thus on disk:

Employee_ID Employee_Name
----------------------------------------------------------
1   DeeDee
2   David
3   Chris
4   Ed
5   Jennie
6 Jeb
7   Tom
8   Jose

By the way, a primary key is a clustered index by default (it doesn't necessarily have to be so, but is by default). Back to our example, our table is no longer a heap. The records are ordered by Employee_ID. So now if you were searching for an employee with a specific ID (select * from Employee where Employee_ID = x), the database would no longer have to scan every record in the table. It would most likely "seek" the record using the clustered index, and find the record much faster. However, if you wanted to find records by name (select * from Employee where Employee_Name = 'x'), you would still be causing a scan because the table is ordered by Employee_ID, not Employee_Name. This is like having a phone directory with all names ordered by some arbitrary ID number, and not by alphabet. Excellent if you are looking for a friend whose ID number you know. Not very good if you only have the name.

This is where a nonclustered index comes in. I like to think of a nonclustered index as an ordered copy of one or more fields. Doesn't "ordered copy of fields" sound much more intuitive than "nonclustered index"? Anyway, adding a nonclustered index to our table on the field Employee_Name would cause a second structure (an ordered copy of Employee_Name) to be created on disk, and it would look something like this:

IDX_Employee_Name
----------------------------------------------------------
Chris   Pointer to 3
David   Pointer to 2
DeeDee   Pointer to 1
Ed   Pointer to 4
Jeb   Pointer to 6
Jennie   Pointer to 5
Jose Pointer to 8
Tom Pointer to 7

Note that the index has records ordered by Employee_Name. Now, when you look for an employee by name, the database engine no longer scans the entire table (or the clustered index). It quickly seeks the name you are looking for in the nonclustered index, which is easy to do because this structure is ordered by name. And you don't even have to read all the records to know you have found all instances of a specific name, because they are all located together, in one place by virtue of being ordered. But there's more work to be done. Once it has found the name you want, the database still needs to find the full record in the table. For this, it uses the pointer, which is simply a pointer to the location of the matching record in the clustered index. So the database does something like this:

1. Find the name in the nonclustered index.

2. Follow the pointer to the full record in the clustered index (this, by the way, is called a bookmark lookup).

3. Return the record from the clustered index.

Note that steps 2 and 3 only happen if your query is looking for fields that don't already exist in the nonclustered index. If all query fields (filters and return values) exist in the nonclustered index, they are "covered" and a bookmark lookup is not required at all. In this situation, the data are simply returned from the nonclustered index.


So, going back to definitions, the terms "Clustered Index" and "Nonclustered Index" sound very similar and imply they are similar things. But besides the fact that they help make searching easier, they don't have very much in common. A nonclustered index is simply an ordered copy of one or more fields in the table, with pointers back to the table. Because it is a copy of table fields, you can have many nonclustered indexes per table. A clustered index causes reordering of the physical records of the table on disk. This is because you can only write the records on disk once for the entire table. You can only order the records one way (any way you choose, but only one way at any given time). The entire contents of the table, all fields, all rows, all its data, become the clustered index. Therefore, there can only be one clustered index per table. And because the clustered index is the entire table, a table cannot so much "have" a clustered index as it "is" a clustered index (unless it's a heap). To say that a table has a clustered index is like saying a phone book has a listing of names and numbers. It does not have a listing, it is the listing.

No comments:

Post a Comment