2012/11/05

Can a Table Really Have a Clustered Index?

The terms 'Clustered Index' and 'Non-Clustered Index' are not very intuitively worded for the concepts they represent. Besides being non-intuitive, they also imply a degree of similarity with each other. Perhaps this leads to a certain level of misunderstanding of these terms by not only those who are new to databases, but also by experienced data professionals who sometimes stumble while explaining them. Here is my attempt at providing a simple explanation of what they mean. I will also use this explanation to argue why a table cannot really "have" a clustered index.

Say you have a table that looks like this:

Employee_ID Employee_Name
----------------------------------------------------------
1 DeeDee
5 Jennie
6   Jeb
2   David
8  Jose
3   Chris
7   Tom
4   Ed

Let's assume there is no clustered index on this table. This makes it a heap, meaning the rows are not physically written to disk in any particular order. When you search for a record in such a table via a select statement, the database must look at each record in the table. This is called a table scan. Not a very good way of doing things, because it is bad for performance. Imagine looking through an unordered phone book for a specific name. You would have to read every name in the book to find your result (you would not stop at the first instance of that name because you could not be sure that you have found all instances of that name). Not very efficient.

Adding a clustered index physically re-orders the records on disk. Let's say we add a clustered index on the Employee_ID field (there already are many articles on which fields to pick as clustered indexes, so I won't revisit that here). This would cause the rows to be ordered thus on disk:

Employee_ID Employee_Name
----------------------------------------------------------
1   DeeDee
2   David
3   Chris
4   Ed
5   Jennie
6 Jeb
7   Tom
8   Jose

By the way, a primary key is a clustered index by default (it doesn't necessarily have to be so, but is by default). Back to our example, our table is no longer a heap. The records are ordered by Employee_ID. So now if you were searching for an employee with a specific ID (select * from Employee where Employee_ID = x), the database would no longer have to scan every record in the table. It would most likely "seek" the record using the clustered index, and find the record much faster. However, if you wanted to find records by name (select * from Employee where Employee_Name = 'x'), you would still be causing a scan because the table is ordered by Employee_ID, not Employee_Name. This is like having a phone directory with all names ordered by some arbitrary ID number, and not by alphabet. Excellent if you are looking for a friend whose ID number you know. Not very good if you only have the name.

This is where a nonclustered index comes in. I like to think of a nonclustered index as an ordered copy of one or more fields. Doesn't "ordered copy of fields" sound much more intuitive than "nonclustered index"? Anyway, adding a nonclustered index to our table on the field Employee_Name would cause a second structure (an ordered copy of Employee_Name) to be created on disk, and it would look something like this:

IDX_Employee_Name
----------------------------------------------------------
Chris   Pointer to 3
David   Pointer to 2
DeeDee   Pointer to 1
Ed   Pointer to 4
Jeb   Pointer to 6
Jennie   Pointer to 5
Jose Pointer to 8
Tom Pointer to 7

Note that the index has records ordered by Employee_Name. Now, when you look for an employee by name, the database engine no longer scans the entire table (or the clustered index). It quickly seeks the name you are looking for in the nonclustered index, which is easy to do because this structure is ordered by name. And you don't even have to read all the records to know you have found all instances of a specific name, because they are all located together, in one place by virtue of being ordered. But there's more work to be done. Once it has found the name you want, the database still needs to find the full record in the table. For this, it uses the pointer, which is simply a pointer to the location of the matching record in the clustered index. So the database does something like this:

1. Find the name in the nonclustered index.

2. Follow the pointer to the full record in the clustered index (this, by the way, is called a bookmark lookup).

3. Return the record from the clustered index.

Note that steps 2 and 3 only happen if your query is looking for fields that don't already exist in the nonclustered index. If all query fields (filters and return values) exist in the nonclustered index, they are "covered" and a bookmark lookup is not required at all. In this situation, the data are simply returned from the nonclustered index.


So, going back to definitions, the terms "Clustered Index" and "Nonclustered Index" sound very similar and imply they are similar things. But besides the fact that they help make searching easier, they don't have very much in common. A nonclustered index is simply an ordered copy of one or more fields in the table, with pointers back to the table. Because it is a copy of table fields, you can have many nonclustered indexes per table. A clustered index causes reordering of the physical records of the table on disk. This is because you can only write the records on disk once for the entire table. You can only order the records one way (any way you choose, but only one way at any given time). The entire contents of the table, all fields, all rows, all its data, become the clustered index. Therefore, there can only be one clustered index per table. And because the clustered index is the entire table, a table cannot so much "have" a clustered index as it "is" a clustered index (unless it's a heap). To say that a table has a clustered index is like saying a phone book has a listing of names and numbers. It does not have a listing, it is the listing.

2012/08/12

The Value of Writing

Part of who I am as a person and as a professional, is being a writer. Writing is something that not only helps me express my perspective on various things, it gives me joy. I think it also helps me think more clearly about various issues, and yes it does make me a better developer by getting in the habit of writing (prose or code) for the sake of consumption by somebody/something else (people or computers).

I recently came across some very nicely written (!) pieces on writing, and am reproducing them below because they are worth reading and spreading the word about. Admittedly, these are self-serving, but they are good nonetheless. These are all taken from the book 'Getting Real' by 37 Signals (you can download it here, I highly recommend it to anybody developing software or managing software development).



Hire Good Writers

If you are trying to decide between a few people to fill a position, always hire the better writer. It doesn't matter if that person is a designer, programmer, marketer, salesperson, or whatever, the writing skills will pay off. Effective, concise writing and editing leads to effective, concise code, design, emails, instant messages, and more.

That's because being a good writer is about more than words. Good writers know how to communicate. They make things easy to understand. They can put themselves in someone else's shoes. They know what to omit. They think clearly. And those are the qualities you need.

- Getting Real by 37 Signals



An Organized Mind

Good writing skills are an indicator of an organized mind which is capable of arranging information and argument in systematic fashion and also helping (not making) other people understand things. It spills over into code, personal communications, instant messaging (for those long-distance collaborations), and even such esoteric concepts as professionalism and reliability.

- Dustin J. Mitchell, developer (and printed in Getting Real by 37 Signals)



Clear Writing Leads to Clear Thinking

Clear writing leads to clear thinking. You don't know what you know until you try to express it. Good writing is partly a matter of character. Instead of doing what's easy for you, do what's easy for your reader.

- Michael A Covington, Professor of Computer Science at The University of Georgia (from How to Write More Clearly, Think More Clearly, and Learn Complex Material More Easily) (and printed in Getting Real by 37 Signals)

2012/06/01

Code Reviews Must Be Brutally Honest

This post was prompted by a code review session I attended recently. The code in question involved T-SQL stored procedures. There were five of us database developers in the room, looking over different components of an application we had divided up and coded amongst ourselves.

The raison d’ĂȘtre of a code review is to improve your code, to find things that the original coder may have missed and suggest ways of improving them. You do not go into a code review looking for an ego boost via validation that everything you did is right. You do not go into a code review fearing that others may find something wrong with your code. And you definitely do not go into code review with a defensive and protective mindset, not willing to be open to critiques. And yet, that is exactly what I encountered from one participant.

When this participant's code was being looked at, and one or two of us pointed out some things that he could have done in a more efficient manner, he started stonewalling and defending a position that we all (himself included, very likely) knew was wrong. Now I am not suggesting at all that you should not defend your position if you truly believe you are right. But when you know you are wrong and you do not accept it, what do you gain by the exercise? Why waste everybody's time, if you are not willing to be reviewed? Our friend made it clear that he was not going to be receptive to constructive criticism, and the atmosphere in the session almost immediately turned from one of camaraderie to sullen silences. I knew the rest of the session would be wasted time.

I look forward to being shown that I did something wrong, it helps me improve as a developer. A code review is a fantastic opportunity to look into other developers' minds, and to have peers suggest ways for you to improve your work. Why would anybody be resistant to that? I would be vastly disappointed if my code were being reviewed and nobody showed me ways of improving it.

The above is largely from the perspective of the "reviewee". From the reviewer's point of view too, brutal honesty is the way to go. If you hold back as a reviewer because you do not want to hurt somebody's feelings, you have probably just done everybody in the room a disservice. You are a participant in the meeting for your technical abilities, to look over other developers' work and find any errors or inefficiencies. If you fail to do that, you have failed in your task to help develop the best product or service you can put out. I am not advocating that you should look to be mean and give other coders a hard time. Just be honest. Brutally honest.

Egos, feelings, false pride - these really should not come into the code review session. Check them at the door, and focus on one thing only: building the best software that can be built with the resources at hand. If you are lucky, somebody may just show you a way of doing something better, and help you find some coding Zen.

2012/05/16

Why It's Good To Be Wrong

Most people don't like to be proven wrong. It seems like a basic instinct: we want to be right all the time. Being wrong is a sign of weakness, and we may be looked down upon by our peers for that weakness. Even when there are no peers involved and the realization of being wrong dawns upon us by way of self discovery, we admonish ourselves.

Perhaps it has something to do with the survival of the fittest as we evolved - if you were right about where the predators were, you could take steps to avoid them. If you were wrong, you died. If you were right about how to find food and shelter, you survived. If you were wrong, you died. Today, if you are right about what decisions to make to earn a good living for yourself and provide for your family, your family does better than if you get it wrong. Whatever the reasons, it seems an almost universal human trait that we do not like being wrong.

And yet, not so much on a life changing level, but more on a day-to-day basis, I look forward to being proven that I was wrong. And my humble opinion is that we all should feel this way. Here is why.

When I find that I have been wrong about something, I have just learned something new, and I have just been corrected. It makes me a better professional, a better practitioner of whatever art I am practicing. For instance, I recently said at my workplace that a database backup created on one version of SQL Server could not be restored onto a different version of SQL Server. A good friend (thanks D2) pointed out the error, and I looked it up, and sure enough I was wrong. Knowing this makes me a better database developer. There is no shame or ego that stands in the way of admitting that I was wrong today, because it makes me right going forward. Every time somebody shows you that you were wrong, and you accept it, you have just made yourself a little bit better educated.

Please do not take this to mean that you should want to be wrong. No, the goal still is to be right. Rather, this is about your attitude when faced with the possibility that you just may have been wrong. Hence, crazy as it sounds, I look forward to being proven wrong. When that happens, I accept it and embrace it, and I am better off for it. If you show me I am wrong, I will thank you for it.

So, have you been wrong today?

2012/05/12

Defrag All Indexes On a Database

I recently had the need to look up all fragmented indexes on a specific database, to determine if they were the reason for less than optimal performance. I wrote the following script, which allows you to pick a specific database to check for fragmented indexes, and decide to just view them or reorganize/rebuild them, and also the fragmentation threshold percentage at which to run these operations. Then I thought it might be useful to others too, so here it is (below).

I am the original developer of this script. I offer it freely to anybody to use, just leave my name in the header where it is.

Thanks,
Hakim Ali


/****************************************************************************************************
Hakim Ali (hakim.ali@sqlzen.com) 2012/05/04
****************************************************************************************************/



-----------------------------------------------------------------------------------------------------
------------------------------------------------------- SET DESIRED VALUES HERE ---------------------
-----------------------------------------------------------------------------------------------------
declare @Database_Name nvarchar(100);                    set @DATABASE_NAME = ''
declare @Rebuild_Threshold_Percent nvarchar(10);        set @REBUILD_THRESHOLD_PERCENT = 30
declare @Reorganize_Threshold_Percent nvarchar(10);        set @REORGANIZE_THRESHOLD_PERCENT = 30
declare @Execute_Rebuild bit;                            set @EXECUTE_REBUILD = 0        -- set to 1 to rebuild
declare @Execute_Reorganize bit;                        set @EXECUTE_REORGANIZE = 0        -- set to 1 to reorganize. If rebuilding, no need to do this.
-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------



----------------------------------
-- Initial checking
----------------------------------
declare @Error nvarchar(500)
declare @SQL nvarchar(max)
declare @Online nvarchar(50)

set @Database_Name = ltrim(rtrim(isnull(@Database_Name,'')))

set @Error = 'Database name required.'
if (@Database_Name = '')
begin
    raiserror(@Error,16,1)
    goto the_end
end

set @Error = 'Database "' + @Database_Name + '" does not exist.'
if not exists (select name from sys.databases where name = @Database_Name)
begin
    raiserror(@Error,16,1)
    goto the_end
end

set @Online = ''
if (@@version like '%enterprise edition%')
begin
    set @Online = ' with (online = on)'
end



set @SQL = '
----------------------------------
-- Create table to hold results
----------------------------------
declare @fragmented_indexes table
(            ID                            int identity(1,1)
            ,Database_Name                nvarchar(1000)
            ,[Schema_Name]                nvarchar(1000)
            ,Table_Name                    nvarchar(1000)
            ,Index_Name                    nvarchar(1000)
            ,Fragmentation_Percent        money
            ,Num_Rows                    int
            ,Page_Count                    int
            ,Index_Type                    nvarchar(1000)
            ,Reorganize_SQL                nvarchar(4000)
            ,Rebuild_SQL                nvarchar(4000)
)



----------------------------------
-- Populate table
----------------------------------
insert into @fragmented_indexes
(            Database_Name
            ,[Schema_Name]
            ,Table_Name
            ,Index_Name
            ,Fragmentation_Percent
            ,Num_Rows
            ,Page_Count
            ,Index_Type
)
select        distinct
            Database_Name
                = db_name(database_id)
            ,[Schema_Name]
                = sch.name
            ,Table_Name
                = parent.Name
            ,Index_Name
                = indx.name
            ,Fragmentation_Percent
                = left(isnull(phys_stats.avg_fragmentation_in_percent,0),5)
            ,Num_Rows
                = x.rowcnt
            ,Page_Count
                = phys_stats.page_count
            ,Index_Type
                = phys_stats.index_type_desc
from        sys.dm_db_index_physical_stats(
                db_id('''+@Database_Name+'''),
                default,
                default,
                default,
                ''detailed''
            ) phys_stats
inner join    ['+@Database_Name+'].sys.indexes indx
            on indx.object_id = phys_stats.object_id
            and indx.index_id = phys_stats.index_id
inner join    ['+@Database_Name+'].sys.objects parent
            on parent.object_id = phys_stats.object_id
inner join    ['+@Database_Name+'].dbo.sysindexes x
            on x.id = indx.object_id
inner join    ['+@Database_Name+'].sys.schemas sch
            on sch.schema_id = parent.schema_id
where        1 = 1
and            isnull(indx.name,'''') <> ''''
and            x.rowcnt > 0
and            (phys_stats.avg_fragmentation_in_percent >= '+@Rebuild_Threshold_Percent+'
            or
            phys_stats.avg_fragmentation_in_percent >= '+@Reorganize_Threshold_Percent+')



update        @fragmented_indexes
set            Reorganize_SQL =
                ''alter index [''+Index_Name+''] on [''+Database_Name+''].[''+[Schema_Name]+''].[''+Table_Name+''] reorganize''
            ,Rebuild_SQL =
                ''alter index [''+Index_Name+''] on [''+Database_Name+''].[''+[Schema_Name]+''].[''+Table_Name+''] rebuild'+@Online+'''



----------------------------------
-- View results
----------------------------------
select        ID
            ,Database_Name
            ,[Schema_Name]
            ,Table_Name
            ,Index_Name
            ,Fragmentation_Percent
            ,Num_Rows
            ,Page_Count
            ,Index_Type
from        @fragmented_indexes
order by    Database_Name
            ,[Schema_Name]
            ,convert(money,Fragmentation_Percent) desc
            ,Table_Name
'

----------------------------------
-- If rebuild/reorganize option set...
----------------------------------
if (@Execute_Rebuild = 1)
begin
    set @SQL = @SQL + '
    declare @current_sql nvarchar(max)
    while exists (select top 1 Rebuild_SQL from @fragmented_indexes)
    begin
        set @current_sql = (select top 1 Rebuild_SQL from @fragmented_indexes)
        execute sp_executesql @current_sql
        --select @current_sql
        delete @fragmented_indexes where Rebuild_SQL = @current_sql
    end
    '
end
else if (@Execute_Reorganize = 1)
begin
    set @SQL = @SQL + '
    declare @current_sql nvarchar(max)
    while exists (select top 1 Reorganize_SQL from @fragmented_indexes)
    begin
        set @current_sql = (select top 1 Reorganize_SQL from @fragmented_indexes)
        execute sp_executesql @current_sql
        --select @current_sql
        delete @fragmented_indexes where Reorganize_SQL = @current_sql
    end
    '
end



----------------------------------
-- Go!
----------------------------------
execute sp_executesql @SQL
the_end:


Simplicity is Not Simple!

It is far too easy to build something complex. But simplicity - and a semblance of ease and intuition - now that takes a lot of work. But let that not be a deterrent to try and achieve it. And do not mistake the preference for simplicity as an excuse to shy away from complex problems. Quite the opposite: let's take on difficult problems, and engineer a solution so elegant that it makes the problem seem easy to solve. Let's find some database Zen !