2013/05/07

Thoughts on Database Design and Development

I recently worked on a database of somebody else's design that had me wondering why the designer did certain things. I generally give the designer the benefit of doubt when I don't know their reasons, but there are certain mistakes that are plainly a result of a lack of experience/expertise. So I decided to write up a few guidelines on database design and development. This is a set of general recommendations for putting together a relational database for transactional operations. Every pointer in here may not always be the right thing for your situation, but in general if you understand the reasons behind these guidelines then you should be able to put together a decent database.


Database Design

Understand the Business: Before you even begin, understand the business problem you are trying to solve. This will help you determine the requirements yourself instead of relying on somebody else to give them to you. You will also be able to come up with a more optimal design by cutting out the translator (requirements gatherer) and doing it yourself. You will know "why" you need to do certain things, as opposed to being told "what" to do.

Primary Keys: Generally, all tables should have a primary key which defaults to a clustered index. Avoid heaps. As for the primary key itself, call it "ID", apply data type integer, and give it the identity/auto-increment property. While this works for most tables, a not very uncommon exception is a child table that will almost always be referenced by its parent ID foreign key. Think Sales_Order and Sales_Order_Detail. You will probably never look up detail records directly by detail ID; it is much more likely that you will first find the Sales_Order ID and then look for its matching details (children). In such tables, you're better off creating a composite PK/clustered index on two fields in this order: Parent.ID, Child.ID. (Yes this violates 3NF, but that's another discussion.) Do not use GUIDs as your primary key / clustered index. There is almost never a good reason to do so. An integer (or bigint if necessary) works much better.

Foreign Keys: All table relationships must be enforced via a formal foreign key relationship. Don't use undocumented foreign keys, i.e. using the primary key values of table A as column values in table B without formally establishing the foreign key relationship. I have seen many databases that ignore this, and among other things it causes the problem of orphaned records and makes data discovery very difficult.

Indexes: Indexes are great for read performance, but they come at the expense of making writes slower, and databases larger on disk. So do index, but only as much as necessary. Try and anticipate the queries that will be reading your data most frequently, and the ones on which performance is crucial. Then add indexes to "cover" those queries. Look up what a covering index is, and then build as necessary. Also, have a plan in place to maintain your indexes, to periodically check on and fix fragmentation, and also to remove indexes that are no longer necessary.

Documentation: Take the time to create and maintain a schema description document. Add all the entities your database uses, especially tables, views, stored procedures and functions. Describe what the purpose of the table is, its dependencies and relationships, its columns, foreign keys. Maintain this document, update as necessary. There are tools to automate some of this. If you can also add business process descriptions, this document will be golden.

Triggers: Avoid them. They might seem like a great idea to automatically perform an action when something is done to a set of data, but they cause their own set of headaches. They are difficult to track and troubleshoot, they often get forgotten, and potentially lead to circular operations.

Naming Convention: Settle on one before you start creating database objects, and stick to it. Use this for all your objects: tables, views, stored procedures, functions, indexes, triggers (if you must use them), and the database itself.

Normalization: Normalize normalize normalize. If your app must have a de-normalized view, please put it in a separate view. But as far as your tables go, normalize with the intention of not storing any piece of data more than once. That is how the relational database was meant to be used. Don't get too hung up on which normal form to use and the differences between them. Just stick to the intention of storing any piece of data only once, and you'll be fine.

Data Types: Yes memory is cheap, but performance is still king. Don't use bigint if int will do the job. Don't use varchar(200) if varchar(20) will suffice. When your data volume grows, these little things add up and make a huge difference. Do think of future growth and scaling out/up, but don't overkill and don't try to solve problems that have not come up yet. For instance, if you are developing a new app with a target audience of 100,000 users, don't try to scale it up to 50 million users at this point. When that problem happens, you will be better equipped (in terms of resources, money, personnel) to handle a good problem to have. Right now, it could scuttle your efforts if you try to build and host a database that is much bigger than you'll need in the foreseeable future.

Big Data (I am currently also inclined to include NoSQL with this): This is a fad, a term that is currently "hot". Vendors that have jumped onto the bandwagon will market it to you. Stay away from it. For most business solutions, the relational database still works beautifully.

XML: Storing xml inside a relational database is a questionable idea. It is a compromise. XML and relational data are two very different paradigms designed to address two very different problems, and they don't marry well. I am not against generating xml from relational data for consumption by web services etc, or parsing xml streams out into atomic elements to store in relational format. Just not a big fan of storing xml as is in a table. If you must do this, question why.


Database Development

Row-by-Row Operations: Avoid loops, cursors, and "for" type constructs. If at all possible, use set-based operations. Jeff Moden has written some very good material on the how/why of this, look it up.

Documentation: Repeat after me: code is not its own documentation. Commit this to the foundation of your very being. Your future self will thank you a few months from now, when you have to come back and rework some complex code and you don't have the time to figure out the what/how/why of code. Others who may have to work on your code a few months from now will thank you too. Write comments, introductions, and explanations in your code objects. Over-communicate.

Naming Conventions: Name entities so that the name tells you what they are. If a variable contains an account number, don't call it @x or @an or @acc. Call it @Account_Number. This makes code much easier to read and maintain.

Readability: Code must be neatly organized and easy to read. Prefer readability over "cuteness" (using very smart techniques that are impossible for anybody to read/understand). Be consistent with your code layout. If you write inner joins a certain way, write them that way all the time. Be consistent with upper or lower case, and ask everybody in your organization to follow established standards. This may be painful in the short term, but in the long term it will make code easier for DBAs and developers to read and will help speed development time. Promise.

Source Code Control: Script out your schema. As far as possible, don't actually create any objects through a GUI. Play with the GUI, sure, but when it comes time to execute your changes, script them out and save them. Check the scripts into source control. This way, you will not have to repeat or remember each detail of each object, and deployment on multiple machines will be easier, in addition to giving you a backup. Also script out your configuration table data as csv files and check those into source control too. This will help auto-populate your config tables once you've created them.

Backup and Restore: Definitely have a restore plan, not just a backup plan. If you do not know that you can restore your backups successfully (not guess, but know for sure), you don't have a backup plan at all. The only way to know is to actually restore your backups. Do it. Test it. Plan it. Formally implement it.

Database as a Service: This is something I'd like to put into practice, but for some reason or another have not been able to achieve yet. This is my ideal database consumption method: don't allow SQL code in your application. Put every database operation inside a stored procedure, and give your application permissions to only execute stored procedures and nothing more: no DML (select / insert / update / delete) and definitely no DDL (create / alter / drop). Only give applications execute permission on stored procedures. Why? Among the benefits are:
- Caching the execution plan for performance.
- Using the database server (as opposed to the web server or application server) for data manipulation which is what it was designed to do.
- Not sending massive volumes of data over the network between the database server and other servers.
- Making your application more secure by reducing vulnerability to a SQL injection attack.
- Preventing non-database programmers from writing potentially inefficient and possibly buggy database code.
- Being able to tweak the stored procedure for performance tuning or simple functionality without having to re-release/update the application or its code.


In conclusion, you don't necessarily have to do all of the above, but at least take the time to understand why these are recommended and then make your own informed decision. Your application will be better off for it.

2013/05/06

Could You Live Like a DBA?


Ever wondered how everyday life would pan out if you took the DBA approach to life's events?


Backups: You have a nightly backup routine. This means a differential backup every night, when you tell your spouse everything that has happened during the day in minute detail, and a full backup on weekends when you tell them everything that has happened during the week in your entire life. Don't think either of you would sleep much.

Replication: You are on the phone continuously, telling your twin sibling at a remote location everything that you are experiencing. Everything. What you see, hear, smell, eat, work on etc. Your twin, of course, has no life of her own, and is living vicariously through you, ever prepared for you to drop dead so she can take over. Talk about sibling rivalry.

Restores: Remember that nightly backup you transferred to your spouse earlier? You periodically and randomly ask your spouse to repeat some of those conversations back to you, ensuring no detail is skipped, so you can confirm that the backup is working. Marriage counselor's number handy?

Scans: When you go looking for the soy sauce in your pantry, you read the label on every single item in there: all the spices, all the jars and bottles, every bag of chips, every bottle of juice and soda, everything, because the items are arranged in no particular order. It takes you a few hours to put together one meal, and your kids have gone to bed hungry.

Seeks: After scanning through your pantry multiple times and getting tired of how long it takes each time, you decide to rearrange all the items in order by alphabet. Now you can retrieve the soy sauce in no time. Kids are happy again.

Fragmentation and Re-indexing: Over time, as your spouse adds items to the pantry at random locations (because he is not a DBA) and removes items leaving gaps on the shelves, the ordering gets so out of whack that you revert to going through all items again to find the soy sauce. Eventually, you decide to create an automated re-indexing plan by having your spouse rearrange everything by alphabet once a week. Divorce lawyer's number handy?

More Backups (because, you know, a DBA can never have enough backups): When you prepare a sandwich for your kid's school lunch, you also make a second sandwich. You pack the first one in your child's lunch box, and you courier the second one to arrive at school just in time for lunch. You don't tell anybody that you have a third one in your car, you know, just in case...

Adding Records: When you discuss baby names with your spouse for the upcoming addition to your family, you very briefly consider an integer for the name, and then even more briefly think that a tinyint would be more appropriate, before you think of the divorce lawyer and decide to shut up. You do, however, bring up the Bobby Tables joke because you think it's funny. Your spouse does not.

Searching: You wonder on a daily basis why you can't just enter this into Google's search box: select * from closeby_restaurants where cuisine = 'chinese' and distance_from_home <= 2 miles and deliciousness = 'awesome' and service_time = 'quick'...

Normalization: When your spouse asks you for a copy of your shopping list before he goes shopping, you point him to the location of the original list instead because you do not want to create duplicates. He throws his hands up and leaves without your list. Your items never get bought.

Normalization 2: You do not have a driver's license, or a bank account, or a passport, or a mortgage in your name, because you are incapable of filling out any forms in duplicate, leave alone triplicate.

Set-based Operations: After the kids have gone to bed and you sit down with your spouse to watch some TV, you roll out all the TVs, desktops, laptops, tablets and phones in your house into the living room and fire up a different episode of your favorite show on each one at the same time, because you know that set-based operations are better than episode-by-agonizing-episode. The kids were very entertained when you did this with their cartoons. But the spouse - not so much.


So if your domestic life isn't exactly a picture of happiness and you're getting grief from your spouse, kids, and siblings, now you know why. You're a DBA!