Thoughts on Database Design and Development

I recently worked on a database of somebody else's design that had me wondering why the designer did certain things. I generally give the designer the benefit of doubt when I don't know their reasons, but there are certain mistakes that are plainly a result of a lack of experience/expertise. So I decided to write up a few guidelines on database design and development. This is a set of general recommendations for putting together a relational database for transactional operations. Every pointer in here may not always be the right thing for your situation, but in general if you understand the reasons behind these guidelines then you should be able to put together a decent database.

Database Design

Understand the Business: Before you even begin, understand the business problem you are trying to solve. This will help you determine the requirements yourself instead of relying on somebody else to give them to you. You will also be able to come up with a more optimal design by cutting out the translator (requirements gatherer) and doing it yourself. You will know "why" you need to do certain things, as opposed to being told "what" to do.

Primary Keys: Generally, all tables should have a primary key which defaults to a clustered index. Avoid heaps. As for the primary key itself, call it "ID", apply data type integer, and give it the identity/auto-increment property. While this works for most tables, a not very uncommon exception is a child table that will almost always be referenced by its parent ID foreign key. Think Sales_Order and Sales_Order_Detail. You will probably never look up detail records directly by detail ID; it is much more likely that you will first find the Sales_Order ID and then look for its matching details (children). In such tables, you're better off creating a composite PK/clustered index on two fields in this order: Parent.ID, Child.ID. (Yes this violates 3NF, but that's another discussion.) Do not use GUIDs as your primary key / clustered index. There is almost never a good reason to do so. An integer (or bigint if necessary) works much better.

Foreign Keys: All table relationships must be enforced via a formal foreign key relationship. Don't use undocumented foreign keys, i.e. using the primary key values of table A as column values in table B without formally establishing the foreign key relationship. I have seen many databases that ignore this, and among other things it causes the problem of orphaned records and makes data discovery very difficult.

Indexes: Indexes are great for read performance, but they come at the expense of making writes slower, and databases larger on disk. So do index, but only as much as necessary. Try and anticipate the queries that will be reading your data most frequently, and the ones on which performance is crucial. Then add indexes to "cover" those queries. Look up what a covering index is, and then build as necessary. Also, have a plan in place to maintain your indexes, to periodically check on and fix fragmentation, and also to remove indexes that are no longer necessary.

Documentation: Take the time to create and maintain a schema description document. Add all the entities your database uses, especially tables, views, stored procedures and functions. Describe what the purpose of the table is, its dependencies and relationships, its columns, foreign keys. Maintain this document, update as necessary. There are tools to automate some of this. If you can also add business process descriptions, this document will be golden.

Triggers: Avoid them. They might seem like a great idea to automatically perform an action when something is done to a set of data, but they cause their own set of headaches. They are difficult to track and troubleshoot, they often get forgotten, and potentially lead to circular operations.

Naming Convention: Settle on one before you start creating database objects, and stick to it. Use this for all your objects: tables, views, stored procedures, functions, indexes, triggers (if you must use them), and the database itself.

Normalization: Normalize normalize normalize. If your app must have a de-normalized view, please put it in a separate view. But as far as your tables go, normalize with the intention of not storing any piece of data more than once. That is how the relational database was meant to be used. Don't get too hung up on which normal form to use and the differences between them. Just stick to the intention of storing any piece of data only once, and you'll be fine.

Data Types: Yes memory is cheap, but performance is still king. Don't use bigint if int will do the job. Don't use varchar(200) if varchar(20) will suffice. When your data volume grows, these little things add up and make a huge difference. Do think of future growth and scaling out/up, but don't overkill and don't try to solve problems that have not come up yet. For instance, if you are developing a new app with a target audience of 100,000 users, don't try to scale it up to 50 million users at this point. When that problem happens, you will be better equipped (in terms of resources, money, personnel) to handle a good problem to have. Right now, it could scuttle your efforts if you try to build and host a database that is much bigger than you'll need in the foreseeable future.

Big Data (I am currently also inclined to include NoSQL with this): This is a fad, a term that is currently "hot". Vendors that have jumped onto the bandwagon will market it to you. Stay away from it. For most business solutions, the relational database still works beautifully.

XML: Storing xml inside a relational database is a questionable idea. It is a compromise. XML and relational data are two very different paradigms designed to address two very different problems, and they don't marry well. I am not against generating xml from relational data for consumption by web services etc, or parsing xml streams out into atomic elements to store in relational format. Just not a big fan of storing xml as is in a table. If you must do this, question why.

Database Development

Row-by-Row Operations: Avoid loops, cursors, and "for" type constructs. If at all possible, use set-based operations. Jeff Moden has written some very good material on the how/why of this, look it up.

Documentation: Repeat after me: code is not its own documentation. Commit this to the foundation of your very being. Your future self will thank you a few months from now, when you have to come back and rework some complex code and you don't have the time to figure out the what/how/why of code. Others who may have to work on your code a few months from now will thank you too. Write comments, introductions, and explanations in your code objects. Over-communicate.

Naming Conventions: Name entities so that the name tells you what they are. If a variable contains an account number, don't call it @x or @an or @acc. Call it @Account_Number. This makes code much easier to read and maintain.

Readability: Code must be neatly organized and easy to read. Prefer readability over "cuteness" (using very smart techniques that are impossible for anybody to read/understand). Be consistent with your code layout. If you write inner joins a certain way, write them that way all the time. Be consistent with upper or lower case, and ask everybody in your organization to follow established standards. This may be painful in the short term, but in the long term it will make code easier for DBAs and developers to read and will help speed development time. Promise.

Source Code Control: Script out your schema. As far as possible, don't actually create any objects through a GUI. Play with the GUI, sure, but when it comes time to execute your changes, script them out and save them. Check the scripts into source control. This way, you will not have to repeat or remember each detail of each object, and deployment on multiple machines will be easier, in addition to giving you a backup. Also script out your configuration table data as csv files and check those into source control too. This will help auto-populate your config tables once you've created them.

Backup and Restore: Definitely have a restore plan, not just a backup plan. If you do not know that you can restore your backups successfully (not guess, but know for sure), you don't have a backup plan at all. The only way to know is to actually restore your backups. Do it. Test it. Plan it. Formally implement it.

Database as a Service: This is something I'd like to put into practice, but for some reason or another have not been able to achieve yet. This is my ideal database consumption method: don't allow SQL code in your application. Put every database operation inside a stored procedure, and give your application permissions to only execute stored procedures and nothing more: no DML (select / insert / update / delete) and definitely no DDL (create / alter / drop). Only give applications execute permission on stored procedures. Why? Among the benefits are:
- Caching the execution plan for performance.
- Using the database server (as opposed to the web server or application server) for data manipulation which is what it was designed to do.
- Not sending massive volumes of data over the network between the database server and other servers.
- Making your application more secure by reducing vulnerability to a SQL injection attack.
- Preventing non-database programmers from writing potentially inefficient and possibly buggy database code.
- Being able to tweak the stored procedure for performance tuning or simple functionality without having to re-release/update the application or its code.

In conclusion, you don't necessarily have to do all of the above, but at least take the time to understand why these are recommended and then make your own informed decision. Your application will be better off for it.

1 comment:

  1. Great article, Hakim. A good reminder on what to focus on during dev. We're working on a .net application in house, and I've implemented the same 'database-as-a-service' methodology as you've described above.