Warning: Trying to access array offset on value of type bool in /home2/myrddin/public_html/wp-content/plugins/organic-customizer-widgets/public/class-organic-widgets-public.php on line 68

Of Guids, Primary Keys, and Clustered …

One of the more raucous debates among developers and DBA’s for the past decade or more has been over the use of GUIDs (Graphical Unique Identifiers) or UID as primary keys on database tables.  The amount of blog entries that actually miss the entire point of good database design is actually deplorable.  So here is my attempt at addressing the problem not simply from a performance perspective (though that is ultimately important) or a size perspective (though you may be concerned here) but by identifying the issues from a database design principle perspective and offering various options depending on your constraints.

Traditional Principles

The first question is how you are using the GUID?  Somehow, when talking about the use of GUIDs we forget that often that are meaningless from a business perspective, but form a technical perspective they are tantamount.  This is probably because data design in the middle tier has had a huge impact on how we design our databases (for better or worse).  But back to basics, what is the GUID’s function?  If it is nothing more than a MBUN (meaningless but unique number) then it should probably never be used as the primary key in the database. (See this great post for a never perspective on this http://randommadness.blogspot.com/2008/07/guids-and-clustered-indexes.html.)

If the GUID is a meaningful key, then use it, and apply traditional database design principles.  For example, when the order number is a GUID and the order items are numbered sequentially, the primary key of the order items table may be the GUID and the Sequence.  In this case the use of the GUID is appropriate because it has meaning, it is the order ID and in most cases, all the order items will be pulled out at the same time.  The performance issues and such consideration discussed below do not apply in this case, because the GUID means something.

However, if there is no other unique key but the GUID as a MBUN, then review your design again.  If there is still no better meaningful key (and I am highly skeptical there is not another in most business data elements), then perhaps the GUID should be used as a key and even as a unique key, but refer to the constraints below for the trade-offs.  Even in cases where a unique key is defined, it should not always be the primary key for the table anyway.  In database design the primary key must be determined based upon a number of issues.  This unique identifier is only a candidate for a unique key or index, not necessarily the primary.

Considerations

GUIDs are Ubiquitous

GUIDs may be created in any tier of the application, supplying a unique identifier for an entity in the application without the expense of a database call.  Especially for disconnected systems that may not have reliable real-time access to the database, this feature makes GUIDs an irreplaceable design tool.

TRADEOFF
GUIDS Integer/Long Integer
Create Anywhere Create only in Database

VERDICT: GUIDs where applicable

GUIDs are larger

GUIDs take more space, they are 16 bytes versus using an integer key which is 4 bytes.  In most cases, this is a non-issue.  Disk space is cheap.  And with the addition of larger integer types to the databases, the difference is becoming less important.   And since GUIDs can be generated in the middle tier without a database call, they are more useful to use as unique identifiers, especially in disconnected systems.

TRADEOFF
GUIDS Integer/Long Integer
16 bytes 4 bytes/8 bytes

VERDICT: Noise (i.e., not applicable to the decision for 99.999% of cases)

GUIDs are Bad Key Values?

Basically the GUIDs may really poor sequential values to use for a database key.  Since database keys are based upon the idea of sequential references to look up the value and GUIDs are not necessarily sequential.  Database server engines have tried to compensate for this by supplying functionality to create a more sequential GUID to use, but it basically then requires that all GUIDs be created in the database machine, thus removing the advantage of ubiquitous GUID creation.  So what is the impact? 

On non-clustered indexes, this means that the index needs to be rebuilt form time to time for performance reasons because its performance will fall off faster than say a date and time based index in which most of the values are sequential.  Okay, so not a big problem with non-clustered indexes.

TRADEOFF
GUIDS Sequential Value
Degrade index performance faster Standard Performance

VERDICT: Noise (no large impact on traditional database design)
MITIGATION: Rebuild the indexes more often based upon the frequency of inserts/updates.

Index Sequencing – To Cluster or not to Cluster?

However, the BIG problem comes in when GUIDs are used as the clustered index for a table.  A clustered index defines how a table is physically laid out, so when inserts are non-sequential, as the insertions of GUIDs have a tendency to be, the lack of sequences in the clustered index will likely cause a large number of page splits and impact the performance of the record insertion.  However if the clustered index is on a sequential column such as an integer or a date time the page splits would not occur and the insertions would have much better performance.

TRADEOFF
CLUSTER GUIDS CLUSTER Sequential Data
Insertion performance quickly degrades Fewer page splits

VERDICT: Never use GUIDs as Clustered Indexes where performance is critical

Table Joins – One Key to bring them all?

Where tables will be joined to other tables in the same database having multiple fields that combine to create the primary key for a table becomes cumbersome and tends to increase disk usage by propagating that data to a number of tables where it may not be used.  One solution to this issue is to create a Database ID field which is a sequence (integer) that is the primary key and the clustered index.  This becomes a consideration because in many design instances the GUID has become that single key that allows the system accomplish the same goal of tying elements together, and this is often used in the object model to reference data items without a reference to the database.

TRADEOFF
One Key Data Item Many Key Data Items
Less data to pass around More meaningful data to key from
Must begin key lookup in source table and join Can look up the key information in the desired table
Usually better performance in joins Reduces the number of joins necessary
Requires less disk space Require more disk space

 VERDICT: It Depends…

  • Ease of Use & DB space – One Key
  • Fewer Joins more meaningful data – Many Keys

Proposed Solutions

So what to do?  In most cases, in smaller databases where there are either not a large number of insertions or there is little change to the data once it has been added to the table, there really is not much of an issue.  Use traditional database design principles and allow the database engines to optimize your tables and indexes for you.  A good Database Administrator in a modern database engine can compensate for many ‘mediocre’ database designs.

But when performance is critical and a large number of records are being inserted and referenced, or when large batch or set operations are being performed on the data, the considerations above must be addressed.   But there is a compromise solution that will allow both the developers to use their ubiquitous GUIDs for convenience and will allow the database administrators to have a well tuned database.

The solution is to use both a GUID as a unique key to reference the table while also using a database identity (most likely a sequence and an integer) as the clustered index for the table.  The Database Identity does not ever have to be shared outside of the data layer and in this particular case, it should not be shared outside of the database (it would encourage short-cuts in the services layer that could circumvent the designer’s intent, i.e., it will be abused and misused).  Very often, development departments will mandate the use of the database ID for consistency.  If you know every table has a clustered index on the primary key with is called <TableName>ID and that when the primary key of that table is exported with the same name, you the implementation of the join is a breeze and the query designer can concentrate on the logic implementation rather than the mechanics of the join.  The can be overused, however, and should rarely be a ‘blanket mandate’ nor does it excuse the database designer from considering the alternatives, particularly with lookup tables.  However, if it is used, then it may be appropriate to name of GUID field as <TableName>UID.

The GUID field then is defined as the UID type (as appropriate for the database, avoid using text or strings as the GUID type) and a unique index constraint is placed upon it.  In cases where an entry may be inserted that would not necessitate a GUID be defined prior to the insertion, the table should default to a NEWID () or NEWSEQUENTIALID () or the appropriate database command.  The primary key should be an integer field, defined as the primary key and created with a clustered index.  This field should be auto generated upon insertion and seeded as appropriate.

Design and Development Considerations

Insertions into Child Tables – The impact of this design, especially if the Database ID is not passed out to the middle tier, means that an insert into a child table may necessitate a read from the parent table to retrieve the Database ID of that table.  However, in batch set operations where a large amount of data is being processed, this should not be necessary.

Reads from Child Tables by Parent ID – In this case the Parent GUID may not be propagated to the child table.  Consider that all program accounts for a particular customer must be read by CustomerUID.  This would necessitate a join on the Customer Table by CustomerUID and joining it with the Account table where CustomerID = CustomerID.  In most cases this should not have a significant impact, but if a large number of the reads form the Account table are accomplished in this fashion, then perhaps the CustomerUID should also be added to the Account table removing the need for a join to the Customer table.

Additional Considerations

  • Consider turning off the NEWID generation in batch tables
  • Consider creating staging tables where rules are processed against the data, and using set operations to push that data into the production tables
  • The indexes for the GIUD keys must be rebuilt fairly frequently, especially after a large number of insertions; this may or may not be an issue with the particular database engine which may force the rebuild as needed.

Conclusion

GUIDs are not going away, and a good database design must consider how to use them most appropriately.  The good thing is that in most cases, traditional database design principles are enough, so long as we remember that the definition of a “Meaningful” value is in the context of the user of the system, not the middle tier developer.  With a little bit of care, even large high-performance database designs can incorporate both the need to use GUIDs and the need to provide high performance insertions and retrievals.

Share

Leave a Reply