http://seancorfield.github.io for newer blog posts." />

An Architect's View

CFML, Clojure, Software Design, Frameworks and more...

An Architect's View

To UUID or not to UUID?

April 16, 2011 · 10 Comments

A discussion that crops up from time to time is whether primary keys should be UUIDs or simple auto-incrementing integers. There are pros and cons on both sides. UUIDs provide universal uniqueness which allows data to be stored across multiple databases, merged and so on. Auto-increment primary keys are easy to work with and efficient for indexing - but cause problems when you need to deal with data created across multiple databases, such as data migration and merging. UUIDs, as 36 character strings, take up a lot more space and can be much less efficient as an index.

But what about storing UUIDs in a more efficient format? A DBA recently recommended using BINARY(16) since that holds 128-bit data and a UUID is usually decomposable as two 64-bit long integers. The question is how to do this? And, I guess, another question is: is it really that much more efficient than a 36 character string?

When I first looked at implementing the DBA's advice, I couldn't find a type in cfqueryparam that would support BINARY(16). There's BLOB... and I guess that would work if I could provide a byte array? Then the question is how to get a UUID - a 36 character string of hex with '-' separators - into a byte array in the first place? And then you have to ask how efficient that would be (under the hood, the UUID generator will create two 64-bit longs and then have to convert that to a string before CFML gets it's hands on it!).

Then another issue is the raw speed of UUID generation in the first place. Adobe made some big improvements between CF8 and CF9 here (100x faster) but generating UUIDs is traditionally still quite slow and in CFML they get converted to a delimited hex string so you can't even get at the underlying binary data. When I did some research, I found that the UUID generator built into Java (since Java 5) can only produce type 3 (name-based) and type 4 (random) UUIDs - it can't create time-based (type 1) or DCE security (type 2) UUIDs. There are third-party generators that can produce time-based UUIDs and which perform better than the built-in Java version.

Overall tho' this sort of fiddling around is something better suited to a low-level language like Java. We want a raw UUID. We want to get the two 64-bit longs and write them to a 16-byte array and use that as our BINARY key. As an experiment, I tried Johann Burkard's lightweight time-based UUID generator - see below - and wrapped it up in Clojure to create the value I wanted:

(defn uuid [] (com.eaio.uuid.UUID.))

(defn uuid-as-byte-array []
  (let [u (uuid)
        ^long lo (.getClockSeqAndNode u)
        ^long hi (.getTime u)]
    (-> (java.nio.ByteBuffer/allocate 16)
      (.putLong hi)
      (.putLong lo)
      (.array))))

There may well be more efficient ways of constructing the byte array, such as direct write operations and bit manipulation, but this seems clean and simple enough.

I'd be very interested to hear what others have done around UUID and/or binary indexes on large tables and, in particular, how you've deal with that from CFML?

See Johann Burkard's comparison of UUID generators for feature / performance analysis. See also Mark Kruger's recent post about leveraging Java UUID from ColdFusion - and the performance improvements possible (and read the comments).

Tags: clojure · coldfusion

10 responses so far ↓

  • 1 Randy Merrill // Apr 16, 2011 at 8:58 PM

    I was kinda curious of this from the aspect of a db, specifically Postgres since it has a native uuid type.

    I found this post that talks about the performance difference a bit between the uuid and a normal character array: http://jochem.vandieten.net/2008/02/06/postgresql-uuids-and-coldfusion-1/

    For me the benefits of it being almost guaranteed unique across databases is a big benefit. I haven't had to work with any databases where it would actually be an big performance slayer so I can't really speak from experience there.
  • 2 Tyler Clendenin // Apr 17, 2011 at 2:26 AM

    Due to environmental requirements we have been forced to move to using GUID's for our primary keys. Overall, it has not been an issue, though SQL Server statistics is most likely the main reason for that. The main reason that I personally hate GUID's as primary keys is that it kills any kind of clustered key index, because any row with a PK generated with NewID() will always create a fragmented index.

    Unfortunately the method NEWSEQUENTIALID()

    http://msdn.microsoft.com/en-us/library/ms189786.aspx

    came around after we started using uniqueidentifiers for our primary keys. From what I understand (sans a reboot). As long as you have a network card (which of course you will) GUID's created with the sequential id function will be unique and will be sequential. This seems like it would have tremendous advantages. Eventually I hope to start using this method to generate primary keys for our schemas.
  • 3 Tyler Clendenin // Apr 17, 2011 at 2:33 AM

    Oh, and I forgot to mention how much I hate using the createUUID function from Coldfusion on a window machine. If you didn't know, java has a little bug that messes up your computers clock every time you request a new UUID unless you specify the -XX:+ForceTimeHighResolution argument to your JVM. I had a server that was off by several hours because of this silly bug.

    http://kb2.adobe.com/cps/190/tn_19007.html
  • 4 Barney Boisvert // Apr 17, 2011 at 2:23 PM

    One huge gotcha with UUIDs is using them for the clustered index of a table. With a clustered index on a UUID column, INSERTs can be ridiculously expensive, as on average half the rows need to shift "down" to put the new row in place. With a big dataset, it can be significantly more performant to place the primary key (clustered index) on a sequential integer, and put a unique key on your UUID column.
  • 5 Sam Jones // Apr 18, 2011 at 12:39 AM

    Wouldn't it be more efficient to create the uuid using a database function like newid() in ms-sql server or the uuid() funktionin mysql?
  • 6 Sean Corfield // Apr 19, 2011 at 2:29 PM

    @Barney, can you simply not have a PK declared and just add a unique key on the UUID? MySQL certainly allows that but I'm not sure what the overall impact would be for most people...
  • 7 Ken Redler // Apr 19, 2011 at 7:23 PM

    One option is to set the UUID column as the primary key, but not clustered. Then choose another column -- one whose values create a more meaningful sequence -- and set that as the table's clustered index. The table will thus naturally order itself according to something that makes sense with respect to inserts or selects (like an order number or timestamp), but retain the useful qualities of the UUID for clustering, object creation outside the database, etc.
  • 8 Michel Fleur // Apr 22, 2011 at 1:31 AM

    @Tyler Clendenin:
    - Since 1.5, java has java.util.UUID, that hasn't got the clock-problem that Form.CreateUUID has.
    - That problem was related to Thread.sleep() and fixed from Java 1.4.1
    - UUIDs and Clustered indexes is indeed not a very good combination, on the other hand
    - Generating random UUIDs client (java)-side increases scalability, because there is less (=no) coherence in the generation process. Beware of the birthday paradox though...

    @Ken Redler
    The problem with adding an additional, more meaningful column for a clustered index, is that it also introduces a performance hit, unless it's a column with a strictly rising value.
  • 9 Arne Lewinski // May 20, 2011 at 6:40 AM

    I am wondering, how many discussions about UUIDs exists. Do you guys reflect on this? Global uniqueness sounds nice. But it lacks consistency by chance (also if double IDs are very rare). In default it lacks type information, so you have to encode this or administrate a UUID to type-table to get the proper type. It uses four times more space, it is about two times slower, DB Admins going crazy comparing UUIDs. It gives the responsibility for persistent object creation to EVERYONE, not only the db, so you lack the paradigm of layering. UUIDs refer to the context of multi-master systems like microsoft active directory domain controllers and of course always server-side. Do you really are in need of multi-master systems??? I am wondering... wondering...
  • 10 Sean Corfield // May 20, 2011 at 11:24 AM

    @Arne, I think in these days of noSQL data stores and distributed data persistence, UUIDs are a fact of life. Performance of int vs UUID is no longer relevant if you have to step far outside a single master database...

    You certainly do not give up layering - UUID creation is done in the data layer, where it belongs (and with newer data stores, UUID generation is done in the database itself).

    Also, there's no less type information in a UUID than any other PK. The value alone is never enough: you always need something else to determine the "shape" of the data behind the PK.

    However, it's entirely possible you will never need UUIDs. I don't have that luxury, hence this blog post...

Leave a Comment

Leave this field empty