What do we do about primary keys

This piece is about two related subjects. The first is the great question of whether we should rely on auto incremented integers or uuids to identify entities. The second is who gets to generate those identifiers (your code or some external application like your RDBMS).

I feel that just about any article I read on these subjects promptly mixes the two areas. Also, none of the articles seem to making valid points (this one made me quite sad) so I got the urge to make my own attempt.

First off, auto incremented integer primary keys. Why do we use them?

  • they’re the default. Just about any RDBMS has them, any ORM will happily use them.
  • they’re fast. Integers take 4 bytes to represent, bigints (as per mySQL type) take 8 bytes. They’re naturally fast for indexing.
  • they’re naturally unique for a table — and that’s usually good enough.

Why can auto incremental primary keys be bad?

  • they disclose information about your entities (for example, when used in URLs, but as long as they get into some UI, it can be a problem) and helps with scraping data off your db as well as simply leaking metrics (image you have an online shop and someone makes today an order, gets an auto incremented id then makes another order at the end of the month — boom, they know how many orders you got)
  • to debug you need to pass around several pieces of information, which makes things difficult in certain settings (like multi-tenant environments, where a tenant might have their own prod + staging)

What about UUIDs? What’s their upside?

  • they’re *really* unique. Running at 16 bytes of randomly generated data, you have a better chance at winning the jackpot 100 times in a row that running into an UUID collision
  • they can be used as such to identify data across tables / databases / servers. This makes transitions and migrations easy (lets say you want to start over with a new db, but without collisions with IDs from your previous db — auto increments are an issue unless you arbitrarily set a different starting point)
  • no meaning can be inferred from them, they can be used publicly without disclosing anything about your system (aside from your use of uuid’s, of course)

What about downsides?

  • At 16 bytes, they are usually represented as strings. This is a big performance hit on indexing, particularly if you use relationship. A search by an UUID — by default — is a text search, which is significantly slow.
  • Speaking of mySQL, you need version 8 to be able to partially mitigate the performance impact by changing their representation to binary, but on any generation you need to also call UUID_TO_BIN

It’s not a binary choice though, there are other systems as well — you can use your own UUID-like system with your own character space, or perhaps a different primitive, like a 64-bit integer. But regardless of system, there’s another question that needs to be answered: who will generate your identifier?

Well, the obvious first choice is … the database engine itself!

  • it’s right there! I mean, you are using one.
  • modern versions (like mySQL 8) can do either autoincrements or UUIDs, with various encodings (binary for UUID, which helps the performance issue)
  • you’re doing it at the lowest layer so consistency is assured (even in clustered environments, as long as your primary key is clustered)

Why would you consider an alternative, you ask?

  • leaving it to the RDBMS couples you with that engine. Particularly if using db-generated UUIDs, which aren’t available everywhere just yet.
  • performance: image you want to save related entities. You need one trip to the db to save the first and get its primary key so that you can pass it to the second one to establish the relationship. Second trip is to save the second entity.

So, you might choose to generate your key in code.

  • you’re free to manage relationships outside of RDBMS, for flexibility and performance (no foreign keys, you can save in a single transaction in the above example)
  • you are not tied to a db vendor

What about downside of managing your own generation?

  • you’d better use UUIDs — autoincrements managed this way need an extra db call to see what value you can use.
  • it’s not scalable, two copies of your application generating UUIDs with the same algorithm are increasing the risk of collision by 4 and it rises exponentially the more copies you run.
  • unless you write your own implementation of UUID generation, you’re tying yourself to a library vendor.

Is there a middle way? How about generating your own identifiers (so not leaving it to the db) but also not doing it in the application code? Twitter is using this thing, called Snowflake, which is a dedicate service generating id’s. It’s much faster than DB-generated content (where each process has to deal with a potentially large table structure and account for unicity) with some network latency issues (potentially, but likely negligible when running in the same private network).

Of course, an extra service comes with its own headaches (security / maintenance). What do you think?