128-Bit numbers are absolutely humongous and just by virtue of being a primary k...

xyzzy_plugh · on June 12, 2024

They're not "absolutely humongous" but merely two times as large as a 64-bit number. That's nothing. They're not even that bad for performance, assuming you have random writes regardless. UUIDs work incredibly well.

Where they don't work well is when they're stored in textual encoding (base16 with dashes) which I've encountered far more than I'd like. They also don't work (random) for append-only insertions like events, for which ULID (or newer UUID variants) are more appropriate.

But for primary keys? Totally fine.

> extremely good at generating and optimizing PK sequences (You're taking away all that optimization just because you want an enormous random number to be the primary key).

Please share the benchmarks to back this up. UUIDs optimize down perfectly fine.

If you're trying to squeeze every last ounce of performance out of the database then certainly this is a yak you can shave. But most Postgres users have 1001 problems that are much bigger than UUIDs. If you've managed to avoid folks writing shitty queries and this is the bottleneck you're facing then I salute you.

Otherwise, don't worry. UUIDs are fine.

sgarland · on June 12, 2024

> but merely two times as large as a 64-bit number. That's nothing.

It's literally 2x as large. This starts to matter quite a bit at the millions/billions of rows scale.

> They're not even that bad for performance, assuming you have random writes regardless.

> Please share the benchmarks to back this up.

I've done [0] benchmarks, with [1] source to recreate it. This was for loading data, but it demonstrates quite nicely the problem of massive B+tree splits. An integer PK loads data ~2x as quickly as a UUIDv4 PK stored as `uuid` type, and ~3x as fast as one stored as `text` type.

> But most Postgres users have 1001 problems that are much bigger than UUIDs.

All of which become compounded by the use of non-k-sortable keys.

> If you've managed to avoid folks writing shitty queries and this is the bottleneck you're facing then I salute you.

Using UUIDs as a PK without having an extremely good reason to do so demonstrates a lack of understanding of how the underlying technology works. Worse, it demonstrates a lack of understanding of a basic data structure, something that devs should have near and dear to their heart.

With Postgres specifically, the use of UUIDv4 (or ULID) PKs will also cause an enormous amount of WAL bloat [2], unless you happen to be rolling your own DB and are using ZFS for the file system, and you've disabled `full_page_writes` (which is the only safe way to do so). I have personally seen this take down prod, because the other nodes couldn't keep up with the changes being issued (fun fact, most AWS instances under `.4xlarge` in size have a baseline network speed that's much lower than rated).

[0]: https://gist.github.com/stephanGarland/ee38c699a9bb999894d76...

[1]: https://gist.github.com/stephanGarland/fe0788cf2332d6e241ff3...

[2]: https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-pa...

Aeolun · on June 13, 2024

> This starts to matter quite a bit at the millions/billions of rows scale.

Only if you have no other data in the database? Our UTF8 encoded strings easily beat all the UUID’s in terms of storage size.

VWWHFSfQ · on June 13, 2024

I doubt those are also duplicated repeatedly in indexes throughout the database.

xyzzy_plugh · on June 12, 2024

> I've done [0] benchmarks, with [1] source to recreate it.

Where's the source for the data files? Without that it's not possible to make sense of these benchmarks. Are the keys sorted?

Comparing random-insert to insert-at-end, if that is what you are indeed doing, is silly and not representative of real-world scenarios.

Even then, it's _less than_ a 2x overhead, which is far from an order-of-magnitude difference and squarely within the realm of "I don't care."

> All of which become compounded by the use of non-k-sortable keys.

But that's the thing -- they don't! Bad queries, like missing an index, improperly pruning via predicate pushdown and recursive queries will all dominate performance regardless of what you do here. We're not talking _quadratic_ behavior here. The compounding effect you are worried about is negligible for the vast, vast majority of workloads.

> With Postgres specifically, the use of UUIDv4 (or ULID) PKs will also cause an enormous amount of WAL bloat [2],

Again, only for _random insertions_, which is not a problem specific to UUIDs. Yes I agree that if you are rewriting all of your tables all the time you're going to have a bad time. However UUIDs are not inherently problematic as UUIDv7 or their predecessor ULIDs facilitate temporal ordering, which nearly eliminates the disadvantages discussed here.

I fail to see how ULIDs inherently cause WAL bloat, and checking my notes (read: databases) shows they do not. Why do _you_ believe otherwise?

> Using UUIDs as a PK without having an extremely good reason to do so demonstrates a lack of understanding of how the underlying technology works. Worse, it demonstrates a lack of understanding of a basic data structure, something that devs should have near and dear to their heart.

What is problematic is not understanding the basic data structures and the underlying technology, which is the root of most database issues, which is my point: you almost always have bigger fish to fry. Whether or not you opt to wield UUIDs will not save you here.

In fact, 2ndquadrant agrees with me[0]:

> Chances are your database schema is more complicated and uses various other indexes, so maybe the issues due to UUIDs are fairly negligible in the bigger picture.

[0]: https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...

sgarland · on June 12, 2024

> Where's the source for the data files? Without that it's not possible to make sense of these benchmarks. Are the keys sorted?

The names were randomly generated, as were the UUIDs. Serial / Identity were of course handled by the DB itself. Nothing was pre-sorted.

> 2ndquadrant agrees with me

I wouldn’t say a summary “maybe the issues are fairly negligible” counts as agreeing. Do terrible queries matter more? Often, yes. However, I’ve found there is a strong correlation between devs using UUIDv4s (even after having been warned of the consequences) and poor schema and query design.

I’m not going to argue any points further as I don’t see it being fruitful. My day job is DBRE; I have ran both MySQL and Postgres DBs at the TB scale, managed and self-hosted. I’m sure there are many who have ran far larger clusters, but I believe I have enough experience seeing the impact first-hand to be able to speak about them.

xyzzy_plugh · on June 13, 2024

If your argument boils down to "bad engineers use them" then I don't know what to tell you. There's nothing intrinsically wrong or harmful with UUIDs, and they're not really any worse of a footgun than any other feature.

I don't particularly care about your credentials, but it's harmful to preach extremes. UUIDs don't deserve the hate they're getting here.

Credentials: Postgres (and begrudgingly MySQL) for thousands of developers at the TB scale, 100TB scale, as well as PB scale (we don't talk about that one), and all (yes, all) the Postgres-related OLAP DBs, as well as maintaining a fork of Postgres for some time.

The FUD here is unreal.

hipadev23 · on June 13, 2024

UUIDs are an inefficient bloated way to store PKs and only make sense in scenarios where distributed clients need a way to independently generate conflict-free identifiers.

VWWHFSfQ · on June 13, 2024

To be honest, it sounds like you've made some of the terrible mistakes with uuid PKs described in this thread and now you're just very aggressively defending the poor decision making.

There's no FUD. These are facts.

xyzzy_plugh · on June 14, 2024

I haven't! I literally never think about these things because they are literally never problems. It's frustrating to see people suggest that such things are intrinsically problematic because they have misused, or witnessed them be misused, without recognizing that they are not intrinsically worse or better than alternatives.

I've witnessed severe, horrifying problems, absolute nightmares, due to misuse of serial/bigserial. I'm not here trying to convince anyone not to use them. Right tool for the right job and all.

The argument here reads to me akin to suggesting that, a baseball bat is not very good at cutting loaves of bread! don't use baseball bats! you should really stick to knives!

Can we perhaps just agree that it's both possible to hold knives without cutting yourself and possess and wield a baseball bat without ruining your sandwich?

Anyways, at this point you're just posting flamebait and I don't particularly enjoy the direction the thread has taken by pulling credentials and implying inadequacy. It's not constructive. I hope at least some readers are convinced that they need not fear the UUID.

hot_gril · on June 12, 2024

I can understand not caring at first if this option is easier, but it's not. Bigserials are already the easiest thing to use.

pdhborges · on June 12, 2024

> But once your tables start to grow it will become misery to work with.

I buy the data layout argument for databases with clustered tables without any smi-sequencial uuid support. But the storage argument looks vanishingly applicable to me: if someone needs to add a column to one of these tables it basically offsets a 4 byte optimization already.

bruce511 · on June 12, 2024

8 Byte (assuming 128 bit instead of 64 bit) but yeah.

It's not quite as simple as saving 8 bytes per row though. It's 8 bytes for the UUID, plus 8 for at least the PK, plus 8 more for any other keys the UUID is in.

Then you need to do that for any foreign-key-fields and keys those are in as well.

However, unless your table has very few non-uuid columns, the rest of the table size will dwarf the extra n*8 bytes here. And if you are storing any blobs (like pictures, documents etc) then frankly all the uuids won't add up to anything.

In summary, whether using uuids or not is right for you depends a Lot on your context. An Insert Only log table, with a billion rows, is a very different use case to say a list of employees or customers.

Generally I'm very comfortable with uuids for tables of say less than 100m rows. Very high inserting though suggests tables bigger than that, and perhaps benefits from different strategies.

When it comes to data stored in multiple places (think on-phone first, syncd to cloud, exported from multiple places to consolidated BI systems), uuids solve the Lot of problems.

Context is everything.

moomoo11 · on June 12, 2024

Quick question. So you’re saying use bigserial for PK (instead of uuid) and have another column where I set a uuid in case of application level identifier.

But wouldn’t that require a index on it if it’s going to be an identifier that’s used to query?

VWWHFSfQ · on June 12, 2024

Yes but you only have a single index containing the UUIDs instead of duplicated everywhere that is using it as a foreign key relation. Those indexes quickly become explosive in relations like e.g. m2m tables.

londons_explore · on June 12, 2024

UUID's are for when an auto-incrementing ID can't be used, usually due to scale (if you will be inserting millions of records per second from millions of clients).

However, postgres isn't suited to such scale (something like spanner IS).

Therefore, you probably shouldn't be using UUID's with postgres.

The other reason for UUID's is because the system is badly architected and , for example, the client gets to choose their own ID which are written into a DB. Such systems generally are best avoided.

bruce511 · on June 12, 2024

Also think about systems that are "beyond one database". Like data that lives in multiple places at the same time (on phone local storage, in database, consolidated and syncd either other databases etc.)

In other words once you move beyond '1 sql database' they become important.

(I'm not downvoting you, just disagreeing with you.)

deathanatos · on June 12, 2024

I've used auto-incrementing integers across multiple systems.

There was one system responsible for allocating the integers (what should have been a relational database that held our user accounts & their metadata); once that write was committed¹’² then other systems (e.g., one that kept bulk/frequent per-user metrics (lots of data), but which could shard it better) used it.

I think I'd opt for UUIDs today (the size difference is minimal, and they distribute more naturally if you shard on them), but integers are doable in this manner too.

¹which … this is its own battle. It turned out later our database was less than good here. ("So call me, maybe.")

²you need a lock here. That could matter. But "user creation" is not a frequent enough event for lock contention to matter. We only did this for user IDs.

djbusby · on June 12, 2024

This is why I asked at the start of the thread. We *need* 128bit numbers to make the uniqueness work across loads of systems (like a distributed/federated type thing)

hot_gril · on June 13, 2024

Maybe, but they don't need to be the PKs. And you only need some 128bit number at all if you want one generator not to be able to guess the others' numbers, like if clients are generating them.

bruce511 · on June 13, 2024

Are you mixing up primary key with clustered key?

Primary Key serves to uniquely identify the row. This is used to reference the tow in other tables. (So the Invoice table contains the Primary Key value for the customer.)

The Clustered key describes the physical sort order on the disk. Clustering on your most common search improves performance for that search. Searching on other criteria becomes equally slow.

So, it's really common to search through invoices by date. So a date key is ideal for a clustered index on that table. Looking up all invoices for a date range is faster than all invoices by, say, customer.

Clustering the customer or employee table by date though isn't great, I don't want to often see my employees on take-on order.

We teach Primary Keys at Uni, but in my era anyway they didn't put the same emphasis on selecting a correct clustered key. So I feel like people tend to assume the Primary Key is also the clustered key.

hot_gril · on June 13, 2024

I'm used to Postgres, which doesn't have clustered key. If this is MySQL and you're using a clustered key, pkey is still how you'd identify rows in foreign keys etc, so I think the same advice of bigserial pkey applies.

We learned using natural pkeys in college, which in practice burned me enough that I'll never do it anymore. I just set some uniqueness constraints.

hot_gril · on June 13, 2024

Even in these situations, you probably want bigserial PKs on the shards. I think Citus does this by starting each shard's sequence at a different value.