Some thoughts on partition keys in clustered databases

Partition keys are a common concept among distributed software including CouchDB and Azure CosmosDB to other systems like Kafka and AWS Kinesis. They are a key input into the system that has consequences on how well it can be used to solve different end user problems and how well it performs under load.

To start, forget about partition keys briefly. Instead, picture a system that has an infinite supply of buckets to store data, but each bucket has a limited size. They key question is, how is data divided among the buckets. This is Azure CosmosDB.

CosmosDB is a neat product. It has,

Georeplication with support for multiple write regions
Support for multi-modal interaction
Tunable transactions

It also, for reasonably large datasets, requires that a partition key be chosen in order to distribute data. Microsoft does a better job than I will of addressing of the nuances of how partition keys are used in CosmosDB. The concept is similar to many other distributed systems. In order to process lots of stuff or store lots of stuff (or both!), the “stuff” needs to be divided up and worked on by separate physical hardware in parallel. The choice of how data and computation is split up determines how well the system performs for the end user. For CosmosDB specifically, our choice of partitioning key determines how much data we can store (each logical partition is limited to 20GB), how much our system costs (CosmosDB charges by provisioned request units that are split across logical partitions), and how fast our application can execute queries.

For distributed databases there is often an additional wrinkle to the choice of partition key. It cannot be changed after database creation. This is at least the case with DynamoDB, CosmosDB, and CouchDB. There are ways around this by migrating data between databases when the partition key changes, but this can be expensive and slow.

What’s developer to do when faced with choosing a partition key for a new system? I have some thoughts but few concrete answers:

The dataset (probably) is not as big as it looks

First, should we care about the choice of partition key in the first place? In the case of building an application which stores data in a CosmosDB collection, it may be the case that the entire data set is less than 20GB. Congratulations, choice of partition key is now not the most important part of the application’s design.

Size limits can be deceptive. 20GB of 4K video isn’t that much video. 20GB of user records including passwords is a lot of users. US Census data downloads are typically measured in the sub-gigabyte range; as are many population datasets hosted on AWS S3 open data registry. I’ve found myself sometimes over thinking data systems, it can be the case that an application’s entire data may fit into a handful of gigabytes. Consider this well since if it is the case then the problem of how to store data is greatly simplified.

Customer id is (probably) not the right choice

We’re now at the point that the dataset is big enough to need partitioning. For those of us that build enterprise or small and medium business applications (ie stuff for other businesses), one possibility is to divide data by tenant. A tenant or customer id is an identifier for each customer organization using the application. For example, Slack very likely maintains a unique identifier for each new Slack organization created. Partitioning data by customer identifier can be appealing since the customer concept is likely already present in the system for the purpose of authorization. Again, think of the example of Slack where separate customer organizations cannot, by default, interact with each other.

Unfortunately storing data this way is probably the wrong choice.

“Partitions over a size limit”

The issue with using a customer identifier alone is that there can be significant variability in size. Yet again, Slack, one customer might have four people, the other might be an enterprise with 50000 employees with integrated file sharing. The design runs into an issue if the largest customers exceed the partition size limit since these are often hard limits that will require choosing a new key to sub-divide the data and a migration to start using the new key. Another way of designing around the issue is to use a customer id based partition key but only store reference data in the clustered database. Large objects are stored elsewhere, perhaps in another clustered database partitioned by object id.

“Partitions data using reference keys”

Sometimes the use case itself is problematic

I once worked on a project that had an architecture team mandate the choice of technology. The technology choice was made independently of the product requirements.

This created some problems for us. We were using CouchDB v2.X at the time and had some pretty significant performance issues that were not the fault of the technology. Our product team wanted to enable highly flexible search across a variety of fields in the dataset and needed all of the fields to be optional. We compromised on a partition key choice that enabled each search query to be roughly evenly distributed across the hosts in the cluster. This wasn’t ideal though because it meant that every query was distributed to every host, sort of similar to a table scan in DynamoDB.

“Partitioned Table Scan”

Microsoft’s guidance on choosing partition keys even mentions that ideally the partition identifier should help scope which physical hosts handle queries.

“CosmosDB wants you to include he partition key”

Several years later the system has evolved significant and these “search” use cases are now handled by an elasticsearch based system; one designed for the use case in mind enabling end users to search across events. Incidentally, elasticsearch also happens to be a clustered, distributed, database. The difference being that elasticsearch is intended for enabling search use cases and it makes trade-offs in other areas for this.

Finally, the examples are not always useful and this means careful thought is required

This is a compliant about the importance of key choice and where I want to conclude.

CosmosDB’s documentation on key choice uses user identifier and first name as examples. This is great for building an identity and access management system and little else. AWS’ DynamoDB blog post on partition keys is better, addressing more complex use cases and touching on nuances of DynamoDB’s two key approach for organizing records. CouchDB’s documentation on partitioning is closer to CosmosDB’s, with an idealized example and a couple notes indicating that cardinality is important.

Partition key choice is easy when dealing with scenarios where certain fields of stored data will always be present in queries handled by the system. User id is often used as an example because in systems dealing with users, frequently identity systems, almost all queries to the system will be based on that field. Many of us are not building identity systems though which means that finding the right partition key will take some thought to avoid traps like partitioning by customer in multi-tenant systems.

My summarized advice here is about avoiding certain pitfalls,

Don’t partition if it is not needed or conversely be sure that the problem actually requires partitioned data
Don’t partition in a way that allows (unexpected) unbounded growth within each partition key
Don’t try to implement search with tools not meant for search
Do consider that the system’s use cases for ingesting and querying data will dictate the choice of key

The dataset (probably) is not as big as it looks#

Customer id is (probably) not the right choice#

Sometimes the use case itself is problematic#

Finally, the examples are not always useful and this means careful thought is required#

The dataset (probably) is not as big as it looks

Customer id is (probably) not the right choice

Sometimes the use case itself is problematic

Finally, the examples are not always useful and this means careful thought is required