Designing Data Intensive Applications - Chapter 6 Partitions

DDIA Notes Chapter 6 - Sharding

Chapter 5 discussed replication -> Having multiple copies of data on different nodes

Sharding / Partitioning

Sharding or partitioning allows breaking up data into partitions, where each partition is essentially a small database. This approach is used to distribute a large dataset across many processors, leading to faster lookups for big data.

Usually combined with replication, although the strategy is the same as in Chapter 5, so replication won’t be covered.

Key Range Partition

Partition by Hash

Skewed Workloads and Relieving Hot Spots

Hash doesn’t fix everything; sometimes there are hot keys (like a celebrity on Instagram). Keys can be split into parts to ease the load, such as adding two random numbers to the celebrity ID to yield more partitions. This splits the writes but means reads need to check all 100 different fields.

Secondary Indexes

Another index is needed, like occurrence of User 123 or the number of cars with the color red. Relational databases are good at this, but partitioning these indexes is difficult.

Rebalancing Partitions

Evenly distributing the load of all partitions as data changes over time is essential as systems crash, the dataset increases, RAM increases, queries increase, and CPU increases. Hash mod is not ideal because it requires data to be moved around after it changes (e.g., Mod 10 -> Mod 15).