Wednesday, September 27, 2023

Database Sharding

 What is Database Sharding?

Any application or website that sees significant growth will need to scale to take care of the increase in traffic. Some organisation's choose to scale their databases dynamically.

Sharding is a database architecture pattern related to horizontal partitioning.

  • Separating a table's rows into multiple different tables (which are known as partitions)
  • Each partition will have the same table and schema but the rows would be different
  • Data in each of the rows is different
  • Data is broken up into smaller sets known as shards.
  • These shards are then pushed into multiple distributed database nodes.
  • All the data collectively in all these shards represent the entire dataset.
  • This is also known as scaling out.
  • We can add more machines as the load increases.
  • Leads to faster processing.


Vertical partitioning
  • Columns are separated and put into new tables
  • Data is unique but there is a common column in all tables to align the data
  • Known as scaling up
  • Upgrade hardware, CPU and memory.



Benefits
  1. As the table grows, queries slow down.
  2. With horizontal sharding, queries go through fewer rows and hence are faster.
  3. If there is an outage, only a single shard may get impacted and while it may impact the application, the overall impact would be less as compared to the whole DB being down.
Drawbacks
  1. Complex
  2. Very difficult to return to unsharded/original form
  3. Unbalanced shards
    1. In case the sharding was done by name e.g. A-H, I-P and S-Z
    2. If the names are more in A-H and less in S-Z, it would lead to unbalanced shards
    3. Application queries to A-H will slow down
    4. Thus, A-H shard is known as database hotspot.
  4. Not supported by all DBs like PostgreSQL
Sharding Architectures

Some of the Architectures that are important for all to be aware of:

Range Based sharding
  • Sharding data based on a range (obvious by the name)
  • Relatively simple to implement
  • Every shard has unique set of data
  • Read and write from the shard where the range falls in.
  • As expected, this can lead to unbalanced data as well.

Directory Based sharding
  • Create a lookup table that will contain a mapping of a Key and shard ID.
  • Do note that the key here should have low cardinality (less # of possible values)
  • Application needs to have another query to get shard ID from the lookup table.
  • Can the lookup table become a point of failure? Yes, it can.


Key Based Sharding
  • Also known as Hash Based Sharding
  • Creates a hash value from a key and that hash value represents the Shard ID
  • The shard key value should be static and not change with time
  • If data grows and more shards need to be added, effort would be needed to add corresponding hash and do remapping of existing values.







No comments:

Post a Comment