My Learning Cafe

Wednesday, September 27, 2023

Database Sharding

What is Database Sharding?

Any application or website that sees significant growth will need to scale to take care of the increase in traffic. Some organisation's choose to scale their databases dynamically.

Sharding is a database architecture pattern related to horizontal partitioning.

Separating a table's rows into multiple different tables (which are known as partitions)
Each partition will have the same table and schema but the rows would be different
Data in each of the rows is different
Data is broken up into smaller sets known as shards.
These shards are then pushed into multiple distributed database nodes.
All the data collectively in all these shards represent the entire dataset.
This is also known as scaling out.
We can add more machines as the load increases.
Leads to faster processing.

Vertical partitioning

Columns are separated and put into new tables
Data is unique but there is a common column in all tables to align the data
Known as scaling up
Upgrade hardware, CPU and memory.

Benefits

As the table grows, queries slow down.
With horizontal sharding, queries go through fewer rows and hence are faster.
If there is an outage, only a single shard may get impacted and while it may impact the application, the overall impact would be less as compared to the whole DB being down.

Drawbacks

Complex
Very difficult to return to unsharded/original form
Unbalanced shards

In case the sharding was done by name e.g. A-H, I-P and S-Z
If the names are more in A-H and less in S-Z, it would lead to unbalanced shards
Application queries to A-H will slow down
Thus, A-H shard is known as database hotspot.

Not supported by all DBs like PostgreSQL

Sharding Architectures

Some of the Architectures that are important for all to be aware of:

Range Based sharding

Sharding data based on a range (obvious by the name)
Relatively simple to implement
Every shard has unique set of data
Read and write from the shard where the range falls in.
As expected, this can lead to unbalanced data as well.

Directory Based sharding

Create a lookup table that will contain a mapping of a Key and shard ID.
Do note that the key here should have low cardinality (less # of possible values)
Application needs to have another query to get shard ID from the lookup table.
Can the lookup table become a point of failure? Yes, it can.

Key Based Sharding

Also known as Hash Based Sharding
Creates a hash value from a key and that hash value represents the Shard ID
The shard key value should be static and not change with time
If data grows and more shards need to be added, effort would be needed to add corresponding hash and do remapping of existing values.

Saturday, October 17, 2020

ElasticSearch (Introduction)

Q. What is ElasticSearch?

ElasticSearch is an open source analytics and full text search engine.

Often used to enable search functionality for applications. One can built complex search functionality using Elastic Search. Can also be used to aggregate data and analyze results.

Data is stored as documents (JSON Object) in ElasticSearch.

e.g A document in ElasticSearch corresponds to a row in a relational database (for better understanding).

This document contains fields which corresponds to columns in a relational database row.

e.g

{

"Name": "James Bond",

"DOB": "01-01-1999",

"EmployeeID":"10203"

}

Data is stored in Nodes. There are multiple nodes and each node stores a part of the data.

Node is an instance of ElasticSearch. A machine can have many nodes. To store large datasets, we use multiple machines, each having multiple nodes. Best to have a node per machine.

Set of nodes is called a Cluster.

Every document in ElasticSearch is stored within an index (all documents need to be indexed).

e.g A document containing a person's details like name, country, DOB etc may be stored in an index names "people_index".

Indexes are broken into shards which resides on nodes. Based on the capacity of node(s), shards are created and they reside on the nodes. Sharding helps in improving performance since queries can now be run in parallel on multiple shards.

By default, an index has one shard and we can configure to increase/decrease the # of shards using split/shrink API.

But what if a node/shard fails? Will we lose data?

If there is no copy of the data, we will lose data. Replication is enabled by default in ElasticSearch. Default value is 1.

Replication is configured at the index level. We can choose the # of replicas we need while creating an index (default is 1 as mentioned above). Copies of shards are created. These copies are known as "replicas" or "replica shards".

The main shard is known as the "primary shard" and a replication group (primary and replicas) is created.

As obvious the replicas will be stored in different nodes to avoid a single point of failure.

But what if I have only one node?

Replication in one node makes no sense since failure of node will lead to loss of data. Need to have a minimum of 2 nodes for replication to be effective.

How is replication distributed?

1 node:

Replication ineffective.

2 nodes:

With the default configuration of 1, the replicas will be placed in the other node (other than the primary)
With a configuration > 1, all replicas will be placed in the other node (other than the primary)

> 2 nodes:

With a configuration > 1, all replicas will be distributed across multiple nodes to improve availability and fault tolerance.

What does replication achieve?

High availability (replica available in case of downtime)
Throughput (since search queries can be routed to replica as well and hence improve performance)

Can we take backups of the data?

Snapshots can be taken at any point in time (at index level or cluster level). Can be used to restore data as well.