Saturday, September 26, 2020

Cassandra (High Level details)



What is Cassandra?
  • Open source
  • Distributed Database
  • NoSQL Database
  • Can add lots of nodes in a cluster hence highly scalable
  • High availability since the nodes can be spread across DC's.
  • Fault tolerant, durable with no single point of failure.
  • Note that in Cassandra distributed architecture, there are multiple nodes. These nodes can be replicated across multiple data centers to avoid a single point of failure.

All the nodes have the same functionality and no one node is a master node. We do not have a master slave node architecture. All nodes have the same functionality.

If all nodes have the same functionality, how do the nodes know about the other nodes in the cluster?
Ans. Snitch.

What is a Snitch?

As per official documentation: https://cassandra.apache.org/doc/latest/operating/snitch.html?highlight=snitch

In Cassandra, a snitch has two functions:
  1. Teaches Cassandra enough about your network topology to route requests efficiently
  2. Allows Cassandra to spread replicas around your cluster to avoid correlated failures. It does this by grouping machines into “data centers” and “racks.” Cassandra will do its best not to have more than one replica on the same “rack” (which may not actually be a physical location).
There are various snitch classes. Read more about them in the above link.
SimpleSnitch is suitable for only a single DC.

In PropertyFileSnitch, one will put the IP address of all the nodes in a cluster and also specify the rack and DC. Data is entered in the cassandra-topology.properties file.
This is fine if we had lesser # of nodes. But if we had 1000 nodes, it would be very tedious to keep creating and modifying entries for the nodes.

What is Gossip?
Gossip is how the nodes in a cluster communicate with each other.
Every second, a node sends information about itself and the other nodes information (that it has) to at-least 3 other nodes.
Thus, as mentioned before, there is no master slave configuration. Eventually all nodes have information of all the other nodes and they keep sending/sharing information every 1 sec.
Do note that this is internal communication between nodes in a cluster.

Data is distributed across nodes (so rows in a table can be in different nodes).

But if the rows in a table are in different nodes and if a node goes down, would we lose data?
We have an option to replicate data by setting a replication factor. If the replication factor is 1, then data is not replicated and you may lose data. But if the replication factor is increased to 2 or 3 etc, the data is replicated across nodes and we have high availability of data even if nodes go down.