October 2020

Saturday, October 17, 2020

ElasticSearch (Introduction)

Q. What is ElasticSearch?

ElasticSearch is an open source analytics and full text search engine.

Often used to enable search functionality for applications. One can built complex search functionality using Elastic Search. Can also be used to aggregate data and analyze results.

Data is stored as documents (JSON Object) in ElasticSearch.

e.g A document in ElasticSearch corresponds to a row in a relational database (for better understanding).

This document contains fields which corresponds to columns in a relational database row.

e.g

{

"Name": "James Bond",

"DOB": "01-01-1999",

"EmployeeID":"10203"

}

Data is stored in Nodes. There are multiple nodes and each node stores a part of the data.

Node is an instance of ElasticSearch. A machine can have many nodes. To store large datasets, we use multiple machines, each having multiple nodes. Best to have a node per machine.

Set of nodes is called a Cluster.

Every document in ElasticSearch is stored within an index (all documents need to be indexed).

e.g A document containing a person's details like name, country, DOB etc may be stored in an index names "people_index".

Indexes are broken into shards which resides on nodes. Based on the capacity of node(s), shards are created and they reside on the nodes. Sharding helps in improving performance since queries can now be run in parallel on multiple shards.

By default, an index has one shard and we can configure to increase/decrease the # of shards using split/shrink API.

But what if a node/shard fails? Will we lose data?

If there is no copy of the data, we will lose data. Replication is enabled by default in ElasticSearch. Default value is 1.

Replication is configured at the index level. We can choose the # of replicas we need while creating an index (default is 1 as mentioned above). Copies of shards are created. These copies are known as "replicas" or "replica shards".

The main shard is known as the "primary shard" and a replication group (primary and replicas) is created.

As obvious the replicas will be stored in different nodes to avoid a single point of failure.

But what if I have only one node?

Replication in one node makes no sense since failure of node will lead to loss of data. Need to have a minimum of 2 nodes for replication to be effective.

How is replication distributed?

1 node:

Replication ineffective.

2 nodes:

With the default configuration of 1, the replicas will be placed in the other node (other than the primary)
With a configuration > 1, all replicas will be placed in the other node (other than the primary)

> 2 nodes:

With a configuration > 1, all replicas will be distributed across multiple nodes to improve availability and fault tolerance.

What does replication achieve?

High availability (replica available in case of downtime)
Throughput (since search queries can be routed to replica as well and hence improve performance)

Can we take backups of the data?

Snapshots can be taken at any point in time (at index level or cluster level). Can be used to restore data as well.