Saturday, October 17, 2020

ElasticSearch (Introduction)

Q. What is ElasticSearch?

ElasticSearch is an open source analytics and full text search engine.

Often used to enable search functionality for applications. One can built complex search functionality using Elastic Search. Can also be used to aggregate data and analyze results.

Data is stored as documents (JSON Object) in ElasticSearch.

e.g A document in ElasticSearch corresponds to a row in a relational database (for better understanding).

This document contains fields which corresponds to columns in a relational database row.

e.g

{

    "Name": "James Bond",

    "DOB": "01-01-1999",

    "EmployeeID":"10203"

}


Data is stored in Nodes. There are multiple nodes and each node stores a part of the data.

Node is an instance of ElasticSearch. A machine can have many nodes. To store large datasets, we use multiple machines, each having multiple nodes. Best to have a node per machine.

Set of nodes is called a Cluster.

Every document in ElasticSearch is stored within an index (all documents need to be indexed).

e.g A document containing a person's details like name, country, DOB etc may be stored in an index names "people_index". 

Indexes are broken into shards which resides on nodes. Based on the capacity of node(s), shards are created and they reside on the nodes. Sharding helps in improving performance since queries can now be run in parallel on multiple shards.

By default, an index has one shard and we can configure to increase/decrease the # of shards using split/shrink API.

But what if a node/shard fails? Will we lose data? 

If there is no copy of the data, we will lose data. Replication is enabled by default in ElasticSearch. Default value is 1.

Replication is configured at the index level. We can choose the # of replicas we need while creating an index (default is 1 as mentioned above). Copies of shards are created. These copies are known as "replicas" or "replica shards".

The main shard is known as the "primary shard" and a replication group (primary and replicas) is created.

As obvious the replicas will be stored in different nodes to avoid a single point of failure. 

But what if I have only one node?

Replication in one node makes no sense since failure of node will lead to loss of data. Need to have a minimum of 2 nodes for replication to be effective.

How is replication distributed?

1 node:

  • Replication ineffective.

2 nodes:

  • With the default configuration of 1, the replicas will be placed in the other node (other than the primary)
  • With a configuration > 1, all replicas will be placed in the other node (other than the primary)
> 2 nodes:
  • With a configuration > 1, all replicas will be distributed across multiple nodes to improve availability and fault tolerance.
What does replication achieve?
  • High availability (replica available in case of downtime)
  • Throughput (since search queries can be routed to replica as well and hence improve performance)
Can we take backups of the data?

Snapshots can be taken at any point in time (at index level or cluster level). Can be used to restore data as well.




     




    Saturday, September 26, 2020

    Cassandra (High Level details)



    What is Cassandra?
    • Open source
    • Distributed Database
    • NoSQL Database
    • Can add lots of nodes in a cluster hence highly scalable
    • High availability since the nodes can be spread across DC's.
    • Fault tolerant, durable with no single point of failure.
    • Note that in Cassandra distributed architecture, there are multiple nodes. These nodes can be replicated across multiple data centers to avoid a single point of failure.

    All the nodes have the same functionality and no one node is a master node. We do not have a master slave node architecture. All nodes have the same functionality.

    If all nodes have the same functionality, how do the nodes know about the other nodes in the cluster?
    Ans. Snitch.

    What is a Snitch?

    As per official documentation: https://cassandra.apache.org/doc/latest/operating/snitch.html?highlight=snitch

    In Cassandra, a snitch has two functions:
    1. Teaches Cassandra enough about your network topology to route requests efficiently
    2. Allows Cassandra to spread replicas around your cluster to avoid correlated failures. It does this by grouping machines into “data centers” and “racks.” Cassandra will do its best not to have more than one replica on the same “rack” (which may not actually be a physical location).
    There are various snitch classes. Read more about them in the above link.
    SimpleSnitch is suitable for only a single DC.

    In PropertyFileSnitch, one will put the IP address of all the nodes in a cluster and also specify the rack and DC. Data is entered in the cassandra-topology.properties file.
    This is fine if we had lesser # of nodes. But if we had 1000 nodes, it would be very tedious to keep creating and modifying entries for the nodes.

    What is Gossip?
    Gossip is how the nodes in a cluster communicate with each other.
    Every second, a node sends information about itself and the other nodes information (that it has) to at-least 3 other nodes.
    Thus, as mentioned before, there is no master slave configuration. Eventually all nodes have information of all the other nodes and they keep sending/sharing information every 1 sec.
    Do note that this is internal communication between nodes in a cluster.

    Data is distributed across nodes (so rows in a table can be in different nodes).

    But if the rows in a table are in different nodes and if a node goes down, would we lose data?
    We have an option to replicate data by setting a replication factor. If the replication factor is 1, then data is not replicated and you may lose data. But if the replication factor is increased to 2 or 3 etc, the data is replicated across nodes and we have high availability of data even if nodes go down.








    Monday, August 10, 2020

    MapReduce example

     Q. Given a data set of movie ID, ratings and user ID, find the # of ratings + popular movies.

    MapReduce code:

    Use MR Job and have the following script:


    class MovieRatings(MRJob):

        def steps(self):

            return [

                MRStep(mapper=self.mapper_get_ratings,

                       reducer=self.reducer_count_ratings)

            ]



        def mapper_get_ratings(self, _, line):

            (userID, movieID, rating, timestamp) = line.split('\t')

            yield rating, 1


    Here we are splitting the data (delimiter is tab) into a tuple.

    Return a 1 for every rating.


        def reducer_count_ratings(self, key, values):

            yield key, sum(values)


    For the reducer we use a sum with the keys being the movie ratings.



    if __name__ == '__main__':

        MovieRatings.run()


    The main run part.


    Execute the same on local or on Hadoop cluster to get a break down.


    Friday, August 7, 2020

    Hadoop Overview and Terminology

     Trying to list the components of Hadoop as an overview:

    1. HDFS (Hadoop Distributed File system)

    • Data Storage
    • Distribute the storage of Big Data across clusters of computers
    • Makes all clusters look like 1 file system
    • Maintains copies of data for backup and also to be used when a computer in a clusters fails.
    2. YARN (Yet another resource negotiator)
    • Data Processing
    • Manages the resources on the computer cluster
    • Decides which nodes are free/available to be run etc
    3. MapReduce
    • Process Data
    • Programming model that allows one to process data
    • Consists of mappers and reducers
    • Mappers transform data in parallel
      • Converts raw data into key value pairs
      • Same key can appear multiple times in a dataset
      • Shuffle and sort to sort and group the data by keys
    • Reducers aggregate the data
      • Process each keys values
    4. PIG
    • SQL Style syntax
    • Programming API to retrieve data
    5. HIVE
    • Like a SQL Database
    • Can run SQL queries on the database (makes the distributed data system look like a SQL DB)
    • Hadoop is not a relational DB and HIVE makes it look like one
    6. Apache Ambari
    • Gives view of your cluster
    • Resources being used by your systems
    • Execute HIVE Queries, import DB into HIVE
    • Execute PIG Queries
    7. SPARK
    • Allows to run queries on the data
    • Create spark queries using Python/Java/Scala
    • Quickly and efficiently process data in the cluster
    8. HBASE
    • NoSQL DB
    • Columnar Data store
    • Fast DB meant for large transactions
    9. Apache Storm
    • Process streaming data
    10. OOZIE
    • Schedule jobs on your cluster
    11. ZOOKEEPER
    • Coordinating resources
    • Tracks which nodes are functioning
    • Applications rely on zookeeper to maintain reliable and consistent performance across clusters
    12. SQOOP
    • Data Ingestion
    • Connector between relational DB and HDFS
    13. FLUME
    • Data Ingestion
    • Transport Web Logs to your cluster
    14. KAFKA
    • Data Ingestion
    • Collect data from a cluster or web servers and broadcast to HADOOP cluster.