Friday, August 7, 2020

Hadoop Overview and Terminology

 Trying to list the components of Hadoop as an overview:

1. HDFS (Hadoop Distributed File system)

  • Data Storage
  • Distribute the storage of Big Data across clusters of computers
  • Makes all clusters look like 1 file system
  • Maintains copies of data for backup and also to be used when a computer in a clusters fails.
2. YARN (Yet another resource negotiator)
  • Data Processing
  • Manages the resources on the computer cluster
  • Decides which nodes are free/available to be run etc
3. MapReduce
  • Process Data
  • Programming model that allows one to process data
  • Consists of mappers and reducers
  • Mappers transform data in parallel
    • Converts raw data into key value pairs
    • Same key can appear multiple times in a dataset
    • Shuffle and sort to sort and group the data by keys
  • Reducers aggregate the data
    • Process each keys values
4. PIG
  • SQL Style syntax
  • Programming API to retrieve data
5. HIVE
  • Like a SQL Database
  • Can run SQL queries on the database (makes the distributed data system look like a SQL DB)
  • Hadoop is not a relational DB and HIVE makes it look like one
6. Apache Ambari
  • Gives view of your cluster
  • Resources being used by your systems
  • Execute HIVE Queries, import DB into HIVE
  • Execute PIG Queries
7. SPARK
  • Allows to run queries on the data
  • Create spark queries using Python/Java/Scala
  • Quickly and efficiently process data in the cluster
8. HBASE
  • NoSQL DB
  • Columnar Data store
  • Fast DB meant for large transactions
9. Apache Storm
  • Process streaming data
10. OOZIE
  • Schedule jobs on your cluster
11. ZOOKEEPER
  • Coordinating resources
  • Tracks which nodes are functioning
  • Applications rely on zookeeper to maintain reliable and consistent performance across clusters
12. SQOOP
  • Data Ingestion
  • Connector between relational DB and HDFS
13. FLUME
  • Data Ingestion
  • Transport Web Logs to your cluster
14. KAFKA
  • Data Ingestion
  • Collect data from a cluster or web servers and broadcast to HADOOP cluster.

No comments:

Post a Comment