Showing posts with label KAFKA. Show all posts
Showing posts with label KAFKA. Show all posts

Friday, August 7, 2020

Hadoop Overview and Terminology

 Trying to list the components of Hadoop as an overview:

1. HDFS (Hadoop Distributed File system)

  • Data Storage
  • Distribute the storage of Big Data across clusters of computers
  • Makes all clusters look like 1 file system
  • Maintains copies of data for backup and also to be used when a computer in a clusters fails.
2. YARN (Yet another resource negotiator)
  • Data Processing
  • Manages the resources on the computer cluster
  • Decides which nodes are free/available to be run etc
3. MapReduce
  • Process Data
  • Programming model that allows one to process data
  • Consists of mappers and reducers
  • Mappers transform data in parallel
    • Converts raw data into key value pairs
    • Same key can appear multiple times in a dataset
    • Shuffle and sort to sort and group the data by keys
  • Reducers aggregate the data
    • Process each keys values
4. PIG
  • SQL Style syntax
  • Programming API to retrieve data
5. HIVE
  • Like a SQL Database
  • Can run SQL queries on the database (makes the distributed data system look like a SQL DB)
  • Hadoop is not a relational DB and HIVE makes it look like one
6. Apache Ambari
  • Gives view of your cluster
  • Resources being used by your systems
  • Execute HIVE Queries, import DB into HIVE
  • Execute PIG Queries
7. SPARK
  • Allows to run queries on the data
  • Create spark queries using Python/Java/Scala
  • Quickly and efficiently process data in the cluster
8. HBASE
  • NoSQL DB
  • Columnar Data store
  • Fast DB meant for large transactions
9. Apache Storm
  • Process streaming data
10. OOZIE
  • Schedule jobs on your cluster
11. ZOOKEEPER
  • Coordinating resources
  • Tracks which nodes are functioning
  • Applications rely on zookeeper to maintain reliable and consistent performance across clusters
12. SQOOP
  • Data Ingestion
  • Connector between relational DB and HDFS
13. FLUME
  • Data Ingestion
  • Transport Web Logs to your cluster
14. KAFKA
  • Data Ingestion
  • Collect data from a cluster or web servers and broadcast to HADOOP cluster.