My Learning Cafe

Showing posts with label KAFKA. Show all posts

Showing posts with label KAFKA. Show all posts

Friday, August 7, 2020

Hadoop Overview and Terminology

Trying to list the components of Hadoop as an overview:

1. HDFS (Hadoop Distributed File system)

Data Storage
Distribute the storage of Big Data across clusters of computers
Makes all clusters look like 1 file system
Maintains copies of data for backup and also to be used when a computer in a clusters fails.

2. YARN (Yet another resource negotiator)

Data Processing
Manages the resources on the computer cluster
Decides which nodes are free/available to be run etc

3. MapReduce

Process Data
Programming model that allows one to process data
Consists of mappers and reducers
Mappers transform data in parallel

Converts raw data into key value pairs
Same key can appear multiple times in a dataset
Shuffle and sort to sort and group the data by keys

Reducers aggregate the data

Process each keys values

4. PIG

SQL Style syntax
Programming API to retrieve data

5. HIVE

Like a SQL Database
Can run SQL queries on the database (makes the distributed data system look like a SQL DB)
Hadoop is not a relational DB and HIVE makes it look like one

6. Apache Ambari

Gives view of your cluster
Resources being used by your systems
Execute HIVE Queries, import DB into HIVE
Execute PIG Queries

7. SPARK

Allows to run queries on the data
Create spark queries using Python/Java/Scala
Quickly and efficiently process data in the cluster

8. HBASE

NoSQL DB
Columnar Data store
Fast DB meant for large transactions

9. Apache Storm

Process streaming data

10. OOZIE

Schedule jobs on your cluster

11. ZOOKEEPER

Coordinating resources
Tracks which nodes are functioning
Applications rely on zookeeper to maintain reliable and consistent performance across clusters

12. SQOOP

Data Ingestion
Connector between relational DB and HDFS

13. FLUME

Data Ingestion
Transport Web Logs to your cluster

14. KAFKA

Data Ingestion
Collect data from a cluster or web servers and broadcast to HADOOP cluster.

Subscribe to: Posts (Atom)