Trying to list the components of Hadoop as an overview:
1. HDFS (Hadoop Distributed File system)
- Data Storage
- Distribute the storage of Big Data across clusters of computers
- Makes all clusters look like 1 file system
- Maintains copies of data for backup and also to be used when a computer in a clusters fails.
2. YARN (Yet another resource negotiator)
- Data Processing
- Manages the resources on the computer cluster
- Decides which nodes are free/available to be run etc
3. MapReduce
- Process Data
- Programming model that allows one to process data
- Consists of mappers and reducers
- Mappers transform data in parallel
- Converts raw data into key value pairs
- Same key can appear multiple times in a dataset
- Shuffle and sort to sort and group the data by keys
- Reducers aggregate the data
- Process each keys values
4. PIG
- SQL Style syntax
- Programming API to retrieve data
5. HIVE
- Like a SQL Database
- Can run SQL queries on the database (makes the distributed data system look like a SQL DB)
- Hadoop is not a relational DB and HIVE makes it look like one
6. Apache Ambari
- Gives view of your cluster
- Resources being used by your systems
- Execute HIVE Queries, import DB into HIVE
- Execute PIG Queries
7. SPARK
- Allows to run queries on the data
- Create spark queries using Python/Java/Scala
- Quickly and efficiently process data in the cluster
8. HBASE
- NoSQL DB
- Columnar Data store
- Fast DB meant for large transactions
9. Apache Storm
- Process streaming data
10. OOZIE
- Schedule jobs on your cluster
11. ZOOKEEPER
- Coordinating resources
- Tracks which nodes are functioning
- Applications rely on zookeeper to maintain reliable and consistent performance across clusters
12. SQOOP
- Data Ingestion
- Connector between relational DB and HDFS
13. FLUME
- Data Ingestion
- Transport Web Logs to your cluster
14. KAFKA
- Data Ingestion
- Collect data from a cluster or web servers and broadcast to HADOOP cluster.
No comments:
Post a Comment