Refer Big Data Introduction
As mentioned in a previous blog post - Basics of Big data:
Hadoop is a set of tools that supports running of applications on big data.
Hadoop addresses challenges created by Big Data, namely:
On a machine, the MapReduce server is called Task Tracker.
The HDFS server on a machine is called DataNode.
Cluster
As you need more storage or more computing power - add more machines
Hadoop will scale relatively linearly.

Job Tracker is the coordinater for Map Reduce
A single JobTracker keeps tracks of Jobs run (regardless of the size of the cluster). It accomplishes the following:
Name Node is the coordinator for HDFS
Keeps information on data location
Client talks to NameNode (for read/write) and gets redirected to the Task Tracker which has the data.
Note: Data never flows via the NameNode. NameNode only provides the data location details.

Clustered Environment diagram with one Job Tracker and one Name Node.

As mentioned in a previous blog post - Basics of Big data:
Hadoop is a set of tools that supports running of applications on big data.
Hadoop addresses challenges created by Big Data, namely:
- Velocity: Lot of data coming in at high speed
- Volume: Lots of data being gathered (and volume is ever growing)
- Variety: Data is from varied sources (audio, video, log files - not organized data)
Key attributes:
- Redundant and Reliable - ensuring no loss of data if a machine fails.
- Mainly focussed on batch processes
- Scalable - Works on a distributed model (one code - multiple machines)
- Runs on commodity hardware (no special hardware needed). Reliability is in software and not needed in the hardware.
Architecture:
Hadoop consists of two main pieces:
- MapReduce (Processing part of Hadoop
- HDFS (Hadoop distributed file system) - Stores the data
On a machine, the MapReduce server is called Task Tracker.
The HDFS server on a machine is called DataNode.
Cluster
As you need more storage or more computing power - add more machines
Hadoop will scale relatively linearly.
Job Tracker is the coordinater for Map Reduce
A single JobTracker keeps tracks of Jobs run (regardless of the size of the cluster). It accomplishes the following:
- Accepts users jobs
- Divides into tasks
- Assigns each task to an individual task tracker
- Tracks the status of the jobs
- Tracks the availability of a task tracker and if there is any hardware failure, the task is assigned to a new task tracker.
Name Node is the coordinator for HDFS
Keeps information on data location
Client talks to NameNode (for read/write) and gets redirected to the Task Tracker which has the data.
Note: Data never flows via the NameNode. NameNode only provides the data location details.
Clustered Environment diagram with one Job Tracker and one Name Node.
No comments:
Post a Comment