Hadoop (Introduction)

Tuesday, April 15, 2014

Hadoop (Introduction)

Refer Big Data Introduction

As mentioned in a previous blog post - Basics of Big data:

Hadoop is a set of tools that supports running of applications on big data.
Hadoop addresses challenges created by Big Data, namely:

Velocity: Lot of data coming in at high speed
Volume: Lots of data being gathered (and volume is ever growing)
Variety: Data is from varied sources (audio, video, log files - not organized data)

Key attributes:

Redundant and Reliable - ensuring no loss of data if a machine fails.
Mainly focussed on batch processes
Scalable - Works on a distributed model (one code - multiple machines)
Runs on commodity hardware (no special hardware needed). Reliability is in software and not needed in the hardware.

Architecture:

Hadoop consists of two main pieces:

MapReduce (Processing part of Hadoop
HDFS (Hadoop distributed file system) - Stores the data

On a machine, the MapReduce server is called Task Tracker.
The HDFS server on a machine is called DataNode.

Cluster
As you need more storage or more computing power - add more machines
Hadoop will scale relatively linearly.

Job Tracker is the coordinater for Map Reduce
A single JobTracker keeps tracks of Jobs run (regardless of the size of the cluster). It accomplishes the following:

Accepts users jobs
Divides into tasks
Assigns each task to an individual task tracker
Tracks the status of the jobs
Tracks the availability of a task tracker and if there is any hardware failure, the task is assigned to a new task tracker.

Name Node is the coordinator for HDFS
Keeps information on data location
Client talks to NameNode (for read/write) and gets redirected to the Task Tracker which has the data.
Note: Data never flows via the NameNode. NameNode only provides the data location details.

Clustered Environment diagram with one Job Tracker and one Name Node.

My Learning Cafe

Tuesday, April 15, 2014

Hadoop (Introduction)

No comments:

Post a Comment