Tuesday, April 15, 2014

Hadoop (Introduction)

Refer Big Data Introduction

As mentioned in a previous blog post - Basics of Big data:

Hadoop is a set of tools that supports running of applications on big data.
Hadoop addresses challenges created by Big Data, namely:
  • Velocity: Lot of data coming in at high speed
  • Volume: Lots of data being gathered (and volume is ever growing)
  • Variety: Data is from varied sources (audio, video, log files - not organized data)
Key attributes:
  • Redundant and Reliable - ensuring no loss of data if a machine fails.
  • Mainly focussed on batch processes
  • Scalable - Works on a distributed model (one code - multiple machines)
  • Runs on commodity hardware (no special hardware needed). Reliability is in software and not needed in the hardware.
Architecture:
Hadoop consists of two main pieces:
  1. MapReduce (Processing part of Hadoop
  2. HDFS (Hadoop distributed file system) - Stores the data


On a machine, the MapReduce server is called Task Tracker.
The HDFS server on a machine is called DataNode.



Cluster
As you need more storage or more computing power - add more machines
Hadoop will scale relatively linearly.



Job Tracker is the coordinater for Map Reduce
A single JobTracker keeps tracks of Jobs run (regardless of the size of the cluster). It accomplishes the following:
  • Accepts users jobs
  • Divides into tasks
  • Assigns each task to an individual task tracker
  • Tracks the status of the jobs
  • Tracks the availability of a task tracker and if there is any hardware failure, the task is assigned to a new task tracker.


Name Node is the coordinator for HDFS
Keeps information on data location
Client talks to NameNode (for read/write) and gets redirected to the Task Tracker which has the data.
Note: Data never flows via the NameNode. NameNode only provides the data location details.



Clustered Environment diagram with one Job Tracker and one Name Node.



No comments:

Post a Comment