Monday, February 3, 2020

Big Data and Hadoop Introduction

How do we define Big Data?

Big data in lay man's term implies large volume of data (in petabytes [1024 terabytes] or exabytes [1024 petabytes]). This data can be in structured and unstructured format.

Such large volumes of data cannot be handled by traditional data processing software.

Big data attributes:

1. Volume - as mentioned above, we are handling large volumes of data.
2. Format - Data can be in structured and unstructured format.
3. Velocity - Data processing can take time based on its structure.
4. Veracity - In large data sets, the quality of data may vary and needs to be ascertained.

e.g
Amazon and Netflix recommendation engines based on subscriber interests.
Uber Geodata.
Apple Siri - voice data.
Driverless cars - sensor and image quality data processing.


What is Hadoop?

As per text book definition,"Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware."


What it implies?

- Hadoop can store large volumes of data.
- It runs on commodity hardware implying lower cost per GB.



Hadoop stores data in clusters and hence can scale horizontally. This helps in reducing cost per GB as volume increases. This implies that data needs to be fetched from different clusters and is not available in a single machine.

No comments:

Post a Comment