Monday, August 10, 2020

MapReduce example

 Q. Given a data set of movie ID, ratings and user ID, find the # of ratings + popular movies.

MapReduce code:

Use MR Job and have the following script:


class MovieRatings(MRJob):

    def steps(self):

        return [

            MRStep(mapper=self.mapper_get_ratings,

                   reducer=self.reducer_count_ratings)

        ]



    def mapper_get_ratings(self, _, line):

        (userID, movieID, rating, timestamp) = line.split('\t')

        yield rating, 1


Here we are splitting the data (delimiter is tab) into a tuple.

Return a 1 for every rating.


    def reducer_count_ratings(self, key, values):

        yield key, sum(values)


For the reducer we use a sum with the keys being the movie ratings.



if __name__ == '__main__':

    MovieRatings.run()


The main run part.


Execute the same on local or on Hadoop cluster to get a break down.


Friday, August 7, 2020

Hadoop Overview and Terminology

 Trying to list the components of Hadoop as an overview:

1. HDFS (Hadoop Distributed File system)

  • Data Storage
  • Distribute the storage of Big Data across clusters of computers
  • Makes all clusters look like 1 file system
  • Maintains copies of data for backup and also to be used when a computer in a clusters fails.
2. YARN (Yet another resource negotiator)
  • Data Processing
  • Manages the resources on the computer cluster
  • Decides which nodes are free/available to be run etc
3. MapReduce
  • Process Data
  • Programming model that allows one to process data
  • Consists of mappers and reducers
  • Mappers transform data in parallel
    • Converts raw data into key value pairs
    • Same key can appear multiple times in a dataset
    • Shuffle and sort to sort and group the data by keys
  • Reducers aggregate the data
    • Process each keys values
4. PIG
  • SQL Style syntax
  • Programming API to retrieve data
5. HIVE
  • Like a SQL Database
  • Can run SQL queries on the database (makes the distributed data system look like a SQL DB)
  • Hadoop is not a relational DB and HIVE makes it look like one
6. Apache Ambari
  • Gives view of your cluster
  • Resources being used by your systems
  • Execute HIVE Queries, import DB into HIVE
  • Execute PIG Queries
7. SPARK
  • Allows to run queries on the data
  • Create spark queries using Python/Java/Scala
  • Quickly and efficiently process data in the cluster
8. HBASE
  • NoSQL DB
  • Columnar Data store
  • Fast DB meant for large transactions
9. Apache Storm
  • Process streaming data
10. OOZIE
  • Schedule jobs on your cluster
11. ZOOKEEPER
  • Coordinating resources
  • Tracks which nodes are functioning
  • Applications rely on zookeeper to maintain reliable and consistent performance across clusters
12. SQOOP
  • Data Ingestion
  • Connector between relational DB and HDFS
13. FLUME
  • Data Ingestion
  • Transport Web Logs to your cluster
14. KAFKA
  • Data Ingestion
  • Collect data from a cluster or web servers and broadcast to HADOOP cluster.