MapReduce example

Monday, August 10, 2020

Q. Given a data set of movie ID, ratings and user ID, find the # of ratings + popular movies.

MapReduce code:

Use MR Job and have the following script:

class MovieRatings(MRJob):

def steps(self):

return [

MRStep(mapper=self.mapper_get_ratings,

reducer=self.reducer_count_ratings)

]

def mapper_get_ratings(self, _, line):

(userID, movieID, rating, timestamp) = line.split('\t')

yield rating, 1

Here we are splitting the data (delimiter is tab) into a tuple.

Return a 1 for every rating.

def reducer_count_ratings(self, key, values):

yield key, sum(values)

For the reducer we use a sum with the keys being the movie ratings.

if __name__ == '__main__':

MovieRatings.run()

The main run part.

Execute the same on local or on Hadoop cluster to get a break down.

My Learning Cafe