Q. Given a data set of movie ID, ratings and user ID, find the # of ratings + popular movies.
MapReduce code:
Use MR Job and have the following script:
class MovieRatings(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]
def mapper_get_ratings(self, _, line):
(userID, movieID, rating, timestamp) = line.split('\t')
yield rating, 1
Here we are splitting the data (delimiter is tab) into a tuple.
Return a 1 for every rating.
def reducer_count_ratings(self, key, values):
yield key, sum(values)
For the reducer we use a sum with the keys being the movie ratings.
if __name__ == '__main__':
MovieRatings.run()
The main run part.
Execute the same on local or on Hadoop cluster to get a break down.