Monday, August 10, 2020

MapReduce example

 Q. Given a data set of movie ID, ratings and user ID, find the # of ratings + popular movies.

MapReduce code:

Use MR Job and have the following script:


class MovieRatings(MRJob):

    def steps(self):

        return [

            MRStep(mapper=self.mapper_get_ratings,

                   reducer=self.reducer_count_ratings)

        ]



    def mapper_get_ratings(self, _, line):

        (userID, movieID, rating, timestamp) = line.split('\t')

        yield rating, 1


Here we are splitting the data (delimiter is tab) into a tuple.

Return a 1 for every rating.


    def reducer_count_ratings(self, key, values):

        yield key, sum(values)


For the reducer we use a sum with the keys being the movie ratings.



if __name__ == '__main__':

    MovieRatings.run()


The main run part.


Execute the same on local or on Hadoop cluster to get a break down.


No comments:

Post a Comment