Adhoc data analysis – book reviews and ratings from Goodreads

Kaggle.com offers a variety of public datasets, including one of Goodreads Book Reviews. I thought I’d take a quick and dirty look at this thing and see what I could see. I wrote a little program to gather the ratios of ratings to reviews. I’ve been curious about that, and overall the dataset seems to suggest that for every 33 ratings there is 1 written review. That was gleaned from an overall calculation, row by row.

Looking further into it I find that the dataset is not at all clean – the columns often don’t correspond so that a ratingsCount column might be a number or it might be ‘J.K. Rowling’. It needs a lot of work, which I’m a little too lazy to do this morning, so instead I went through again and ignored all the rows for which the rating was not in the 1-5 star range. This gave me some bad results as well.. The 4-star rating column totals seem worthless, but the others seem reasonably consistent and provided one possible insight:

ratingsCount: 1 592
ratingsCount: 2 2378
ratingsCount: 3 55836
ratingsCount: 4 52425090
ratingsCount: 5 11170

reviewsCount: 1 79
reviewsCount: 2 298
reviewsCount: 3 5540
reviewsCount: 4 1661702
reviewsCount: 5 1085

1 star ratio, ratings to reviews: 7
2 star ratio, ratings to reviews: 7
3 star ratio, ratings to reviews: 10
4 star ratio, ratings to reviews: 31
5 star ratio, ratings to reviews: 10

If this data is to be believed, it looks to me that the less someone likes a book, the more likely they are to say something about it (1 and 2 stars vs 3 and 5 stars). Negativity is more eager to express itself. I feel like this falls in line with the natural intuition, and crosses over to other areas in life, like social media, the news media in general, politics and so on.

dataset is here

python code:

from argparse import ArgumentParser
import csv
import pandas as pd

class GoodreadsAnalysis():
    def __init__(self):
        self.args = self.arguments()
        self.parse_csv()

    def arguments(self):
        """
        argument parser
        :return: parsed args
        """
        parser = ArgumentParser()
        parser.add_argument('--input_file', default="./goodreads_book_reviews.csv")
        return parser.parse_args()

    def parse_csv(self):
        columns = ['bookID','title','author','rating','ratingsCount','reviewsCount','reviewerName','reviewerRatings','review']
        df = pd.read_csv(self.args.input_file, names=columns, quoting=csv.QUOTE_NONE)
        print df.head()

        ratings_count = {}
        reviews_count = {}
        total_ratings = 0
        total_reviews = 0
        for index, row in df.iterrows():
            try:
                rating = int(row.rating)
                if rating > 0 and rating < 6:
                    if ratings_count.has_key(rating):
                        ratings_count[rating] += int(row.ratingsCount)
                    else:
                        ratings_count[rating] = int(row.ratingsCount)

                    if reviews_count.has_key(rating):
                        reviews_count[rating] += int(row.reviewsCount)
                    else:
                        reviews_count[rating] = int(row.reviewsCount)
            except:  # bad column
                continue

        for k, v in ratings_count.iteritems():
            print "ratingsCount: ", k, v
            total_ratings += v
        print
        for k, v in reviews_count.iteritems():
            print "reviewsCount: ", k, v
            total_reviews += v
        print "totals (ratings, reviews):", total_ratings, total_reviews  # 3465722733 104000732  # 33:1

        for i in range(1,6):
            ratio = ratings_count[i] / reviews_count[i]
            print "{} star ratio, ratings to reviews: ".format(i), ratio

if __name__ == '__main__':
    g = GoodreadsAnalysis()