Kaggle.com offers a variety of public datasets, including one of Goodreads Book Reviews. I thought I’d take a quick and dirty look at this thing and see what I could see. I wrote a little program to gather the ratios of ratings to reviews. I’ve been curious about that, and overall the dataset seems to suggest that for every 33 ratings there is 1 written review. That was gleaned from an overall calculation, row by row.
Looking further into it I find that the dataset is not at all clean – the columns often don’t correspond so that a ratingsCount column might be a number or it might be ‘J.K. Rowling’. It needs a lot of work, which I’m a little too lazy to do this morning, so instead I went through again and ignored all the rows for which the rating was not in the 1-5 star range. This gave me some bad results as well.. The 4-star rating column totals seem worthless, but the others seem reasonably consistent and provided one possible insight:
ratingsCount: 1 592
ratingsCount: 2 2378
ratingsCount: 3 55836
ratingsCount: 4 52425090
ratingsCount: 5 11170
reviewsCount: 1 79
reviewsCount: 2 298
reviewsCount: 3 5540
reviewsCount: 4 1661702
reviewsCount: 5 1085
1 star ratio, ratings to reviews: 7
2 star ratio, ratings to reviews: 7
3 star ratio, ratings to reviews: 10
4 star ratio, ratings to reviews: 31
5 star ratio, ratings to reviews: 10
If this data is to be believed, it looks to me that the less someone likes a book, the more likely they are to say something about it (1 and 2 stars vs 3 and 5 stars). Negativity is more eager to express itself. I feel like this falls in line with the natural intuition, and crosses over to other areas in life, like social media, the news media in general, politics and so on.
dataset is here
python code:
from argparse import ArgumentParser import csv import pandas as pd class GoodreadsAnalysis(): def __init__(self): self.args = self.arguments() self.parse_csv() def arguments(self): """ argument parser :return: parsed args """ parser = ArgumentParser() parser.add_argument('--input_file', default="./goodreads_book_reviews.csv") return parser.parse_args() def parse_csv(self): columns = ['bookID','title','author','rating','ratingsCount','reviewsCount','reviewerName','reviewerRatings','review'] df = pd.read_csv(self.args.input_file, names=columns, quoting=csv.QUOTE_NONE) print df.head() ratings_count = {} reviews_count = {} total_ratings = 0 total_reviews = 0 for index, row in df.iterrows(): try: rating = int(row.rating) if rating > 0 and rating < 6: if ratings_count.has_key(rating): ratings_count[rating] += int(row.ratingsCount) else: ratings_count[rating] = int(row.ratingsCount) if reviews_count.has_key(rating): reviews_count[rating] += int(row.reviewsCount) else: reviews_count[rating] = int(row.reviewsCount) except: # bad column continue for k, v in ratings_count.iteritems(): print "ratingsCount: ", k, v total_ratings += v print for k, v in reviews_count.iteritems(): print "reviewsCount: ", k, v total_reviews += v print "totals (ratings, reviews):", total_ratings, total_reviews # 3465722733 104000732 # 33:1 for i in range(1,6): ratio = ratings_count[i] / reviews_count[i] print "{} star ratio, ratings to reviews: ".format(i), ratio if __name__ == '__main__': g = GoodreadsAnalysis()