I recently had the pleasure of attempting an advanced statistics class called “Mining Massive Data Sets” on Coursera. My math skills were not quite up to it so I’m back-tracking now and taking lower-level statistics classes to work myself back up to it. One of the interesting parts of the class talked about how Netflix pioneered its “recommendation engine”, using methodologies which have become the standard for all such beasts. Mainly it centers around the concept of “people like you”. Your own feedback is loaded into a massive data set including everybody else’s feedback, and a lot of math is used to arrange and group the data into “neighbors” and “communities” in the attempt to feed you other products you are more likely to purchase. The more data the better, of course, but in the end it is all just math and suffers from what I call “dimensionality poverty”. There are not enough factors taken into account. The only factor, really, is the five-star rating, which is a static beast. Once entered, there it remains. It stands alone, dependent on nothing, honored as the true and only fact, but is it?
What about time? You rate something once and only once. Or, if you rate it again at a later time, the newer rating supplants the old. Which was more valid? Were you wrong before or are you wrong now? Or maybe you are wrong neither time, but when you first watched Legally Blonde you gave it 4 stars but the second time around you gave it 5. Both were authentic and valid ratings based on your reaction, but “there can be only one”.
You can never do something for the first time more than once. This is another limitation, because even if you could add a time-dimension to the rating system, your second rating would be influenced by your initial experience, your third by the first two, and so on.
Restaurant ratings suffer from the same problem – not only time itself, but age, weather, season, and a multitude of other factors. If you ate a meal at The Burger Squat every day for a year, each experience would have its own rating. You might have different food, it might be prepared differently, you’d be in a different mood, it might be raining and so on.
The recommendation engines are really just correlation engines, calculating the chances that “if you liked this you’ll like that” to some coefficient – greater than 0.40 for example. At least they have some basis, some data to go on. In my bookstore years I would often get people asking me to recommend something to them with no context whatsoever. I did not know them, or the history of their reading, or the mood they were in, and even when they tried to fill in some of these details it was still something of a wild-ass guess. I remember once a man insisting that I sell him whatever I thought was the best novel I’d read. At the time it was One Hundred Years of Solitude. (Last year, some few decades later, I read it again, and while I enjoyed the first half, I got bored and didn’t bother finishing it!). A few weeks later the man returned to the store in a rage, threw the book down on the counter and shouted that it was the worst piece of crap he’d ever come across and how dare I!
So you never know.
Yesterday I read a review of the TV series “Black Mirror” and while I agreed with their rave about how interesting and excellent the productions are, the reviewer’s choice of “best” episode was not even in my top three, and at the same time my faves didn’t even get mentioned in her article.
Maybe someday there will be so much data that any number of factors can be fed into a recommendation engine, where they can say, hey, you liked Book A when you were 12 and played with toy soldiers and Book B when you were 14 and dressed in all black and Movie C when you were 19 and brewing artisanal beer and Music D when you were 23 and stoned out of your fucking mind and City E when you were 31 and depressed because you just got dumped and hit the road so now that you’re 35 there’s a chance you might like Restaurant F !!