Here’s a challenge for students of expository writing: review a popular product on Amazon and aim to get your review chosen by readers as “most helpful.” It’s dead hard. The product review, as a literary form, is in its heyday. Polemical, evocative, witty, narrative, exhortative, furious, ironic, off the cuff....
What I found amusing was the fact that, after reading this article, I got a notification that the journal version of the paper Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics, co-authored with my frequent co-author, Anindya Ghose, has been accepted for publication at the IEEE Transactions on Knowledge and Data Engineering (TKDE) journal.
As the title suggests, one of the problems that we attack in the paper is how to predict the usefulness of a product review. For example, if you go on Amazon, you will see, on top of many reviews, how many people considered a particular product review helpful:
So, the question is: Can we predict how helpful a particular review will be?
Our first attempts to address this problem appeared in the WITS 2006 and the ICEC 2007 papers. Following the scientific zeitgeist, a large number of other papers appeared these years, all tackling the question of predicting helpfulness of reviews. (See the actual paper for references.)
What I found rather surprising was the relative easiness of the task. A few relatively straightforward features can be used to predict with good accuracy whether a review will be deemed helpful or not.
- Check the readability of the article, as measured by one of the many readability metrics, check the number of spelling errors, and measure basic statistics of the text, such as review length. Using just the readability and the fraction of spelling errors in the article we can estimate with 70%-80% accuracy whether a review will be deemed helpful or not.
- Check for spelling errors in the article and check the grammar: To get a proxy variable for the spelling errors, just compare all the words in the review with words in an online English dictionary. If the word does not appear in the dictionary, the probability is high that it can be a typo. (Yes, I know about acronyms, proper names, etc. We care about a rough proxy). It is also possible to check the grammar (although that did not make it into the paper): Just compare the log-likelihood of a particular review based on the frequencies of its unigrams, bi-grams, and tri-grams, compared to the statistics from Google N-grams. If the likelihood is very low, then the review is likely to have grammatical errors. To ensure that the log-likelihood is comparable across reviews, we compare the log-likelihood of each review with the median likelihood of other reviews with similar readability scores. (Update: Amazingly enough, Zappos noticed the same thing and took action to improve the spelling and grammar of its reviews.)
- Check the history of the reviewer. If the reviewer has been writing helpful reviews in the past, it is highly likely that reviews in the future will also be helpful. Also, if a reviewer has disclosed personal details (name, location, etc) the reviews are more likely to be helpful. Again, using just reviewer history and disclosure details, we get 70%-80% accuracy, as measured with the AUC metric.
- Check the "subjectivity" of the review. We call a review objective if it contains mainly information that can be found in the product description and specs. A subjective review contains information that depends on the personal experiences of the reviewer. Helpful reviews tend to contain a mix of both.
Interestingly enough, all three feature sets seem to have equivalent predictive power. Even using them all together does not seem to increase substantially the predictive performance.
While preparing the final version of the paper, I also checked other papers that were attacking the same problem. While many papers were trying to predict helpfulness using textual features, I noticed that a few papers were using a set of alternative and interesting features:
- Coverage of product features. Many products can be considered an aggregation of multiple product features. For example, a digital camera has resolution, size, battery life, sensor size, etd. How many product features are being discussed in the review? This feature tends to have predictive power, according to (Liu et al, EMNLP 2007).
- Dynamics of reviews. Reviews that are posted early on get a higher fraction of helpful votes. In contrast, later reviews need to be more informative and comprehensive to attract the same fraction of helpful votes (Liu et al, EMNLP 2007).
- Controversy. The helpfulness of a review depends not only on its own content but also on how controversial is the product under consideration (Danescu-Niculescu-Mizil, WWW 2009).
- Social network of reviewers. If reviewer A trusts the reviews of reviewer B, then the reviews of B are likely to be more helpful than the reviews of A. ("Exploiting Social Context for Review Quality Prediction"; by Lu, Tsaparas, Ntoulas, and Polanyi; WWW 2010)
Although I have not seen a paper combining all the above features in order to predict the helpfulness of a review (or for ranking reviews by helpfulness), I guess that these set of features will bring predictive accuracy pretty close to its limit for this task.
What is next? I guess personalized recommendations are going to appear sooner or later, matching users with reviews that are more likely to benefit them. (Update: See Eugene's comment below for related papers.) For example, a beginner in photography will be interested in a different type of review when buying an SLR, compared to a seasoned professional. We already know that reviews from similar users can be used for recommending products (see Netflix) so it is not unlikely that different types of reviews will be deemed helpful by different types of users.
So, did you find this blog post useful?