Saturday, October 4, 2008

Reviewing the Reviewers

I received today the latest issue of TOIS, and the title of the editorial by Gary Marchionini caught my eye: "Reviewer Merits and Review Control, in an Age of Electronic Manuscript Management Systems". The article makes the case for using the electronic management systems to allow for grading of the reviewer efforts and allow for memory of the reviewing process, including both the reviews and the reviewer ratings.

In principle, I agree with the idea. Having the complete reviewing history for each reviewer, and for each journal and conference, can bring several improvements in the process:

1. Estimating and Fixing Biases

One way to see the publication process is as noisy labeling of an example, where the true labels are "accept" or "reject". The reviewers can be modeled as noisy processes, each with its own sensitivity and specificity. The perfect reviewer has sensitivity=1, i.e., marks as "accept" all the "true accepts", and has specificity=1, i.e., marks as "reject" all the "true rejects".

Given enough noisy ratings, it is possible to use statistical techniques to infer what is the "true label" for each paper, and infer at the same time the sensitivity and specificity of each reviewer. Bob Carpenter has presented a hierarchical Bayesian model that can be used for this purpose, but simpler maximum likelihood models, like the one of Dawid and Skene, also work very well. In my own (synthetic) experiments the MLE method worked almost perfectly for recovering the quality characteristics of the reviewers and to recover the true labels of the papers (of course, without the uncertainty estimates that the Bayesian methods provide.)

One issue with such a model? The assumption that we have an underlying "true" label. For people with different backgrounds and research interests, what is a "true accept" and what a "true reject" is not easy to define even with perfect reviewing.

2. Reviewer Ratings

Reviewer reviewing by the editors

The statistical approaches described above reduce the quality of a reviewer into two metrics. However, these ratings only show agreement of the recommendations with the "true" value (publish or not). They say nothing about other aspects of the review: comprehensiveness, depth, timeliness, helpfulness, are all important aspects that need to be captured using different methods.

Marchionini mentions that current manuscript management systems allow the editors to rate reviewers in terms of timeliness and in terms of quality. By following the references, I ran into the article Reviewer Merits, published in Information Processing and Management, where the Editors-in-Chief of many IR journals stated:
Electronic manuscript systems easily provide time data for reviewers and some offer rating scales and note fields for editors to evaluate review quality. Many of us (editors) are beginning to use these capabilities and, over time, we will be able to have systematic and persistent reviewer quality data. Graduate students, faculty, chairs, and deans should be aware that these data are held.
Now, while I agree with reviewer accountability, I think that this statement is not worded properly. I find the use of the phrase "should be aware" as semi-threatening. ("We, the editors, are rating you... remember that!")

If reviewer quality history is being kept, then the reviewers should be aware and have access to it. Being reminded that "your history is out there somewhere" is not the way to go. If reviewer quality is going to be a credible evaluation metric, the reviewers need to know how well they did. (Especially junior reviewers, and especially when the review does not meet the quality standards.)

Furthermore, if the editors are the ones rating the reviewers, then who controls the quality of these ratings? How do we know that the evaluation is fair and accurate? Notice that if we have a single editorial quality rating per review, then the statistical approaches described above do not work.

Reviewer reviewing by the authors

In the past, I have argued that authors should rate reviewers. My main point in that post was to propose a system that will encourage reviewers to participate by rewarding the highly performing reviewers. (There is a similar letter to Science, named "Rewarding Reviewers.") Since authors will have to provide multiple feedback points, it is much easier to correct the biases in the reviewer ratings of the authors.

3. Reviewer History and Motivation

If we have a history of reviewers, we should not forget potential side-effects. One clear issue that I see, is motivation. If "reviews of reviewers" become a public record, then it is not clear how easy it will be to recruit reviewers.

Right now, many accept invitations to review, knowing that they will be able to do a decent job. If the expectations increase, it will be natural for people to reject invitations, focusing only on a few reviews for which they can do a great job. Arguably, reviewer record is never going to be as important for evaluation as other metrics, as research productivity or teaching, so it is unlikely to get more time devoted to it.

So, there will always be the tradeoff: more reviews or better reviews?

One solution that I have proposed in the past: Impose a budget! Any researcher should remove from the reviewing system the workload it generates. Five papers submitted (not accepted) within a year? The researcher needs to review 3x5 = 15 papers to remove the workload that these five papers generated. (See also the article "In Search of Peer Reviewers" that has the same ideas.)

4. Training Reviewers

So, suppose that we have the system in place to keep reviewer history, we have solved the issue of motivation, and now one facet of researcher reputation is the reviewer quality score. How do we learn how to review properly? A system that generates a sensitivity and specificity of a reviewer can provide some information on how strict or lenient a reviewer is, compared to others.

However, we need something more than that. What makes a review constructive? What makes a review fair? In principle, we could rely on academic advising to pass such qualities to newer generations of researchers. In practice, when someone starts reviewing a significant volume of papers, there is no advisor or mentor to oversee the process.

Therefore, we need some guidelines. An excellent set of guidelines is given in the article "A Peer Review How-To". Let me highlight some nuggets:

Reviewers make two common mistakes. The first mistake is to reflexively demand that more be done. Do not require experiments beyond the scope of the paper, unless the scope is too narrow.
[...]
Do not reject a manuscript simply because its ideas are not original, if it offers the first strong evidence for an old but important idea.

Do not reject a paper with a brilliant new idea simply because the evidence was not as comprehensive as could be imagined.

Do not reject a paper simply because it is not of the highest significance, if it is beautifully executed and offers fresh ideas with strong evidence.

Seek a balance among criteria in making a recommendation.

Finally, step back from your own scientific prejudices

And now excuse me, because I have to review a couple of papers...