Sunday, January 27, 2008

Who is Reviewing the Reviewers?

Complaining about the quality of the reviews that we get back from journals and conferences is a favorite pastime in academia. We all like to pinpoint why a particular review was not good and led to an unfair rejection of a paper. We never praise the good feedback that we get back.

Perhaps this lack of recognition is one of the reasons that we actually have bad reviews. Once someone is a member of a program committee, there are very few incentives to do an excellent job. The only real incentive is to do an average job and not be embarrassed during the discussion phase or when the program chair (or the associated editor) reads the reviews. There is almost no incentive to do an excellent job. At best, the authors and the program chair will be happy with the quality of the reviews and there will give thanks to the anonymous reviewer. It is a pretty thankless job!

One way to avoid this is to start a reputation system for reviewers. When the reviews are ready, the text of the review is sent to the authors. The authors then rate the quality of the reviews, and the reviewer is assigned (privately) a reviewing score. The highest rated reviewers are then recognized for their efforts.

There are a couple of issues of course. Authors will be happy when the reviewer is positive towards the paper and this gives the incentive to the reviewer to be lenient and propose acceptance for many papers. To counter-balance this, we should condition the reviews on the score assigned by the reviewer to the paper. Therefore, the reviewer evaluation will be a matrix like:

Reviewer's score for the paper
RejectWeak RejectWeak AcceptAcceptStrong Accept
Author's Evaluation of ReviewNot Useful11110
Very Useful03110

where the number in the cell correspond to the number of reviews with the given ratings. For example, the values in the column "Weak Reject" mean that the reviewer submitted 1+2+3=6 reviews with a "Weak Reject" rating and 3 of them were deemed "Very Useful" by the authors, 2 of them were deemed "Useful" and 1 was rated as "Not Useful".

Potentially, such a metric can be propagated from conference to conference, and the reviewer can even make the matrix public if desired.

It is important though, not to see this approach as a punitive measure. Having "halls of shame" and other such tactics may result in a backslash. "Why participate in a program committee and risk seeing my name published in a hall-of-shame list?" The number of researchers available to review is not growing fast in contrast to the number of publications that are being submitted to the conferences. The goal of the system should be to reward and encourage participation, not punish.

Wouldn't it be nice to allow PhD students to review papers, and encourage them to review papers not with the goal of rejecting them but with the goal of offering constructive feedback? And a recognition like "top reviewer for ACM 2008 conference on..." would definitely look good on the CV and signal that the student can give high quality feedback.

Update: Of course, you should also look at the (imho, classic): "Reviewing the Reviewers" by Ken Church.

Wednesday, January 23, 2008

Misunderstandings of Power-Law Distributions

Power laws are ubiquitous. In its most basic form, a power-law distribution has the following form:

$Pr\{x=k; a\} = \frac{k^{-a}}{\zeta(a)}$

where $a>1$ is the parameter of the power-law and $\zeta(a) = \sum_{i=1}^{+\infty} \frac{1}{i^a}$ is the Riemann zeta function that serves as a normalizing constant.

Part A

Often in research we get data sets that seem to follow a power-law. Since power-laws are cool, it is common to examine whether some distribution is indeed a power law and then try to estimate its parameter(s).

One of the most common approaches is to:
  • Generate a frequency-count histogram: in the x-axis we have the frequency of an event, and in the y-axis we have how often such events happen (count).
  • Plot the histogram in a log-log plot. Since the power law is $Pr\{x=k\} \propto k^{-a}$ then $\log(Pr\{x=k\}) = -a \cdot \log(k) +b$, the resulting plot shows typically a straight line, with a negative slope.
  • Run least-squares regression to estimate the slope and the intersection of the best fitting line.
  • If the $R^2$ is high (or if the line looks to fit nicely), then the distribution is a power-law and the parameters are given by the regression.
Unfortunately, this (common!) strategy is wrong. Running a regression in a log-log space violates many of the underlying assumptions of an OLS regression. First of all, the errors in a log space are distorted and cannot follow a normal distribution. Second, many distributions will tend to fit a straight line in a log-log plot. In fact, I have seen people fitting Gaussians with a few outliers to a log-log plot and naming the distribution a power-law.

For the reasons above, it is always better to use more robust estimating techniques to estimate the parameters of a power-law. Personally, I have been using the maximum likelihood estimator. You can also read this great paper in arXiv that discusses estimation techniques for many variations of power-laws. (See Appendix A for a more rigorous discussion of the perils of the least-squares approach.) From SIGMOD, I also recall the paper on modeling skew in data streams that presents an approach that converges faster than the MLE estimate.

Have you seen any Bayesian estimator? The MLE estimator tends to be unstable with only few observed sample values and it would be useful to impose a prior, limiting the acceptable parameter values to a predefined space. I have seen a short discussion (Section II.B) but the approach requires numeric methods. Any method that results in a closed-form expression?

Part B

Another common misunderstanding of the power-law distributions is their interpretation. A common description of a power-law is "there are only a few events that occur frequently and many events that occur only a few times." While this is true, it hides the fact that the "few events that occur frequently" appear much more often compared to other "familiar" distributions. In other words, under a probabilistic interpretation, we have too many events that "occur frequently" not "only a few".

For example, word frequencies tend to follow a power-law distribution. This means that there are a few words (like if, the, a, when, and so on) that appear frequently, and many words that occur only a few times. However, if words frequencies followed a Gaussian distribution, then we would never observe words with high frequency. Most of the words would tend to appear with roughly equal frequencies, and the outliers would be extremely rare.

To inverse the example, if the height of people followed a power-law instead of a Gaussian (say with $a=2$), then we would have 5 billion midgets at 1ft, 1.2 billion short people at 2ft, 300 million people 4ft tall, .... 1.2 million 64ft tall, 20 thousand at 500ft, and a handful of 30,000ft tall.

In other words, power-laws have heavy tails and outliers appear frequently. Not acknowledging the frequent-outliers phenomenon can lead to wrong modeling decisions. See for example Manderbrot's article "How the Finance Gurus Get Risk All Wrong".

Part C

Finally a question and call for help: What is the distribution of the sum of two power-law random variables? In other words, we try to compute the distribution for the random variable $Z=X+Y$ where $X$ and $Y$ follow power-law distributions with parameters $a_1$ and $a_2$. Following the standard definition, we need to compute the convolution:

$Pr\{ z \} = \sum_{k=1}^{z-1} Pr\{ x=k \} \cdot Pr\{ y=z-k \} =\sum_{k=1}^{z-1} \frac{k^{a_1}}{\zeta(a_1)} \cdot \frac{(z-k)^{a_2}}{\zeta(a_2)}$

(The index $k$ goes from $1$ to $z-1$ because power-laws are only defined for positive integers.)

Now, what is the result of this convolution?

The paper by Wilke et al. claims that if we sum two power-law random variables with parameters $a_1$ and $a_2$ the result is again a power-law distribution with parameter $\min(a_1,a_2)$.

If we use the Fourier transform of a power-law (is this transformation correct?), and compute the convolution by taking the product of the transforms and the going back, then the resulting distribution is again a power-law and the configuring parameter seems to be $a_1 + a_2 +1$. My gut feeling though says that some underlying assumption is violated, and that this result is not correct.

Any other pointers?

Update (1/31/2008): Daniel Lemire reposted my bleg (I learned from Suresh that bleg stands for "blog request for help"), and Yaroslav Bulatov pointed to the paper "The Distribution of Sums of Certain I.I.D. Pareto Variates" by Colin M. Ramsay, which pretty much answers the question. The derivation uses Laplace transforms (instead of Fourier) and the Bromwich integral, and tends to be quite technical. However, if you look at Figures 1 and 2, you will pretty much understand the result of the summation. On a first check, the shape of the resulting distribution seems similar to the result of summing two exponential distributions (see the comments below).

Tuesday, January 15, 2008

Are you a Bayesian or a Frequentist? (Or Bayesian Statistics 101)

OK, the previous post was actually a brain teaser given to me by Roy Radner back in 2004, when I joined Stern, in order to teach me the difference between Bayesian and Frequentist statistics. It actually illustrates nicely how the two techniques lead to different conclusions. For completeness, let me list the question again:

You have a coin that when flipped ends up head with probability $p$ and ends up tail with probability $1-p$. (The value of $p$ is unknown.)

Trying to estimate $p$, you flip the coin 14 times. It ends up head 10 times.

Then you have to decide on the following event: "In the next two tosses we will get two heads in a row."

Would you bet that the event will happen or that it will not happen?

One of the basic differences of Bayesian and Frequentists is how they treat the parameters. Using frequentist statistics (the statistics that I learned in school :-) we would say that the best (maximum likelihood) estimate for $p$ is $p=\frac{10}{14}$, i.e., $p \approx 0.714$.

In this case, the probability of two heads is $0.714^2 \approx 0.51$ and it makes sense to bet for the event.

Bayesian approach: $p$ is not a value, it's a distribution

A Bayesian approach, instead of considering only the ML estimate for $p$, it would treat $p$ as a random variable with its own distribution of possible values. The distribution can be defined by the existing evidence. The logic goes as follows. What is the probability of a given value of $p$, given the data?

$Pr\{ p | data \} = \frac{ Pr\{ data | p \} \cdot Pr\{ p \}} { Pr\{ data \} }$

Since the term $Pr\{ data \}$ is going to be the same for all values of p, we can for now ignore it. We will see that we can infer it later, as the probability distribution when integrated over all values of $p$ will need to be equal to 1.

The value $Pr\{ data | p \}$ is easy to compute using simple statistics. The result of the experiment is a binomial distribution and we have

$Pr\{ data | p \} = \binom{14}{10} \cdot p^{10} \cdot (1-p)^{4}$

Again, ignoring the binomial coefficient $\binom{14}{10}$, we have:

$Pr\{ p | data \} \propto p^{10} \cdot (1-p)^{4} \cdot Pr\{ p \}$

where the $\propto$ symbol means proportional.

Now, the factor $Pr\{ p \}$, the prior distribution, comes into play. A very convenient prior distribution for this scenario (also known as the conjugate prior of the binomial) is the Beta distribution, $Beta(p;a,b)$ defined as:

$Beta(p;a,b) = \frac{ \Gamma(a+b) } {\Gamma(a) \cdot \Gamma(b)} \cdot p^{a-1} \cdot (1-p)^{b-1}$

In this case, we have:

$Pr\{ p \} = \frac{ \Gamma(a+b) } {\Gamma(a) \cdot \Gamma(b)} \cdot p^{a-1} \cdot (1-p)^{b-1}$

Dropping once again the constant:

$Pr\{ p | data \} \propto p^{10} \cdot (1-p)^{4} \cdot p^{a-1} \cdot (1-p)^{b-1}$

$Pr\{ p | data \} \propto p^{(10+a-1)} \cdot (1-p)^{(4+b-1)}$

Now, what distribution is that? We need the normalizing factor to get the proper distribution. Looking at the form of the Beta distribution above, we can see that this is again a Beta distribution, and the normalizing factor is

$\frac{ \Gamma((10+a)+ (4+b)) } { \Gamma(10+a) \cdot \Gamma(4+b) } = \frac{1}{B(10+a, 4+b)}$

where B(x,y) is the Beta function (not to be confused with the Beta distribution!)

Therefore the distribution for the parameter $p$ is $Beta(p;a+10, b+4)$. If we assume that we know nothing about $p$, we can assume that the prior is a uniform distribution, and the uniform distribution is a special case of Beta, with $a=b=1$. In this case the distribution for $p$ is $Beta(p; 1+10, 1+4) = Beta(p; 11, 5)$, which has the following form:

Probability of "Two Heads in a Row"

Now, we need to compute the probability of the event "two heads in a row", given the data that we have, which is $Pr\{HH|data\}$. To simplify:
  • We do not update $p$ between the two tosses.
  • We assume conditional independence that, given $p$, the two tosses are independent.
Given that we will not update the value of $p$, we can break the process into two steps. First, we compute the possible values of $p$ as described above; then, for each value of $p$ we compute the probability of two heads in a row. These steps are encoded into the following integral computation:

$Pr\{HH|data\} = \int_0^1 Pr\{HH|p\} \cdot Pr\{p|data\} dp$

Given the conditional independence of the two tosses:

$Pr\{HH|p\} = [Pr\{H|p\}]^2 = p^2$

So, we have:

$Pr\{HH|data\} = \int_0^1 p^2 \cdot Pr\{p|data\} dp$

From above, we know that:

$Pr\{ p | data \}= \frac{ 1 } { B(10+a, 4+b) } \cdot p^{10+a-1} \cdot (1-p)^{4+b-1}$

By replacing this value, we get:

$Pr\{HH|data\} = \frac{ 1 } { B(10+a, 4+b) } \int_0^1 p^{(10+a-1)+2} \cdot (1-p)^{4+b-1} dp$

Solving the integral, we have:

$Pr\{HH|data\} = \frac{ B(10+a+2, 4+b) } { B(10+a, 4+b) }$

If we assume that we know nothing about $p$, we can assume that the prior is a uniform distribution, and the uniform distribution is a special case of Beta, with $a=b=1$. In this case, the probability of the event "two heads in a row" is $\frac{ B(13, 5) } { B(11, 5) } = 0.485$ and it makes sense to bet against the event. So, the Frequentist approach gives probability 51% and the Bayesian approach with uniform prior gives 48.5%. Kudos to Roy for coming up with example, and shame on me for screwing up the initial posting!

(Update based on Foster's comment below: instead of using the uniform distribution as a prior, we can be even more agnostic. In this case, we can use the Beta(0,0) distribution as a prior. Such a distribution corresponds to the case where any mean of the distribution is equally likely. In this case, the two approaches, Bayesian and frequentist give the same results.)

(Update 2 based on multiple comments below: I changed the example to be more "fully Bayesian". Initially, I was estimating the full distribution for $p$, but then, in order to compute the "two heads in a row" probability, I was collapsing $p$ into a point estimate, using its mean value. Now, with the help of the commenters, the example follows closer the Bayesian paradigm. I had to change the number of heads and tails for the example to still be valid. WolframAlpha was surprisingly easy to use for identifying numbers that will work for this example. The values (3 heads, 1 tails),  (5 heads, 2 tails), (8 heads, 3 tails), (try your own value of tails)... also work.)

Monday, January 14, 2008

Are you a Bayesian or a Frequentist?

You have a coin that when flipped ends up head with probability p and ends up tail with probability 1-p. (The value of p is unknown.)

Trying to estimate p, you flip the coin 10 100 times. It ends up head 7 71 times. (Update: The correct numbers for the example to be instructional are 71 and 100, not 7 and 10 as I listed previously.)

Then you have to decide on the following event: "In the next two tosses we will get two heads in a row."

Would you bet that the event will happen or that it will not happen?

See the solution

Saturday, January 12, 2008

Definining Probability in Prediction Markets

The New Hampshire Democratic primary was one of the few(?) events in which prediction markets did not give an "accurate" forecast for the winner. In a typical "accurate" prediction, the candidate that has the contract with the highest price ends up winning the election.

This result, combined with an increasing interest/hype about the predictive accuracy of prediction markets, generated a huge backslash. Many opponents of prediction markets pointed out the "failure" and started questioning the overall concept and the ability of prediction markets to aggregate information.

Interestingly enough, such failed predictions are absolutely necessary if we want to take the concept of prediction markets seriously. If the frontrunner in a prediction market was always the winner, then the markets would have been a seriously flawed mechanism. In such a case, an obvious trading strategy would be to buy the frontrunner's contract and then simply wait for the market to expire to get a guaranteed, huge profit. If for example Obama was trading at 66 cents and Clinton at 33 cents (indicating that Obama is twice as likely to be the winner), and the markets were "always accurate" then it would make sense to buy Obama's contract the day before the election and get $1 back the next day. If this was happening every time, then this would not be an efficient market. This would be a flawed, inefficient market.

In fact, I would like to argue that the late streak of successes of the markets to always pick the winner of the elections lately has been an anomaly, indicating the favorite bias that exists in these markets. The markets were more accurate than they should, according to the trading prices. If the market never fails then the prices do not reflect reality, and the favorite is actually underpriced.

The other point that has been raised in many discussions (mainly from a mainstream audience) is how we can even define probability for an one-time event like the Democratic nomination for the 2008 presidential election. What it means that Clinton has 60% probability of being the nominee and Obama has 40% probability? The common answer is that "if we repeat the event for many times, 60% of the cases Clinton will be the nominee and 40% of the cases, it will be Obama". Even though this is an acceptable answer for someone used to work with probabilities, it makes very little sense for the "average Joe" who wants to understand how these markets work. The notion of repeating the nomination process multiple times is an absurd concept.

The discussion brings in mind the ferocious battles between Frequentists and Bayesians for the definition of probability. Bayesians could not accept that we can use a Frequentist approach for defining probabilities for events. "How can we define the probability of success for an one-time event?" The Frequentist would approach the prediction market problem by defining a space of events and would say:
After examining prediction markets for many state-level primaries, we observed that 60% of the cases the frontrunners who had a contract priced at 0.60 one day before the election, were actually the winners of the election. In 30% of the cases, the candidates who had a contract priced at 0.30 one day before the election, was actually the winners of the election, and so on.
A Bayesian would criticize such an approach, especially when the sample size of measurement is small, and would point to the need to have an initial belief function, that should be updated as information signals come from the market. Interestingly enough, the two approaches tend to be equivalent in the presence of infinite samples, which is however rarely the case.

I just could not help but notice that the fight between the proponents and enemies of the prediction markets was reminiscent of the battles between Bayesians and Frequentists :-)