Misunderstandings of Power-Law Distributions

Power laws are ubiquitous. In its most basic form, a power-law distribution has the following form:

$Pr\{x=k; a\} = \frac{k^{-a}}{\zeta(a)}$

where $a>1$ is the parameter of the power-law and $\zeta(a) = \sum_{i=1}^{+\infty} \frac{1}{i^a}$ is the Riemann zeta function that serves as a normalizing constant.

Part A

Often in research we get data sets that seem to follow a power-law. Since power-laws are cool, it is common to examine whether some distribution is indeed a power law and then try to estimate its parameter(s).

One of the most common approaches is to:

Generate a frequency-count histogram: in the x-axis we have the frequency of an event, and in the y-axis we have how often such events happen (count).
Plot the histogram in a log-log plot. Since the power law is $Pr\{x=k\} \propto k^{-a}$ then $\log(Pr\{x=k\}) = -a \cdot \log(k) +b$, the resulting plot shows typically a straight line, with a negative slope.
Run least-squares regression to estimate the slope and the intersection of the best fitting line.
If the $R^2$ is high (or if the line looks to fit nicely), then the distribution is a power-law and the parameters are given by the regression.

Unfortunately, this (common!) strategy is wrong. Running a regression in a log-log space violates many of the underlying assumptions of an OLS regression. First of all, the errors in a log space are distorted and cannot follow a normal distribution. Second, many distributions will tend to fit a straight line in a log-log plot. In fact, I have seen people fitting Gaussians with a few outliers to a log-log plot and naming the distribution a power-law.

For the reasons above, it is always better to use more robust estimating techniques to estimate the parameters of a power-law. Personally, I have been using the maximum likelihood estimator. You can also read this great paper in arXiv that discusses estimation techniques for many variations of power-laws. (See Appendix A for a more rigorous discussion of the perils of the least-squares approach.) From SIGMOD, I also recall the paper on modeling skew in data streams that presents an approach that converges faster than the MLE estimate.

Have you seen any Bayesian estimator? The MLE estimator tends to be unstable with only few observed sample values and it would be useful to impose a prior, limiting the acceptable parameter values to a predefined space. I have seen a short discussion (Section II.B) but the approach requires numeric methods. Any method that results in a closed-form expression?

Part B

Another common misunderstanding of the power-law distributions is their interpretation. A common description of a power-law is "there are only a few events that occur frequently and many events that occur only a few times." While this is true, it hides the fact that the "few events that occur frequently" appear much more often compared to other "familiar" distributions. In other words, under a probabilistic interpretation, we have too many events that "occur frequently" not "only a few".

For example, word frequencies tend to follow a power-law distribution. This means that there are a few words (like if, the, a, when, and so on) that appear frequently, and many words that occur only a few times. However, if words frequencies followed a Gaussian distribution, then we would never observe words with high frequency. Most of the words would tend to appear with roughly equal frequencies, and the outliers would be extremely rare.

To inverse the example, if the height of people followed a power-law instead of a Gaussian (say with $a=2$), then we would have 5 billion midgets at 1ft, 1.2 billion short people at 2ft, 300 million people 4ft tall, .... 1.2 million 64ft tall, 20 thousand at 500ft, and a handful of 30,000ft tall.

In other words, power-laws have heavy tails and outliers appear frequently. Not acknowledging the frequent-outliers phenomenon can lead to wrong modeling decisions. See for example Manderbrot's article "How the Finance Gurus Get Risk All Wrong".

Part C

Finally a question and call for help: What is the distribution of the sum of two power-law random variables? In other words, we try to compute the distribution for the random variable $Z=X+Y$ where $X$ and $Y$ follow power-law distributions with parameters $a_1$ and $a_2$. Following the standard definition, we need to compute the convolution:

$Pr\{ z \} = \sum_{k=1}^{z-1} Pr\{ x=k \} \cdot Pr\{ y=z-k \} =\sum_{k=1}^{z-1} \frac{k^{a_1}}{\zeta(a_1)} \cdot \frac{(z-k)^{a_2}}{\zeta(a_2)}$

(The index $k$ goes from $1$ to $z-1$ because power-laws are only defined for positive integers.)

Now, what is the result of this convolution?

The paper by Wilke et al. claims that if we sum two power-law random variables with parameters $a_1$ and $a_2$ the result is again a power-law distribution with parameter $\min(a_1,a_2)$.

If we use the Fourier transform of a power-law (is this transformation correct?), and compute the convolution by taking the product of the transforms and the going back, then the resulting distribution is again a power-law and the configuring parameter seems to be $a_1 + a_2 +1$. My gut feeling though says that some underlying assumption is violated, and that this result is not correct.

Any other pointers?

Update (1/31/2008): Daniel Lemire reposted my bleg (I learned from Suresh that bleg stands for "blog request for help"), and Yaroslav Bulatov pointed to the paper "The Distribution of Sums of Certain I.I.D. Pareto Variates" by Colin M. Ramsay, which pretty much answers the question. The derivation uses Laplace transforms (instead of Fourier) and the Bromwich integral, and tends to be quite technical. However, if you look at Figures 1 and 2, you will pretty much understand the result of the summation. On a first check, the shape of the resulting distribution seems similar to the result of summing two exponential distributions (see the comments below).

A Computer Scientist in a Business School

Wednesday, January 23, 2008

Misunderstandings of Power-Law Distributions