Thursday, March 19, 2009

Fitting a Power-Law with Censored Data

A few days back, I was describing how to model the distribution of waiting times for Mechanical Turk. There, I described how to use the maximum likelihood estimator to infer the parameter of the power-law, using as input to the estimator the duration of the completed tasks.

However, this approach introduces a bias: There are tasks that we observed that have not finished yet. Take a look for example at the two distribution plots for the completion time of tasks on Mechanical Turk:

You see that there is a "cutoff" in the pdf plot at some point between 512 and 1024 hours. Similarly, we see a sudden drop in the cdf plot. This is just the effect of keeping only the completed tasks, ignoring the tasks that are still running.

By using only the completed tasks, we effectively ignore the information provided by the incomplete tasks: knowing that a task has not finished after running for $U$ hours, gives us information that the duration of the task will be at least $U$: This is valuable information. Such data points are called censored. Let me give some background on censored data for readers not familiar with the concept.

What do you mean by calling a data point censored?

In general, we call a data point censored when we cannot get the exact value for the data point, but we can provide a lower bound for its value, or an upper value, or both. According to the type of bounds, the data points are classified as: right censored (lower bound), left censored (upper bound), and interval censored (both lower and upper bound).

By far the most typical type of censored data is the right censored data. The duration of the unfinished tasks on Mechanical Turk are right censored. In a salary survey if the income is marked as "100K+ per year" is also a right censored data point. The lifetime of a patient treated by a drug who is still alive at the end of the study is also a right censored data point.

A left censored point is an observation for which we know its maximum possible value. In a salary survey if the income is marked as "Less than 10K+ per year" this is a left censored data point.


How can we modify a maximum likelihood estimator to use censored data points?

Deriving the simple MLE estimator (no censored data points)

Let's see first how we derive a maximum likelihood estimator. I will use as an example the continuous power-law distribution but the process is largely the same for other distributions as well. For the continuous power law, the probability density function is

$Pr(x=x_i) = (\alpha-1) \cdot {x_i}^{-\alpha}$

First, let's say that we observe $n$ data points with values $x_1, \ldots, x_n$. We assume that the data points come indeed from a continuous power-law and we are trying to estimate the most likely parameter $\alpha$ that generated these points.

The likelihood of observing these data points is the probability of seeing the the value $x_1$ and the value $x_2$ and the value $x_3$, ... and the value $x_n$. So, assuming independence across the points, the likelihood function is:

$L(\alpha) = \prod_{i=1}^n Pr(x=x_i) $

By substituting the value $Pr(x=x_i)$ with the power-law instantiation, we have:

$L(\alpha) = \prod_{i=1}^n (\alpha-1) \cdot {x_i}^{-\alpha}$

We need to find the $\alpha$ that maximizes $l(\alpha)$. Instead of maximizing $l(\alpha)$ directly, we opt to maximize the logarithm which is also maximized at the same value, and it much easier to work with analytically:

$l(\alpha) = \ln(L(\alpha)) = \sum_{i=1}^n \ln\left( (\alpha-1) \cdot {x_i}^{-\alpha} \right)$
which results in:
$l(\alpha) = n \ln(\alpha-1) - \alpha \sum_{i=1}^n \log(x_i)$

To find the value of $\alpha$ that maximizes $l(\alpha)$, we take the derivative with respect to $\alpha$ and set the derivative equal to 0:

$\frac{d}{d \alpha}l(\alpha) = 0$

so

$n \frac{1}{\alpha-1} - \sum_{i=1}^n \log(x_i) = 0$

So the MLE estimator is:

$\alpha = 1 + n \left[\sum_{i=1}^n \ln(x_i) \right]^{-1}$

Let's move on now to the case of censored data.

Deriving the MLE estimator with right censored data

Now, let's say that we observe $n$+$m$ data points drawn from the distribution: We have $n$ completed tasks, with duration values $x_1, \ldots, x_n$ and $m$ non-completed tasks that had duration $u_1, \ldots, u_m$ when we observed them but they were still going on (and so their final duration is going to be longer than the currently observed one --- right censored points).

Again, we assume that the data points come indeed from a continuous power-law and we are trying to estimate the most likely parameter $\alpha$ that generated these points.

Now the likelihood of observing these data points is slightly different. We need to accommodate the fact that the duration for the censored observations will be longer than the currently observed one. So we get this version, (for more details see also the related Wikipedia entry) boldfacing the part that deals with the censored data:

$L(\alpha) = \prod_{i=1}^n Pr(x=x_i) \cdot$ $ \mathbf{\prod_{i=1}^m Pr(x>u_i)}$

Since $Pr(x>u_i) = {u_i}^{-(\alpha-1)}$ the likelihood becomes:

$L(\alpha) = \prod_{i=1}^n (\alpha-1) {x_i}^{-\alpha} \cdot \prod_{i=1}^m {u_i}^{-\alpha+1}$

Analogously to the case above, we take the log:

$l(\alpha) = \ln(L(\alpha))$

which becomes:

$l(\alpha) =n \ln(\alpha-1) -\alpha \sum_{i=1}^n \ln(x_i) + \sum_{i=1}^m \ln(u_i) -\alpha \sum_{i=1}^m \ln(u_i)$

and set the derivative equal to 0:

$\frac{d}{d \alpha}l(\alpha) = 0$
$n \frac{1}{\alpha-1} -\sum_{i=1}^n \ln(x_i) -\sum_{i=1}^m \ln(u_i) = 0$

So, here is the MLE estimator that uses censored data points as well:

$\alpha= 1 + n \left[ \sum_{i=1}^n \ln(x_i) +\sum_{i=1}^m \ln(u_i)\right]^{-1}$

As you will see the MLE estimator with censored data also contains the factor $\sum_{i=1}^m \ln(u_i) \right$ but the normalizing constant is still $n$ and not $n+m$.

So, what is the difference? For power laws, the higher the fraction of censored points, the lower the exponent, i.e., the tail is heavier: an expected outcome as the censored data points tend to represent unobserved tail points. For example, for Mechanical Turk, the exponent without censored data points is 1.35 and with censored data points is 1.29. Not a big difference in this case, but there are situations where this correction can change significantly the overall analysis (e.g., if the exponent is close to 2).