Tuesday, January 15, 2008

Are you a Bayesian or a Frequentist? (Or Bayesian Statistics 101)

OK, the previous post was actually a brain teaser given to me by Roy Radner back in 2004, when I joined Stern, in order to teach me the difference between Bayesian and Frequentist statistics. It actually illustrates nicely how the two techniques lead to different conclusions. For completeness, let me list the question again:

You have a coin that when flipped ends up head with probability $p$ and ends up tail with probability $1-p$. (The value of $p$ is unknown.)

Trying to estimate $p$, you flip the coin 14 times. It ends up head 10 times.

Then you have to decide on the following event: "In the next two tosses we will get two heads in a row."

Would you bet that the event will happen or that it will not happen?

One of the basic differences of Bayesian and Frequentists is how they treat the parameters. Using frequentist statistics (the statistics that I learned in school :-) we would say that the best (maximum likelihood) estimate for $p$ is $p=\frac{10}{14}$, i.e., $p \approx 0.714$.

In this case, the probability of two heads is $0.714^2 \approx 0.51$ and it makes sense to bet for the event.

Bayesian approach: $p$ is not a value, it's a distribution

A Bayesian approach, instead of considering only the ML estimate for $p$, it would treat $p$ as a random variable with its own distribution of possible values. The distribution can be defined by the existing evidence. The logic goes as follows. What is the probability of a given value of $p$, given the data?

$Pr\{ p | data \} = \frac{ Pr\{ data | p \} \cdot Pr\{ p \}} { Pr\{ data \} }$


Since the term $Pr\{ data \}$ is going to be the same for all values of p, we can for now ignore it. We will see that we can infer it later, as the probability distribution when integrated over all values of $p$ will need to be equal to 1.

The value $Pr\{ data | p \}$ is easy to compute using simple statistics. The result of the experiment is a binomial distribution and we have

$Pr\{ data | p \} = \binom{14}{10} \cdot p^{10} \cdot (1-p)^{4}$

Again, ignoring the binomial coefficient $\binom{14}{10}$, we have:

$Pr\{ p | data \} \propto p^{10} \cdot (1-p)^{4} \cdot Pr\{ p \}$

where the $\propto$ symbol means proportional.

Now, the factor $Pr\{ p \}$, the prior distribution, comes into play. A very convenient prior distribution for this scenario (also known as the conjugate prior of the binomial) is the Beta distribution, $Beta(p;a,b)$ defined as:

$Beta(p;a,b) = \frac{ \Gamma(a+b) } {\Gamma(a) \cdot \Gamma(b)} \cdot p^{a-1} \cdot (1-p)^{b-1}$

In this case, we have:

$Pr\{ p \} = \frac{ \Gamma(a+b) } {\Gamma(a) \cdot \Gamma(b)} \cdot p^{a-1} \cdot (1-p)^{b-1}$

Dropping once again the constant:

$Pr\{ p | data \} \propto p^{10} \cdot (1-p)^{4} \cdot p^{a-1} \cdot (1-p)^{b-1}$

$Pr\{ p | data \} \propto p^{(10+a-1)} \cdot (1-p)^{(4+b-1)}$

Now, what distribution is that? We need the normalizing factor to get the proper distribution. Looking at the form of the Beta distribution above, we can see that this is again a Beta distribution, and the normalizing factor is

$\frac{ \Gamma((10+a)+ (4+b)) } { \Gamma(10+a) \cdot \Gamma(4+b) } = \frac{1}{B(10+a, 4+b)}$

where B(x,y) is the Beta function (not to be confused with the Beta distribution!)

Therefore the distribution for the parameter $p$ is $Beta(p;a+10, b+4)$. If we assume that we know nothing about $p$, we can assume that the prior is a uniform distribution, and the uniform distribution is a special case of Beta, with $a=b=1$. In this case the distribution for $p$ is $Beta(p; 1+10, 1+4) = Beta(p; 11, 5)$, which has the following form:


Probability of "Two Heads in a Row"

Now, we need to compute the probability of the event "two heads in a row", given the data that we have, which is $Pr\{HH|data\}$. To simplify:
  • We do not update $p$ between the two tosses.
  • We assume conditional probability (i.e., given $p$, the two tosses are independent).
Given that we will not update the value of $p$, we can break the process into two steps. First, we compute the possible values of $p$ as described above; then, for each value of $p$ we compute the probability of two heads in a row. These steps are encoded into the following integral computation:

$Pr\{HH|data\} = \int_0^1 Pr\{HH|p\} \cdot Pr\{p|data\} dp$

Given the conditional independence of the two tosses:

$Pr\{HH|p\} = [Pr\{H|p\}]^2 = p^2$

So, we have:

$Pr\{HH|data\} = \int_0^1 p^2 \cdot Pr\{p|data\} dp$

From above, we know that:

$Pr\{ p | data \}= \frac{ 1 } { B(10+a, 4+b) } \cdot p^{10+a-1} \cdot (1-p)^{4+b-1}$

By replacing this value, we get:

$Pr\{HH|data\} = \frac{ 1 } { B(10+a, 4+b) } \int_0^1 p^{(10+a-1)+2)} \cdot (1-p)^{4+b-1} dp$

Solving the integral, we have:

$Pr\{HH|data\} = \frac{ B(10+a+2, 4+b) } { B(10+a, 4+b) }$

If we assume that we know nothing about $p$, we can assume that the prior is a uniform distribution, and the uniform distribution is a special case of Beta, with $a=b=1$. In this case, the probability of the event "two heads in a row" is $\frac{ B(13, 5) } { B(11, 5) } = 0.485$ and it makes sense to bet against the event. So, the Frequentist approach gives probability 51% and the Bayesian approach with uniform prior gives 48.5%. Kudos to Roy for coming up with example, and shame on me for screwing up the initial posting!

(Update based on Foster's comment below: instead of using the uniform distribution as a prior, we can be even more agnostic. In this case, we can use the Beta(0,0) distribution as a prior. Such a distribution corresponds to the case where any mean of the distribution is equally likely. In this case, the two approaches, Bayesian and frequentist give the same results.)

(Update 2 based on multiple comments below: I changed the example to be more "fully Bayesian". Initially, I was estimating the full distribution for $p$, but then, in order to compute the "two heads in a row" probability, I was collapsing $p$ into a point estimate, using its mean value. Now, with the help of the commenters, the example follows closer the Bayesian paradigm. I had to change the number of heads and tails for the example to still be valid. WolframAlpha was surprisingly easy to use for identifying numbers that will work for this example. The values (3 heads, 1 tails),  (5 heads, 2 tails), (8 heads, 3 tails), (try your own value of tails)... also work.)

19 comments:

  1. If we assume that we know nothing about p, we can assume that the prior is a uniform distribution

    Why? If p is unkown, assuming anything shouldn't be relevant after the new measures. AFAIK the proper bayesian approach would give no difference if we start from a 1:1 ratio and adjusted the odds after every toss. Also we could assess the occurrences of two heads in a row in the first 100 tosses and use it instead.

    ReplyDelete
  2. Daniel,

    This is exactly what the Bayesian approach is doing. It starts with a Beta(a,b) distribution, and it updates the distribution to Beta(a+#heads), b+#tails).

    If you do the math, you will see that the mean of the Beta(1,1) is actually 0.5, so this may qualify as "1:1" ratio that you imply.

    Regarding the two heads in a row, the experiment simply assumes independence across experiments, so this is not an issue.

    ReplyDelete
  3. And once again, how can I write proper math in Blogger? Once I figure this out, I will fix all the ugly math in the posting.

    Actually, all the math (except the [(a+71)/(a+b+100) ]^2), which wasn't wrapped in $s) rendered correctly for me in Firefox. I had to switch over to IE to see the ugly math.

    ReplyDelete
  4. There's a useful interpretation of the Beta(1:1) that may shed some light on this phenomenon.

    First: the Beta distribution is "convenient" as the prior distribution here to a large extent because it is the "conjugate" prior to the binomial distribution. This means that the posterior distribution will have the same functional form as the prior.

    So we can (sort of) unify the frequentist and Bayesian perspectives by thinking of the prior as being equivalent to having seen some number of heads and tails before. (Consider that when presented with yet new evidence, the old coin tosses are combined with the prior to form the new "prior," adding the heads and tails to a and b.)

    Using the "uniform" distribution, Beta(1,1), then corresponds to seeing one head and one tail previously. So from a frequentist point of view, it is as if we saw 72 heads and 30 tails -- after which we would be on the other side of the decision threshold (as in the solution).

    ReplyDelete
  5. Ah, the improper prior case of Beta(0,0)!

    (For this posting, I assumed that someone who knows what a conjugate prior is will not really need to read the posting :-)

    ReplyDelete
  6. I'm very late to the game here, but I was looking for Bayesian vs. Frequentist problems and stumbled across this one. I couldn't reproduce your results, however. After thinking about this for a long time, I realized that you forgot to take into account a vital piece of information! Since we are looking at the probability of tossing two heads in a row, we need the probability of tossing one head given 71 successes out of 100, followed by the probability of tossing a second head given 72 successes out of 101. Thus the probability comes out to .5002855511136494

    Unfortunately, that's on the same side of even odds as the frequentist estimate. Try 29 heads out of 41 tosses.

    ReplyDelete
  7. @jrm: Good catch.

    Yes, I implicitly assumed that we are not going to perform another Bayesian update in between the two tosses.

    So 29 out of 41 works with both approaches? (Updating vs. not updating the estimate between the two tosses?)

    ReplyDelete
  8. If you the simple frequentist estimate of 29/41, you will infer that the probability of two subsequent heads is .500297

    If you use the Beta(1,1) prior, you will infer a probability of .6976 for the first flip. If you ignore the result of the flip and continue to use the .6976, you will infer a probability of .4867, but if you take the first flip into consideration you will infer a probability of .4915

    So which is the right answer? If we pick the coin's bias by selecting a random number between 0 and 1, then we have a flat prior and should use Beta(1,1), and get the answer of .4915

    Of course, the variance is high, so you might have to do this experiment a lot to get this actual result.

    ReplyDelete
  9. Isn't the whole point of being Bayesian the estimation of posterior uncertainty? There's also the problematic issue in this pedagogical example that you're estimating down to three decimal places based on a diffuse prior and only 100 training samples.

    The central 95% interval for Beta(71+1,29+1) is roughly (.61,.79), which is a whole lot of uncertainty. The 95% interval on 2 heads in a row (without updating in between) is (.41,.58). Of course, if I could include graphics, I could plot out the posterior density or provide a histogram!

    ReplyDelete
  10. @Bob: Good comment. I added a plot of the posterior, to demonstrate that the mean estimate is rather uncertain. I do not know of the analytic form for the density of the product of two Betas (is there any?), so I skipped that.

    Now, regarding the decimals, you are right: the reliability of the measurement does not allow for more digits. Potentially I could find some other number combination, with higher number of trials that would generate the same result, but for an introductory example I would consider the current version fine.

    ReplyDelete
  11. The original problem came down to making an estimate of where the peak of the curve was, not the width. You could make the distribution as narrow as you'd like by increasing the number of tosses. (Of course you'd eventually reach an unrealistic number.)

    ReplyDelete
  12. The probability of a particular number of successes in a future trial is found by combining the posterior with the likelihood function of the future event, called the posterior predictive, which here is a beta-binomial. The probabilities for 0, 1 and 2 successes, based on a uniform prior (which one wouldn't use in the case of a coin!) are 0.0885, 0.4112 and 0.5003, respectively, as found by another method in an earlier comment too.

    ReplyDelete
  13. Very nice! Thanks for putting this up!

    ReplyDelete
  14. Jonathan FischoffAug 12, 2010 12:54 PM

    Nice. You might want to add to this page: http://math.stackexchange.com/questions/1747/how-to-determine-if-coin-comes-up-heads-more-often-than-tails

    ReplyDelete
  15. Giovanni PetrisAug 13, 2010 09:38 AM

    In the original post, I think the posterior probability of two heads in a row is E(p^2|data), which is different from E(p|data)^2. Coin flips are conditionally independent, given p, but they are not independent.

    ReplyDelete
  16. "I think the posterior probability of two heads in a row is E(p^2|data), which is different from E(p|data)^2."

    Giovanni, you have greater expertise in that topic than I do. Could you please provide some additional explanation, and help me fix this?

    ReplyDelete
  17. Ah, Giovanni, always stirring up the pot.

    As pointed out, the correct quantity is the posterior predictive. In this case, we need to calculate p(HH|data), the probability of two heads in a row conditional on the data. The math is

    p(HH|data)
    = \int p(HH|p) p(p|data) dp
    = \int [p(H|p)]^2 p(p|data) dp {by cond. ind.}
    = \int p^2 p(p|data) dp
    = E[p^2|data] {as Giovanni pointed out}
    = Beta(73,31)/Beta(71,31) {if I did the math correctly.}

    The final conclusion is that the probability of two heads in a row is 0.4865791, two tails is 0.09442224, and one of each is 0.4189987. So anybody assuming a uniform prior would not bet on two heads in a row.

    ReplyDelete
  18. You are almost correct, but p(HH|data) should factor as p(H|(data + H))p(H|data). If the first posterior flip is `heads' you need to count that as one more data point.

    ReplyDelete
  19. Thanks Jarad for the explanation, and the previous commenters for pointing the missing part!

    I now updated the posting. If you see anything missing, let me know.

    ReplyDelete