A Computer Scientist in a Business School

Sunday, October 28, 2007

Visualizing the Dirichlet

Last week, while working with Foster Provost and Xiahoan Zhang, one of our PhD students, we were trying to understand the internals of the Latent Dirichlet Allocation.

In particular, we were getting strange results from the LDA-C program by David Blei, and we wanted to figure out what we were doing wrong. The first suspects were the parameter values. We wanted to see how the values of the different parameters affect the behavior of the technique.

The conceptual model of LDA is pretty simple. We have a Dirichlet distribution, parameterized by a k-dimensional vector alpha (k is the number of topics). The Dirichlet distribution allows us to draw k-dimensional random vectors theta_i, one for each document in the collection. The vector theta_i represents the topic mixture within the document. For example, for 3 topics (say, art, technology, and sports), a vector theta_i = (0.2, 0.5, 0.3), means that the document i is 20% about art, 50% about technology, and 30% about sports. Then, each topic is modeled as a random distribution of words, representing the relative frequency of each word within the given topic.

The "alpha" parameter of the Dirichlet is crucial, as it defines the distribution of the theta_i vectors or, in other words, the "distribution of topic distributions".

After looking at some visualizations of the Dirichlet, the demonstrated shapes looked strange. Most of the visualizations of the Dirichlet (e.g., on Wikipedia) showed that the mass of the density distribution is around the "center" of the simplex. For example for a Dirichlet with three topics (image from Wikipedia):

most of the density was around the area (0.33, 0.33, 0.33). At the same time, very little mass was allocated at the "corners" of the simplex. Such a shape implied that the majority of the theta vectors drawn from such a distribution would be vectors that have mix many topics within a document. At the same time, very few documents would be generated with only one topic.

This was a counterintuitive modeling choice. Why do we want to force most documents to have many topics and not have the majority of documents to have only a small number of topics? Something seemed wrong.

To understand better how the Dirichlet density function is allocated in the simplex, I pulled my favorite Maple, and started coding my visualization code. First, I defined the 3-d Dirichlet density function.

B := (a1, a2, a3) -> (GAMMA(1.0*a1) * GAMMA(1.0*a2) * GAMMA(1.0*a3)) / GAMMA(1.0*a1+1.0*a2+1.0*a3);

Dir := (x1, x2, a1, a2, a3) -> (x1^(a1-1)) * (x2^(a2-1)) * ( (1-x1-x2)^(a3-1)) /B(a1,a2,a3) ;

Then, I plotted the log of the density in an animated 3d plot. Since I just wanted to see how the value of the vector alpha change the distribution, I decided to keep all the vector elements to alpha equal to each other, and vary them to have values from 0.3 to 2.0.

with(plots);

plotsetup(gif, plotoutput=`LogDirichletDensity-alpha_0.3_to_alpha_2.0.gif`);

animate ( plot3d, [eval(log(Dir(x1, x2, a1, a2, a3)), {a1=a, a2=a, a3=a}), x1=0.00..1, x2=0.00..1, axes=BOXED, grid=[25,25], gridstyle=triangular, orientation=[-135, 60], shading=zhue, contours=20, style=surfacecontour, view=-3..2 ], a=0.3..2.0, frames=100);

The result was pretty revealing:

When the values of alpha are below 1.0, the majority of the probability mass is in the "corners" of the simplex, generating mostly documents that have a small number of topics. When the values of alpha are above 1.0, the majority of the generated documents tend to contain all the topics. (Yes, I am pretty sure that any people with experience in Bayesian statistical modeling who are reading this post are yawning by now. But this is my blog and I will write things that I find interesting.)

So, for general topic collections, and traditional "topical clustering" tasks, it seems best to drive the LDA inference towards configurations that have low alpha values, so that each document contains a small number of topics.

Someone might ask, when we would want at all to examine configurations where alpha is higher than 1.0? Can such configurations be useful at all? Well, there are cases when documents tend to contain frequently a mixture of topics. In product reviews, the reviewers tend to cover a large number of product features (e.g., for a digital camera they will talk about the lenses, about the image quality, about battery life, and so on). Hotel reviews on TripAdvisor also exhibit a similar structure. Similarly, in feedback postings in the reputation systems of eBay and of Amazon, the buyers tend to give feedback about various characteristics of the merchant with whom they transacted (how fast was the delivery, whether the product was accurately described, and so on). In these cases, LDA tends to work best when the values of alpha are above 1.0. Otherwise the resulting topics do not make much sense.

Friday, October 26, 2007

Using Facebook ... as a facebook

There is a lot of discussion lately about Facebook, its API, its valuation, its growth prospects, and so on. Lately, I realized that Facebook has a very nice functionality: it is a ... facebook! What do I mean?

I have had a Facebook profile since late 2004. While I have a full profile, I am not that active, since I tend to reach most of my contacts either through email or through instant messaging. Nevertheless, every years students who take my undergrad class, add me as a "friend" on Facebook, and these contacts remain even after the class is over.

Now, as my freshmen students in 2004 reach their senior year, they start asking for recommendation letters. Remembering who is the person who asks for a recommendation, just by looking at the name, tends to be tricky: Each year we deal with hundreds of students, and unfortunately our human memories are not keeping up with Moore's law. However, when the student is on Facebook then I can check easily the corresponding Facebook profile. By looking at the photo of the student, I can remember very easily the student, the performance in class, and the general impression that I had formed during the course. Then, checking the grades in the homeworks and projects completes the image, and writing a recommendation letter tends to be much much easier.

In effect, Facebook for me today tends to be like a LinkedIn for connecting with my students. If only they removed such sentences from their privacy policy: "Facebook may also collect information about you from other sources, such as newspapers, blogs, instant messaging services, and other users of the Facebook service .... We may use information about you that we collect from other sources, including but not limited to newspapers and Internet sources such as blogs, instant messaging services, Facebook Platform developers and other users of Facebook, to supplement your profile..."

Wednesday, October 10, 2007

Only for Database Geeks

From today's xkcd:

Monday, October 8, 2007

Implicit and Explicit Changes in Contracts: Rent-A-Coder

I have blogged previously about outsourcing some research tasks using Amazon Mechanical Turk. Another option that I have been using is Rent-A-Coder (RAC): I am using this service mainly for outsourcing basic programming tasks, such as building basic crawlers for data collection. (Even though someone could argue that students can be used for writing such programs, I believe that the time of the PhD students is better spent trying to solve some research problems, rather than completing basic programming tasks). So far, I have been rather satisfied with the overall process, although there were glitches from time to time. (More about the overall experiences in another post.)

However, an incident that happened lately forced me to think seriously about assigning any important project using the RAC platform. Specifically, for one of the projects, we ended up disagreeing with the coder about the specifications. The summary of the disagreement:

I have posted a project description and the corresponding specifications.
The coder proposed a change to the specifications
I did not accept or reject explicitly the proposed change, but rather pointed the coder to the contract to see the description
I accepted the bid
The coder had 24 hours to review and accept or reject the project assignment
The coder assumed that I agreed with the modifications that he proposed, and
The coder delivered the modified project, which was not according to the project specification

After the project was delivered, we could not agree whether the deliverable was acceptable or not. Since we could not agree, we resorted to arbitration, which is done by the Rent-A-Coder staff. According the the RAC terms of service, in case of disagreement, the RAC serves as a judge and decides who is correct.

The arbitrator, as part of the analysis said:

The coder has proposed a change in the contract.
The buyer has the following options:
A. Explicitly reject the change in the contract.
B. Implicitly/explicitly accept the change in the contract.
The buyer did not reject the change and implicitly accept the new requirements.

These steps are not described in the terms of RAC, but rather are devised by the arbitrator. In defense of this ruling, the arbitrator said that the same rules would apply if myself, as a buyer, had asked for some extra features to be implemented. In such a case, if the coder did not explicitly reject them, the coder would have to implement the extra features, even if they are not described explicitly in the project description.

And here lies my disagreement with the ruling. My understanding is that a contract can be modified only explicitly, not implicitly. A party can reject changes implicitly or explicitly. This setting effectively gives priority to the statements in the contract (in this case, project specification) over the changes that are proposed during negotiations (which can be accepted and agreed upon explicitly). Otherwise, we have a dangerous precedent, where one party (buyer or coder) can start proposing an endless list of amendments, and the other party has to waste time explicitly rejecting the proposed changes, stating that they are out of the scope of the original project description.

At least this is my understanding of the Uniform Commercial Code. It is clear to me that the Rent-A-Coder has the right to rule independently of the provisions of the code. However, it seems problematic to devise a new set of undocumented rules when handling outsourcing contracts, instead of relying on existing legislation. If not anything else, it does not build trust among the participants in the RAC marketplace to know that the existing law does not apply in the RAC contracts.

Opinions? Am I incorrect in my analysis?

Disclaimer: I am not a lawyer, so the analysis above is based on my own interpretation of the law.