Tuesday, December 18, 2007


Just came back from a short trip in England and I realized that the number of visitors to the blog has gone up tremendously. After looking at the referrals, I spotted the website that sent the traffic:


Apparently Steven D. Levitt and Stephen J. Dubner, the authors of the well-known Freakonomics book and of the homonymous blog at The New York Times, considered interesting the posting about the efficiency of the prediction markets.

Apparently having a link from a New York Times blog results in a couple of thousand visitors within a day. Given that I was just talking about some very preliminary results in that posting, reaching that level of readability so early on, is as close as it gets to "instant gratification" in terms of research :-)

Plus, I now feel the urge to write down all the results as soon as possible, following the "
authors should write to satisfy their readers"mode of doing research (via Daniel Lemire). Indeed a very good motivation!

Sunday, December 9, 2007

Political Prediction Markets: Some Thoughts

Apparently, my last postings on the predictability of the political prediction markets generated some interest. The analysis is more difficult in this scenario, but for the next few days we see stabilizing signals with a trend to go upwards" and we were proven wrong: the price declined from 43 on Dec 2nd, to 39.5 on Dec 9th, an 8% decline. I realized what was wrong in my reasoning. What was stabilizing was the sentiment index, not the price. And a stabilized sentiment around 50% tends to be a pretty bad adviser on how the market will move.

Bo's comment made me think about parallels in "prediction market trading" and "stock market trading". As Bo pointed out, in existing stock markets, there is a significant amount of algorithmic trading. This algorithmic trading makes the stock market significantly more efficient than, say, in the early 1980's where the programmatic trading was at its infancy. In fact, I have heard many stories from old-timers, saying that in the early days it was extremely easy to find inefficiencies in the markets and get healthy profits. As algorithmic trading proliferated, it became increasingly harder to spot inefficiencies in the market.

Something similar can happen today with prediction markets. If we have a prediction market platform that allows automatic/algorithmic trading, then we can improve tremendously the efficiency of today's prediction markets. Furthermore, such a tool (if done with play money) can be used as a great educational tool, similar to the now inactive Penn-Lehman Automated Trading (PLAT) Project. Allowing also for some data integration from the existing prediction markets (BetFair, Intrade, etc.) we could have a pretty realistic tool that can be used for many educational purposes that, at the same time, can generate useful and efficient prediction markets.

Now, I need to find someone willing to fund the idea. Ah, there are a couple of NSF call for proposals still open :-)

Tuesday, December 4, 2007

By Popular Demand: Mitt Romney

The last post generated some general interest and I got requests to post analysis for more presidential candidates. As one more data point, here is the market for Mitt Romney to be the Republican Presidential Nominee in 2008:

I checked again our sentiment indicator (in maroon), which seems to capture well the upward spikes. (If you see carefully, our indicator spikes upwards before the market.)

This market, similarly to the market of Hillary Clinton, seems to move in cycles. This cyclical behavior can be nicely visualized by plotting the 30-day moving average of our sentiment index (in black). It seems that a downward cycle has started for Romney and should should continue for the next couple of weeks, until the 30-day moving average gets close to 0.3 or so. Time will tell :-)

Sunday, December 2, 2007

Prediction Markets are NOT Efficient

I have been wondering in the past if prediction markets are efficient. Then, I was saying:
how long does it take for a prediction market to incorporate all the available information about an event? Liquidity seems to be an issue for the existing prediction markets, preventing them from reaching equilibrium quickly.
In fact, today's prediction markets are far from being efficient. Ari Gilder and Kevin Lerman, as part of an undegraduate project at University of Pennsylvania supervised by Fernando Pereira, have shown that by using linguistic analysis of news articles it is possible to predict the future price movements of the Iowa Electronic Markets. Therefore, the Iowa markets did not incorporate all the available information. Furthermore, the results indicated that it is possible to predict the price movement by simply using past pricing data. Therefore, the markets were not even weakly efficient. (Kevin is now a first year PhD student at Columbia University.)

One question was whether liquidity played a role in that result. The Iowa markets are thinly traded with upper limit on how much someone can bet. This imposes some artificial constraints making it difficult for information to flow freely into the market. Therefore, it is important to examine other markets with higher liquidity.

Over the last months we have been discussing this issue with George Tziralis, trying to examine how to evaluate the "Efficient Prediction Market" hypothesis. After long discussions, we came up with some techniques for extracting signals from the news about the prediction markets and see whether we can use these signals for predicting the future performance of markets in InTrade. Our sentiment indicator seems to work pretty well, even in liquid markets. Here is a preliminary result for the market on whether Hillary Clinton will be the Democratic Presidential Nominee in 2008:

Our sentiment index (in maroon) is close to 1 when we predict that the market will move higher, and it is close to 0 when we predict that the market will move down. Typically, it works pretty well for predicting long periods of price increases and declines. To put our money where our mouth is, the signal from the last few days shows that Hillary's market price will edge lower in the next few days/weeks.

The market prices for whether Giuliani will be the Republican Presidential Nominee in 2008, together with our sentiment index is displayed below.

The analysis is more difficult in this scenario, but for the next few days we see stabilizing signals with a trend to go upwards.

We will need to analyze quite a few more markets before generating the paper, but so far the results seem interesting.

Let's see what the future brings :-)

Saturday, November 24, 2007

My First Mechanical Turk Paper

A while back, I wrote about my experiences after using Amazon Mechanical Turk for conducting experiments that require input from users. I am now happy to announce :-) that the first paper that uses this methodology been accepted at IEEE ICDE 2008 and is now available online.

The paper, co-authored with Wisam Dakka, discusses a simple empirical technique for automatically extracting from a text database a set of facet hierarchies that are useful for browsing the contents of the database. We used Mechanical Turk in this context to evaluate the "precision" and "recall" of the generated hierarchies. This experiment would have been almost impossible to conduct without using Mechanical Turk, as it would require multiple users reading and annotating thousands of news articles. Using Mechanical Turk, the experiments were done in less than three days.

In the final experiment of the paper, though, where we needed to interview and time the users while they were using the faceted hierarchies to complete various tasks, we resorted to the traditional, lab-based setting. However, during the summer, we only managed to recruit five users that expressed interest to participate. We observed them in the lab while performing their tasks and recorded their reactions and impressions. (Fortunately, the results were statistically significant.)

Next time, we will attempt to use Mechanical Turk for such "interview+timing" experiments as well. However, I will need to talk more with people that perform often such experiments to see how they would react to such approaches, where the human subjects are completely disconnected from the researcher. Even though simple timing experiments can be easily performed using MTurk, I am a little uncomfortable about the reliability of such experiments.

Tuesday, November 20, 2007

GMail Supports IMAP (but forgets to notify the abuse team)

A couple of weeks back, Google announced that GMail supports IMAP, an email protocol that allows synchronization of email across email clients and across platforms. The support of IMAP, together with the option to buy additional disk space on Google was a great incentive for me to migrate all my email to GMail, and have it available online, accessible and searchable from everywhere.

So, I paid $20 for the extra 10Gb of disk space, and I setup my email client (Thunderbird 2.0) to access GMail over IMAP. One of the first things that Thunderbird does when accessing an email account through IMAP is to synchronize the content of the remote folders with the content of the locally stored ones. Plus, I decided to upload all my old email to GMail.

Unfortunately, GMail considers such actions "Unusual Activity" and keeps locking me out of my email account for 24 hours. (Under Thunderbird, I get the cryptic message "lockdown in sector 4," whatever this means.) In fact, over the last week, it is almost impossible to do anything that resembles "heavy activity" with my GMail account, since I am continuously locked out and I get the following when I try to login through the web:

Although I understand that Google wants to protect the service from being abused, I see little reason for locking out completely its users from accessing email. Bandwidth throttling seems to be a much better choice for controlling "strange" behavior; in the worse case, Google can block access through IMAP but still allow the user to access the account over the web.

Furthermore, there should be an option for contacting customer support and getting some answer back. Right now, I only get the standard boilerplate response, indicating that I have done something wrong (bad me!) and in 24 hours I will get again access to my email. It is absolutely impossible to reach someone at Google and understand why I am getting locked out.

I suspect that the abuse-detection team (is there such a thing?) needs to update its policies and triggers, to understand better the "expected behavior" of email clients under IMAP. Blocking access without any warning to a mission-critical service (especially for paying customers), seems like a no-no decision to me.

Update (Dec 3, 2007): I just received an email from Google:


You recently contacted us about disabled access to your Gmail account due to abnormal account activity, specifically message uploading.

While our engineers are working diligently to make the upload process faster and easier, we're currently unable to provide support for message uploading.

We wanted to remind you that, at this time, uploading an excessive number of messages to your Gmail account via IMAP may lead to being temporarily locked out of your account. If this happens to you, please be aware that these lockouts are temporary and you should be able to re-access your
account shortly.

We appreciate your patience while we work to improve Gmail.


The Google Team
Well, too bad that now GMail blocks every time that I am trying to synchronize my Thunderbird client with GMail, which corresponds to a large number of downloaded messages. Furthermore, this email does not answer at all why I am blocked from accessing my GMail through the web interface, or actually why I am getting locked at all. (Have anyone heard of throttling?)

Tuesday, November 13, 2007

New Class: Search and the New Economy

Next semester, I will be teaching an MBA class with the title "Search and the New Economy," and I will be also participating in the undergraduate version of the class, taught by Norm White. The intended audience for the class are MBA students, that have interest in technology but are not necessarily programmers.

I have been thinking a lot on how to organize such a class, so that it has some internal structure and flow. My current list of topics:
  1. Search Engine Marketing: Introduction, Search Basics: Crawling, Indexing, Ranking, Pagerank, Spam, TrustRank
  2. Search Engine Marketing: Analyzing and Understanding Users╩╣s Behavior, Web Analytics
  3. Search Engine Marketing: Search Engine Optimization
  4. Search Engine Marketing: AdWords, AdSense, Click Fraud
  5. Social Search and Collective Intelligence: Blog Analysis and Aggregation, Network Analysis, Opinion Mining
  6. Social Search and Collective Intelligence: Recommender Systems, Reputation Systems
  7. Social Search and Collective Intelligence: Prediction Markets
  8. Social Search and Collective Intelligence: Wikis and Collaborative Production
  9. Ownership of Electronic Data: Privacy on the Web
  10. Ownership of Electronic Data: Intellectual Property issues on the Web
  11. Ownership of Electronic Data: The Future of Privacy and Intellectual Property
  12. Future Directions and Wrapping‐up
Some rough sketches of the assignments for this course:
  • Run and optimize an online advertising campaign, using Google AdWords or Microsoft adCenter.
  • Analyze the visitorship data of an online website to analyze the effectiveness of different pages. You can use Google Analytics, or tools like CrazyEgg
  • Optimize the keyword campaign of a company by choosing the appropriate keywords and bid amounts, depending on the competition and the rank of the organic pages.
  • Analyze (or build) a recommender system for movies, books, and TV Shows using Facebook data.
  • Build a dating recommendation system using Facebook data
  • Build prediction markets at Inkling Markets, for an event of interest, examine the accuracy of the predictions, and analyze the behavior of the participants. Alternatively, analyze real‐money prediction markets at InTrade and BetFair and examine the effect of real‐life events in political campaigns.
  • Use Google Trends to build a predictor of unemployment measures.
Any more topics what would be worth covering? Alternative exercises?

Update (Feb 21): The class material (slides and recorded lectures, for now) is now available at the class website. You can also look at the class roster and at the prediction market site of the class.

Wednesday, November 7, 2007

What is Wrong with the ACM Typesetting Process?

Recently, I had to go through the process of preparing the camera-ready version for two ACM TODS papers. I am not sure what exactly is the problem but the whole typesetting process at ACM seems to be highly problematic.

My own pet peeves:

Pet peeve A: The copyeditors do not know how to typeset math and they do not even check the paper to see if they have incorporated correctly their own edits.

I detected problems repeatedly and the copyeditor consistently does not check the proofs after making the edits. Here are a few examples.

Example #1

I submit the latex sources and the PDF, with the following equation:

The copyeditor does not like the superscripted e^{\beta x_a}, so decides to convert it into the inline form exp(\beta x_a). Not a bad idea! Look, though, what I get back instead:

To make things worse, such errors were pervasive and appeared in many equations in the paper. I asked the copyeditor to fix these errors and send me back the paper after the mistakes are fixed, so that I can check it again. I get reassured that I will be able to inspect the galley proofs again before they go to print. Well, why would I expect that someone who does such mistakes will be diligent enough to let me inspect again the paper...

A couple of weeks later, and despite all the promises, I get an email indicating that my paper was published and is available online. I check the ACM Digital Library, and I see my paper online, with the following formula:

OK, so we managed to get an interesting hybrid :-). Seriously, do the ACM copyeditors even LOOK at what they are doing? If they do check and they do not understand that this is an error, why do we even have copyeditors?

Example #2

I assumed that the previous snafu was just an exception. Well, never say never. A couple of days back, I got the galleys for another TODS paper, due to be published in the next few days. Again, the copyeditor decided to make (minor) changes in the equations. In my originally submitted paper, I had the following equations:

In the galleys, the same equations look like:

I will repeat myself: do the ACM copyeditors even LOOK at what they are doing? If they do check and they do not understand that this is an error, why do we even have copyeditors?

Pet peeve B: Converting vectorized figures into bitmaps

If you have submitted a paper to a conference, you know how crazy the copyeditors get about getting PDFs with only Type 1 fonts, vectorized, not-bitmapped, and so on. This is a good thing, as the resulting PDFs contain only scalable, vector-based fonts that look nice both on screen and on paper.

For the same reason, I also prepare nice, vectorized figures for my papers, so that they look nice both on screen and on paper. However, for some reason, the copyeditors at ACM they seem to like to convert the vectorized images into horrible, ugly bitmaps that do not scale and look awful. Here is an example of a figure in the original PDF:

Here is how the same figure looks at the PDF that I received as a galley:

Am I too picky? Is it bad that I want my papers to look good?

End of pet peeves

(Note: The same copyediting process, described above, at IEEE seems to work perfectly fine.)

I start believing that the whole idea of publishing is a horribly outdated process. I assumed that copyeditors were a part of a chain that adds value to the paper, not a part that subtracts value.

If I need to check carefully my paper, being afraid that the copyeditor will introduce bugs, that the copyeditor will make everything look horrible, then why do we even have copyeditors? Just get rid of them; they are simply parasites in the whole process! Can you imagine having a professor that teaches a class and at the end the students know less about a topic? Would you keep this professor teaching?

Make everything open access. Let every author be responsible for the way that the paper looks. Let the authors revise papers in digital libraries that have problems. Why we consider perfectly acceptable to have bug fixes and new versions for applications and operating systems, but we want the papers that we produce to be frozen in time and completely static?

Furthermore, the whole motivation for having journals is to have the peer-reviewing process that guarantees that the "published" paper is better than the submitted one. Everything else is secondary. Why keep in the chain processes that only cause problems?

When are we going to realize that the publication system should be completely revamped? Why not having an ongoing reviewing process, improving the paper continuously? Should we keep the system as-is so that we can be "objectively evaluated" by counting static papers that are produced once and never visited again?

OK, venting is over. Back to the SIGMOD papers.

Sunday, October 28, 2007

Visualizing the Dirichlet

Last week, while working with Foster Provost and Xiahoan Zhang, one of our PhD students, we were trying to understand the internals of the Latent Dirichlet Allocation.

In particular, we were getting strange results from the LDA-C program by David Blei, and we wanted to figure out what we were doing wrong. The first suspects were the parameter values. We wanted to see how the values of the different parameters affect the behavior of the technique.

The conceptual model of LDA is pretty simple. We have a Dirichlet distribution, parameterized by a k-dimensional vector alpha (k is the number of topics). The Dirichlet distribution allows us to draw k-dimensional random vectors theta_i, one for each document in the collection. The vector theta_i represents the topic mixture within the document. For example, for 3 topics (say, art, technology, and sports), a vector theta_i = (0.2, 0.5, 0.3), means that the document i is 20% about art, 50% about technology, and 30% about sports. Then, each topic is modeled as a random distribution of words, representing the relative frequency of each word within the given topic.

The "alpha" parameter of the Dirichlet is crucial, as it defines the distribution of the theta_i vectors or, in other words, the "distribution of topic distributions".

After looking at some visualizations of the Dirichlet, the demonstrated shapes looked strange. Most of the visualizations of the Dirichlet (e.g., on Wikipedia) showed that the mass of the density distribution is around the "center" of the simplex. For example for a Dirichlet with three topics (image from Wikipedia):

most of the density was around the area (0.33, 0.33, 0.33). At the same time, very little mass was allocated at the "corners" of the simplex. Such a shape implied that the majority of the theta vectors drawn from such a distribution would be vectors that have mix many topics within a document. At the same time, very few documents would be generated with only one topic.

This was a counterintuitive modeling choice. Why do we want to force most documents to have many topics and not have the majority of documents to have only a small number of topics? Something seemed wrong.

To understand better how the Dirichlet density function is allocated in the simplex, I pulled my favorite Maple, and started coding my visualization code. First, I defined the 3-d Dirichlet density function.
B := (a1, a2, a3) -> (GAMMA(1.0*a1) * GAMMA(1.0*a2) * GAMMA(1.0*a3)) / GAMMA(1.0*a1+1.0*a2+1.0*a3);

Dir := (x1, x2, a1, a2, a3) -> (x1^(a1-1)) * (x2^(a2-1)) * ( (1-x1-x2)^(a3-1)) /B(a1,a2,a3) ;

Then, I plotted the log of the density in an animated 3d plot. Since I just wanted to see how the value of the vector alpha change the distribution, I decided to keep all the vector elements to alpha equal to each other, and vary them to have values from 0.3 to 2.0.


plotsetup(gif, plotoutput=`LogDirichletDensity-alpha_0.3_to_alpha_2.0.gif`);

animate ( plot3d, [eval(log(Dir(x1, x2, a1, a2, a3)), {a1=a, a2=a, a3=a}), x1=0.00..1, x2=0.00..1, axes=BOXED, grid=[25,25], gridstyle=triangular, orientation=[-135, 60], shading=zhue, contours=20, style=surfacecontour, view=-3..2 ], a=0.3..2.0, frames=100);

The result was pretty revealing:

When the values of alpha are below 1.0, the majority of the probability mass is in the "corners" of the simplex, generating mostly documents that have a small number of topics. When the values of alpha are above 1.0, the majority of the generated documents tend to contain all the topics. (Yes, I am pretty sure that any people with experience in Bayesian statistical modeling who are reading this post are yawning by now. But this is my blog and I will write things that I find interesting.)

So, for general topic collections, and traditional "topical clustering" tasks, it seems best to drive the LDA inference towards configurations that have low alpha values, so that each document contains a small number of topics.

Someone might ask, when we would want at all to examine configurations where alpha is higher than 1.0? Can such configurations be useful at all? Well, there are cases when documents tend to contain frequently a mixture of topics. In product reviews, the reviewers tend to cover a large number of product features (e.g., for a digital camera they will talk about the lenses, about the image quality, about battery life, and so on). Hotel reviews on TripAdvisor also exhibit a similar structure. Similarly, in feedback postings in the reputation systems of eBay and of Amazon, the buyers tend to give feedback about various characteristics of the merchant with whom they transacted (how fast was the delivery, whether the product was accurately described, and so on). In these cases, LDA tends to work best when the values of alpha are above 1.0. Otherwise the resulting topics do not make much sense.

Friday, October 26, 2007

Using Facebook ... as a facebook

There is a lot of discussion lately about Facebook, its API, its valuation, its growth prospects, and so on. Lately, I realized that Facebook has a very nice functionality: it is a ... facebook! What do I mean?

I have had a Facebook profile since late 2004. While I have a full profile, I am not that active, since I tend to reach most of my contacts either through email or through instant messaging. Nevertheless, every years students who take my undergrad class, add me as a "friend" on Facebook, and these contacts remain even after the class is over.

Now, as my freshmen students in 2004 reach their senior year, they start asking for recommendation letters. Remembering who is the person who asks for a recommendation, just by looking at the name, tends to be tricky: Each year we deal with hundreds of students, and unfortunately our human memories are not keeping up with Moore's law. However, when the student is on Facebook then I can check easily the corresponding Facebook profile. By looking at the photo of the student, I can remember very easily the student, the performance in class, and the general impression that I had formed during the course. Then, checking the grades in the homeworks and projects completes the image, and writing a recommendation letter tends to be much much easier.

In effect, Facebook for me today tends to be like a LinkedIn for connecting with my students. If only they removed such sentences from their privacy policy: "Facebook may also collect information about you from other sources, such as newspapers, blogs, instant messaging services, and other users of the Facebook service .... We may use information about you that we collect from other sources, including but not limited to newspapers and Internet sources such as blogs, instant messaging services, Facebook Platform developers and other users of Facebook, to supplement your profile..."

Monday, October 8, 2007

Implicit and Explicit Changes in Contracts: Rent-A-Coder

I have blogged previously about outsourcing some research tasks using Amazon Mechanical Turk. Another option that I have been using is Rent-A-Coder (RAC): I am using this service mainly for outsourcing basic programming tasks, such as building basic crawlers for data collection. (Even though someone could argue that students can be used for writing such programs, I believe that the time of the PhD students is better spent trying to solve some research problems, rather than completing basic programming tasks). So far, I have been rather satisfied with the overall process, although there were glitches from time to time. (More about the overall experiences in another post.)

However, an incident that happened lately forced me to think seriously about assigning any important project using the RAC platform. Specifically, for one of the projects, we ended up disagreeing with the coder about the specifications. The summary of the disagreement:
  • I have posted a project description and the corresponding specifications.
  • The coder proposed a change to the specifications
  • I did not accept or reject explicitly the proposed change, but rather pointed the coder to the contract to see the description
  • I accepted the bid
  • The coder had 24 hours to review and accept or reject the project assignment
  • The coder assumed that I agreed with the modifications that he proposed, and
  • The coder delivered the modified project, which was not according to the project specification
After the project was delivered, we could not agree whether the deliverable was acceptable or not. Since we could not agree, we resorted to arbitration, which is done by the Rent-A-Coder staff. According the the RAC terms of service, in case of disagreement, the RAC serves as a judge and decides who is correct.

The arbitrator, as part of the analysis said:
The coder has proposed a change in the contract.
The buyer has the following options:
A. Explicitly reject the change in the contract.
B. Implicitly/explicitly accept the change in the contract.
The buyer did not reject the change and implicitly accept the new requirements.
These steps are not described in the terms of RAC, but rather are devised by the arbitrator. In defense of this ruling, the arbitrator said that the same rules would apply if myself, as a buyer, had asked for some extra features to be implemented. In such a case, if the coder did not explicitly reject them, the coder would have to implement the extra features, even if they are not described explicitly in the project description.

And here lies my disagreement with the ruling. My understanding is that a contract can be modified only explicitly, not implicitly. A party can reject changes implicitly or explicitly. This setting effectively gives priority to the statements in the contract (in this case, project specification) over the changes that are proposed during negotiations (which can be accepted and agreed upon explicitly). Otherwise, we have a dangerous precedent, where one party (buyer or coder) can start proposing an endless list of amendments, and the other party has to waste time explicitly rejecting the proposed changes, stating that they are out of the scope of the original project description.

At least this is my understanding of the Uniform Commercial Code. It is clear to me that the Rent-A-Coder has the right to rule independently of the provisions of the code. However, it seems problematic to devise a new set of undocumented rules when handling outsourcing contracts, instead of relying on existing legislation. If not anything else, it does not build trust among the participants in the RAC marketplace to know that the existing law does not apply in the RAC contracts.

Opinions? Am I incorrect in my analysis?

Disclaimer: I am not a lawyer, so the analysis above is based on my own interpretation of the law.

Saturday, September 22, 2007

The Price of Privacy (comScore edition)

The data from comScore are used extensively to analyze trends about internet companies, and are also used by academic researchers when doing web research and need user behavior data. The breadth of information that comScore captures about the users is breath-taking. Almost all the clicks, URLs, and queries submitted by a user are available to researchers that get access to the collected data. Even though comScore does not release any personally identifiable information about its panelists, it is not impossible to infer a lot about each individual user by looking at their Internet behavior. (The examples from the AOL query log are plenty.)

So, I have been wondering what comScore offers to the users to convince them to participate in the comScore panel. According to comScore, the benefits that their panelists get are:
  • Security software applications such as server-based virus protection, remote data storage, encrypted local storage, Internet history removal
  • Attractive sweepstakes prizes
  • Opportunity to impact and improve the Internet
Still, I thought that this cannot be everything. The offerings were not at all lucrative to convince someone to release so much personal information. Today, though, I got my answer. Someone called me from comScore and after asking me a long list of demographic-related questions (to which I lied systematically), I was offered to be one of the panelists for comScore, and have all my internet behavior tracked and recorded, so that I can "have the opportunity to impact and improve the Internet" (yeah!). Then the coveted lucrative prize was revealed: $25 when I download the software and $5 per month afterwards, (plus the benefits listed above).

So, now you know. Your privacy is worth a couple of double espressos per month. You have already surrendered for free everything about you to Google researchers. Now you can get paid $5/month to have the rest of the world to look at your Internet habits.

Ambiguous First Names and Disambiguation

I was preparing an assignment for my class, trying to introduce students to issues of data quality, and I was using Facebook data for this.

As a simple example, I wanted students to find automatically the gender of a person, given only the first name, since 1/3 of the Facebook users do not list their gender. (The "homework motivation" was the need to send letters to customers, and we need to decide whether to put "Dear Mr." or "Dear Ms." as a greeting.) In general, the task is relatively easy and the majority of the names are not ambiguous. However, there is a set of highly ambiguous names, for which inference based on first name is problematic. For your viewing pleasure, the most ambiguous first names, together with the confidence that the name belongs to a male:

Ariel 50.00%
Yang 50.00%
Kiran 50.00%
Nikita 50.00%
Casey 49.30%
Min 46.67%
Paris 53.85%
Dorian 53.85%
Adi 45.45%
Kendall 45.45%
Quinn 54.55%
Aubrey 54.55%
Sunny 44.83%
Angel 55.32%
Yan 41.67%
Yi 41.67%
Yu 58.33%
Devon 59.46%
Nana 40.00%
Jin 38.89%
Ji 38.46%
Ming 61.54%
Taylor 37.80%
Rory 62.50%
Carey 36.36%
Sami 63.64%
Robin 34.55%
Ali 34.45%
Jean 34.09%

The next part of the homework, motivated by the ambiguity for some of the first names, asks students to guess the gender of a person based on the other stated preferences on Facebook profiles, regarding movies, books, TV shows and so on.

Based on the analysis of these features, women favor overwhelmingly the books "Something Borrowed," "Flyy Girl," "Good In Bed," "The Other Boleyn Girl," "Anne Of Green Gables", the movie "Dirty Dancing" and they like dancing as an activity.

On the other hand, characteristics that are unique to men are movies like "Terminator 2," "Wall Street," "Unforgiven," "The Good the Bad and the Ugly," "Seven Samurai"; the book "Moneyball"; sports-related activities (baseball, lifting) and sports-related TV shows (e.g., PTI, Sportscenter, Around the Horn). Another distinguishing feature of men is that they list "women" and "girls" as their interests (and in this case they should also think about taking perhaps some dancing lessons :-)

Friday, September 7, 2007

Experiences using Amazon Mechanical Turk

Everyone who has ever done web-related research knows that user studies and obtaining user evaluations is one of the most frustrating parts of the process. We often need to create data sets with annotated data, users have to look at the results of the proposed systems and evaluate their quality, and so on. Commonly, we tend to rely on "volunteers" for such tasks, i.e., asking friends and colleagues to participate in the experiment, doing mundane, repetitive and tedious work. Typically, after the "volunteer" has participated in one experiment it was extremely difficult to be convinced to participate in another one. (PhD students are the exception --- they typically have no choice.) The whole process can easily take 2-3 weeks just to gather enough data, plus it is unclear how accurate are the results that are gathered by such overused human subjects.

When I joined Stern, I was glad to find out about the "behavioral lab", which made recruiting of users much easier. We just have to specify on a web form the target demographics and then a set of 1000 registered (real) volunteers are notified. This facilitates greatly the process and makes sure that the participating users are indeed willing to participate in the experiments. Still though, the process is tiring as someone has to wait in the lab for the users, give directions, pay the participants, and so on.

One interesting alternative is Amazon Mechanical Turk, a service introduced by Amazon in November 2005. Mechanical Turk allows requesters to post a large number of small tasks, and pay a small amount to the person who completes the task. Examples of such tasks include:
  • can you see a person in the photo?
  • is the document relevant to a query?
  • is the review of this product positive or negative?
We have started experimenting with Mechanical Turk early in 2007, to complete various tasks. I was truly surprised with the speed that "Turkers" complete these tasks. Asking users to annotate 150 articles took less than 2 hours. Clearly, much faster than traditional techniques.

One of the problems that we faced was the uncertainty about the validity of the submitted answers. We had no way to ensure that an answer submitted by an "Turker" was a carefully thought answer or just a random response.

To avoid this problem, we decided to get multiple, redundant answers for the same question. To be able to decide about statistical significance, we asked for 5 answers for the same question. We marked an answer as correct only if at least 4/5 answers agreed. Furthermore, to discourage users from submitting random responses, we clarified in the instructions that we will pay only for submissions that agree with the responses submitted by other annotators. This followed the spirit of the ESP game by Luis von Ahn, and ensured that the Turkers had the appropriate incentives to submit correct answers. Even though this approach increases the cost, it ensures that the received answers are consistent and the level of noise is low.

A second approach for minimizing noise in the answers is the use of the "qualification tests." Instead of letting users submit directly answers, we wanted to see if they are competent enough to participate in these experiments. For example, we had a task where we were soliciting relevance feedback for sets of queries and documents. To make sure that the users follow the instructions, we asked users to submit their answers for already labeled query-document pairs (in this case, the pairs were coming from TREC collections). We also required annotators to retake the qualification tests if they wanted to mark a large number of query-document pairs. This ensured that annotators that submit a large number of evaluations also pass a proportionally larger number of qualification tests. However, the presence of qualification tests slows down the process by a factor of 3 to 4. (Nothing comes for free :-)

Overall, our experience with Mechanical Turk has been very positive. The interface is very clean, easy to program, and the answers come back very quickly. It is not uncommon to send thousands of requests in the evening and have all the results ready by the following morning.

Now, let's see if the reviewers will like the results :-)

Update (Nov 24, 2007): Our first paper that uses Amazon Mechanical Turk for conducting the experimental evaluation has been accepted at IEEE ICDE 2008 and is now available online. Hopefully, more will follow soon.

Wednesday, August 29, 2007

Experimental Repeatability Requirements for SIGMOD 2008

This is a very interesting change for SIGMOD 2008. Starting this year, there is a requirement for authors to submit data sets and code for replicating the experiments. For details, see:


I will be very interested to see how much of a burden this will be, how many times people will cite the use of proprietary software, and so on. I expect some small problems for this year but hopefully we will get used quickly to this mode of operation and start working on a paper with reproducibility of experiments in mind.

In any case, this is a very interesting development and I expect many conferences to follow.

Monday, August 13, 2007

Only in Greece...

I was having a drink with some friends in Greece, listening to some light music at a summer bar. Suddenly, the music stops and the DJ announces: "The police is checking drivers for blood alcohol content and they are waiting next to the hotel Xenia. The customers are advised to take the alternate route...".

A few minutes later, a new announcement informed the customers for the location of a second block from the police, advising customers to either take a taxi, or use the breathanalyzer that is available in the bar in the restroom (yes, there was a coin-operated breathanalyzer in the restroom!), so that they can "drive home safely."

If only all this creativity was used for something useful...

Wednesday, July 25, 2007

Being Accurate when having an Agenda

I was reading a report a few days back, and one of the sentences was:

"we observed that 50% of the women get salaries lower than the median salary of men"

Just think of alternative formulations of the same sentence. I will refrain from commenting further.

Thursday, July 12, 2007

Evaluating Information Extraction using xROC Curves

Suppose that you have an information extraction system, which behaves like a blackbox. The box takes as input a document and gives in the output a set of relation tuples. For example, the information extraction system may process newspaper articles and identify mergers and acquisitions (Company1 bought Company2), or management succession events (Person1 succeeds Person2 in the the CEO position), and so on.

As expected, such systems are inherently noisy and generate imperfect output. Sometimes they miss tuples that appear in the documents and sometimes they generate spurious tuples. One of the important questions is how to evaluate such a system objectively and with the minimum amount of effort.

A common evaluation strategy is to use precision-recall curves to show how the system behaves under different settings. The precision of the output is defined as the number of correct tuples in the output over the total number of generated tuples; recall is defined as the as the number of correct tuples in the output over the total number of correct tuples that can be extracted from the documents.

Unfortunately, precision is problematic due to its dependence on the input class distribution, as the following example illustrates:

  • Example: Consider an extraction system E that generates a table of companies and their headquarters locations, Headquarters(Company, Location) from news articles in The New York TimesBusiness'' and the "Sports'' section. The "Business'' documents contain many tuples for the target relation, while "Sports'' documents do not contain any. The information extraction system works well, but occasionally extracts spurious tuples from some documents, independently of their topic. If the test set contains a large number of "Sports'' documents then the extraction system will also generate a large number of incorrect tuples from these "bad'' documents, bringing down the precision of the output. Actually, the more "Sports'' documents in the test set, the worse the reported precision, even though the underlying extraction system remains the same. Notice, though, that the recall is not affected by the document distribution in the test set and remains constant, independently of the number of "Sports" documents in the test set.

The fact that precision depends on the distribution of good and bad documents in the test set is well-known in machine learning, from the task of classifier evaluation. To evaluate classifiers, it is preferable to use ROC curves, which are independent of the class distribution in the test set. The ROC curves summarize graphically the tradeoffs between the different types of errors. When characterizing a binary decision process with ROC curves, we plot the
true positive rate (the fraction of true positives correctly classified as positives, i.e., recall) as the ordinate, and the false positive rate (the fraction of true negatives incorrectly classified as positives) as the abscissa.

The standard application of ROC curves for information extraction is unfortunately problematic, for two reasons.

First reason: We typically do not know what a "true negative" is. Unlike document classification, a "bad tuple" does not exist apriori in a document. It only exists because the extraction system can extract it.
  • Solution 1: One way to overcome this problem is to measure the number of all bad tuples that can be extracted from a document using all possible settings and all available extraction systems. Then, we can use this number as the normalizing factor to define the false positive rate. This solution works when dealing with a static set of extraction systems. Alas, the definition of false positive rate becomes unstable if we introduce later another system (or another setting) that generates previously unseen noisy tuples; this changes the number of all bad tuples, which serves as the normalizing constant, and forces recomputation of all false positive rates.
  • Solution 2: Another way to avoid this problem is by having an un-normalized x-axis (abscissa). Instead of having the false positive rate, we can have the average number of bad tuples generated. In this case, the new curve is called the "Free Response Operating Characteristic" (FROC) curve. Such techniques are widely used in radiology to evaluate the performance of systems that detect nodules in MRI and CAT scans. (A nodule refers to a small aggregation of cells, indicative of a disease.) A problem with this approach is the lack of a "probabilistic" interpretation of the x-axis; the probabilistic interpretation can be convenient when analyzing/integrating the extraction system as part of of a bigger system, and we are not simply trying to measure its performance in a vacuum.
Second reason: It is too expensive to know all the "true positives". You may have noticed that recall is the y-axis in the ROC/FROC curves. Unfortunately, computing recall is a rather expensive task. If we want to be correct, we need to have annotators read a document and infer which tuples can be extracted from each document; the number of all good tuples will be used as the normalizing constant to compute the recall of each extraction system. This is an expensive task, especially compared to the computation of precision or compared to the computation of the false positive rate. To compute the false positives we need only to evaluate the extracted tuples, a presumably much easier task than reading a lengthy document.
  • Solution 1: We can play the same trick as above, to avoid the problem of reading/annotating the documents. We process each document multiple times, using all possible settings and all possible extraction systems. The union of the extracted tuples can be then validated to identify the set of all correct tuples. As in the case of true negatives, though, the definition becomes unstable if we have a dynamic set of extraction systems that can identify more good tuples at some point in the future, forcing a re-calculation of recall metrics for all systems.
  • Solution 2: We can also have an un-normalized y-axis. For instance, we can have as the ordinate (y-axis) the average number of good tuples extracted for each document. (I have not seen anything like the FROC curves that will leave the true positive rate unnormalized, though.) The downside is that by leaving recall unnormalized, the values now depend on the distribution of good and bad documents in the input: the more bad documents with no good tuples in the test set, the lower the unnormalized value will be. Therefore, this definition seems to go against the spirit of ROC curves.
I am not aware of any alternative definitions of the ROC curves that do not have the problems of the solutions that I described above. If you have any ideas or pointers, post it in the comments.

Monday, July 9, 2007

Blending Mind Maps with Wikipedia

Following up on the topic, of how to use wikis to organize research literature about a given topic.

I have been thinking about the idea of keeping wikis to keep track of the state-of-the-art in any field. One way to organize the literature is to transform an existing survey article into a wiki and let people edit at-will (e.g., see this article on duplicate record detection). Another alternative is to keep a set of tables in a wiki, which summarize the results of a paper in a single line (e.g., see here and here for NLP-related tasks).

A problem with both approaches is the lack of a network that will allow people to mark and annotate the relations across different papers. The idea of using CMapTools (http://cmap.ihmc.us/) was interesting but I could not manage to install the software to see how it really works.

Today, I run into WikiMindMap, a web-based service that uses the wiki structure to convert Wikipedia articles into a mind map. The heuristics that it uses are rather basic but one could use this tool to organize the literature in a wiki (having the advantage of collaboration), and still have all the advantages of a visual tool that can show connections across entities. See for example the wikified article, mentioned above, transformed into a mind map.

Friday, July 6, 2007

Data Cleaning Service

Often, I get questions from colleagues and students on how to match "dirty" data, i.e., string database entries that are similar but not exactly identical. Although there is a very significant amount of research literature, there are not that many software packages available for free. For this reason, I decided to implement the technique that we described in our WWW2003 paper, and make it available on the web for anyone to use:


So, now if you have two lists of names that need to be matched you can try this service. It works well for small datasets with a few thousands entries in each. I am pretty sure that it can scale to much bigger datasets, but it will take some time for the service to finish the job. One of the neat features of the service is that you can submit the job, keep the id of the submitted job, and come back to retrieve the results later.

If you have any ideas, or any bug reports, feel free to contact me.

Wikipedia-based Web Services

We have launched a beta version of some web services that use Wikipedia data for named entity extraction, for term identification, and for document expansion. You can take a look and try the services at:


We are actively working on this, so if you see a bug or some unexpected behavior, feel free to drop me an email.

Monday, June 25, 2007

Brain Activity and Economic Judgment?

Just attended the keynote talk of Tom Mitchel at ACL 2007 in Prague. He described how the brain reacts when seeing a word or a picture associated with different concepts. For example, there are different parts of the brain activated when seeing a hammer (the word or a picture) and different parts when seeing an apartment. There seem to be different parts of the brain that process different concepts, so there is evidence that the "meaning" of a word corresponds to the activation of particular neurons.

I am wondering if similar things happen when someone is trying to judge the economic value of something. How do people act when seeing a bargain compared to seeing something that they perceive as expensive? Is there a similar activation pattern in the brain?

Update (July 2): I was reading the latest issue of Scientific American and I found an answer to the question above. Apparently, there is a related paper published in the January 26th issue of Science with the title "The Neural Basis of Loss Aversion in Decision Making under Risk" that discusses which parts of the brain get activated when someone needs to make a decision that will result in a future gain or loss. Interesting paper to read.

Monday, June 18, 2007

Are Publishers Making Themselves Useless?

No matter how nicely I try to format my camera-ready papers, occasionally I will get an email like this from the publisher:
  1. On page 2, top of column 1, you have a widow (last line of a paragraph at the top of a column), this is a bad break, please tighten the previous column or force a line over to eliminate this bad break.
  2. On page 2, bottom of column 1, there is an orphan, please tighten the previous material or force the head over to the next column. (The use of \vfill\eject placed before the head commmand will force text to the next page or column).
  3. Please make the body of the references section flush left/ragged right (not justified). This will eliminate letter and word spacing of references with urls (web addresses).
  4. On the last page, please use \vfill\eject before reference 19 to move more over to column two (2) and produce a more balanced last page. (This command is great for forcing material over to the next page.) Be sure that you have run bibtex, then cut and paste the references from your bbl file into your tex file.
I cannot understand why the author has to deal with all these typesetting issues, if there is publisher involved in the process. Isn't typesetting something that the publisher does? Given that I have submitted my LaTeX sources and all the necessary files needed to compile the paper, it is trivially easy to fix all these issues.

In the pre-digital age, a publisher would assign an editor to each accepted paper who would read the manuscript, fix grammatical and syntactic errors, typeset the article, and prepare a nice, readable version from the manuscript, which was either written by hand or typed in a typewriter. Today it seems that publishers want to push all the typesetting work to the authors, minimizing as much as possible the cost of production. Makes sense (economically).

However, as publishers push more and more of their responsibilities to the authors, they increasingly make themselves irrelevant. There is no reason to pay a publisher to prepare proceedings or even journals when the publisher does not even want to take care of the most esoteric publishing-related issues. In fact, there is no reason for a publisher to exist at all. What is the added value that a publisher offers? Open access and the web made already a significant part of the publisher's role obsolete; if publishers voluntarily make typesetting the task of their suppliers (authors), then they voluntarily make themselves a useless part of the workflow, ripe to be dropped completely in the near future.

Wednesday, June 13, 2007

Are Prediction Markets Efficient?

I was reading Fred Wilson's post about the information efficiency of the venture capital market.

While reading the posting, I started wondering whether prediction markets are efficient, according to the definition of Eugene Fama. In other words, how long does it take for a prediction market to incorporate all the available information about an event? Liquidity seems to be an issue for the existing prediction markets, preventing them from reaching equilibrium quickly. But if we had enough liquidity, how long would it take for humans to "agree" on prices that incorporate all the available information about an event?

Wednesday, June 6, 2007

Playing with Wikipedia

I was working with Wisam Dakka on a Wikipedia project, and I was puzzled by some Wikipedia entries that had really long titles.

The first one that I noticed was a term with 163 characters: "Krungthepmahanakornamornratanakosinmahintarayu
which redirects to Bangkok. I do not know if this is a prank, or a valid entry. (Update: It is a correct entry, according to the talk page of the entry.)

Then, I noticed another term with 255 characters:
"Wolfeschlegelsteinhausenbergerdorffvoralternwarenge wissenhaftschaferswessenschafewarenwohlgepflegeund sorgfaltigkeitbeschutzenvonangreifendurchihrraubgierig
which in fact is a valid term and the 255 characters is simply a shortcut for the 580 character entry :-)

Finally, there is a term with 182 characters:
that has Greek roots, and I will let you click to find out its exact meaning.

Also, these entries seem to trigger some buggy behavior on Google. If you do a web search for the above terms, you will find no web page with these words. However, Google returns a set of product matches on Google, none of which is really correct.

The joy of large-scale data processing!

Friday, June 1, 2007

Uses and Abuses of Student Ratings

I recently revisited an old posting from Tomorrow's Professor mailing list about "Uses and Abuses of Student Ratings". It is an excerpt from the book "Evaluating Faculty Performance, A Practical Guide to Assessing Teaching, Research, and Service" and lists a set of common problems in the use of student ratings for evaluating the teaching performance of a faculty member. I enjoyed (re-)reading the whole list, but I particularly liked these three items:
  • Abuse 1: Overreliance on Student Ratings in the Evaluation of Teaching ...
  • Abuse 2: Making Too Much of Too Little ... Is there really a difference between student ratings averages of 4.0 and 4.1? ....To avoid the error of cutting a log with a razor, student ratings results should be categorized into three to five groups ... Utilizing more than three to five groups will almost certainly exceed the measurement sophistication of the instrument being used.
  • Abuse 5: Using the Instrument (or the Data Collected) Inappropriately ... While we have 20 items on our ratings form ... only #7 really matters for making personnel decisions.
I am confident that if we read an academic paper that analyzes the results of a questionnaire in the same way that we currently analyze student ratings, the paper would have been considered naive (no statistical significance of findings, no control variables, use of a single instrument for evaluation, and so on). Unfortunately, it is does not seem likely that we will ever apply the same rigor when analyzing the student ratings forms.

Sunday, May 13, 2007

Impact Factors and Other Metrics for Faculty Evaluations

A few months back, as a department we had to submit to the school a list of the "A" journals in our field. The ultimate reason for this request was to have a list of prime venues for each field, and thus facilitate the task of the promotion & tenure committees that include researchers with little (or no) knowledge of the candidates field.

Generating such a list can be a difficult task, especially if we try to keep the size of the list small, and directly connected with the problem of ranking journals. There are many metrics that can be used to for such a ranking, and the used metric is the "impact factor", proposed by Eugene Garfield. The impact factor ranks the "impact" of the research published in each journal by counting the average number of citations to the 2- and 3-year old articles published in the journal, over the last year. The basic idea is that journals with a large number of recent incoming citations (from last year) that also point to relatively recent articles (2- and 3-year old papers), show the importance of the topics published in the given journal. The choice of the time window makes comparisons across fields difficult, but generally within a single field the impact factor is a good metric for ranking journals.

Garfield, most probably expecting the outcome, explicitly warned that the impact factor should not be used to judge the quality of the research of an individual scientist. The simplest reason is the fact that the incoming citations to the papers that are published in the same journal follow a power-law: a few papers receive a large number of citations, while many others get only a few. Copying from a related editorial in Nature: "we have analysed the citations of individual papers in Nature and found that 89% of last year’s figure was generated by just 25% of our papers. [...snip...] Only 50 out of the roughly 1,800 citable items published in those two years received more than 100 citations in 2004. The great majority of our papers received fewer than 20 citations." So, the impact factor is a pretty bad metric for really examining the quality of an individual article, even though the article might have been published in an "A" journal.

The impact factor (and other journal-ranking metrics) was devised be used as a guideline for librarians to allocate the subscription resources, and as a rough metric for providing guidance to scientists that are trying to decide which journals to follow. Unfortunately, such metrics have been mistakenly used as convenient measures for summarily evaluating the quality of someone's research. ("if published in a journal with high impact factor, it is a good paper; if not, it is a bad paper"). While impact factor can (?) be a prior, it almost corresponds to a Naive Bayes classifier that does not examine any feature of the classified object before making the classification decision.

For the task of evaluating the work of an individual scientist, other metrics appear to be better. For example, the much-discussed h-index and its variants seem to get traction. (I will generate a separate post for this subject.) Are such metrics useful? Perhaps. However, no metric, no matter how carefully crafted can substitute careful evaluation of someone's research. These metrics are only useful as auxiliary statistics, and I do hope that they are being used like that.

Friday, May 11, 2007

Names (or Parents?) Make a Difference

I just read an article about a study showing that the name of a girl can be used to predict whether a girl will study math or physics after the age of 16. The study, done by David Figlio, professor of economics at the University of Florida indicated that girls with "very feminine" names, such as Isabella, Anna, and Elizabeth, are less likely to study hard sciences compared to girls with names like Grace or Alex.

Myself, I find it hard to understand how someone can estimate the "femininity" of a name but it might be just me. Even if there is such a scale though, I do not see any causality in the finding, as implied in the article. (I see predictive power, but no causality.) In my own interpretation, parents that choose "very feminine" names also try to steer their daughters towards more "feminine" careers. I cannot believe that names by themselves set a prior probability on the career path of a child. (The Freakonomics book had a similar discussion about names and success.)

Oh well, how you can lie with statistics...

Tuesday, May 8, 2007

Replacing Survey Articles with Wikis?

Earlier this year, together with Ahmed Elmagarmid and Vassilios Verykios, we published a survey article at IEEE TKDE on duplicate record detection (also known as record linkage, deduplication, and with many other names).

Although I see this paper as a good effort in organizing the literature in the field, I will be the first to recognize that the paper is incomplete. We tried our best to include every research effort that we identified, and the reviewers helped a lot in this respect. However, I am confident that there are still many nice papers that we missed.

Furthermore, since the time the paper has been accepted for publication, many more papers have been published and many more will be published in the future. So, this means that the useful half-life of (any?) such survey is necessarily short.

How can we make such papers more relevant and more resistant to deprecation? One solution that I am experimenting with is to make the survey article a wiki, and then post it to Wikipedia, allowing other researchers to add their own papers in the survey.

I am not sure if Wikipedia is the best option, due to licensing issues, though. A personal wiki may be a better option, but I do not have a good grasp of the pros and cons of each approach. One of the benefits of Wikipedia is the existence of nice templates for handling citations. One of the disadvantages is the copyright license of Wikipedia, which may discourage (or prevent) people from posting material there.

Furthermore, it is not clear that a wikified document is the best way to organize a survey. A few days back, I got a (forwarded) email from Foster Provost, who was seeking my opinion for the best way to organize an annotated bibliography. (Dragomir Radev had a similar question.) Is a wiki the best option? Or is it by construction too flat? Should we use some other type of software that allows people to generate explicit, annotated connections between the different papers? (Any public tool?)

Any ideas?