Saturday, November 24, 2007

My First Mechanical Turk Paper

A while back, I wrote about my experiences after using Amazon Mechanical Turk for conducting experiments that require input from users. I am now happy to announce :-) that the first paper that uses this methodology been accepted at IEEE ICDE 2008 and is now available online.

The paper, co-authored with Wisam Dakka, discusses a simple empirical technique for automatically extracting from a text database a set of facet hierarchies that are useful for browsing the contents of the database. We used Mechanical Turk in this context to evaluate the "precision" and "recall" of the generated hierarchies. This experiment would have been almost impossible to conduct without using Mechanical Turk, as it would require multiple users reading and annotating thousands of news articles. Using Mechanical Turk, the experiments were done in less than three days.

In the final experiment of the paper, though, where we needed to interview and time the users while they were using the faceted hierarchies to complete various tasks, we resorted to the traditional, lab-based setting. However, during the summer, we only managed to recruit five users that expressed interest to participate. We observed them in the lab while performing their tasks and recorded their reactions and impressions. (Fortunately, the results were statistically significant.)

Next time, we will attempt to use Mechanical Turk for such "interview+timing" experiments as well. However, I will need to talk more with people that perform often such experiments to see how they would react to such approaches, where the human subjects are completely disconnected from the researcher. Even though simple timing experiments can be easily performed using MTurk, I am a little uncomfortable about the reliability of such experiments.

Tuesday, November 20, 2007

GMail Supports IMAP (but forgets to notify the abuse team)

A couple of weeks back, Google announced that GMail supports IMAP, an email protocol that allows synchronization of email across email clients and across platforms. The support of IMAP, together with the option to buy additional disk space on Google was a great incentive for me to migrate all my email to GMail, and have it available online, accessible and searchable from everywhere.

So, I paid $20 for the extra 10Gb of disk space, and I setup my email client (Thunderbird 2.0) to access GMail over IMAP. One of the first things that Thunderbird does when accessing an email account through IMAP is to synchronize the content of the remote folders with the content of the locally stored ones. Plus, I decided to upload all my old email to GMail.

Unfortunately, GMail considers such actions "Unusual Activity" and keeps locking me out of my email account for 24 hours. (Under Thunderbird, I get the cryptic message "lockdown in sector 4," whatever this means.) In fact, over the last week, it is almost impossible to do anything that resembles "heavy activity" with my GMail account, since I am continuously locked out and I get the following when I try to login through the web:


Although I understand that Google wants to protect the service from being abused, I see little reason for locking out completely its users from accessing email. Bandwidth throttling seems to be a much better choice for controlling "strange" behavior; in the worse case, Google can block access through IMAP but still allow the user to access the account over the web.

Furthermore, there should be an option for contacting customer support and getting some answer back. Right now, I only get the standard boilerplate response, indicating that I have done something wrong (bad me!) and in 24 hours I will get again access to my email. It is absolutely impossible to reach someone at Google and understand why I am getting locked out.

I suspect that the abuse-detection team (is there such a thing?) needs to update its policies and triggers, to understand better the "expected behavior" of email clients under IMAP. Blocking access without any warning to a mission-critical service (especially for paying customers), seems like a no-no decision to me.

Update (Dec 3, 2007): I just received an email from Google:

Hello,

You recently contacted us about disabled access to your Gmail account due to abnormal account activity, specifically message uploading.

While our engineers are working diligently to make the upload process faster and easier, we're currently unable to provide support for message uploading.

We wanted to remind you that, at this time, uploading an excessive number of messages to your Gmail account via IMAP may lead to being temporarily locked out of your account. If this happens to you, please be aware that these lockouts are temporary and you should be able to re-access your
account shortly.

We appreciate your patience while we work to improve Gmail.

Sincerely,

The Google Team
Well, too bad that now GMail blocks every time that I am trying to synchronize my Thunderbird client with GMail, which corresponds to a large number of downloaded messages. Furthermore, this email does not answer at all why I am blocked from accessing my GMail through the web interface, or actually why I am getting locked at all. (Have anyone heard of throttling?)

Tuesday, November 13, 2007

New Class: Search and the New Economy

Next semester, I will be teaching an MBA class with the title "Search and the New Economy," and I will be also participating in the undergraduate version of the class, taught by Norm White. The intended audience for the class are MBA students, that have interest in technology but are not necessarily programmers.

I have been thinking a lot on how to organize such a class, so that it has some internal structure and flow. My current list of topics:
  1. Search Engine Marketing: Introduction, Search Basics: Crawling, Indexing, Ranking, Pagerank, Spam, TrustRank
  2. Search Engine Marketing: Analyzing and Understanding Usersʹs Behavior, Web Analytics
  3. Search Engine Marketing: Search Engine Optimization
  4. Search Engine Marketing: AdWords, AdSense, Click Fraud
  5. Social Search and Collective Intelligence: Blog Analysis and Aggregation, Network Analysis, Opinion Mining
  6. Social Search and Collective Intelligence: Recommender Systems, Reputation Systems
  7. Social Search and Collective Intelligence: Prediction Markets
  8. Social Search and Collective Intelligence: Wikis and Collaborative Production
  9. Ownership of Electronic Data: Privacy on the Web
  10. Ownership of Electronic Data: Intellectual Property issues on the Web
  11. Ownership of Electronic Data: The Future of Privacy and Intellectual Property
  12. Future Directions and Wrapping‐up
Some rough sketches of the assignments for this course:
  • Run and optimize an online advertising campaign, using Google AdWords or Microsoft adCenter.
  • Analyze the visitorship data of an online website to analyze the effectiveness of different pages. You can use Google Analytics, or tools like CrazyEgg
  • Optimize the keyword campaign of a company by choosing the appropriate keywords and bid amounts, depending on the competition and the rank of the organic pages.
  • Analyze (or build) a recommender system for movies, books, and TV Shows using Facebook data.
  • Build a dating recommendation system using Facebook data
  • Build prediction markets at Inkling Markets, for an event of interest, examine the accuracy of the predictions, and analyze the behavior of the participants. Alternatively, analyze real‐money prediction markets at InTrade and BetFair and examine the effect of real‐life events in political campaigns.
  • Use Google Trends to build a predictor of unemployment measures.
Any more topics what would be worth covering? Alternative exercises?

Update (Feb 21): The class material (slides and recorded lectures, for now) is now available at the class website. You can also look at the class roster and at the prediction market site of the class.

Wednesday, November 7, 2007

What is Wrong with the ACM Typesetting Process?

Recently, I had to go through the process of preparing the camera-ready version for two ACM TODS papers. I am not sure what exactly is the problem but the whole typesetting process at ACM seems to be highly problematic.

My own pet peeves:

Pet peeve A: The copyeditors do not know how to typeset math and they do not even check the paper to see if they have incorporated correctly their own edits.

I detected problems repeatedly and the copyeditor consistently does not check the proofs after making the edits. Here are a few examples.

Example #1

I submit the latex sources and the PDF, with the following equation:


The copyeditor does not like the superscripted e^{\beta x_a}, so decides to convert it into the inline form exp(\beta x_a). Not a bad idea! Look, though, what I get back instead:


To make things worse, such errors were pervasive and appeared in many equations in the paper. I asked the copyeditor to fix these errors and send me back the paper after the mistakes are fixed, so that I can check it again. I get reassured that I will be able to inspect the galley proofs again before they go to print. Well, why would I expect that someone who does such mistakes will be diligent enough to let me inspect again the paper...

A couple of weeks later, and despite all the promises, I get an email indicating that my paper was published and is available online. I check the ACM Digital Library, and I see my paper online, with the following formula:


OK, so we managed to get an interesting hybrid :-). Seriously, do the ACM copyeditors even LOOK at what they are doing? If they do check and they do not understand that this is an error, why do we even have copyeditors?

Example #2

I assumed that the previous snafu was just an exception. Well, never say never. A couple of days back, I got the galleys for another TODS paper, due to be published in the next few days. Again, the copyeditor decided to make (minor) changes in the equations. In my originally submitted paper, I had the following equations:



In the galleys, the same equations look like:


I will repeat myself: do the ACM copyeditors even LOOK at what they are doing? If they do check and they do not understand that this is an error, why do we even have copyeditors?

Pet peeve B: Converting vectorized figures into bitmaps

If you have submitted a paper to a conference, you know how crazy the copyeditors get about getting PDFs with only Type 1 fonts, vectorized, not-bitmapped, and so on. This is a good thing, as the resulting PDFs contain only scalable, vector-based fonts that look nice both on screen and on paper.

For the same reason, I also prepare nice, vectorized figures for my papers, so that they look nice both on screen and on paper. However, for some reason, the copyeditors at ACM they seem to like to convert the vectorized images into horrible, ugly bitmaps that do not scale and look awful. Here is an example of a figure in the original PDF:


Here is how the same figure looks at the PDF that I received as a galley:

Am I too picky? Is it bad that I want my papers to look good?

End of pet peeves

(Note: The same copyediting process, described above, at IEEE seems to work perfectly fine.)

I start believing that the whole idea of publishing is a horribly outdated process. I assumed that copyeditors were a part of a chain that adds value to the paper, not a part that subtracts value.

If I need to check carefully my paper, being afraid that the copyeditor will introduce bugs, that the copyeditor will make everything look horrible, then why do we even have copyeditors? Just get rid of them; they are simply parasites in the whole process! Can you imagine having a professor that teaches a class and at the end the students know less about a topic? Would you keep this professor teaching?

Make everything open access. Let every author be responsible for the way that the paper looks. Let the authors revise papers in digital libraries that have problems. Why we consider perfectly acceptable to have bug fixes and new versions for applications and operating systems, but we want the papers that we produce to be frozen in time and completely static?

Furthermore, the whole motivation for having journals is to have the peer-reviewing process that guarantees that the "published" paper is better than the submitted one. Everything else is secondary. Why keep in the chain processes that only cause problems?

When are we going to realize that the publication system should be completely revamped? Why not having an ongoing reviewing process, improving the paper continuously? Should we keep the system as-is so that we can be "objectively evaluated" by counting static papers that are produced once and never visited again?

OK, venting is over. Back to the SIGMOD papers.