A Computer Scientist in a Business School

Thursday, October 28, 2010

Cease and desist...

This was just too funny to resist posting.

Here is the background: As part of the core undergraduate introductory class "Information Technology in Business and Society", students have to create a website. To make things more interesting, I ask them to pick a few queries on Google, and try to create a website that will show up on the top of the results for these queries. Essentially it is a mix of technical skills with the ability to understand how pages are ranked and how to analyze the "competition" for these keywords.

So, a student of mine (John Cintolo), created a website about "Hit Club Music Summer 2010", with links to YouTube videos. No copyright infringment or anything illegal.

And one day later, he gets a "cease and desist" letter from HotNewClubSongs. It has so many gems that I will list it here, for your viewing pleasure.

To whom it may concern

It has come to my attention that your website "Hit Club Music Summer 2010" on this URL has potential to threaten my Alexa page ranking. As a consequence, this may cause our website to lose vital income which is generated from ad-space and it will not be tolerated. Due to the nature of your actions I am requesting a formal take-down of your website due to copyright infringement as the music posted on your "http://www.youtube.com" links is not endorsed by the rightful authors, as counseled by my attorney. Considering that you are also going through the New York University server, your actions may cost you and your educational institution unless you cease the aforementioned copyright infringement. If you continue hosting your service I will be forced to file a civil suit in which you will be charged for any lost advertisement revenue, averaging $0.52 per day.

In addition, your html markup shows your ineptitude in online web design, making your website an inefficient option for visitors who truly care about the Club Songs Industry. The listing of the dates on your monthly playlists go in ascending order rather than descending. This is just one of the many flaws of your clearly haphazardly designed website. However, I will give you neither my website URL nor my constructive criticism, for you are clearly trying to make money in an industry which doesn’t have room for your lack of music and website design knowledge. My page viewers have complimented me numerous times on the layout and content of my page.

You may contact me at this e-mail for any further concerns, although it is clear there is not much more to say. Your carelessness, inefficiency, and utter incompetence have gotten you into this hole, and unless you find a way out by October 31st, when my ad-space revenue comes in, further action will be taken. Also, for legal purposes, when and where was this website created? In the chance that it was created before September 30th, 2010, a law suit will be filed for the obvious decrease in revenue from my ads last month, totaling $7.34.

Thank you for your time,

HotNewClubSongs- A Forerunner in the Club Music Industry

Needless to say, I congratulated the student for achieving the goals of the assignment, and offered to cover the damages :-)

Tuesday, October 26, 2010

Student websites

Can Crowdsourcing Scale? The Role of Active Learning

Nobody is denying the fact that crowdsourcing becoming mainstream. People use Mechanical Turk for all sorts of applications. And many startups create business plans assuming that crowdsourcing markets will be able to provide enough labor to complete the tasks that will be posted in the market.

And at this point, things become a little tricky.

Can crowdsourcing markets scale? MTurk can tag a thousand images within a few hours. But what will happen if we place one million images in the market? Will there be enough labor to handle all of the posted tasks? How long will the task take? And what will be the cost?

Scaling by combining machine learning with crowdsourcing

Unless you can come up with ingenious ideas, the acquisition of data comes at a cost. To reduce cost, we need to reduce the need for humans to label data. To reduce the need for humans, we need to automate the process. To automate the process, we need to build machine learning models. To build machine learning models, we need humans to label data.... Infinite loop? Yes and no.

The basic idea is to use crowdsourcing in conjunction with machine learning. In particular, we leverage ideas from active learning: The idea behind active learning is to use humans only for the uncertain cases, and not for everything. Machine learning can take care of the simple cases, and ask humans to help for the most important and ambiguous cases.

We also need to have one extra thing in mind: Crowdsourcing generates noisy training data, as opposed to the perfect data that most active learning algorithms expect from humans. So, we need to perform active learning not only towards identifying the cases that are ambiguous for the model, but also figure out which human labels are more likely to be noisy, and fix them. And we also need to be proactive in identifying the quality of the workers.

In any case, after addressing the quality complications, and once we have enough data, we can use the acquired data to build basic machine learning models. The basic machine learning models can then take care of the simple cases, and free humans to handle the more ambiguous and difficult cases. Then, once we collect enough training data for the more difficult cases, we can then build an even better machine learning model. The new model will then automate an even bigger fraction of the process, leaving humans to deal with only the harder cases. And we repeat the process.

This idea was at the core of our KDD 2008 paper, and since then we have significantly expanded these techniques to work with a wider variety of cases (see our current working paper: Repeated Labeling using Multiple Noisy Labelers.)

Example: AdSafe Media.

Here is an example application, deployed in practice through AdSafe Media: Say that we want to build a classifier that recognizes porn pages. Here is an overview of the process, which follows the process of our KDD paper:

We get a few web pages labeled as porn or not.
We get multiple workers to label each page, to ensure quality.
We compute the quality of each labeler, fix biases, and get better labels for the pages.
We train a classifier that classifies pages as porn or not.
For incoming pages, we classify them using the automatic classifier.

If the classifier is confident, we use the outcome of the classifier
If the classifier is not confident, the page is directed to humans for labeling (the more ambiguous the page, the more humans we need)

Once we get enough new training data, we move to Step 4 again.

Benefits: Once the classifier is robust enough, there is no need to use humans to handle basic tasks. The classifier takes care of the majority of tasks, ensuring that the speed of classification is high, and that the cost is low. (Even at a 0.1 cents per page, humans are simply too expensive when we deal with billions of pages.) Humans are reserved to handle pages that are difficult to classify. This ensures that for the difficult cases there is always someone to provide feedback, and this crowdsourced feedback ensures that the classifier improves over time.

Other example: SpeakerText.

According to the press, SpeakerText is using (?) this idea: they use an automatic transcription package to generate a first rough transcript, and then use humans to improve the transcription. The high quality transcriptions can be then used to train a better model for automatic speech recognition. And the cycle continues.

Another example: Google Books.

The ReCAPTCHA technique is used as as the crowdsourcing component for digitizing books for the Google Books project. As you may have imagined, Google is actively using optical character recognition (OCR) to digitize the scanned books and make them searchable. However, even the best OCR software will not be able to recognize some words from the scanned books.

ReCAPTCHA uses the millions of users on the Internet (most notably, the 500 million Facebook users) as transcribers that fix whatever OCR cannot capture. I guess that Google reuses the fixed words in order to improve their internal OCR system, so that they can reach their goal of digitizing 129,864,880 books a little bit faster.

The limit?

I guess the Google Books and ReCAPTCHA project are really testing the scalability limits of this approach. The improvements in the accuracy of machine learning systems start being marginal once we have enough training data, and we need orders of magnitude more training data to see noticeable improvements.

Of course, with 100 million books to digitize, even an "unnoticeable" improvement of 0.01% in accuracy corresponds to 1 billion more words being classified correctly (assuming 100K words per book), and results in 1 billion less ReCAPTCHA's needed. But I am not sure how many ReCAPTCHA's are needed in order to achieve this hypothetical 0.01% improvement. Luis, if you are reading, give us the numbers :-)

But in any case, I think that 99.99% of the readers of this blog would be happy to hit this limit.

Thursday, October 21, 2010

A Plea to Amazon: Fix Mechanical Turk!

It is now almost four years since I started experimenting with Mechanical Turk. Over these years I have been a great evangelist of the idea.

But as Mechanical Turk becomes mainstream, it is now time for the service to get the basic stuff right. The last few weeks I found myself repeating the same things again and again, so I realized that it is now time to write these things down...

Mechanical Turk, It is Time to Grow Up

The beta testing is over. If the platform wants to succeed, it needs to evolve. Many people want to build on top of MTurk, and the foundations are lacking important structural elements.

Since the beginning of September, I have met with at least 15 different startups describing their ideas and their problems in using and leveraging Mechanical Turk. And hearing their stories, one after the other, I realized: Every single requester has the same problems:

Scaling up
Managing the complex API
Managing execution time
Ensuring quality

These problems were identified years ago. And the problems were never addressed.

The current status quo simply cannot continue. It is not good for the requesters, it is not good for the workers, it is not good for even completing the tasks. Amazon, pay attention. These are not just feature requests. These are fundamental requirements for any marketplace to function.

Amazon likes to present the hands-off approach to Mechanical Turk as a strategic choice: In the same way that EC2, S3, and many other web services are targeted to developers, in the same way Mechanical Turk is a neutral clearinghouse of labor. It provides just the ability to match requesters and workers. Everything else is the responsibility of the two consenting parties.

Too bad that this hands-off approach cannot work for a marketplace. The badly needed aspects can be easily summarized in four bullet points:

Below, I discuss these topics in more detail.

Requesters Need: A Better Interface To Post Tasks

A major task of a marketplace is to reduce overhead, friction, transaction costs, and search costs. The faster and easier it is to transact, the better the market. And MTurk fails miserably on that aspect.

I find it amazing that the last major change on Mechanical Turk for the requesters was the introduction of a UI to submit batch tasks. This was back in the summer of 2008. George Bush was the president, Lehman Brothers was an investment bank, Greece had one of the highest growing GDP's in Europe, Facebook had less than 100 million users, and Twitter was still a novelty. It would take 8 more months for FourSquare to launch.

It is high time to make it easier to requesters to post tasks. It is ridiculous to call the command-line tools user-friendly!

What is the benefit of having access to a workforce for microtasks, if a requester needs to hire a full time developer (costing at least $60K) just to deal with all the complexities? How many microtasks someone should execute to recoup the cost of development?

If every requester, in order to get good results, needs to: (a) build a quality assurance system from scratch, (b) ensure proper allocation of qualifications, (c) learn to break tasks properly into a workflow, (d) stratify workers according to quality, (e) [whatever else...], then the barrier is just too high. Only very serious requesters will devote the necessary time and effort.

What is the expected outcome of this barrier? We expect to see a few big requesters and a long tail of small requesters that are posting tiny tasks. (Oh wait, this is the case already.) In other words: It is very difficult for small guys to grow.

Since we are talking about allowing easy posting of tasks: Amazon, please take a look at TurkIt. Buy it, copy it, do whatever, but please allow easy implementation of such workflows in the market. Very few requesters have simple, one-pass tasks. Most requesters want to have crowdsourced workflows. Give them the tools to do so easily.

MTurk is shooting themselves in the foot by encouraging requesters to build their own interfaces and own workflow systems from scratch! For many many HITs, the only way to have a decent interface is to build it yourself in an iframe. What is the problem with iframe? Doing that, MTurk makes it extremely easy for the requester to switch labor channels. The requester who has build an iframe-powered HIT can easily get non-Turk workers to work on these HITs. (Hint: just use different workerid's for other labor channels and get the other workers to visit directly the iframe html page to complete the task.) Yes, it is good for the requester in the long term not to be locked in, but I guess all requesters would be happier if they did not have to build the app from scratch.

Requesters Need: A True Reputation System for Workers

My other big complaint. The current reputation system on Mechanical Turk is simply bad. "Number of completed HITs" and "approval rate" are easy to game.

Requesters need a better reputation profile for workers. Why? A market without a reputation mechanism turns quickly into a market for lemons: When requesters cannot differentiate easily good from bad workers, they tend to assume that every worker is bad. This results in good workers getting paid the same amount as the bad ones. With so low wages, good workers leave the market. At the end, the only Turkers that remain in the market are the bad ones (or the crazy good ones willing to work for the same payment as the bad workers.)

This in turn requires the same task to be completed from many workers, way too many times to ensure quality. I am not against redundancy! (Quite the opposite!) But it should be a technique for taking moderate quality input to generate high quality output. A technique for capturing diverse points of view for the same HIT. Repeated labeling should NOT be the primary weapon against spam.

The lack of a strong reputation system hurts everyone, and hurts the marketplace! Does Amazon want to run a market for lemons? I am sure that the margins will not be high.

Here are a few suggestions on what a worker reputation mechanism should include.

Have more public qualification tests:Does the worker have the proper English writing skills? Can the worker proofread?Most marketplaces (eLance, oDesk, vWorker, Guru), allow participants to pass certification tests to signal their quality and knowledge in different areas. Same should happen on Turk. If Amazon does not want to build such tests, let requesters make their own qualification tests available to other requesters for a fee? Myself, I would pay to use the qualifications assigned by CastingWords and CrowdFlower. These requesters would serve as the certification authorities for MTurk, in the same way that universities certify abilities for the labor markets.
Keep track of working history: For which requester did the worker work in the past? How many HITs, for what payment? For how long? Long history of work with reputable requesters is a good sign. In the real world, working history matters. People list their work histories in their resumes. Why not on MTurk?
Allow rating of workers: What is the rating that the worker received for the submitted work? Please allow requesters to rate workers. We have it everywhere else. We rate films, books, electronics, we rate pretty much everything.
Disconnect payment from rating: Tying reputation to acceptance rate is simply wrong. Currently, we can either accept the work and pay, or reject the work and refuse to pay. This is just wrong. We do not rate restaurants based on how often the customers refused to pay for the food! I should not have to reject and not pay for the work, if the only thing that I want to say is that the quality was not perfect. Rejecting work should be an option reserved for spammers. It should never be used against honest workers that do not meet the expectations of the requester.
Separate HITs and ratings by type: What was the type of the submitted work? Transcription? Image tagging? Classification? Content generation? Twitter spam? Workers are not uniformly good in all types of tasks. Writing an article requires a very different set of skills from those required for transcription, which in turn are different than the skills for image tagging. Allow requesters to see the rating across these different categories. Almost as good as the public qualification tests.
And make all the above accessible from an API, for automatic hiring decisions.

It cannot be that hard to do the above! Amazon.com runs a huge marketplace with thousands of merchants, for years. The guys as Amazon know how to design, maintain, and protect a reputation system for a much bigger marketplace. How hard can it be to port it to Mechanical Turk?

(Amazon's response about the reputation system... )

In a recent meeting, I asked this same question: Why not having a real reputation system?

The MTurk representative defended the current setup, with the following argument:

On the Amazon.com marketplace, the (large number of) buyers can rate the (small number of) merchants, but not vice versa. So, the same thing happens on MTurk. The (large number of) workers can rate the (small number of) requesters using TurkerNation and TurkOpticon. So the opposite should not happen: requesters should not rate workers.

I felt that the answer made sense: two-sided reputation systems indeed have deficiencies. They often lead to mutual-admiration schemes, so such systems end up being easy to hack (not that the current system is too hard to beat.) So, I was satisfied with the given answer... For approximately 10 minutes! Then I realized: Humbug!

There is no need for a reputation system for product buyers on Amazon.com's marketplace! It is not like eBay, where a buyer can win the auction and never pay! The reputation of the buyer on Amazon.com is irrelevant. On Amazon, when a buyer buys a product, as long as the credit card payment clears, the reputation of the buyer simply does not matter. There is no uncertainty, and no need to know anything about the buyer.

Now let's compare the Amazon.com product marketplace with MTurk: The uncertainty on MTurk is about the workers (who are the ones selling services of uncertain quality). The requester is the buyer in the MTurk market. So, indeed, there should not be a need for a reputation system for requesters, but the workers should be rated.

And at that point, people will protest: Why do we have the Hall of Fame/Shame on Turker Nation, why do we have TurkOpticon? Does Panos consider these efforts irrelevant and pointless?

And here is my reply: The very fact that we have such systems means that there is something very wrong with the Mturk marketplace. I expand below.

Workers Need: A Trustworthiness Guarantee for Requesters

Amazon should really learn from its own marketplace on Amazon.com. Indeed, on Amazon.com, it is not possible to rate buyers. Amazon simply ensures that when a buyer buys a product online, the buyer pays the merchant. So, Amazon, as the marketplace owner, ensures the trustworthiness of at least one side of the market.

Unfortunately, MTurk does not really guarantee the trustworthiness of the requesters. Requesters are free to reject good work and not pay for work they get to keep. Requesters do not have to pay on time. In a sense, the requesters are serving as the slave masters. The only difference is that on MTurk the slaves can choose their master.

And so, Turker Nation and TurkOpticon were born for exactly this reason: To allow workers to learn more about their masters. To learn which requesters behave properly, which requesters abuse their power.

However, this generates a wrong dynamic in the market. Why? Let's see how things operate.

The Requester Initiation Process

When new requesters come to the market, they are treated with caution by the experienced, good workers. Legitimate workers will simply not complete many HITs of a new requester, until the workers know that the requester is legitimate, pays promptly, and does not reject work unfairly. Most of the good workers will complete just a few HITs of the newcomer, and then wait and observe how the requester behaves.

Now, try to be on the requester's side.

If the requester posts small batches, things may work well. A few good workers do a little bit of good work, and the results come back like magic. The requester is happy, pays, everyone is happy. The small requester will come back after a while, post another small batch, and so on. This process generates a large number of happy small requesters.

However, what happens when the newcomers post big batches of HITs? Legitimate workers will do a little bit of work and then wait and see. Nobody wants to risk a mass rejection, which can be lethal for the reputation of the worker. Given the above, who are the workers who will be willing to work on HITs of the new, unproven requester? You guessed right: Spammers and inexperienced workers. Result? The requester gets low quality results, gets disappointed and wonders what went wrong.

In the best case, the new requesters will seek expert help, (if they can afford it). In the worst case, the new requesters leave the market and use more conventional solutions.

At this point, it should be clear that just having a subjective reputation system for requesters is simply not enough. We need a trustworthiness guarantee for the requesters. Workers should not be afraid of working for a particular requester.

Online merchant in the Amazon marketplace do not need to check the reputation of the people the sell to. Amazon ensures that the byers are legitimate and not fraudsters. Can you imagine if every seller on Amazon had to check the credit score and the trustworthiness of every buyer they sell to? What did you say? It would be a disaster? That people would only sell to a few selected buyers? Well, witness the equivalent disaster on Mechanical Turk.

So, what is needed for the requesters? Since the requester is essentially the "buyer", there is no need to have subjective ratings. The worker should see a set of objective characteristics of the requester, and decide whether to pick a specific posted HIT or not. Here are a few things that are objective:

Show speed of payment: The requester payment already goes into an Amazon-controlled "escrow" account. The worker should know how fast the requester typically releases payment.
Show the rejection rate for the requester: Is a particular requesters litigious and reports frequently the work of the workers as spam?
Show the appeal rate for the requester: A particular requester may have high rejection rate just due to an attack from spammers. However, if the rejected workers appeal and win frequently, then there is something wrong with the requester.
Disallow the ability to reject work that is not spam: The requester should not be able to reject submitted work without paying. Rejection should be a last-resort mechanism, reserved only for obviously bad work. The worker should have the right to appeal (and potentially have the submitted work automatically reviewed by peers). This should take out a significant uncertainty in the market, allowing workers to be more confident to work with a new requester.
Show total volume of posted work: Workers want to know if the requester is going to come back to the market. Volume of posted work and the lifetime of the worker in the market are important characteristics: workers can use this information to decide whether it makes sense to invest the time to learn the tasks of the requester.
Make all the above accessible from an API: Let other people build worker-facing applications on top of MTurk.

So, a major role of a marketplace is to instill a sense of trust. Requesters should trust the workers to complete the work, and workers should not have to worry about unreasonable behavior of the workers. This minimizes the search costs associated with finding a trustworthy partner in the market.

Let's see the final part that is missing.

Workers Need: A Better User Interface

As mentioned earlier, beyond trust, the other important role of the market is to minimize as much as possible transaction overhead and search costs. The transacting parties should find each other as fast as possible, fulfill their goals, and move on. The marketplace should almost be invisible. In this market, where requesters post tasks and the tasks wait for the workers, it is important to make it as easy as possible for workers to find tasks the workers want to work on.

Current Problem: Unpredictable Completion Times

Unfortunately, currently the workers are highly restricted by the current interface, in their ability to find tasks. Workers cannot search for a requester, unless the requester put their name in the keywords. Also workers have no way to navigate and browse through the available tasks, to find things of interest.

At the end, workers end up using mainly two main sorting mechanisms: See the most recent HITs, or see the HITgroups with the most HITs. This means that workers use priority queues to pick the tasks to work on.

What is the result when tasks are being completed following priorities? The completion times of the tasks follow a power-law! (For details on the analysis, see the preprint of the XRDS report "Analyzing the Amazon Mechanical Turk Marketplace".) What is the implication? It is effectively impossible to predict the completion time of the posted tasks. For the current marketplace (with a power-law exponent a=1.5), the distribution cannot even be used to predict the average waiting time: the theoretical average is infinite, i.e., in practice the mean completion time is expected to increase continuously as we observe the market for longer periods of time.

The proposed solutions? So easy, so obvious solutions, that it even hurts to propose them:

Have a browsing system with tasks being posted under task categories. See for example, the main page for oDesk, where tasks are being posted under one or more categories. Is this really hard to do?
Improve the search engine. Seriously, how hard is it to include all the fields of a HIT into the search index? Ideally it would be better to have a faceted interface on top, but I would be happy to just see the basic things done right.
Use a recommender system to propose HITs to workers. For this suggestion, I have to credit ba site on the Internet, with some nifty functionality: it monitors your past buying and rating history, and then recommends products that you may enjoy. It is actually pretty nice and helped that online store to differentiate itself from its competitors. Trying to remember the name of the site... The recommendations look like that:

It would be a good idea to have something like that on the Amazon Mechanical Turk. Ah! I remembered! The name of the site with the nice recommendations is Amazon! Seriously. Amazon cannot have a good recommender system for its own market?

Competition waits

Repeat after me: A labor marketplace is not the same thing as a computing service. Even if everything is an API, the design of the market still matters.

It is too risky to assume that MTurk can simply a bare-bones clearinghouse for labor, in the same way that S3 can be a bare-bones provider of cloud storage. There is simply no sustainable advantage and no significant added value. Network effects are not strong (especially in the absence of reputation), and just clearing payments and dealing with Patriot Act and KYC is not a significant added value.

Other marketplaces already do that, build API's, and have better design as well. It will not be difficult to get to the micro segment of the crowdsourcing market, and it may happen much faster than Amazon expects. Imho, oDesk and eLance are moving towards the space by having strong APIs for worker management, and good reputation systems. Current MTurk requesters that create their HITs using iframes, can very easily hire eLance and oDesk workers instead of using MTurk.

The recent surge of microcrowdsourcing services indicates that there are many who believe that the position of MTurk in the market is ready to be challenged.

Is it worth trying to challenge MTurk? Luis von Ahn, looking at an earlier post of mine, tweeted:

MTurk is TINY (total market size is on the order of $1M/year): Doesn't seem like it's worth all the attention.

I will reply with a prior tweet of mine:

Mechanical Turk is for crowdsourcing what AltaVista was for search engines. We now wait to see who will be the Google.

Friday, October 15, 2010

Mechanical Turk and Data Driven Journalism: The Case of ProPublica

Last year, in a Mechanical Turk Meetup in New York, I met with Amanda Michel of Propublica, a "non-profit newsroom that produces investigative journalism in the public interest".

ProPublica had a set of very interesting ideas on how to use crowdsourcing, to improve their practices and increase their reporting reach. Amanda had some great ideas on how to use crowdsourcing, starting with operational aspects of data-driven journalism, up to more ambitious goals. What was common, in all efforts, was a simple goal: Find, reveal, and fight corruption. When you meet with such people, it is hard not to be inspired. So, over last year I kept interacting with ProPublica on how to use Mechanical Turk for their goals.

Take a simple example. ProPublica was facing a significant data integration problem. For one of their projects, they wanted to extract data from hundreds of different city, country, and state databases. Needless to say, building an integration system of such scale is difficult and beyond the reach of many advanced IT companies. Definitely not a problem that a journalism organization could solve for the purpose of writing a single story. How Mechanical Turk could help? The Turkers could be the ones interacting with the databases, creating an effective, human-powered hidden-web crawler, that was up and running in a couple of days.

Mechanical Turk became quickly an integral part of ProPublica's newsroom operations. It became so valuable, that ProPublica today published an article describing how they are using Amazon’s Mechanical Turk to do data-driven reporting and they made public the ProPublica's Guide to Mechanical Turk. It goes step by step through all the challenges that a newcomer on Mechanical Turk may face, and shows how to best approach the tool. Needless to say, these links are being passed around on Twitter like crazy.

ProPublica is a great case study, not because they did something artistic or fancy, but because they kept their focus razor-sharp in integrating crowdsourcing to their operations. The 10,000 sheep will be passed around virally and inspire ideas, but mainstream adoption will come after reading success stories like the one from ProPublica. At the end of the day, people want to know how to get things done.

I kept the best part for the end. From this article:

ProPublica has received a Special Distinction Award from the Knight-Batten Awards for Innovations in Journalism. ProPublica's Distributed Reporting Project was honored for "systematizing the process of crowdsourcing, conducting experiments, polishing their process and tasking citizens with serious assignments." The judges called it "a major step forward with how we understand crowdsourcing."

Tuesday, October 12, 2010

Be a Top Mechanical Turk Worker: You Need $5 and 5 Minutes

The current reputation system on Mechanical Turk is simply inadequate. The only built-in reputation metrics are the number of completed HITs and the approval rate.

Some people believe that they are adequate as a basic filtering mechanism. They are not.

For example, ask for all workers in your HITs to have 1000 completed HITs and 99% approval rate. You believe that you will only get high quality workers? You are wrong!

I tried to filter workers using just these metrics. I failed. Spammers got me again. (And once in, spammers submit a lot of crap. It costs nothing.) I questioned why. How can it be? And then I realized: It is trivial to beat these metrics.

Let's see the effort it takes to beat the system.

The mission: Become a top Turker, 100% approval rate and 1000 completed HITs.

Step 1: Login as a requester. Post a task, with 1000 HITs. Each HIT pays 1 cent. Total cost: \$15. Out of these, \$10 go to the worker, \$5 go to Amazon. The title of the HIT: "Write a 500 word review". No sane worker will touch these HITs. Done. Logout.
Step 2: Login as worker, using a different email. Complete and submit the 1000 HITs created in Step 1. You just need to click submit 1000 times. Bored? iMacro and Greasemonkey can help. Done. Logout.
Step 3: Login as a requester again. Approve all submitted HITs. Pay the \$15. Amazon gets \$5. The worker account has the remaining \$10. Done. Logout.

Your worker account is a top Turker now. 1000 completed HITs and 100% approval rate. Congratulations! You have a license to spam.

Monday, October 11, 2010

The Explosion of Micro-Crowdsourcing Services

In my last post, I expressed my surprise for the sudden explosion of the research-oriented workshops in computer science conferences that are explicitly focused on the concept of crowdsourcing.

I should also note though, that there is a parallel explosion of similar micro-crowdsourcing services. Here is a list of services that I have ran into:

Some of the companies above are serious, some are new and upcoming, some are copycats, and some are there just to facilitate spamming.

I thought of doing a more detailed comparison (similar to the report that Brent Frei prepared last year for the more general area of paid crowdsourcing) but then I realized that I do not trust enough half of these companies to even give them my email.

This growing list makes it clear that we enter the bubble period. Bubbles are not necessarily negative. During bubble periods you see many innovations coming into the industry from many different parties. While most of the entrants in the market will die sooner rather than later, I except to see interesting things coming out of this.

Do not forget that the dotcom bubble generated the Pets.com failures but also gave birth to Google, who replaced the early dominant players, such as Lycos and Altavista.

Saturday, October 9, 2010

The Explosion of Crowdsourcing Workshops

Over the last couple of years, there has been an explosion of workshops that look at the topic of crowdsourcing from the academic point of view, within the broader computer science field. Here are the ones that I am aware of:

Human Computation Workshop (HCOMP 2009), with KDD 2009
Workshop on Crowdsourcing for Search Evaluation, with SIGIR 2010
Second Human Computation Workshop (HCOMP 2010), with KDD 2010
Advancing Computer Vision with Humans in the Loop (ACVHL), with CVPR 2010
Creating Speech and Language Data With Amazon’s Mechanical Turk, with NAACL 2010
Computational Social Science and the Wisdom of Crowds, with NIPS 2010
The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, with COLING 2010
Workshop on Ubiquitous Crowdsourcing, with UBIComp 2010
Enterprise Crowdsourcing Workshop, with ICWE 2010
Collaborative Translation: technology, crowdsourcing, and the translator perspective, with AMTA 2010
Workshop on Crowdsourcing and Translation,
Crowdsourcing for Search and Data Mining, with WSDM 2011

(If you think that I missed a relevant workshop, drop me a line, and I will add it to the list above)

In addition to the workshops above, we also have the CrowdConf 2010 conference, organized by CrowdFlower, with some academic presence but overall targeted mainly to industry.

Yes, one workshop in 2009, followed by ten (at least!) additional workshops in 2010, and who knows how many more in 2011. (I am already aware of 3 planned workshops, in addition to the one in WSDM 2011.)

I am deeply interested in the topic and I already feel that I am losing track of the venues that I need to follow.

Friday, October 8, 2010

Mechanical Turk Requester Activity: The Insignificance of the Long Tail

The Pareto principle says that 80% of the effects come from 20% of the causes. It is a favorite anecdote to cite that 20% of the employees in an organization do 80% of the work, or that 20% of the customers are those that generate 80% of the profits.

In online settings, such inequalities are often amplified. For Wikipedia we have the 1% rule, where 1% of the contributors (this is 0.003% of the users) contribute two thirds of the content. In the Causes application on Facebook, there are 25 million users, but only 1% of them contribute a donation.

So, adapting this question for Mechanical Turk, we want to see: What is the distribution of activity across requesters?

The Activity Distribution: The (Insignificance of the) Long Tail of Requesters

To analyze the level of participation, for the XRDS paper, we took the requesters that posted a task on Mechanical Turk from January 2009 until April 2010, and we ranked them according to the total reward amount of the posted HITs. Then, we measured what percentage of the rewards comes from the top the requesters in the market. Here is the resulting plot:

Indeed, the result shows that Mechanical Turk is closer to the "1% rule" of Wikipedia, than to the general 80-20 principle. As in Wikipedia, the top 1% of the requesters, contribute two thirds of the activity in the market.

By reading the graph, we see the following:

Castingwords, the top requester across the 10K requesters in the dataset, accounts for 10% of the dollar-weighted activity (!).
The top 0.1% of the requesters (i.e., the top-10 requesters) account for 30% of the dollar-weighted activity.
The top 1% of the requesters account for 60% of the dollar-weighted activity.
The top 10% of the requesters account for 90% of the dollar-weighted activity.
The long tail of the 90% of the requesters is effectively insignificant.

A closer look at the distribution of requester activity shows that the activity per requester follows roughly a log-normal distribution.

The average level of posted rewards is $58. This corresponds to an average level of activity of just four dollars per month.
The median is just $1.60. Yes, this is not a typo: 1.6 dollars. In other words, 50% of the requesters never post more than a couple of dollars worth of tasks.
Only a small fraction of requesters (less than 1%) posted 1000 dollars worth of tasks or more over the period from January 2009 till April 2010.

The lognormal distribution of activity, also shows that requesters increase their participation exponentially over time: They post a few tasks, they get the results. If the results are good, they increase by a percentage the size of the tasks that they post next time. This multiplicative behavior is the basic process that generates the lognormal distribution of activity.

I would like to try is to check if this model indeed corresponds to reality. Do we see a geometric growth in activity as the requester stays in the market for longer? Do we observe "deaths" of requesters? (The Fader-Hardie model may be a nice, simple model to try.) What is the expected future activity of a requester?

Such questions may be useful for guiding decisions of workers when deciding whether to invest time and effort to get a good reputation for a given requester (e.g., by completing qualification tests or completing the basic HITs that unlock access to the "protected" HITs.)

What Tasks Are Posted on Mechanical Turk?

A few months back, I got an invitation from Michael Bernstein (of Soylent fame) to write a small article about Mechanical Turk for the student magazine of ACM, the ACM XRDS (aka Crossroads). I could have written a summary of past research, a position paper, or anything that I find interesting.

Instead of summarizing and resubmitting already published material, I decided to push myself and start analyzing some data that I have been collecting about the Mechanical Turk marketplace. In the past, I analyzed the data about the demographics of the workers on Mechanical Turk. However, we have limited analysis for the requester side of the market, and for the type of tasks being posted.

My goal was to put some very preliminary analysis in place, just to scratch the surface of a variety of a questions that I have heard over time. Hopefully this will push me to start working more towards getting the answers, and will inspire some interesting new questions for the students and others that read the article. A preprint of the paper is available through the NYU Faculty Digital Archive and the print version should appear sometime early in 2011.

Data Set

I have been collecting data about the marketplace through my Mechanical Turk Tracker. The tracker collects complete snapshots of the marketplace, every hour, starting from January 2009. For the analysis, I took the data from the period of January 2009 till April 2010: The snapshots has a total of:

165,368 HIT groups
6,701,406 HITs
9,436 requesters
$529,259 rewards

These numbers, of course, do not account for the redundancy of the posted HITs, or for HITs that were posted and disappeared between our hourly crawls. Nevertheless, they should be good approximations (within an order of magnitude) of the activity of the marketplace.

Top Requesters

The first question that I looked at was an analysis of the tasks that are being posted on Mechanical Turk. One way to understand what types of tasks are being completed in the marketplace is to find the “top” requesters and analyze the HITs that they post. By ranking requesters according to the sum of the posted rewards, we get the following list, showing the level of activity and the type of tasks that these requesters post. (Note: To avoid skewing the data towards one-shot requesters, I excluded from the list a requesters that were active only for small periods of time or requesters that posted only a small number of HITs. The goal was to find not only the requesters that post big tasks, but also requesters that do that consistently over time.)

So, transcription, classification, and content generation seems to be a common activity on Mechanical Turk. This indicates that people have developed sufficient best practices and can actually get quality work done. (If not, they would not be posting so many tasks.)

Top Keywords

We also wanted to get a feeling of the tasks that are being posted in the market, across all requesters. The table below shows the top-50 most frequent HIT keywords in the dataset, ranked by total reward amount, # of HITgroups, and # of HITs.

Beyond the tasks identified before, we also see data collection, image tagging, website feedback, and usability tests to be common tasks being posted in the marketplace.

In future posts, I will post further analysis of other aspects of the AMT marketplace.