A Computer Scientist in a Business School

The PiP-AUC score for research productivity: A somewhat new metric for paper citations and number of papers

2024-01-18T15:21:00.002-05:00

Many years back, we conducted some analysis on how the number of citations for a paper evolves over time. We noticed that while the raw number of citations tends to be a bit difficult to estimate, if we calculate the percentile of citations for each paper, based on the year of publication, we get a number that stabilizes very quickly, even within 3 years of publication. That means we can estimate the future potential of a paper rather quickly by checking how it is doing against other papers of the same age. The percentile score of a paper is a very reliable indicator of its future.

To make it easy for everyone to check the percentile scores of their papers, we created a small app at

https://scholar.ipeirotis.org/

that allows anyone to search for a Google Scholar profile and then calculate the percentile scores of each paper. We then take all the papers for an author, calculate their percentile scores, and sort them in descending order based on their scores. This generates a plot like this, with the paper percentile on the y-axis and the paper rank on the x-axis.

Then, an obvious next question came up: How can we also normalize the x-axis, which shows the number of papers?

Older scholars have more years to publish, giving them more chances to write high-percentile papers. To control for that, we also calculated the percentiles for the papers published, by using a dataset of around 15,000 faculty members at top US universities. The plot below shows how the percentiles for the number of publications evolve over time.

Now, we can use the percentile scores for the number of papers published to normalize the x-axis as well. Instead of showing the raw number of papers on the x-axis, we normalize paper productivity against the percentile benchmark shown above. The result is a graph like this for the superstar Jure Leskovec

and a less impressive one for yours truly:

Now, with a graph like this, with the x and y axes being normalized between 0 and 1, we have a nice new score that we have given the thoroughly boring name "Percentile in Percentile Area Under the Curve" score, or PiP-AUC for short. It is a score that ranges between 0 and 1, and you can play with different names to see their scores.

~~At some point, we may also calculate the percentile scores of the PiP scores, but we will do that in the future. :-)~~ UPDATE: If you are also curious about the percentiles for the PiP-AUC scores, here is the distribution:

The x-axis shows the PiP-AUC score, and the y-axis shows the corresponding percentile. So, if you have a PiP-AUC score of 0.6, you are in the top 25% (i.e., 75% percentile) for that metric. With a score of 0.8, you are in the top 10% (i.e., 90% percentile), etc.

In general, the tool is helpful when trying to understand the impact of newer work published in the last few years. Especially for people with many highly cited but old papers, the percentile scores are very helpful for quickly finding the newer gems. I also like the PiP-AUC scores and plots, as they offer a good balance of overall productivity and impact. Admittedly, it is a strict score, so it is not especially bragging-worthy most of the time :-)

(With thanks to Sen Tian and Jack Rao for their work.)

Tell these fucking colonels to get this fucking economist out of jail.

2022-10-18T22:53:00.002-04:00

Today is October 18th. It is 41 years since Greece voted for Andreas Papandreou with a 48% vote percentage to be elected as prime minister, fundamentally changing the course of history for Greece. Positively or negatively, this is still debated, but the change was real.

On October 6th, Roy Radner passed away at the age of 95. He was a faculty member at our department and a famous microeconomist with a highly distinguished career. Many others have written about him and his accomplishments as an economist and academic, so I will not try to do the same. 

But Roy also played an important role in making that election in 1981 possible. Why? Let me tell you his story.

When I joined Stern in 2004, Roy Radner came to my office, telling me (lovingly) that he dislikes data mining, but I should not take that personally.

He also wanted to connect with me, so he shared a story with me. So, he started talking:

Roy:

"I had a friend from Greece. But he died a few years back. "
[...]
"He hired me for my first job at Berkeley. A great economist and a great department chair. Strong Trotskyist. Back in the day, especially at Berkeley, economists were not afraid to declare their political views."
[...]
"I visited him in Greece, coming from Italy by ferry and then driving a long way down the Western part of Greece. He had a nice Polish mother and an American wife. He also had a young son; I loved playing with him."
[...]
“At some point, he left Berkeley and returned to Greece to start a new economics research center after the prime minister invited him.”
[...]
(NB: At this point, I understand he is talking about Andreas Papandreou, and I am starstruck listening to all the first-hand stories about him.)
[...]
“Well, when the dictatorship came, they arrested him. And there were rumors that the colonels may execute him.”

“But he was a famous economist, very well-respected. The idea that a fellow academic may be executed, because of his beliefs, in a Western, allied country was unbelievable.”

“So, the unthinkable happened. For the first time in history, 250 economists agreed on something. We wrote a letter demanding that the dictators release Andreas Papandreou immediately.”

“We wrote a letter to the US President, Lyndon Johnson, asking him to intervene and get Andreas Papandreou out of jail."

"As part of a committee, Kenneth Galbraith, Kenneth Arrow, and I go to the White House to deliver the message. Johnson agrees to see us for five minutes.”

“Panos, you may not be familiar with US Presidents, but Johnson was a rough Texan. He was not known for being gentle and polite, and his language was not exactly… presidential.”

“So, after we talked to Johnson, he rolled his eyes, he picked up the phone, and said:”

“Tell these fucking colonels to get this fucking economist out of jail.”

(and the rest is history)

This is how Roy has changed the history of Greece. 

By getting Johnson to tell the fucking colonels to get that fucking economist out of jail. So that the fucking economist can then be a three-times prime minister of Greece and one of the most consequential prime ministers of the modern Greek republic.

NYTimes: Johnson to Appeal to Save Jailed Son of Papandreou
NYTimes: Letters to the Editor of The Times - Andreas Papandreou

"Geographic Footprint of an Agent" or one of my favorite data science interview questions

2021-11-23T15:22:00.000-05:00

Last week we wrote in the Compass blog how we estimate the geographic footprint of an agent.

At the very core, the technique is simple: Use the addresses of the houses that an agent has bought or sold in the past; get their longitude and latitude; and then apply a 2-dimensional kernel density estimation to find what are the areas where the agent is likely to be active. Doing the kernel density estimation is easy; the fundamentals of our approach are material that you can find in tutorials for applying a KDE. There are two interesting twists that make the approach more interesting:

How can we standardize the "geographic footprint" score to be interpretable? The density scores that come back from a kernel density application are very hard to interpret. Ideally, we want a score from 0 to 1, with 0 being "completely outside of the area of activity" and 1 being "as important as it gets". We show how to use a percentile transformation of the likelihood values to create a score that is normalized, interpretable, and very well calibrated.
What are the metrics for evaluating such a technique? We show how we can use the concept of "recall-efficiency" curves to provide a common way to evaluate the models.

You can read more in the blog post.

Despite its simplicity, this topic ended up being an amazing interview question. I think it has a great question for separating candidates that have a deeper knowledge of data science from those that have only a superficial understanding.

The typical question during the interview is:

"You have a property where the owner wants to sell the house. How can you determine if the property is geographically relevant to a particular agent? For each agent, we know all their past transactions, and we can assume that future behavior is captured well by their past behavior. For all their past transactions, you can get the address, zip code, longitude, and latitude. Do not try, for now, to figure out if the agent is the best one among many candidate agents; that is a harder problem. Just figure out if the agent is active in the location where the property lies."

I am very surprised by how many people will start on the wrong foot because they will try to actively shoehorn the problem into a binary classification problem: Indicative answers:

"I will find all the properties that the agent has not transacted in the past, and treat them as negatives".
"I will keep showing properties to the agent and when they say no, I will mark them as negatives"
"I will add features like bathrooms, bedrooms, price, and then predict if it is relevant or not"

In my experience, once people start like that, they rarely get to a solution that works. The key problem is that they have no easy way of evaluating the model they propose.

So, a typical follow-up question is:

How would you evaluate your approach?

This now gets interesting. Common superficial answers are:

I would use precision and recall, or AUC.
I would measure how often agents accept my recommendations

If they recommend precision and recall, I ask how they are going to measure recall (relatively easy, hopefully, the answer is a temporal training/test split), and precision. Precision gets tricky, as there is no "base rate" and there are no clear "negatives".

If someone recommends measuring the agent reactions, I acknowledge that this is correct, in principle, but ask them to propose an offline evaluation, so that we can test the approach before releasing it to the agents.

This is typically the point where the better candidates, even they veered off track will start to move towards a better solution (not necessarily KDE, it can also be clustering, convex hulls, histograms on zipcodes as vector representations, etc), while the one-trick ponies will reveal their lack of substance.

Kind of sorry that I have to retire this question from my question bank. Still impressed that such a simple question had such a strong discriminatory power.

Mechanical Turk, 97 cents per hour, and common reporting biases

2019-11-18T16:27:00.004-05:00

The New York Times has an article about Mechanical Turk in today's print edition: "I Found Work on an Amazon Website. I Made 97 Cents an Hour". (You will find a couple of quotes from yours truly).

The content of the article follows the current zeitgeist: Tech companies exploiting gig workers.

While it is hard to argue that there are tasks on MTurk that are really bad, I think that the article paints an unfairly gloomy picture of the overall platform.

Here are a few of the issues:

Availability and survivorship bias. While the paper does describe accurately the cesspool of low-paying tasks that are available on Mechanical Turk, it fails to convey the fact that these tasks are available on the platform because nobody wants to work on them. The tasks that are easily available for everyone are the ones for which nobody competes to grab: low-paying, badly designed tasks.
The activity levels of workers follow a power-law. We have plenty of evidence that a significant part of the work on MTurk is done by a small minority of workers. While it is hard to have a truly accurate measurement of what percent of the workers do what percent of the tasks, the 1% rule is a good approximation. For example, in my demographic surveys, where I explicitly limit the participation to only once per month, 50% of the responses come from 5% of the participants. Expect the bias to be much stronger in other, more desirable tasks. Such a heavily biased propensity to participate introduces strong sampling problems when trying to find the right set of workers to interview.
Doing piecemeal work while untrained results in low pay. This is a pet peeve of mine, for all the articles of the type "I tried working on MTurk / driving Uber / delivering packages / etc / and I got a lousy pay". Well, if you work piecemeal on any task, the tasks will take a very long time initially, and the hourly wage will suck. This will hold for Turking, coding, lawyering, or anything else. If someone decides to become a freelance journalist, the first few articles will result in abysmally bad hourly wages as well; expert freelance writers often charge 10x the rates that beginner freelancer writers charge, if not more. I am 100% confident that the same applies to MTurk workers as well: Experienced workers make 10x what beginners make.

Having said that, I do agree that Amazon could prohibit tasks that are obviously paying very little (as a rule of thumb, it is impossible to get paid more than minimum wage when the HIT is paying less than 5c/task). But I also think that regular workers are smart enough to know that and avoid such tasks.

Distribution of paper citations over time

2018-11-16T13:32:00.000-05:00

A few weeks ago we had a discussion about citations, and how we can compare the citation impact of papers that were published in different years. Obviously, older papers have an advantage as they have more time to accumulate citations.

To compare papers, just for fun, we ended up opening the profile page of each paper in Google Scholar, and we analyzed the paper citations years by year to find the "winner." (They were both great papers, by great authors, fyi. It was more of a "Lebron vs. Jordan" discussion, as opposed to anything serious.)

This process got me curious though. Can we tell how a paper is doing at any given point in time? How can we compare a 2-year-old article, published in 2016, with 100 citations against a 10-year-old document, published in 2008, with 500 citations?

To settle the question, we started with the profiles of faculty members in the top-10 US universities and downloaded about 1.5M publications, across all fields, and their citation histories over time.

We then analyzed the citation histories of these publications, and, for each year, we ranked the papers based on the number of citations received over time. Finally, we computed the citation numbers corresponding to different percentiles of performance.

Cumulative percentiles

The plot below shows the number of citations that a paper needs to have at different stages to be placed in a given percentile.

A few data points, focusing on certain age milestones: 5-years after publication, 10-years after publication, and lifetime.

50% line: The performance of a "median" paper. The median paper gets around 20 citations 5 years after publication, 50 citations within 10 years, and around 100 citations in its lifetime. Milestone scores: 20,50,90
75% line: These papers perform "better," citation-wise than 75% of the remaining papers with the same age. Such papers get around 50 citations within 5 years, 100 citations within 10 years of publication, and around 200 citations in their lifetime. Milestone scores: 50,100,200
90% line: These papers perform better than 90% of the papers in their cohort. Around 90 citations within 5 years, 200 citations within 10 years, and 500 citations in their lifetime. Milestones scores: 90,200,500

Yearly percentiles and peak years

We also wanted to check at which point papers reach their peak, and start collecting fewer citations. The plot below shows the percentiles based on the yearly numbers of accumulated citations. The vast majority of papers tend to reach their peak 5-10 years after publication; the number of yearly citations starts declining after 5-10 years.

Below is the plot of the peak year for a paper based on the paper percentile:

There is an interesting effect around the 97.5% percentile: After that level, it seems that a 'rich-gets-richer' effect kicks in, and we effectively do not observe a peak year. The number of citations per year keeps increasing. You could call these papers the "classics".

What does it take to be a "classic"? 200 citations at 5 years or 500 citations at 10 years.

How many Mechanical Turk workers are there?

2018-01-29T09:53:00.001-05:00

TL;DR: There are about 100K-200K unique workers on Amazon. On average, there are 2K-5K workers active on Amazon at any given time, which is equivalent to having 10K-25K full-time employees. On average, 50% of the worker population changes within 12-18 months. Workers exhibit widely different patterns of activity, with most workers being active only occasionally, and few workers being very active. Combining our results with the results from Hara et al, we see that MTurk has a yearly transaction volume of a few hundreds of millions of dollars.

For more details read below, or take a look at our WSDM 2018 paper.

Question

A topic that frequently comes up when discussing Mechanical Turk is "how many workers are there on the platform"?

In general, this is a question that is very easy for Amazon to answer, but much harder for outsiders. Amazon claims that there are 500,000 workers on the platform. How can we check the validity of this statement?

Basic capture-recapture model

A common technique for this problem is the capture-recapture technique, that is widely used in the field of ecology, to measure the population of a species.

The simplest possible technique is the following:

Capture/marking phase: Capture $n_1$ animals, mark them, and release them back.
Recapture phase: A few days later, capture $n_2$ animals. Assuming there are $N$ animals overall, $n_1/N$ of them are marked. So, for each of the $n_2$ captured animals, the probability that the animal is marked is $n_1/N$ (from the capture/marking phase).
Calculation: On expectation, we expect to see $n_2 \cdot \frac{n_1}{N}$ marked animals in the recapture phase. (Notice that we do not know $N$.) So, if we actually see $m$ marked animals during the recapture phase, we set $m = n_2 \cdot \frac{n_1}{N}$ and we get the estimate that:

$N = \frac{n_1 \cdot n_2}{m}$.

In our setting we adapted the same idea, where "capture" and "recapture" correspond to participating in a demographics survey. In other words, we "capture/mark" MTurk users that complete the survey in one day. Then, in another day, we also "recapture" by surveying more workers and we see how many workers overlap in the two surveys.

First (naive) attempt

We decided to apply this technique to estimate the size of the Mechanical Turk population. We considered as "capture" period the set of surveys running over a period of 30 days. Then we considered as "recapture" period, the surveys that we ran on another 30-day period. The plot below shows the results.

The x-axis shows the beginning of the recapture period, and the y-axis the estimate of the number of workers. The color of each dot corresponds to the difference in time between the capture-recapture periods: black is a short time, and red is a long time.

If we focus on the black-color dots (~60 days between the surveys), we get a (naive) estimate of around 10K-15K workers. (Warning: this is incorrect.)

While we could stop here, we see some results that are not consistent with our model. Remember, that color encodes time between samples: black is for short time (~2 months) between samples, red is for long time (~2yrs) between samples. Notice that, as the time between the two periods increases, the estimates are becoming higher, and we get the "rainbow cake" effect in the plot. For example, for July 2017, our estimate is 12K workers if we compare with a capture from May 2017, but the estimate goes up to 45K workers if we compare with a sample from May 2015. Our model, though, says that the time between captures should not affect the population estimates. This indicates that there is something wrong with the model.

Assumptions of basic model

The basic capture-recapture estimation described above relies on a couple of assumptions. Both of these assumptions are violated when applying this technique to an online environment.

Assumption of no arrivals / departures ("closed population"): The vanilla capture-recapture scheme assumes that there are no arrivals or departures of workers between the capture and recapture phase.
Assumption of no selection bias ("equal catchability"): The vanilla capture-recapture scheme assumes that every worker in the population is equally likely to be captured.

In ecology, the issue of closed population has been examined under many different settings (birth-death of animals, immigration, spatial patterns of movement, etc.) and there are many research papers on the topic. Catchability, by comparison, has received comparatively less attention. This is reasonable, as in ecology the assumption of closed population is problematic in many settings. By comparison, assuming that the probability of capturing an animal is uniform among similar animals is reasonable. Typically the focus is on segmenting the animals into groups (e.g., nesting females vs hunting males) and assign different catchability heterogeneity to groups (but not to individuals).

In online settings though, the assumption of equal catchability is more problematic. First we have the activity bias: Workers exhibit very different levels of activity: A worker who works every day is much more likely to see and complete a task, compared to someone who works once a month. Similarly, we have a selection bias: Some workers may like to complete surveys, while others may avoid such tasks.

So, to improve our estimates, we need to use models that alleviate these assumptions.

Endowing workers with survival probabilities

We can extend the model, allowing each worker to have a certain survival probability, to allow workers to "disappear" from the platform. If we see the plot above, we can see that the population estimate increases as the time between two samples increases. This hints that workers leave the platform, and the intersection of capture-recapture becomes smaller over time.

If we account for that, we can get an estimate that the "half-life" of a Mechanical Turk worker is between 12-18 months. In other words, approximately 50% of the Mechanical Turk population changes every 12-18 months.

Endowing workers with propensity to participate

We can also extend the model by associating a certain propensity for each worker. The propensity is the probability that a worker is active and willing to participate in a task, at any given time.

In our work, we assumed that the underlying "propensity to participate" follows a Beta distribution across the worker population, and the parameters of the Beta distribution are unknown. When we assume that follow a Beta distribution, then the probability that a worker participates in the survey k times, follows a Beta Binomial distribution. Since we know how many workers participated k times in our surveys, it is then easy to estimate the underlying parameters of the Beta distribution.

Notice that we had to depart from the simple "two occasion" model above, and instead use multiple capturing periods over time. Intuitively, workers that have high propensity to participate will appear many times in our results, while inactive workers will appear only a few times.

By doing this analysis, we can observe that (as expected) the distribution of activity is highly skewed: A few workers are very active in the platform, while others are largely inactive. A nice property of the Beta distribution is its flexibility: Its shape can be pretty much anything: uniform, Gaussian-like, bimodal, heavy-tailed... you name it.

In our analysis, we estimated that the propensity distribution follows a Beta(0.3,20) distribution. We plot above the "inverse CDF" of the distribution (Inverse CDF: "what percentage of the workers have propensity higher than x").

As you can see, the propensity follows a familiar (and expected) pattern. Only 0.1% of the workers have propensity higher than 0.2, and only 10% have propensity higher than 0.05.

Intuitively, a propensity of 0.2 means that the worker is active and willing to participate 20% of their time (this is roughly equivalent to full-time level of activity; full-timer employees work around 2000 hrs per year, out of 24*365 available hours in a year). A propensity of 0.05 means that the worker is active and available approximately 24 hr * 0.05 ~ 1 hour per day.

How big is the platform?

So, how many workers are there? Under such highly skewed distributions, giving an exact number for the number of workers is rather futile. The best that you can do is give a ballpark estimate, and hope to be roughly correct on the order of magnitude. What our estimates are showing is that there are round 180K distinct workers in the MTurk platform. This is good news for anyone who is trying to reach a large number of distinct workers through the platform.

Our analysis also allows us to estimate how many workers are active and willing to participate in our task at any given time. For that, we estimate that around 2K to 5K workers are available, at any given time. If we want to convert this number to full-time employee equivalence, this is equivalent to 10K-25K full-time workers.

The latter part also allows us to give some low and high estimates on the transaction volume of MTurk.

Lower bound: Assuming 2K workers active at any given time, this is 2000*24*365=17,520,000 work hours in a year. If we assume that the median wage is \$2/hr, this is roughly \$35M/yr transaction volume on Amazon Mechanical Turk (with Amazon netting ~\$7M in fees).
Upper bound: Assuming 5K workers active at any given time, this is 5000*24*365=43,800,000 work hours in a year. If we assume average wage of \$12/hr, this is around \$525M/yr transaction volume (with Amazon netting ~$100M in fees).

I understand that a range of \$35M to \$500M may not be very helpful, but these are very rough estimates. If someone wanted my own educated guess, I would put it somewhere in the middle of the two, i.e., transaction volume of a few hundreds of millions of dollars.

Why was my Amazon Mechanical Turk registration denied?

2017-01-17T11:42:00.000-05:00

(This is my answer to a question posted on Quora)

Mechanical Turk is a platform for work. Workers get paid, which makes now Amazon a payment processor. Payment processors are moving money on behalf of other people, and therefore are under heavy scrutiny from the US government for issues related to money laundering (AML), counter-terrorism, tax compliance, etc.

One of the key things that is required from financial institutions is to have a “Customer Identification Program” (CIP), also known as “Know Your Customer” (KYC) process. The CIP/KYC is a set of procedures that the financial institution needs to follow to establish that they know the true identity of a customer. The processes that each financial institutions follows vary, and the exact processes are rarely available to the public, as they are considered security measures. Furthermore, the practices are regularly monitored by regulators (OCC, Fed, FinCEN, etc) and change over time to follow best practices.

In your particular case, the most likely reason is that Amazon was not able to verify your identity.

If you are in the US, Amazon most probably can get your SSN and other personal details and verify whether you are a real person. However, even if you live in the US, if you have no credit history, no bank accounts, and so on, the verification will come back with low confidence. Following standard risk management processes, Amazon could plausibly reject such applications, as part of their CIP processes: it is better to have a false negative (rejecting a normal account) than having a false positive (e.g., accepting an account that will be involved in money laundering or tax-evasion schemes).

For other countries, the ability of Amazon, to follow CIP/KYC processes that conform to the US regulations, varies. I assume, for example, that the cooperation of US with UK or Australian authorities is much smoother compared to, say, Chinese authorities. So, if you live outside the US, the probability of having your account approved depends on how robust is the ability of Amazon to verify individual identities in your country.

Given that Amazon gets paid by requesters, I assume their focus is to establish CIP processes first in regions where potential requesters reside, which is not always the place where workers reside. This also means that you are more likely to be approved if you first register as a requester (assuming this is an option for you), and then try to create the worker account.

AlphaGo, Beat the Machine, and the Unknown Unknowns

2016-03-13T14:10:00.003-04:00

In Game 4, of the 5-game series between AlphaGo and Lee Sedol, the human Go champion, Lee Sedol managed to get his first win. According to the NY Times article:

Lee had said earlier in the series, which began last week, that he was unable to beat AlphaGo because he could not find any weaknesses in the software's strategy. But after Sunday's match, the 33 year old South Korean Go grandmaster, who has won 18 international championships, said he found two weaknesses in the artificial intelligence program. Lee said that when he made an unexpected move, AlphaGo responded with a move as if the program had a bug, indicating that the machine lacked the ability to deal with surprises.

This part reminded me of one of my favorite papers: Beat the Machine: Challenging Humans to Find a Predictive Model’s “Unknown Unknowns”.

In the paper, we tried to use humans to "beat the machine" and identify vulnerabilities in a machine learning system. The key idea was to reward humans whenever they identify cases where the machine fails, while also being confident that it provides the correct answer. In other words, we encouraged humans to find "unexpected" errors, not just cases where naturally the machine was going to be uncertain.

As an example case, consider a system that detects adult content on the web. Our baseline machine learning system had an accuracy of ~99%. Then, we asked Mechanical Turk workers to do the following task: Find web pages with adult content that the machine learning system classifies as non-adult with high confidence. The humans had no information about the system, and the only thing they can do was to submit a URL and get back an answer.

The reward structure was the following: Humans get \$1 for each URL that the machine misses, otherwise they get \$0.001. In other words, we provided a strong incentive to find problematic cases.

After some probing, humans were quick to uncover underlying vulnerabilities: For example, adult pages in Japanese, Arabic, etc., were classified by our system as non-adult, despite their obvious adult content. Similarly for other categories, such as hate speech, violence, etc. Humans were quickly able to "beat the machine" and identify the "unknown unknowns".

Simply told, humans were able to figure out what are the likely cases that the system may have missed during training. At the end of the day, the training data is provided by humans, and no system has access to all possible training data. We operate in an "open world" while training data implicitly assume a "closed world".

As we see from the AlphaGo example, since most machine learning systems rely on existence of training data (or some immediate feedback for their actions), machines may get into problems when they have to face examples that are unlike any examples they have processed their training data.

We designed our Beat The Machine system to encourage humans to discover such vulnerabilities early.

In a sense, our BTM system is s like hiring hackers to break into your network, to identify security vulnerabilities before they become a real problem. The BTM system applies this principle for machine learning systems, encouraging a period of intense probing for vulnerabilities, before deploying the system in practice.

Well, perhaps Google hired Lee Sedol with the same idea: Get the human to identify cases where the machine will fail, and reward the human for doing so. Only in that case, AlphaGo managed to eat its cake (figure out a vulnerability) and have it too (beat Lee Sedol, and not pay the \$1M prize) :-)

A Cohort Analysis of Mechanical Turk Requesters

2016-02-29T10:33:00.001-05:00

In my last post, I examined the number of "active requesters" on Mechanical Turk, and concluded that there is a significant decline in the numbers over the last year. The definition of "active requester" was: "A requester is active at time X if he has a HIT running at time X". A potential issue with this definition is that an improvement in the speed of HIT completion (e.g., due to increased labor supply) could drive down that number.

For this reason, I decided to perform a proper cohort analysis for the requesters on Mechanical Turk. In the cohort analysis that follows, we will examine how many requesters that have first appeared in the platform on a given month (say September 2015), are still posting tasks in the subsequent months.

Here is the resulting "layer cake plot" that indicates that happens in each cohort. Each of the layers corresponds to requesters that were first seen on a given month. (code, data) (Read this post, if want a little bit more background on how the plot should "look like".)

For example, the bottom layer corresponds to all the requesters that were first seen on May 2014 (the first month that the new version of MTurk Tracker started collecting data). We can see that we had ~2700 "new" requesters on that month. (The May-2014 cohort obviously contains all prior cohorts in our dataset, as we do not know when these requesters really started posting.) Out of these requesters, approximately 1700 also posted a task on June 2014 or later, approximately 1000 of these have posted a task on March 2015 or later, and approximately 500 have posted a task on February 2016.

The layer on top (slightly darker blue) illustrates the evolution of the June 2014 cohort. By stacking them on top of each other, we can see the composition of the requesters that have been active in every single month.

As the plot makes obvious, until March 2015, the acquisition of new requesters every month was compensating for the requesters that were lost from the prior cohorts. However, starting March 2015, we start seeing a decline in the overall numbers, as the total decline in requesters from prior cohorts dominates the acquisition of new requesters. So, the cohort analysis supports the conclusions of the prior post, as the trends and conclusions are very similar (always good to have a few robustness checks).

Of course, a more comprehensive cohort analysis would also analyze the revenue generated by each cohort, and not just the number of active users. That requires a little bit more digging in the data, but I will do that in a subsequent post.

The Decline of Amazon Mechanical Turk

2016-02-26T23:31:00.001-05:00

It seems that after years of neglect, Mechanical Turk starts losing its appeal. In our latest measurement, we see Mechanical Turk losing 50% of its requesters in a YoY measurement.

A few days ago, Kristy Milland (aka SpamGirl) asked me if there is a way to see the active requesters on Mechanical Turk over time. I did not have this dashboard on Mechanical Turk tracker, but it was an important metric, so I decided to add it in the MTurk Tracker website.

So, now MTurk Tracker has a tab called "Active Requesters" which shows how many requesters are "active" on Mechanical Turk at any given time. The definition of "Active at time X" means "had a task that was running on MTurk before time X and after time X".

Here is the chart for the active requesters between Jan 1, 2015 and February 28, 2016:

As you can see, starting March 2016 (that is before the announcement of price increases), we see a decline in the number of active requesters. Interestingly, when the fee increases are announced, we see a small "valley" around the period of fee increases. The numbers remain stable until November, but after that we see a steady decline.

Overall, we observe a YoY decline of almost 50% in terms of active requesters.

What is driving the decline? Hard to tell. Perhaps requesters abandon crowdsourcing in favor of more automated solutions, such as deep learning. Perhaps requesters with long running jobs build their own workforce (eg using UpWork). Perhaps they use alternative platforms, such as Crowdflower. Or perhaps my own metric is flawed, and I need to revise it.

But, unless we have a bug in the code, the future does not seem promising for Mechanical Turk. And this is a shame.

An API for MTurk Demographics

2015-06-10T12:38:00.001-04:00

A few months back, I launched demographics.mturk-tracker.com, a tool that runs continuously surveys of the Mechanical Turk worker population and displays live statistics about gender, age, income, country of origin, etc.

Of course, there are many other reports and analyses that can be presented using the data. In order to make easier for other people to use and analyze the data, we now offer a simple API for retrieving the raw survey data.

Here is a quick example: We first call the API and get back the raw responses:

In [1]:

import requests
import json
import pprint
import pandas as pd
from datetime import datetime
import time

# The API call that returns the last 10K survey responses
url = "https://mturk-surveys.appspot.com/" + \
    "_ah/api/survey/v1/survey/demographics/answers?limit=10000"
resp = requests.get(url)
json = json.loads(resp.text)

Then we need to reformat the returned JSON object and transform the responses into a flat table

In [2]:

# This function takes as input the response for a single survey, and transforms it into a flat dictionary
def flatten(item):
    fmt = "%Y-%m-%dT%H:%M:%S.%fZ"
    
    hit_answer_date = datetime.strptime(item["date"], fmt)
    hit_creation_str = item.get("hitCreationDate")
    
    if hit_creation_str is None: 
        hit_creation_date = None 
        diff = None
    else:
        hit_creation_date = datetime.strptime(hit_creation_str, fmt)
        # convert to unix timestamp
        hit_date_ts = time.mktime(hit_creation_date.timetuple())
        answer_date_ts = time.mktime(hit_answer_date.timetuple())
        diff = int(answer_date_ts-hit_date_ts)
    
    result = {
        "worker_id": str(item["workerId"]),
        "gender": str(item["answers"]["gender"]),
        "household_income": str(item["answers"]["householdIncome"]),
        "household_size": str(item["answers"]["householdSize"]),
        "marital_status": str(item["answers"]["maritalStatus"]),
        "year_of_birth": int(item["answers"]["yearOfBirth"]),
        "location_city": str(item.get("locationCity")),
        "location_region": str(item.get("locationRegion")),
        "location_country": str(item["locationCountry"]),
        "hit_answered_date": hit_answer_date,
        "hit_creation_date": hit_creation_date,
        "post_to_completion_secs": diff
    }
    return result

# We now transform our API answer into a flat table (Pandas dataframe)
responses = [flatten(item) for item in json["items"]]
df = pd.DataFrame(responses)
df["gender"]=df["gender"].astype("category")
df["household_income"]=df["household_income"].astype("category")

We can then save the data to a vanilla CSV file, and see how the raw data looks like:

In [3]:

# Let's save the file as a CSV
df.to_csv("data/mturk_surveys.csv")

!head -5 data/mturk_surveys.csv

,gender,hit_answered_date,hit_creation_date,household_income,household_size,location_city,location_country,location_region,marital_status,post_to_completion_secs,worker_id,year_of_birth
0,male,2015-06-10 15:57:23.072000,2015-06-10 15:50:23,"$25,000-$39,999",5+,kochi,IN,kl,single,420.0,4ce5dfeb7ab9edb7f3b95b630e2ad0de,1992
1,male,2015-06-10 15:57:01.022000,2015-06-10 15:35:22,"Less than $10,000",4,?,IN,?,single,1299.0,cd6ce60cff5e120f3c006504bbf2eb86,1987
2,male,2015-06-10 15:21:53.070000,2015-06-10 15:20:08,"$60,000-$74,999",2,?,US,?,married,105.0,73980a1be9fca00947c59b93557651c8,1971
3,female,2015-06-10 15:16:50.111000,2015-06-10 14:50:06,"Less than $10,000",2,jacksonville,US,fl,married,1604.0,a4cdbe00c93728aefea6cdfb53b8c489,1992

Or we can take a peek at the top countries:

In [4]:

# Let's see the top countries
country = df['location_country'].value_counts()
country.head(20)

Out[4]:

US    5748
IN    1281
CA      30
PH      22
GB      16
ZZ      15
DE      14
AE      11
BR      10
RO      10
TH       7
AU       7
PE       7
MK       7
FR       6
IT       6
NZ       6
SG       6
RS       5
PK       5
dtype: int64

I hope that the examples are sufficient to get people started using the API, and I am looking forward to see what analyses people will perform.

Postdoc Position for Quality Control in Crowdsourcing

2015-06-08T20:07:00.001-04:00

The Center for Data Science at NYU invites applications for a post-doctoral fellowship in statistical methodology relating to evaluating rater quality for a new research program in the application of crowdsourcing ratings of human speech production.

Duties and Responsibilities: This is a two-year postdoctoral position in the affiliated with the NYU Center for Data Science. The successful candidate will join a dynamic group of researchers in several NYU Centers including PRIISM, MAGNET, the Stern School of Business, the NYU Medical School and the Department of Communicative Sciences and Disorders. We are seeking highly motivated individuals to develop and test novel statistical and computational methods for evaluating rater quality in crowdsourced tasks. Responsibilities will include development, testing and implementation of statistical algorithms, as well as preparation of manuscripts for academic publication. Advanced knowledge of R is preferred.

Position Qualifications: Candidates will ideally have a doctoral degree in Statistics, Biostatistics, Data Science, Computer Science, or a related field, as well as genuine interests and experiences in interdisciplinary research that integrates study of human speech, citizen science games and computational statistics. Candidates will ideally have expertise in the following areas: Bayesian statistics, numerical methods and techniques, psychometrics and/or knowledge of programming languages. Outstanding computing and communication skills are required.

Please send CV, letter of intent, and three reference letters to Daphna Harel (daphna dot harel at nyu dot edu) by July 31, 2015.

The position is for 2 years (subject to good research progress). The successful candidate will be based at the NYU Center for Data Science, under the primary supervision of NYU faculty members Panos Ipeirotis and Daphna Harel, and will closely work with a multidisciplinary team including NYU faculty members Tara McAllister Byun, R. Luke DuBois, and Mario Svirsky. The position will preferably start by September 2015 (start date negotiable).

The World Bank Report on Online Labor

2015-05-29T10:20:00.001-04:00

I am often asked about statistics and data about the global population of "crowdsourcing" workers, going beyond Mechanical Turk. I am happy to say that from now on I will be able to point everyone to a study from The World Bank, which I was fortunate to participate. The reports examines the global landscape of online labor, identifying the opportunities, and providing statistics about the global landscape.

The study will be officially released on Wednesday June 3rd, and for those of you willing to attend the launch event through Webex, here is the information:

---

When:

Wednesday, June 3, 2015, 9:00AM - 11:30AM EDT

Where:
Webex URL

Meeting number: 730 125 194

Meeting password: online1

Audio connection: 1-650-479-3207 Call-in toll number (US/Canada)

Access code: 730 125 194

Title:
The New Online Outsourcing Approach for Jobs, Youth and Women's Empowerment and Services Exports

Abstract:

This event will discuss the new online outsourcing (OO) phenomena in the world today, its implications for developing countries, and how your clients can leverage it as an innovative approach for jobs, youth employment and women's empowerment.

OO refers to the contracting of third-party workers and providers (often overseas) to supply services or perform tasks via Internet-based marketplaces or platforms. Also known as paid crowdsourcing, online work, microwork and other names - these technology-mediated channels allow clients to outsource their paid work to a large, distributed, global labor pool of remote workers, to enable performance, coordination, quality control, delivery, and payment of such services online.

The global OO marketplace today includes numerous emerging and growing platforms; such as Upwork (formerly Elance-oDesk), Crowdflower, CloudFactory, Amazon Mechanical Turk, etc. There are also wide variety of services that can be performed online - such as data entry, digitization, graphics rendering and design, programming and apps development, accounting and legal services, etc. Workers in developing countries can have access and perform jobs from all over the world - as long as they have computer and Internet access. In addition to jobs and income - OO offers workers flexible time and working environment, develop skills for professional, and drive positive social change for youth and women.

The event will share with participants the OO study that covers comprehensively the definition and segments, trends and market size, economic and non-financial impact on workers, and the implications and policy recommendations. In addition the event will show how u can apply the online toolkit to assess the readiness of your client countries for OO.

The World Bank's ICT Unit is excited to share this new global study and toolkit, which was developed in partnership with the Rockefeller Foundation and Dalberg Global Development Advisors.

Who:

Chair: Mavis Ampah, Lead ICT Policy Specialist and Practice Lead on Jobs, GTIDR
Siou Chew Kuek, Senior ICT Specialist and TTL, GTIDR
Cecilia Paradi-Guilford, ICT Innovation Specialist and Co-TTL, GTIDR
Saori Imaizumi, ICT Innovation and Education Consultant, GTIDR

Demographics of Mechanical Turk: Now Live! (April 2015 edition)

2015-04-06T15:17:00.001-04:00

One of the most common question that I receive is whether I have new data about the demographics of Mechanical Turk workers. The latest data that I had collected were back in 2010, and it was not clear how things have changed since then. The key problem was not that I could not run additional surveys; that would have been trivial. However, the results of the surveys were always changing over time: the aggregate data varied too much across surveys, so I refrained from publishing data that seemed to be unreliable.

So, I thought of how I tackle two problems at once:

Make it easy for people to see current data about the demographics of Mechanical Turk workers
Make it easy to understand the inherent variability of the collected data, and potentially understand the source of the variability

For that reason, we built a new site:

http://demographics.mturk-tracker.com/

(please also check the API)

The site displays live data about the demographics of the workers, based on a small 5-question survey that users are asked to answer (paying 5 cents for each). To be able to capture the time variability, we post one survey every 15 mins, allowing us to observe changes in the answers over time. We also restrict each worker to be able to answer the survey only once per month.

A few key results:

Country

Overall, we see that approximately 80% of the Mechanical Turk workers are from the US and 20% are from India.

However, this mix is not stable during the day. Around 8-10am UTC (ie 3am NYC time, 1.30pm India time), there is much higher number of workers from India (~50%), which then goes down to 5% at 8-10pm UTC.

Gender

The gender participation seems to be balanced, with roughly 50% males and 50%. The charts that examine variability based on hour of day and day of the week do not show any change in this pattern.

Year of birth

Roughly 50% of the workers are born in the 1980's and are around 30 yrs old. Approximately 20% of the workers are born in the 1990's, and another 20% are born in the 1970's.

Marital Status

Approximately 40% of the workers are single, 40% are married, and 10% are cohabitating.

Household Size

Approximately 15% live alone. Then 25% have a household size of two and 25% have a household size of three. Around 25% live in a household of four, and around 10% have five or more members in their household.

Income level

The median household income is around \$50K per year for US Turkers, which is on par with the median US household income. Indian workers have considerably lower household income, with most of them being around \$10K/yr.

Next steps

In our next steps, we plan on making the (anonymized) survey responses available through an API, and potentially add a few more graphs of interest. If you have any idea or suggestion, please send it my way.

My Peer Grading Scheme

2014-06-09T16:07:00.000-04:00

One of the components that I use in my class is student presentations.

While I like having students present, I had always a hard time grading the presentations. Plus, many students seemed to target the presentation to me, trying to sound too technical and advanced, leaving the audience in the class bored and uninterested.

For that reason, I adopted a peer-grading scheme. Students have to present to the class, and get rated by the class, and not me. (Although, I still reserve a small degree of editorial judgement for assigning the grades.) Here is how my scheme works, after a few years of experience.

Rating scale: Students assign a grade from 0 to 10 to the presentations.
No self-grading: Students do not grade their own presentations. (Early on, there were students that were assigning 10 to themselves, and lower grade to everyone else. Now they can still grade themselves if they want but the grade is ignored.)
Normalization: All assigned grades are normalized, to have a zero mean and one standard deviation. (This normalization was introduced to fight the problem where a student would try to game the system by assigning low grades to everyone else, hoping to lower the average rating of all other students.)
Grade assignment: The presentation grade is the average of the assigned normalized scores. Formally, each student $s_i$ assigns to presentation $t$ a grade $z(s,t)$. The overall grade of the presentation is the mean value $E[z(*,t)]$ of the $z(s_i,t)$ grades.
Ensuring careful grading by asking students to estimate class rating: One problem with the peer grading scheme was that many students did not take it seriously enough, and assigned random grades (typically, the same grade to everyone). To avoid indifferent grading, I decided to give credit (~10%) based on the correlation of the assigned grades $z(s,t)$ against the mean value $E[z(*,t)]$ (across all presentations $t$). This ensured that students will at least try to figure out what other students will assign to the presentation, and will not assign random grades.
Separate assigned and estimated grades: The problem with introducing the requirement to agree with the class was that some students believed to be better assessors than the rest of the class. So, they felt that their own grade was the correct one, and did not like losing credit for assigning their own "true" grade. To address that issue, I now ask students to assign two grades: their own grade $z_p(s,t)$, and an estimate of the class grade $z_c(s,t)$. The personal grade $z_p$ is used to compute $E(z(*,t)]$ in Step 4, and I use the $z_c$ to compute the correlation in Step 5.
Examine self-grading: Given that the class-estimate grades are not directly used to grade a presentation, students are also asked to provide an estimate of their own grade as part of Step 6. Effectively, students are encouraged to estimate properly their own grade.

The only thing that I have not tried to far is to modify Step 4 to take into consideration the different correlations from Step 5, effectively weighting each student's grades based on their correlation with the rest of the class. However, most students tend to exhibit the same, moderate agreement with the class (typical correlation values are in the 0.4-0.6 range, after rating 15-20 presentations), so in practice I do not expect to see a difference.

Overall, I am pretty happy with the scheme. Students indeed try to impress the class (and not me), and many presentations are interesting, interactive, and engaging. The grades are also very consistent with the overall feeling that I get for each presentation, so I did not have to practice my "editorial oversight" and adjust the grade very often (only in a couple of cases, where the students ran into technical problems during the presentation). I would be really interested to try this scheme in one of the big MOOC classes that use peer grading, and see if it can instill the same sense of responsibility in peer grading.

Online Markets: Selling products vs. selling time

2014-04-01T15:17:00.000-04:00

We had an interesting discussion a few days back about online job markets, and why they are not a huge success so far, when other, comparatively less important products are getting huge valuation and visibility. For example, oDesk reached a total transaction volume of a billion dollars, for the 10 years of their existence, and roughly 5% to 10% of the volume becomes revenue for the company. Other labor marketplaces have typically even smaller number of report.

While nobody can ignore a billion dollar of transaction volume, I am puzzled why this number has not skyrocketed. It is very clear that the market serves a purpose: work is a trillion dollar industry. Allowing people to work online allows for better and more efficient access to human capital, alleviates need for immigration, and improves the lives of people involved. It is a no-brainer.

Why does it take so long for online work to takeoff? What is missing?

***

I was puzzled by these questions for long. I postulated that there are obstacles that prevent employers from hiring online, but recently I got some hints that there are obstacles from the worker side as well. I talked with some friends of mine back in Greece, who are making a very comfortable living working through the platform. I asked how they like making US salaries while living in Greece, and their answer was surprising. They did not see online work as a long term solution, but rather as a temporary gig.

When I asked why, they both indicated the same problem: There is no room in such markets for career evolution. You end up selling your time, and time is not something that scales. It is very hard to grow your business when you are always a freelancer, without the ability to hire new people, delegate tasks, and build a business. Compare now online work with a market like Amazon and eBay. Both Amazon and eBay allow sellers to effectively build businesses. Currently, online job markets allow workers to just sell their time.

When sellers have a capped growth, the market faces headwinds of growth as it tries to reach maturity.

***

On a general note, this gives birth to a general hypothesis on what can make a marketplace (hugely) successful: The market should allow sellers to grow, without an obvious ceiling. Otherwise, the best sellers are unlikely to be attracted to participate in the platform, due to the lack of upside.

Take some marketplace companies and interpret them through this framework:

Google Helpouts: Same restrictions on seller growth as all other job marketplaces.
Uber: Obviously, currently the sellers have a cap on growth, which is limited by their time. However, Uber allows the enrollment of limo/taxi agencies, which potentially grow indefinitely.
AirBnB: No obvious seller cap for someone who wants to enter the hospitality business.
TaskRabbit: Very obvious growth cap for the individual sellers of services.
OpenTable: No obvious limit of growth for participating restaurants.
eBay/Amazon: No obvious limit of growth for sellers that sell products online
Etsy: This is an interesting case. On the surface, the company looks like eBay/Amazon. However, the etsy guidelines dictate that "Everything on Etsy must be Handmade, Vintage, or a Craft Supply." Unfortunately, this places restrictions on seller growth as it implicitly limits sellers to be (very) small businesses. My bet is that either Etsy will revise this policy down the road, once more and more sellers start hitting their growth ceiling.

How accurate is the hypothesis? Time will tell...

Future of Education: Fighting Obesity or Fighting Hunger?

2014-01-22T06:00:00.000-05:00

I have been following with interest the discussion about the future of education.

***

Some people criticize existing educational institutions, indicating that they offer little in terms of real training, and that real learning occurs outside the classroom, by actually doing. "Nobody learns how to build a system in a computer science class." "Nobody learns how to build a company in an entrepreneurship program."

Others are lamenting that by shifting to training-oriented schemes, we are losing the ability to offer deeper education, on topics that are not marketable. Who is going to study poetry if it has no return on investment? Who is going to teach literature if there is no demand for it?

These two criticisms seem to be pushing in two different directions.

***

In reality, we need to address two different needs:

One need is to really try and democratize education, trying to take the content of the top courses and make it accessible and available to everyone. People that want to learn machine learning, can now take courses from top professors, instead of having to read a book. People can now advance their careers easily, without having to enroll to expensive degree programs.

The other need is to preserve the breadth of education, shielding it from market forces. This need wants to preserve the structure where students during their education get exposed to diverse fields, no matter if there is a market and demand for these fields.

***

This tension reminded me about the discussion about genetically modified foods.

Mass production of food pretty much solved the problem of world hunger. A few decades ago, there was a real problem with world hunger. Famine was a real problem in many areas of the world, due to the inability to produce enough food to feed the growing population: floods, droughts, diseases were disrupting production, resulting in shortages. Today, the advances in agriculture allow the abundant production of grains and food: wheat and rice varieties are now robust, resistant to diseases, adaptable to many different climates, and allow us to feed the world.

The advances that solved the problem of world hunger, ended up creating other problems. Processed carbohydrates and causing obesity, diabetes, gout, and many other "luxury" diseases in the developed world. The poor in the developed world are not dying because they are hungry. They are dying by starving themselves from essential ingredients in their diet.

***

The parallels are striking. The MOOCs, Khan Academies, and Code Academies of the world are the genetically modified foods for those living in the "third world of education". These courses may not be the most nutritious, and they may not provide all the "nutrition" for their education. However, the choice for many of these people in the "third world of education" is not Stanford vs. a Coursera MOOC. It is nothing vs. a Coursera MOOC. Given the choice, take the MOOC at any time.

Those that live in the "developed world of education" can be pickier. They may have access to the genetically modified MOOCs, but if they can afford it, the organic, artisanal, locally sourced education can be potentially better than the mass produced MOOC.

***

Horses for courses (pun intended).

Crowdsourcing research: What is really new?

2014-01-20T19:57:00.000-05:00

A common question that comes up when discussing research in crowdsourcing, is how it compares with similar efforts in other fields. Having discussed these a few times, I thought it would be good to collect all these in a single place.

Ensemble learning: In machine learning, you can generate a large number of "weak classifiers" and then build a stronger classifier on top. In crowdsourcing, you can treat each human as a weak classifier and then learn on top. What is the difference? In crowdsourcing, each judgement has a cost. With ensembles, you can trivially easy create 100 weak classifiers, classify each object, and then learn on top. In crowdsourcing, you have a cost for every classification decision. Furthermore, you cannot force every person to participate, and often you have a heavy-tailed participation: A few humans participate a lot, but from many of them we get only a few judgments.
Quality assurance in manufacturing: When factories create batches of products, they also have a sampling process where they examine the quality of the manufactured products. For example, a factory creates light bulbs, and wants 99% of them to be operating. The typical process involves setting aside a sample for testing and testing if they meet the quality requirement. In crowdsourcing, this would be equivalent to verifying, with gold testing or with post-verification, the quality of each worker. Two key differences: The heavy-tailed participation of workers means that gold-testing each person is not always efficient, as you may end up testing a user a lot, and the the user may leave. Furthermore, it is often the case that a sub-par worker can still generate somewhat useful information, while for tangible products, the product is either acceptable or not.
Active learning: Active learning assumes that humans can provide input to a machine learning model (e.g., disambiguate an ambiguous example) and the answers are assumed to be perfect. In crowdsourcing this is not the case, and we need to explicitly take the noise into account.
Test theory and Item Response Theory: Test theory focuses on how to infer the skill of a person through a set of questions. For example, to create a SAT or GRE test, we need to have a mix of questions of different difficulties, and we need to whether these questions really separate the persons that have different abilities. Item Response Theory studies exactly these questions, and based on the answers that users give to the tests, IRT calculates various metrics for the questions, such as the probability that a user of a given ability will answer correctly the question, the average difficulty of a question, etc. Two things make IRT unapplicable directly to a crowdsourcing setting: First, IRT assumes that we know the correct answer to each question; second, IRT often requires 100-200 answers to provide robust estimates of the model parameters, a cost that is typically too high for many crowdsourcing applications (except perhaps the citizen science and other volunteer based projects).
Theory of distributed systems: This part of CS theory is actually much closer to many crowdsourcing problems than many people realize, especially the work on asynchronous distributed systems, which attempts to solve many coordination problems that appear in crowdsourcing (e.g. agree on an answer). The work on analysis of byzantine systems, which explicitly acknowledges the existence of malicious agents, provides significant theoretical foundations for defending systems against spam attacks, etc. One thing that I am not aware of, is the explicit dealing of noisy agents (as opposed to malicious ones), and I am not aware of any study of incentives within that context that will affect the way that people answer to a given question.
Database systems and User-defined-functions (UDFs): In databases, a query optimizer tries to identify the best way to execute a given query, trying to return the correct results as fast as possible. An interesting part of database research that is applicable to crowdsourcing is the inclusion of user-defined-functions in the optimization process. A User-Defined-Function is typically a slow, manually-coded function that the query optimizer tries to invoke as little as possible. The ideas from UDFs are typically applicable when trying to optimize in a human-in-the-loop-as-UDF approach, with the following caveats: (a) UDFs were considered to be return perfect information, and (b) the UDFs were assumed to have a deterministic or a stochastic but normally distributed execution time. The existence of noisy results and the fact that execution times with humans can be often long-tailed make the immediate applicability of UDF research in optimizing crowdsourcing operations rather challenging. However, it is worth reading the related chapters about UDF optimization in the database textbooks.
(Update) Information Theory and Error Correcting Codes: We can model the workers are noisy channels, that get as input the true signal and return back a noisy representation. The idea of using advanced error correcting codes to improve crowdsourcing is rather underexplored, imho. Instead we rely too much on redundancy-based solutions, although pure redundancy has been theoretically proven to be a suboptimal technique for error correction. (See an earlier, related blog post.) Here are a couple of potential challenges: (a) The errors of the humans are very rarely independent of the "message" and (b) It is not clear if we can get humans to compute properly functions that are commonly required for the implementation of error correcting codes. See a related e
(Update) Information Retrieval and Interannotator Agreement: In information retrieval, it is very common to examine the agreement of the annotators when labeling the same set of items. My own experience with reading the literature, and the related metrics is that they implicitly assume that all workers have the same level of noise, an assumption that is often violated in crowdsourcing.

Any other fields and what other caveats that should be included in the list?

Badges and the Lake Wobegon effect

2013-10-17T17:48:00.000-04:00

For those not familiar with the term, the Lake Wobegon effect is the case when all or nearly all of a group claim to be above average, and comes from the finctional town where "all the women are strong, all the men are good looking, and all the children are above average."

Interestingly enough, as Wikipedia states, this effect of the majority of the group thinking that they are performing above-average "has been observed among drivers, CEOs, hedge fund managers, presidents, coaches, radio show hosts, late night comedians, stock market analysts, college students, parents, and state education officials, among others."

So, a natural question was whether this effect also appears in an online labor setting. We took some data from an online certification company, similar to Smarterer, where people take tests to show how well they know a particular skill (e.g., Excel, Audio Editing, etc.) The tests are not pass/fail but more like a GRE/SAT score: there is no "passing" score, only a percentile indicator that shows what percentage of other participants have a lower score.

Interestingly enough, we noticed a Lake Wobegon effect there as well: Most of the workers that displayed the badge of achievement, have scores above average, giving yet another point for the Lake Wobegon effect.

Of course, this does not mean that all users that took the test performed above average. Test takers have the choice to make their final score public to the world, or keep it private. Given that the user's profile is also used in a site where employers look for potential hires, there is some form of strategic choice in whether the test score is visible or not. Having a low score is often worse than having no score at all.

So, we wanted to see what scores make users comfortable with their performance, and incentivizes them to display their badge of achievement. Marios analyzed the data, and compared the distribution of scores for workers that decided to keep their score private, compared to the workers that made their performance public. Here is the outcome:

It becomes clear that scores below 50% are not posted often, while scores that exceed 60% have significantly higher odds of being posted online for the world to see. This becomes more clear if we take the log-odds of a worker deciding to make the score public, given the achieved percentile:

So, in the world of online labor if you ever hire someone who chose to display a certification, you know that there are good chances that you picked a worker that is better than average, at least in the test. (We have some other results on the predictive power of tests in terms of work performance, but this is a topic that cannot fit into the margins of this blog post :-)

Needless to say, this effect illustrates a direction that will take crowdsourcing, and labor markets in general, out of the race-to-the-bottom, market-for-lemons-style, pricing, where only price can separate the various workers. As education history serves in an offline setting as signaling for the potential quality of the employee, we are going to see more and more globally recognized certifications replacing educational history for many online workers.

CrowdScale workshop at HCOMP 2013

2013-09-11T08:31:00.000-04:00

A public service announcement, to advertise CrowdScale (http://www.crowdscale.org/) a cool workshop at HCOMP 2013 that focuses on challenges that people face when applying crowdsourcing at scale.

A couple of interesting twists from the classic workshop recipe;

The workshop invites submission of short (2-page) position papers which identify and motivate key problems or potential approaches for crowdsourcing at scale, even if there aren’t satisfactory solutions proposed. (Deadline: October 4)
Second, there is a shared task challenge, which also carries a cool $1500 reward for the winner.

The CfP follows:

Crowdsourcing at a large scale raises a variety of open challenges:

How do we programmatically measure, incentivize and improve the quality of work across thousands of workers answering millions of questions daily?

As the volume, diversity and complexity of crowdsourcing tasks increase, how do we scale the hiring, training and evaluation of workers?

How do we design effective elastic marketplaces for more skilled work?

How do we adapt models for long-term, sustained contributions rather than ephemeral participation of workers?We believe tackling such problems will be key to taking crowdsourcing to the next level – from its uptake by early adopters today, to its future as how the world’s work gets done.

To advance the research and practice in crowdsourcing at scale, our workshop invites position papers tackling such issues of scale. In addition, we are organizing a shared task challenge regarding how to best aggregate crowd labels on large crowdsourcing datasets released by Google and CrowdFlower.
Twitter: #crowdscale • @CrowdAtScale
Organizers

Tatiana Josephy (@tatianajosephy), CrowdFlower

Matthew Lease (@mattlease), University of Texas at Austin

Praveen Paritosh (@heuristicity), Google

Online labor markets: Why they can't scale and the crowdsourcing solution.

2013-07-31T09:30:00.000-04:00

I am a big proponent of outsourcing work using online labor markets. Over the last decade I outsourced hundreds of projects, ranging from simple data entry to big, complex software products. I learned to create project specs, learned how to manage contractors, and learned how to keep projects moving forward. In general, I consider myself competent in managing distributed teams and projects.

I also met and talked with many people that share my passion for this style of work. We discuss strategies for hiring, for managing the short- and long-term projects, for pricing, for handling legal risks, and other topics of interest. After many such discussions, I reached a striking conclusion: Everyone has a completely different style of managing this process.

This plurality of "best practices" is a bad thing. Having too many best practices means that there are no best practices. The lack of consensus makes it impossible to effectively teach a newcomer of how to handle the process.

The problem with manual hiring in online labor markets

People that want to use contractors for their projects face the following problems:

Few people know what they want: Just for fun, go and check random projects on oDesk, eLance, and Freelancer. An endless list of poorly described projects, requests for "clone of Facebook" for $500, and a lot of related crap. It is not a surprise that many of these projects remain open for ever.
Few people know how to hire: Ask any startup CEO how easy is to hire an employee. It is a pain. The art and craft of inferring the match of an individual to a given task is a very hard problem. Few people know how to do it right. Even within Google and Microsoft, with their legendary interviewing processes, interviewing is seen by many as a hard, time-consuming, and unrewarding experience.
Few people know how to manage a project: Even fewer people know how to manage a project. The harrowing fact is that most people believe that they can. Most people hire someone, hoping that the employee will be in their head, will understand what these vague specifications mean, will know everything that is not documented in a project, and will be able to do a great job. Very few people realize that outsourcing a project means that you will need to spend significant amount of time managing the project.

The result of the combination of these factors? Online labor does not scale through manual hiring. (Of course, this is not unique to online outsourcing. Offline hiring has the same problem.) There are simply not enough qualified employers that can hire effectively, who will be able to create demand for jobs for the online labor markets to continue to grow.

Online hiring vs online shopping

The counter-argument is that labor was always like that. Since the market for labor operates "manually," the transition to electronic hiring will allow for growth. In the same way people were initially afraid of shopping online, they started buying things online, they are going to switch to hiring online.

I do not buy this argument. When people buy an item online, they buy a standardized product. They are not ordering a bespoke item, which is created according to the customer specifications. Customization is typically limited and allowed on a specific set of dimensions. You can customize your Mac to have a better processor, more memory, and a larger hard disk. But you cannot order a laptop with a 19 in screen, and cannot ask for 96 Gb of memory.

But in online markets this is what happens. The random customer comes and asks for a web application ("just the functionality of the X website"), and wants this app to be built for $500. It is the same as if someone goes to a computer store and asks for a laptop with a 19 inch screen, with 128Gb of memory, and 10Tb disk. And, since 1Gb of memory costs 7 dollars, it is reasonable to just pay $1000 for 128Gb, right?

Lessons from online shopping

Based on the experience for the transition of shopping from offline to online, let's see how online labor can move forward.

Standardize and productize: Currently, in online markets, most people ask for a specific set of tasks. Content generation, website authoring, transcriptions, translations, etc. Many of these can be "productized" and be offered as standardized packages, perhaps with a few pre-set customizations available. (Instead of "select the hard disk size, you have a "select blog post length".) This vertical-oriented strategy is followed by many crowdsourcing companies and offers to the client a clean separation from the process of hiring and managing a task. This vertical strategy works well to create small offerings but it is not clear if there is sufficient demand within each vertical to fuel the growth expected for a startup. This is a topic for a new blog post.
Productize the project creation/management: When a standardized offering is not sufficient, the client is directed into hiring a product manager that will spec the requirements, examine if there is sufficient supply of skills in the market, hire individual contractors, manage the overall process, etc. This is similar to renovating a house. The delivered product is often completely customized, but the client does not seek to hire separately electricians, carpenters, painter, etc. Instead, the owner hires a "general contractor" who creates the master plan for the renovation, procures the materials, hires subcontractors, etc. While it eases some of the problems, this is a process suitable only for reasonably big project.
Become a staffing agency: A problem with all existing marketplaces is that they are not acting as employers, but only as matching agents. Few, if any, marketplaces are guaranteeing quality. Every transaction is a transaction between "consenting adults." Unfortunately, very few potential employers understand that, and hire with the implicit assumption that the marketplace is placing a guarantee on the quality of the contractors. So, if the contractor ends up being unqualified for the task, there is very little recourse. By guaranteeing quality, the employer (who is the one spending the money) gets some minimum level of guarantee about the deliverable. Unfortunately, providing such quality guarantees is easier said than done.
Let contractors build offerings: By observing the emergence of marketplaces like Etsy, you can see that people are becoming more comfortable with ordering semi-bespoke, handcrafted items online, for which they have little information. A potential route is to allow the contractors in online markets to build such "labor products" and price them themselves, in the same way that Etsy sellers are putting up their handcrafted stuff online.

All these approaches are fine, and I expect most current marketplaces to adopt one or more of these strategies over time. However, all of them rely on the same assumption: That hiring, as shopping, will be a human activity.

What happens, though, if we stop assuming that hiring is a human-mediated effort?

Crowdsourcing practices to the rescue

I will not pretend that the current state of the crowdsourcing industry offers concrete solutions to the problems listed above. But today's efforts in crowdsourcing move us towards an algorithmically-mediated work environment.

Of course, like all automatic solutions, the initial environment is much worse than "traditional" approaches. We see that in all the growing pains of Mechanical Turk. It is often easier to just hire a couple of trusted virtual assistants from oDesk to do the job, instead of trying to implement the full solution stack to get things done properly on MTurk.

However, the initial learning curve starts paying off later. Production environments that rely on a "crowd" need to automate as much as possible the hiring and management of workers. This automation makes the tasks much more scalable than traditional hiring and project management. High-startup costs, then lower marginal costs of adding workers to a process.

This leads to easier scalability. Of course, the moment the benefits of easier scalability start becoming obvious, it will be too late for players that rely on manual hiring to catch up. It is one of the reasons that I believe that Mechanical Turk has the potential to be the major labor platform, even if this seems a laughable proposition at this point.

I will make a prediction: Crowdsourcing is currently at the forefront of defining the methods and practices in the workplace for the next few decades. Assembly lines and integration of machines in the work environment led to the mass production revolution of the 20th century. The current crowdsourcing practices will define how the majority of people are going to work on knowledge tasks in the future. A computer process will monitor and manage the working process, and hiring manually will be soon a thing of the past, for many "basic" knowledge tasks.

Some will find this prospect frightening. I do not find it any more frightening than having traffic lights regulate traffic in intersections, or having the auto-pilot taking care of my flight.

Crowdsourcing and information theory: The disconnect

2013-07-28T19:43:00.000-04:00

In crowdsourcing, redundancy is a common approach to ensure quality. One of the questions that arises in this setting is the question of equivalence. Let's assume that a worker has a known probability $q$ of giving a correct answer, when presented with a choice of $n$ possible answers. If I want to simulate one high-quality worker workers of quality $q$, how many workers of quality $q' < q$ do we need?

Information Theory and the Noisy Channel Theorem

Information theory, and the noisy channel theorem, can give an answer to the problem: Treat each worker as a noisy channel, and measure the "capacity" of each user. Then, the sum of the capacities of the different workers should give us the equivalent capacity of a high-quality worker.

We have that the capacity $C(q,n)$ of a worker with quality $q$, who returns the correct answer with probability $q$, when presented with $n$ choices, is:

$C(q,n) = H(\frac{1}{n}, n) - H(q, n)$

where $H(q,n) = -q \cdot \log(q) - (1-q) \cdot \log(\frac{1-q}{n-1}) $ is the entropy (aka uncertainty) of the worker.

Examples

The value $H(\frac{1}{n}, n) = \log(n)$ is the initial entropy that we have for the question, when no answers are given. Intuitively, when $n=2$, the initial uncertainty is equal to $\log(2)=1$ bit, since we need one bit to describe the correct answer out of the 2 available. When $n=4$, the uncertainty is equal to $\log(4)=2$ bits, as we need 2 bits to describe the correct answer out of the 4 available.

Therefore, a perfect worker, with quality $q=1$ will have $H(1,n)=0$ entropy, and therefore the capacity of a perfect worker is $\log(n)$.

Can Two Imperfect Workers Simulate a Perfect One?

Now, here comes the puzzle. Assume that we have $n=4$, and the workers have to choose among 4 possible answers. We also have also two workers with $q=0.85$, that select with 85% probability the correct answer out of 4 available. These workers have each capacity equal to $C(0.85, 4) = 1.15$ bits. At the same time, we have one perfect worker with $q=1$. This worker has a capacity of $C(1,4)=2$ bits. So, in principle, the two noisy workers are sufficient to simulate a perfect worker (and would leave a remaining 0.3 bits to use :-)

What am I missing?

My problem is that I do not get how to reach this theoretical limit. I cannot figure out how to use these two workers with $q=0.85$, in order to reconstruct the correct answer. Asking two workers to work in parallel will not cut it (still possible for both workers to agree and be incorrect). Sequential processing (get first a worker to select two out of the four answers, then the second one pick the correct out of the two) seems more powerful, but again I do not understand how to operationalize this.

According to information theory, these two $q=0.85$ workers are equivalent, on average, with one perfect $q=1.0$ worker. (Actually, they seem to carry more information). And even if we avoid perfection, and we set target quality at $q=0.99$, $C(0.99,4)=1.9$. I still cannot see how I can combine two workers with 85% accuracy to simulate a 99% accurate worker.

Update 1 (thanks to Michael Nielsen): Information theory operates over a large amount of transmitted information, so posing the question as "answering a single question" makes it sound more impossible than it should.

We need 2 bits of information to transfer the answer for a multiple choice question with n=4 choices. Say that we have a total of N such questions. So, we need 2N bits to transfer perfectly ann the answers. If we have perfect workers, with $q=1$, we have that $C(1,4)=2$, and we need 2N bits / 2 bits/answer = N answers, from these workers.

Now, let's say that we have workers with $q'=0.85$. In that case $C(0.85, 4) = 1.15$ bits per answer. Therefore, we need 2N bits / 1.15 bits/answer = 1.74N answers from these 85% accurate workers in order to perfectly reconstruct the answers for these N questions.

So, if we get from these 85% workers a total of 100 answers (each one 85% correct), we should be able to reconstruct the 100% correct answer for ~57 (=100/1.74) questions.

Of course we should be intelligent of what exactly to ask and get these 100 answers.

I see in Wikipedia, in the article about the noisy channel theorem, that "Simple schemes such as 'send the message 3 times and use a best 2 out of 3 voting scheme if the copies differ' are inefficient error-correction methods" and that "Advanced techniques such as Reed–Solomon codes and, more recently, turbo codes come much closer to reaching the theoretical Shannon limit". Unfortunately, my familiarity with such coding schemes is minimal (i.e., I have no clue), so I cannot understand their applicability in a crowdsourcing setting.

So, here is my question: What coding schemes should we use in crowdsourcing in order to get closer to the theoretical limits given by Shannon? Or what is the fundamental thing that I miss? Because I do feel that I am missing something...

Any help would be appreciated.

Update 2 (thanks to the comments by stucchio and syrnik): Information theory predicts that we can always recover the perfect answers from noisy workers, given sufficient worker capacity. For anyone that has worked in crowdsourcing, this sounds very strange, and seems practically infeasible. The problem does not seem to be in the assumptions of the analysis; instead it seems to rely on the feasibility of implementing a proper encoding scheme on top of human computation.

The key concept in information theory is the coding scheme that is used to encode the information, to make the transmission of information robust to errors. Information theory does not say how we can recover this perfect information using a noisy channel. Over time, researchers came up with appropriate encoding schemes that approach very closely the theoretical maximum (see above, Reed-Solomon codes, turbo codes, etc). However, it is not clear whether these schemes are translatable into a human computation setting.

Consider this gross simplification (which, I think, is good enough to illustrate the concept): In telecommunications, we put a "checksum" together with each message, to capture cases of incorrect information transmission. When the message gets transmitted erroneously, the checksum does not match the message content. This may be the result of corruption in the message content, or the result of corruption in the checksum (or both). In such cases, we re-transmit the message. Based on the noise characteristics of the channel, we can decide how long the message should be, how long the checksum should be, etc., to achieve maximum communication efficiency.

For example, consider using a parity bit, the simplest possible checksum computation. We count the number of 1 bits in the message: if the number of 1's is odd, we set the parity bit to 1, if the number of 1's is even, we set the parity bit to 0. The extra parity bit increases the size of the message but can be used to detect errors when the message gets transmitted over a noisy channel, and reduce the error rate. By increasing the number of parity bits we can reduce the error rate to arbitrarily low levels.

In a human computation setting, computing such a checksum is highly non-trivial. Since we do not really know the original message, we cannot compute at the source an error-free checksum. We can of course try to create "meta"-questions that will try to compute the "checksum" or even try to modify all the original questions to have an error-correcting component in them.

See now the key difference: In information theory, we have computed error-free the message to be transmitted with built-in error-correction. Consider now the same implementation in a human computation setting: We ask the user to inspect the previous k questions, and report some statistics about the previously submitted answers. The user now operates on the noisy message (i.e., the given, noisy answers), therefore even the error-free computation of the checksum is going to be noisy, defeating the purpose of an error-correcting code.

Alternatively, we can try to take the original questions, and try to ask them in a way that enforces some error-correcting capability. However, it is not clear that these "meta-questions" are going to have the same level of complexity for the humans, even if in the information-theoretic sense, they carry the same amount of information.

It seems that in a human computation setting we have noisy computation, as opposed to noisy transmission. Since the computation is noisy, there is a good chance that the computation these "checksums" is going to be correlated with the original errors. Therefore, it is not clear whether we can actually implement the proper encoding schemes on top of human computers, to achieve the theoretical maximums predicted by information theory.

Or, at least, this seems like a very interesting, and challenging research problem.

Facebook implements brand safety, doing it "manually" (crowdsourcing?)

2013-06-28T20:34:00.003-04:00

I just read that Facebook started thinking about brand safety, and will restrict ads from appearing next to content that may be controversial (e.g., adult-oriented content).

I was rather surprised to find out that Facebook has not been doing that already. It is know that Facebook has been using crowdsourcing to detect content that violates the terms of service. So, I assumed that the categorization of the content as brand-inappropriate was also part of that process. Apparently not.

Given the similarities of the two tasks (the difference between no-ads-for-brand-safety and violating-terms-of-service is often just part of intensity of the offense), I assume that Facebook is also going to adopt a crowdsourcing-style solution (perhaps with a private crowd), and then they will build a machine learning algorithm on top using the crowd judgements. At least the wording "In order to be thorough, this review process will be manual at first, but in the coming weeks we will build a more scalable, automated way" in the announcement seems to imply that.

Or perhaps, to blow my own horn, Facebook should just use Integral Ad Science, (aka AdSafe Media). At AdSafe, we built a solution for exactly this problem back in 2009, employing a combination of crowdsourcing and machine learning to detect brand-inappropriate content. We did not go just for porn, but also for other categories, such as alcohol use, offensive language, hate speech, etc. In fact, most of my work in crowdsourcing was inspired, one way or another, through the problems faced when trying to deploy a crowdsourcing solution at scale. Also, except for the academic research, my work with Integral also led to one of the best blog posts that I have written, "Uncovering an advertising fraud scheme (or, the Internet is for Porn)".

Perhaps, the next step is to demonstrate how to use Project Troia, together with a good machine learning toolkit in order to deploy quickly a system for detecting brand inappropriate content. Maybe Facebook could use that ;-)

Mechanical Turk account verification: Why Amazon disables so many accounts

2013-06-28T16:57:00.003-04:00

Over the last year, Amazon embarked into a big effort: All holders of an Amazon Payments account (which includes all the Mechanical Turk worker) had to verify their accounts, by providing their social security number, address, full legal name, etc. Users that did not provide this information found their accounts disabled, and unable to perform any financial transaction.

This led to big changes in the market, as many international workers realized that Amazon could not verify their identity (even if they provided the correct information), and they found themselves locked out of Mechanical Turk.

So, why would Amazon start doing that?

Low quality of international workers. While there are certainly many high-quality workers outside the US, there is a certain segment of workers that join the market with the sole purpose of getting something for nothing. Especially after Indian workers became eligible to receive cash compensation (instead of just gift cards available to other non-US workers), the number of spam attacks from India went up significantly.

So, identity verification can help in that front. It is well-known that it is difficult to have a good reputation scheme that allows for cheap generation of identities. When identities are easy to create, every time someone commits a bad action and gets caught, the account gets closed and a new account is created, ready to commit the same bad actions again. This hurts significantly new workers, that are defacto treated as potential spammers, discouraging them to join the market.

I have long criticized the fact that Amazon allowed for easy generation of ids. Even though it seemed that Amazon required SSN numbers, and other information to create an account, this was an effectively optional step. In fact, it was possible to use Fake Name Generator, and create plenty of seemingly authentic "US based" accounts, using simply SSN numbers of dead people. This meant that many fake accounts existed, many of them being "US based" that then used Amazon Payments to forward their earnings to the true puppetmaster holder.
Labor law. Even though many (small) requesters are unaware of the fact, when you post jobs on Mechanical Turk, you directly engage into hiring contractors to do some work for you. Many people believe that you are paying Amazon, who then pays the workers but in reality Amazon acts simply as a payment processor. Amazon does not act as an employer; the requester acts as an employer. As discussed in the past, this forces many requesters to unknowingly participate in a black market.

The moment requesters realize that they are actually employing all these contractors is when some workers end up receiving more than $600 in payments from the requester over the fiscal year. At that point, due to IRS regulations, the requester needs to send a 1099-MISC form to the MTurk worker. Amazon then provides the full information (SSN, address, etc) of the workers to the requester. So Amazon would like to have the correct information, to avoid forcing the requesters to send 1099 forms to fake addresses, with fake names and SSNs.

I should clarify here that the $600 limit is the point where the employer is forced to send a 1099-MISC form. In principle, a requester may want to send 1099-MISC forms to all workers, and Amazon may want to provide this information on demand. (I doubt that this can be the reason, though).

Finally, there was a new regulation from IRS last year: IRS introduced the concept of a 1099-K form. Since Amazon acts as a defacto payment processor (and not as an employer), Amazon should also report the amount of payments sent to each worker. So, even if no worker have met the $600 limit from a single requester, if the overall payments for a single worker was high enough (specifically $20,000/yr or more, and more than 200 requesters) then again Amazon needs to report this information and include valid worker information there.
Money laundering: Since Mechanical Turk started becoming a marketplace with significant volume, this may have raised some flags in all the places that monitor financial transactions for money laundering. All US companies need to comply with the infamous US Patriot Act, and for Mechanical Turk the provisions about money laundering and financing of terrorist activities may have been a reason for cleaning up the marketplace from fake worker identities. The basic idea, known as the "Know Your Customer (KYC)" doctrine, is that Amazon should know from whom they get money and to whom they send the money. Since Amazon accepts payments from US requesters only, they know where the money come from. Now, with cleaning up the marketplace from fake identities and verifying the existing ones, they also know where the money flows to, so they seem to be more in compliance with the money laundering laws.

Overall, there are many reasons for Amazon to check and clean up the market from fake accounts and prevent any anonymous activity. For me, this is a good step, despite all the problems that it may generate for workers that have problems proving their identity. Even in India, the new UID system will eventually allow the legitimate Indian workers to prove their identity without problems.

One concerns that someone expressed to me was that this direction was removing the ability of workers to be truly anonymous. I am not exactly sure how this can be a concern, given that it is well established that in the workplace (electronic or not) there is very limited right to privacy. Knowing the true identity of your workers (contractors or employees) is a pretty fundamental right of the employer, and I doubt that the expectation that a worker remains anonymous can be a "reasonable expectation of privacy". The only case that I see this happening is if Amazon switches from being a payment processor to being an employer of all the Mechanical Turk workers, but I doubt this will happen anytime soon.

At the end of the day, markets do not mix well with true and complete anonymity.

Project Troia: Quality Assurance in Crowdsourcing

2013-06-18T16:31:00.003-04:00

One of the key problems in crowdsourcing is the issue of quality control. Over the last few years, a large number of methods have been proposed for estimating the quality of workers and the quality of the generated data. A few years back, we have released the Get Another Label toolkit, which allowed people to run their data through a command-line interface, and get back estimates of the worker quality, estimates of how well the data have been labeled, and identify the data points that have high uncertainty and therefore may require additional attention.

The next step for the Get Another Label was to get it ready to work in more practical settings. The GAL toolkit, assumed that we have all the labels assigned by the workers, we process them, and get the results. In reality, though, most tasks run in an incremental mode. The task is running over time, new data arrive, new workers arrive, and the "load-analyze-output" process was not a good fit. We wanted to have something that gives back estimates of worker quality on the fly, and again on-the-fly identifies the data points that need most attention.

Towards this goal, over the last few months we have been porting the GAL code into a web service, called Project Troia. You can load the data as the crowdsourced project runs and get back the results immediately. This allows for very fast estimation of worker quality, and also allows the quick identification of data points that either meet the target quality, or require additional labeling effort.

Supports labeling with any number of discrete categories, not just binary.
Supports labeling with continuous variables.
Allows the specification of arbitrary misclassification costs (e.g., "marking spam as legitimate has cost 1, marking legitimate content as spam has cost 5").
Allows for seamless mixing of gold labels and redundant labels for quality control.
Estimates the quality of the workers that participate in the task and returns the estimates on-the-fly.
Estimates the quality of the data that are returned back by the algorithm and returns the estimate of labeling accuracy on-the-fly.
Estimates a quality-sensitive payment for every worker, based on the quality of the work done so far.

If you are interested in the description of the methods implemented in the toolkit, please take a look at the paper "Quality-based Pricing for Crowdsourced Workers". Our experiments indicate that when labeling allocation happens following the suggestions of Project Troia, we achieve the target data quality with almost optimal budget, and workers are fairly compensated for their effort. (For details, see the paper :-)

Special thanks to Tagasauris, oDesk, and Google for providing support for developing the software. Needless to say, the API is free to use, and the source code is available on Github. We hope that you will find it useful.