Friday, November 16, 2018

Distribution of paper citations over time

A few weeks ago we had a discussion about citations, and how we can compare the citation impact of papers that were published in different years. Obviously, older papers have an advantage as they have more time to accumulate citations.

To compare papers, just for fun, we ended up opening the profile page of each paper in Google Scholar, and we analyzed the paper citations years by year to find the "winner." (They were both great papers, by great authors, fyi. It was more of a "Lebron vs. Jordan" discussion, as opposed to anything serious.)

This process got me curious though. Can we tell how a paper is doing at any given point in time? How can we compare a 2-year-old article, published in 2016, with 100 citations against a 10-year-old document, published in 2008, with 500 citations?

To settle the question, we started with the profiles of faculty members in the top-10 US universities and downloaded about 1.5M publications, across all fields, and their citation histories over time.

We then analyzed the citation histories of these publications, and, for each year, we ranked the papers based on the number of citations received over time. Finally, we computed the citation numbers corresponding to different percentiles of performance.

Cumulative percentiles

The plot below shows the number of citations that a paper needs to have at different stages to be placed in a given percentile.

A few data points, focusing on certain age milestones: 5-years after publication, 10-years after publication, and lifetime.

  • 50% line: The performance of a "median" paper. The median paper gets around 20 citations 5 years after publication, 50 citations within 10 years, and around  100 citations in its lifetime. Milestone scores: 20,50,90
  • 75% line: These papers perform "better," citation-wise than 75% of the remaining papers with the same age. Such papers get around 50 citations within 5 years, 100 citations within 10 years of publication, and around 200 citations in their lifetime. Milestone scores: 50,100,200
  • 90% line: These papers perform better than 90% of the papers in their cohort. Around 90 citations within 5 years, 200 citations within 10 years, and 500 citations in their lifetime. Milestones scores: 90,200,500


Yearly percentiles and peak years

We also wanted to check at which point papers reach their peak, and start collecting fewer citations. The plot below shows the percentiles based on the yearly numbers of accumulated citations. The vast majority of papers tend to reach their peak 5-10 years after publication; the number of yearly citations starts declining after 5-10 years.


Below is the plot of the peak year for a paper based on the paper percentile:


There is an interesting effect around the 97.5% percentile: After that level, it seems that a 'rich-gets-richer' effect kicks in, and we effectively do not observe a peak year. The number of citations per year keeps increasing. You could call these papers the "classics".

What does it take to be a "classic"? 200 citations at 5 years or 500 citations at 10 years.

Monday, January 29, 2018

How many Mechanical Turk workers are there?

TL;DR: There are about 100K-200K unique workers on Amazon. On average, there are 2K-5K workers active on Amazon at any given time, which is equivalent to having 10K-25K full-time employees. On average, 50% of the worker population changes within 12-18 months. Workers exhibit widely different patterns of activity, with most workers being active only occasionally, and few workers being very active. Combining our results with the results from Hara et al, we see that MTurk has a yearly transaction volume of a few hundreds of millions of dollars.

For more details read below, or take a look at our WSDM 2018 paper.

--

Question

A topic that frequently comes up when discussing Mechanical Turk is "how many workers are there on the platform"?

In general, this is a question that is very easy for Amazon to answer, but much harder for outsiders. Amazon claims that there are 500,000 workers on the platform. How can we check the validity of this statement?

--

Basic capture-recapture model

A common technique for this problem is the capture-recapture technique, that is widely used in the field of ecology, to measure the population of a species.

The simplest possible technique is the following:
  • Capture/marking phase: Capture $n_1$ animals, mark them, and release them back. 
  • Recapture phase: A few days later, capture $n_2$ animals. Assuming there are $N$ animals overall, $n_1/N$ of them are marked. So, for each of the $n_2$ captured animals, the probability that the animal is marked is $n_1/N$ (from the capture/marking phase).
  • Calculation: On expectation,  we expect to see $n_2 \cdot \frac{n_1}{N}$ marked animals in the recapture phase. (Notice that we do not know $N$.) So, if we actually see $m$ marked animals during the recapture phase, we set $m = n_2 \cdot \frac{n_1}{N}$ and we get the estimate that:

     $N = \frac{n_1 \cdot n_2}{m}$.
In our setting we adapted the same idea, where "capture" and "recapture" correspond to participating in a demographics survey. In other words, we "capture/mark" MTurk users that complete the survey in one day. Then, in another day, we also "recapture" by surveying more workers and we see how many workers overlap in the two surveys.

--

First (naive) attempt

We decided to apply this technique to estimate the size of the Mechanical Turk population. We considered as "capture" period the set of surveys running over a period of 30 days. Then we considered as "recapture" period, the surveys that we ran on another 30-day period. The plot below shows the results.


The x-axis shows the beginning of the recapture period, and the y-axis the estimate of the number of workers. The color of each dot corresponds to the difference in time between the capture-recapture periods: black is a short time, and red is a long time.

If we focus on the black-color dots (~60 days between the surveys), we get a (naive) estimate of around 10K-15K workers. (Warning: this is incorrect.)

While we could stop here, we see some results that are not consistent with our model. Remember, that color encodes time between samples: black is for short time (~2 months) between samples, red is for long time (~2yrs) between samples. Notice that, as the time between the two periods increases, the estimates are becoming higher, and we get the "rainbow cake" effect in the plot. For example, for July 2017, our estimate is 12K workers if we compare with a capture from May 2017, but the estimate goes up to 45K workers if we compare with a sample from May 2015. Our model, though, says that the time between captures should not affect the population estimates. This indicates that there is something wrong with the model.

--

Assumptions of basic model

The basic capture-recapture estimation described above relies on a couple of assumptions. Both of these assumptions are violated when applying this technique to an online environment.
  • Assumption of no arrivals / departures ("closed population"): The vanilla capture-recapture scheme assumes that there are no arrivals or departures of workers between the capture and recapture phase.
  • Assumption of no selection bias ("equal catchability"): The vanilla capture-recapture scheme assumes that every worker in the population is equally likely to be captured.
In ecology, the issue of closed population has been examined under many different settings (birth-death of animals, immigration, spatial patterns of movement, etc.) and there are many research papers on the topic. Catchability, by comparison, has received comparatively less attention. This is reasonable, as in ecology the assumption of closed population is problematic in many settings. By comparison, assuming that the probability of capturing an animal is uniform among similar animals is reasonable. Typically the focus is on segmenting the animals into groups (e.g., nesting females vs hunting males) and assign different catchability heterogeneity to groups (but not to individuals). 

In online settings though, the assumption of equal catchability is more problematic. First we have the activity bias: Workers exhibit very different levels of activity: A worker who works every day is much more likely to see and complete a task, compared to someone who works once a month. Similarly, we have a selection bias: Some workers may like to complete surveys, while others may avoid such tasks.

So, to improve our estimates, we need to use models that alleviate these assumptions.

--

Endowing workers with survival probabilities 

We can extend the model, allowing each worker to have a certain survival probability, to allow workers to "disappear" from the platform. If we see the plot above, we can see that the population estimate increases as the time between two samples increases. This hints that workers leave the platform, and the intersection of capture-recapture becomes smaller over time. 

If we account for that, we can get an estimate that the "half-life" of a Mechanical Turk worker is between 12-18 months. In other words, approximately 50% of the Mechanical Turk population changes every 12-18 months. 

--

Endowing workers with propensity to participate

We can also extend the model by associating a certain propensity for each worker. The propensity is the probability that a worker is active and willing to participate in a task, at any given time.

In our work, we assumed that the underlying "propensity to participate" follows a Beta distribution across the worker population, and the parameters of the Beta distribution are unknown. When we assume that follow a Beta distribution, then the probability that a worker participates in the survey k times, follows a Beta Binomial distribution. Since we know how many workers participated k times in our surveys, it is then easy to estimate the underlying parameters of the Beta distribution.

Notice that we had to depart from the simple "two occasion" model above, and instead use multiple capturing periods over time. Intuitively, workers that have high propensity to participate will appear many times in our results, while inactive workers will appear only a few times.

By doing this analysis, we can observe that (as expected) the distribution of activity is highly skewed: A few workers are very active in the platform, while others are largely inactive. A nice property of the Beta distribution is its flexibility: Its shape can be pretty much anything: uniform, Gaussian-like, bimodal, heavy-tailed... you name it. 




In our analysis, we estimated that the propensity distribution follows a Beta(0.3,20) distribution. We plot above the "inverse CDF" of the distribution (Inverse CDF: "what percentage of the workers have propensity higher than x").

As you can see, the propensity follows a familiar (and expected) pattern. Only 0.1% of the workers have propensity higher than 0.2, and only 10% have propensity higher than 0.05.

Intuitively, a propensity of 0.2 means that the worker is active and willing to participate 20% of their time (this is roughly equivalent to full-time level of activity; full-timer employees work around 2000 hrs per year, out of 24*365 available hours in a year). A propensity of 0.05 means that the worker is active and available approximately 24 hr * 0.05 ~ 1 hour per day.

--

How big is the platform?

So, how many workers are there? Under such highly skewed distributions, giving an exact number for the number of workers is rather futile. The best that you can do is give a ballpark estimate, and hope to be roughly correct on the order of magnitude. What our estimates are showing is that there are round 180K distinct workers in the MTurk platform. This is good news for anyone who is trying to reach a large number of distinct workers through the platform. 

Our analysis also allows us to estimate how many workers are active and willing to participate in our task at any given time. For that, we estimate that around 2K to 5K workers are available, at any given time. If we want to convert this number to full-time employee equivalence, this is equivalent to 10K-25K full-time workers.

The latter part also allows us to give some low and high estimates on the transaction volume of MTurk. 
  • Lower bound: Assuming 2K workers active at any given time, this is 2000*24*365=17,520,000 work hours in a year. If we assume that the median wage is \$2/hr, this is roughly \$35M/yr transaction volume on Amazon Mechanical Turk (with Amazon netting ~\$7M in fees).
  • Upper bound: Assuming 5K workers active at any given time, this is 5000*24*365=43,800,000 work hours in a year. If we assume average wage of \$12/hr, this is around \$525M/yr transaction volume (with Amazon netting ~$100M in fees).
I understand that a range of \$35M to \$500M may not be very helpful, but these are very rough estimates. If someone wanted my own educated guess, I would put it somewhere in the middle of the two, i.e., transaction volume of a few hundreds of millions of dollars.



Tuesday, January 17, 2017

Why was my Amazon Mechanical Turk registration denied?

(This is my answer to a question posted on Quora)

Mechanical Turk is a platform for work. Workers get paid, which makes now Amazon a payment processor. Payment processors are moving money on behalf of other people, and therefore are under heavy scrutiny from the US government for issues related to money laundering (AML), counter-terrorism, tax compliance, etc.

One of the key things that is required from financial institutions is to have a “Customer Identification Program” (CIP), also known as “Know Your Customer” (KYC) process. The CIP/KYC is a set of procedures that the financial institution needs to follow to establish that they know the true identity of a customer. The processes that each financial institutions follows vary, and the exact processes are rarely available to the public, as they are considered security measures. Furthermore, the practices are regularly monitored by regulators (OCC, Fed, FinCEN, etc) and change over time to follow best practices.

In your particular case, the most likely reason is that Amazon was not able to verify your identity.

If you are in the US, Amazon most probably can get your SSN and other personal details and verify whether you are a real person. However, even if you live in the US, if you have no credit history, no bank accounts, and so on, the verification will come back with low confidence. Following standard risk management processes, Amazon could plausibly reject such applications, as part of their CIP processes: it is better to have a false negative (rejecting a normal account) than having a false positive (e.g., accepting an account that will be involved in money laundering or tax-evasion schemes).

For other countries, the ability of Amazon, to follow CIP/KYC processes that conform to the US regulations, varies. I assume, for example, that the cooperation of US with UK or Australian authorities is much smoother compared to, say, Chinese authorities. So, if you live outside the US, the probability of having your account approved depends on how robust is the ability of Amazon to verify individual identities in your country.

Given that Amazon gets paid by requesters, I assume their focus is to establish CIP processes first in regions where potential requesters reside, which is not always the place where workers reside. This also means that you are more likely to be approved if you first register as a requester (assuming this is an option for you), and then try to create the worker account.

Sunday, March 13, 2016

AlphaGo, Beat the Machine, and the Unknown Unknowns

In Game 4, of the 5-game series between AlphaGo and Lee Sedol, the human Go champion, Lee Sedol managed to get his first win. According to the NY Times article:

Lee had said earlier in the series, which began last week, that he was unable to beat AlphaGo because he could not find any weaknesses in the software's strategy. But after Sunday's match, the 33­ year­ old South Korean Go grandmaster, who has won 18 international championships, said he found two weaknesses in the artificial intelligence program. Lee said that when he made an unexpected move, AlphaGo responded with a move as if the program had a bug, indicating that the machine lacked the ability to deal with surprises.



This part reminded me of one of my favorite papers:  Beat the Machine: Challenging Humans to Find a Predictive Model’s “Unknown Unknowns”

In the paper, we tried to use humans to "beat the machine" and identify vulnerabilities in a machine learning system. The key idea was to reward humans whenever they identify cases where the machine fails, while also being confident that it provides the correct answer. In other words, we encouraged humans to find "unexpected" errors, not just cases where naturally the machine was going to be uncertain.



As an example case, consider a system that detects adult content on the web. Our baseline machine learning system had an accuracy of ~99%. Then, we asked Mechanical Turk workers to do the following task: Find web pages with adult content that the machine learning system classifies as non-adult with high confidence. The humans had no information about the system, and the only thing they can do was to submit a URL and get back an answer.

The reward structure was the following: Humans get \$1 for each URL that the machine misses, otherwise they get \$0.001. In other words, we provided a strong incentive to find problematic cases.

After some probing, humans were quick to uncover underlying vulnerabilities: For example, adult pages in Japanese, Arabic, etc., were classified by our system as non-adult, despite their obvious adult content. Similarly for other categories, such as hate speech, violence, etc. Humans were quickly able to "beat the machine" and identify the "unknown unknowns".



Simply told, humans were able to figure out what are the likely cases that the system may have missed during training. At the end of the day, the training data is provided by humans, and no system has access to all possible training data. We operate in an "open world" while training data implicitly assume a "closed world".

As we see from the AlphaGo example, since most machine learning systems rely on existence of training data (or some immediate feedback for their actions), machines may get into problems when they have to face examples that are unlike any examples they have processed their training data.

We designed our Beat The Machine system to encourage humans to discover such vulnerabilities early.

In a sense, our BTM system is s like hiring hackers to break into your network, to identify security vulnerabilities before they become a real problem. The BTM system applies this principle for machine learning systems, encouraging a period of intense probing for vulnerabilities, before deploying the system in practice.

Well, perhaps Google hired Lee Sedol with the same idea: Get the human to identify cases where the machine will fail, and reward the human for doing so. Only in that case, AlphaGo managed to eat its cake (figure out a vulnerability) and have it too (beat Lee Sedol, and not pay the \$1M prize) :-)

Monday, February 29, 2016

A Cohort Analysis of Mechanical Turk Requesters

In my last post, I examined the number of "active requesters" on Mechanical Turk, and concluded that there is a significant decline in the numbers over the last year. The definition of "active requester" was: "A requester is active at time X if he has a HIT running at time X". A potential issue with this definition is that an improvement in the speed of HIT completion (e.g., due to increased labor supply) could drive down that number.

For this reason, I decided to perform a proper cohort analysis for the requesters on Mechanical Turk.  In the cohort analysis that follows, we will examine how many requesters that have first appeared in the platform on a given month (say September 2015), are still posting tasks in the subsequent months.

Here is the resulting "layer cake plot" that indicates that happens in each cohort. Each of the layers corresponds to requesters that were first seen on a given month. (code, data) (Read this post, if want a  little bit more background on how the plot should "look like".)


For example, the bottom layer corresponds to all the requesters that were first seen on May 2014 (the first month that the new version of MTurk Tracker started collecting data). We can see that we had ~2700 "new" requesters on that month. (The May-2014 cohort obviously contains all prior cohorts in our dataset, as we do not know when these requesters really started posting.) Out of these requesters, approximately 1700 also posted a task on June 2014 or later, approximately 1000 of these have posted a task on March 2015 or later, and approximately 500 have posted a task on February 2016.

The layer on top (slightly darker blue) illustrates the evolution of the June 2014 cohort. By stacking them on top of each other, we can see the composition of the requesters that have been active in every single month.

As the plot makes obvious, until March 2015, the acquisition of new requesters every month was compensating for the requesters that were lost from the prior cohorts. However, starting March 2015, we start seeing a decline in the overall numbers, as the total decline in requesters from prior cohorts dominates the acquisition of new requesters. So, the cohort analysis supports the conclusions of the prior post, as the trends and conclusions are very similar (always good to have a few robustness checks).

Of course, a more comprehensive cohort analysis would also analyze the revenue generated by each cohort, and not just the number of active users. That requires a little bit more digging in the data, but I will do that in a subsequent post.

Friday, February 26, 2016

The Decline of Amazon Mechanical Turk

It seems that after years of neglect, Mechanical Turk starts losing its appeal. In our latest measurement, we see Mechanical Turk losing 50% of its requesters in a YoY measurement.

A few days ago,  Kristy Milland (aka SpamGirl) asked me if there is a way to see the active requesters on Mechanical Turk over time. I did not have this dashboard on Mechanical Turk tracker, but it was an important metric, so I decided to add it in the MTurk Tracker website.

So, now MTurk Tracker has a tab called "Active Requesters" which shows how many requesters are "active" on Mechanical Turk at any given time. The definition of "Active at time X" means "had a task that was running on MTurk before time X and after time X".

Here is the chart for the active requesters between Jan 1, 2015 and February 28, 2016: 


As you can see, starting March 2016 (that is before the announcement of price increases), we see a decline in the number of active requesters. Interestingly, when the fee increases are announced, we see a small "valley" around the period of fee increases. The numbers remain stable until November, but after that we see a steady decline.

Overall, we observe a YoY decline of almost 50% in terms of active requesters.

What is driving the decline? Hard to tell. Perhaps requesters abandon crowdsourcing in favor of more automated solutions, such as deep learning. Perhaps requesters with long running jobs build their own workforce (eg using UpWork). Perhaps they use alternative platforms, such as Crowdflower. Or perhaps my own metric is flawed, and I need to revise it.

But, unless we have a bug in the code, the future does not seem promising for Mechanical Turk. And this is a shame.


Wednesday, June 10, 2015

An API for MTurk Demographics

A few months back, I launched demographics.mturk-tracker.com, a tool that runs continuously surveys of the Mechanical Turk worker population and displays live statistics about gender, age, income, country of origin, etc.

Of course, there are many other reports and analyses that can be presented using the data. In order to make easier for other people to use and analyze the data, we now offer a simple API for retrieving the raw survey data.

Here is a quick example: We first call the API and get back the raw responses:

In [1]:
import requests
import json
import pprint
import pandas as pd
from datetime import datetime
import time

# The API call that returns the last 10K survey responses
url = "https://mturk-surveys.appspot.com/" + \
    "_ah/api/survey/v1/survey/demographics/answers?limit=10000"
resp = requests.get(url)
json = json.loads(resp.text)


Then we need to reformat the returned JSON object and transform the responses into a flat table

In [2]:
# This function takes as input the response for a single survey, and transforms it into a flat dictionary
def flatten(item):
    fmt = "%Y-%m-%dT%H:%M:%S.%fZ"
    
    hit_answer_date = datetime.strptime(item["date"], fmt)
    hit_creation_str = item.get("hitCreationDate")
    
    if hit_creation_str is None: 
        hit_creation_date = None 
        diff = None
    else:
        hit_creation_date = datetime.strptime(hit_creation_str, fmt)
        # convert to unix timestamp
        hit_date_ts = time.mktime(hit_creation_date.timetuple())
        answer_date_ts = time.mktime(hit_answer_date.timetuple())
        diff = int(answer_date_ts-hit_date_ts)
    
    result = {
        "worker_id": str(item["workerId"]),
        "gender": str(item["answers"]["gender"]),
        "household_income": str(item["answers"]["householdIncome"]),
        "household_size": str(item["answers"]["householdSize"]),
        "marital_status": str(item["answers"]["maritalStatus"]),
        "year_of_birth": int(item["answers"]["yearOfBirth"]),
        "location_city": str(item.get("locationCity")),
        "location_region": str(item.get("locationRegion")),
        "location_country": str(item["locationCountry"]),
        "hit_answered_date": hit_answer_date,
        "hit_creation_date": hit_creation_date,
        "post_to_completion_secs": diff
    }
    return result

# We now transform our API answer into a flat table (Pandas dataframe)
responses = [flatten(item) for item in json["items"]]
df = pd.DataFrame(responses)
df["gender"]=df["gender"].astype("category")
df["household_income"]=df["household_income"].astype("category")


We can then save the data to a vanilla CSV file, and see how the raw data looks like:

In [3]:
# Let's save the file as a CSV
df.to_csv("data/mturk_surveys.csv")

!head -5 data/mturk_surveys.csv

,gender,hit_answered_date,hit_creation_date,household_income,household_size,location_city,location_country,location_region,marital_status,post_to_completion_secs,worker_id,year_of_birth
0,male,2015-06-10 15:57:23.072000,2015-06-10 15:50:23,"$25,000-$39,999",5+,kochi,IN,kl,single,420.0,4ce5dfeb7ab9edb7f3b95b630e2ad0de,1992
1,male,2015-06-10 15:57:01.022000,2015-06-10 15:35:22,"Less than $10,000",4,?,IN,?,single,1299.0,cd6ce60cff5e120f3c006504bbf2eb86,1987
2,male,2015-06-10 15:21:53.070000,2015-06-10 15:20:08,"$60,000-$74,999",2,?,US,?,married,105.0,73980a1be9fca00947c59b93557651c8,1971
3,female,2015-06-10 15:16:50.111000,2015-06-10 14:50:06,"Less than $10,000",2,jacksonville,US,fl,married,1604.0,a4cdbe00c93728aefea6cdfb53b8c489,1992

Or we can take a peek at the top countries:

In [4]:
# Let's see the top countries
country = df['location_country'].value_counts()
country.head(20)
Out[4]:
US    5748
IN    1281
CA      30
PH      22
GB      16
ZZ      15
DE      14
AE      11
BR      10
RO      10
TH       7
AU       7
PE       7
MK       7
FR       6
IT       6
NZ       6
SG       6
RS       5
PK       5
dtype: int64

I hope that the examples are sufficient to get people started using the API, and I am looking forward to see what analyses people will perform.

Monday, June 8, 2015

Postdoc Position for Quality Control in Crowdsourcing

The Center for Data Science at NYU invites applications for a post-doctoral fellowship in statistical methodology relating to evaluating rater quality for a new research program in the application of crowdsourcing ratings of human speech production.

Duties and Responsibilities: This is a two-year postdoctoral position in the affiliated with the NYU Center for Data Science. The successful candidate will join a dynamic group of researchers in several NYU Centers including PRIISM, MAGNET, the Stern School of Business, the NYU Medical School and the Department of Communicative Sciences and Disorders. We are seeking highly motivated individuals to develop and test novel statistical and computational methods for evaluating rater quality in crowdsourced tasks. Responsibilities will include development, testing and implementation of statistical algorithms, as well as preparation of manuscripts for academic publication. Advanced knowledge of R is preferred. 

Position Qualifications: Candidates will ideally have a doctoral degree in Statistics, Biostatistics, Data Science, Computer Science, or a related field, as well as genuine interests and experiences in interdisciplinary research that integrates study of human speech, citizen science games and computational statistics. Candidates will ideally have expertise in the following areas: Bayesian statistics, numerical methods and techniques, psychometrics and/or knowledge of programming languages. Outstanding computing and communication skills are required.

Please send CV, letter of intent, and three reference letters to Daphna Harel  (daphna dot harel at nyu dot edu) by July 31, 2015.

The position is for 2 years (subject to good research progress). The successful candidate will be based at the NYU Center for Data Science, under the primary supervision of NYU faculty members Panos Ipeirotis and Daphna Harel, and will closely work with a multidisciplinary team including NYU faculty members Tara McAllister Byun, R. Luke DuBois, and Mario Svirsky. The position will preferably start by September 2015 (start date negotiable).

Friday, May 29, 2015

The World Bank Report on Online Labor

I am often asked about statistics and data about the global population of "crowdsourcing" workers, going beyond Mechanical Turk. I am happy to say that from now on I will be able to point everyone to a study from The World Bank, which I was fortunate to participate. The reports examines the global landscape of online labor, identifying the opportunities, and providing statistics about the global landscape.

The study will be officially released on Wednesday June 3rd, and for those of you willing to attend the launch event through Webex, here is the information:

---
When
Wednesday, June 3, 2015, 9:00AM - 11:30AM EDT

Where:
Webex URL
Meeting number: 730 125 194
Meeting password: online1
Audio connection: 1-650-479-3207 Call-in toll number (US/Canada)
Access code: 730 125 194

Title:
The New Online Outsourcing Approach for Jobs, Youth and Women's Empowerment and Services Exports

Abstract
This event will discuss the new online outsourcing (OO) phenomena in the world today, its implications for developing countries, and how your clients can leverage it as an innovative approach for jobs, youth employment and women's empowerment.

OO refers to the contracting of third-party workers and providers (often overseas) to supply services or perform tasks via Internet-based marketplaces or platforms. Also known as paid crowdsourcing, online work, microwork and other names - these technology-mediated channels allow clients to outsource their paid work to a large, distributed, global labor pool of remote workers, to enable performance, coordination, quality control, delivery, and payment of such services online.

The global OO marketplace today includes numerous emerging and growing platforms; such as Upwork (formerly Elance-oDesk), Crowdflower, CloudFactory, Amazon Mechanical Turk, etc. There are also wide variety of services that can be performed online - such as data entry, digitization, graphics rendering and design, programming and apps development, accounting and legal services, etc. Workers in developing countries can have access and perform jobs from all over the world - as long as they have computer and Internet access. In addition to jobs and income - OO offers workers flexible time and working environment, develop skills for professional, and drive positive social change for youth and women.

The event will share with participants the OO study that covers comprehensively the definition and segments, trends and market size, economic and non-financial impact on workers, and the implications and policy recommendations. In addition the event will show how u can apply the online toolkit to assess the readiness of your client countries for OO.

The World Bank's ICT Unit is excited to share this new global study and toolkit, which was developed in partnership with the Rockefeller Foundation and Dalberg Global Development Advisors.

Who:
  • Chair: Mavis Ampah, Lead ICT Policy Specialist and Practice Lead on Jobs, GTIDR 
  • Siou Chew Kuek, Senior ICT Specialist and TTL, GTIDR 
  • Cecilia Paradi-Guilford, ICT Innovation Specialist and Co-TTL, GTIDR 
  • Saori Imaizumi, ICT Innovation and Education Consultant, GTIDR 

Monday, April 6, 2015

Demographics of Mechanical Turk: Now Live! (April 2015 edition)

One of the most common question that I receive is whether I have new data about the demographics of Mechanical Turk workers. The latest data that I had collected were back in 2010, and it was not clear how things have changed since then. The key problem was not that I could not run additional surveys; that would have been trivial. However, the results of the surveys were always changing over time: the aggregate data varied too much across surveys, so I refrained from publishing data that seemed to be unreliable.

So, I thought of how I tackle two problems at once:
  • Make it easy for people to see current data about the demographics of Mechanical Turk workers
  • Make it easy to understand the inherent variability of the collected data, and potentially understand the source of the variability
For that reason, we built a new site:

(please also check the API)

The site displays live data about the demographics of the workers, based on a small 5-question survey that users are asked to answer (paying 5 cents for each). To be able to capture the time variability, we post one survey every 15 mins, allowing us to observe changes in the answers over time. We also restrict each worker to be able to answer the survey only once per month.

A few key results:

Country

Overall, we see that approximately 80% of the Mechanical Turk workers are from the US and 20% are from India.

However, this mix is not stable during the day. Around 8-10am UTC (ie 3am NYC time, 1.30pm India time), there is much higher number of workers from India (~50%), which then goes down to 5% at 8-10pm UTC.



The gender participation seems to be balanced, with roughly 50% males and 50%. The charts that examine variability based on hour of day and day of the week do not show any change in this pattern.



Roughly 50% of the workers are born in the 1980's and are around 30 yrs old. Approximately 20% of the workers are born in the 1990's, and another 20% are born in the 1970's. 

Marital Status

Approximately 40% of the workers are single, 40% are married, and 10% are cohabitating.

Household Size

Approximately 15% live alone. Then 25% have a household size of two and 25% have a household size of three. Around 25% live in a household of four, and around 10% have five or more members in their household.

Income level

The median household income is around \$50K per year for US Turkers, which is on par with the median US household income. Indian workers have considerably lower household income, with most of them being around \$10K/yr.



Next steps

In our next steps, we plan on making the (anonymized) survey responses available through an API, and potentially add a few more graphs of interest. If you have any idea or suggestion, please send it my way.

Monday, June 9, 2014

My Peer Grading Scheme

One of the components that I use in my class is student presentations. 

While I like having students present, I had always a hard time grading the presentations. Plus, many students seemed to target the presentation to me, trying to sound too technical and advanced, leaving the audience in the class bored and uninterested.

For that reason, I adopted a peer-grading scheme. Students have to present to the class, and get rated by the class, and not me. (Although, I still reserve a small degree of editorial judgement for assigning the grades.) Here is how my scheme works, after a few years of experience.
  1. Rating scale: Students assign a grade from 0 to 10 to the presentations.
  2. No self-grading: Students do not grade their own presentations. (Early on, there were students that were assigning 10 to themselves, and lower grade to everyone else. Now they can still grade themselves if they want but the grade is ignored.)
  3. Normalization: All assigned grades are normalized, to have a zero mean and one standard deviation. (This normalization was introduced to fight the problem where a student would try to game the system by assigning low grades to everyone else, hoping to lower the average rating of all other students.)
  4. Grade assignment: The presentation grade is the average of the assigned normalized scores. Formally, each student $s_i$ assigns to presentation $t$ a grade $z(s,t)$. The overall grade of the presentation is the mean value $E[z(*,t)]$ of the $z(s_i,t)$ grades.
  5. Ensuring careful grading by asking students to estimate class rating: One problem with the peer grading scheme was that many students did not take it seriously enough, and assigned random grades (typically, the same grade to everyone). To avoid indifferent grading, I decided to give credit (~10%) based on the correlation of the assigned grades $z(s,t)$ against the mean value $E[z(*,t)]$ (across all presentations $t$). This ensured that students will at least try to figure out what other students will assign to the presentation, and will not assign random grades.
  6. Separate assigned and estimated grades: The problem with introducing the requirement to agree with the class was that some students believed to be better assessors than the rest of the class. So, they felt that their own grade was the correct one, and did not like losing credit for assigning their own "true" grade. To address that issue, I now ask students to assign two grades: their own grade $z_p(s,t)$, and an estimate of the class grade $z_c(s,t)$. The personal grade $z_p$ is used to compute $E(z(*,t)]$ in Step 4, and I use the $z_c$ to compute the correlation in Step 5. 
  7. Examine self-grading: Given that the class-estimate grades are not directly used to grade a presentation, students are also asked to provide an estimate of their own grade as part of Step 6. Effectively, students are encouraged to estimate properly their own grade.
The only thing that I have not tried to far is to modify Step 4 to take into consideration the different correlations from Step 5, effectively weighting each student's grades based on their correlation with the rest of the class. However, most students tend to exhibit the same, moderate agreement with the class (typical correlation values are in the 0.4-0.6 range, after rating 15-20 presentations), so in practice I do not expect to see a difference.

Overall, I am pretty happy with the scheme. Students indeed try to impress the class (and not me), and many presentations are interesting, interactive, and engaging. The grades are also very consistent with the overall feeling that I get for each presentation, so I did not have to practice my "editorial oversight" and adjust the grade very often (only in a couple of cases, where the students ran into technical problems during the presentation). I would be really interested to try this scheme in one of the big MOOC classes that use peer grading, and see if it can instill the same sense of responsibility in peer grading. 

Tuesday, April 1, 2014

Online Markets: Selling products vs. selling time

We had an interesting discussion a few days back about online job markets, and why they are not a huge success so far, when other, comparatively less important products are getting huge valuation and visibility. For example, oDesk reached a total transaction volume of a billion dollars, for the 10 years of their existence, and roughly 5% to 10% of the volume becomes revenue for the company. Other labor marketplaces have typically even smaller number of report.

While nobody can ignore a billion dollar of transaction volume, I am puzzled why this number has not skyrocketed. It is very clear that the market serves a purpose: work is a trillion dollar industry. Allowing people to work online allows for better and more efficient access to human capital, alleviates need for immigration, and improves the lives of people involved. It is a no-brainer.

Why does it take so long for online work to takeoff? What is missing?

***

I was puzzled by these questions for long. I postulated that there are obstacles that prevent employers from hiring online, but recently I got some hints that there are obstacles from the worker side as well. I talked with some friends of mine back in Greece, who are making a very comfortable living working through the platform. I asked how they like making US salaries while living in Greece, and their answer was surprising. They did not see online work as a long term solution, but rather as a temporary gig.

When I asked why, they both indicated the same problem: There is no room in such markets for career evolution. You end up selling your time, and time is not something that scales. It is very hard to grow your business when you are always a freelancer, without the ability to hire new people, delegate tasks, and build a business. Compare now online work with a market like Amazon and eBay. Both Amazon and eBay allow sellers to effectively build businesses. Currently, online job markets allow workers to just sell their time.

When sellers have a capped growth, the market faces headwinds of growth as it tries to reach maturity.
***

On a general note, this gives birth to a general hypothesis on what can make a marketplace (hugely) successful: The market should allow sellers to grow, without an obvious ceiling. Otherwise, the best sellers are unlikely to be attracted to participate in the platform, due to the lack of upside.

Take some marketplace companies and interpret them through this framework:
  • Google Helpouts: Same restrictions on seller growth as all other job marketplaces.
  • Uber: Obviously, currently the sellers have a cap on growth, which is limited by their time. However, Uber allows the enrollment of limo/taxi agencies, which potentially grow indefinitely.
  • AirBnB: No obvious seller cap for someone who wants to enter the hospitality business.
  • TaskRabbit: Very obvious growth cap for the individual sellers of services.
  • OpenTable: No obvious limit of growth for participating restaurants.
  • eBay/Amazon: No obvious limit of growth for sellers that sell products online
  • Etsy: This is an interesting case. On the surface, the company looks like eBay/Amazon. However, the etsy guidelines dictate that "Everything on Etsy must be Handmade, Vintage, or a Craft Supply." Unfortunately, this places restrictions on seller growth as it implicitly limits sellers to be (very) small businesses. My bet is that either Etsy will revise this policy down the road, once more and more sellers start hitting their growth ceiling.
How accurate is the hypothesis? Time will tell...

Wednesday, January 22, 2014

Future of Education: Fighting Obesity or Fighting Hunger?

I have been following with interest the discussion about the future of education.

***

Some people criticize existing educational institutions, indicating that they offer little in terms of real training, and that real learning occurs outside the classroom, by actually doing. "Nobody learns how to build a system in a computer science class." "Nobody learns how to build a company in an entrepreneurship program."

Others are lamenting that by shifting to training-oriented schemes, we are losing the ability to offer deeper education, on topics that are not marketable. Who is going to study poetry if it has no return on investment? Who is going to teach literature if there is no demand for it?

These two criticisms seem to be pushing in two different directions.

***

In reality, we need to address two different needs:

One need is to really try and democratize education, trying to take the content of the top courses and make it accessible and available to everyone. People that want to learn machine learning, can now take courses from top professors, instead of having to read a book. People can now advance their careers easily, without having to enroll to expensive degree programs.

The other need is to preserve the breadth of education, shielding it from market forces. This need wants to preserve the structure where students during their education get exposed to diverse fields, no matter if there is a market and demand for these fields.

***

This tension reminded me about the discussion about genetically modified foods.

Mass production of food pretty much solved the problem of world hunger. A few decades ago, there was a real problem with world hunger. Famine was a real problem in many areas of the world, due to the inability to produce enough food to feed the growing population: floods, droughts, diseases were disrupting production, resulting in shortages. Today, the advances in agriculture allow the abundant production of grains and food: wheat and rice varieties are now robust, resistant to diseases, adaptable to many different climates, and allow us to feed the world.

The advances that solved the problem of world hunger, ended up creating other problems. Processed carbohydrates and causing obesity, diabetes, gout, and many other "luxury" diseases in the developed world. The poor in the developed world are not dying because they are hungry. They are dying by starving themselves from essential ingredients in their diet.

***

The parallels are striking. The MOOCs, Khan Academies, and Code Academies of the world are the genetically modified foods for those living in the "third world of education". These courses may not be the most nutritious, and they may not provide all the "nutrition" for their education. However, the choice for many of these people in the "third world of education" is not Stanford vs. a Coursera MOOC. It is nothing vs. a Coursera MOOC. Given the choice, take the MOOC at any time.

Those that live in the "developed world of education" can be pickier. They may have access to the genetically modified MOOCs, but if they can afford it, the organic, artisanal, locally sourced education can be potentially better than the mass produced MOOC. 

***

Horses for courses (pun intended).


Monday, January 20, 2014

Crowdsourcing research: What is really new?

A common question that comes up when discussing research in crowdsourcing, is how it compares with similar efforts in other fields. Having discussed these a few times, I thought it would be good to collect all these in a single place.
  • Ensemble learning: In machine learning, you can generate a large number of "weak classifiers" and then build a stronger classifier on top. In crowdsourcing, you can treat each human as a weak classifier and then learn on top. What is the difference? In crowdsourcing, each judgement has a cost. With ensembles, you can trivially easy create 100 weak classifiers, classify each object, and then learn on top. In crowdsourcing, you have a cost for every classification decision. Furthermore, you cannot force every person to participate, and often you have a heavy-tailed participation: A few humans participate a lot, but from many of them we get only a few judgments.
  • Quality assurance in manufacturing: When factories create batches of products, they also have a sampling process where they examine the quality of the manufactured products. For example, a factory creates light bulbs, and wants 99% of them to be operating. The typical process involves setting aside a sample for testing and testing if they meet the quality requirement. In crowdsourcing, this would be equivalent to verifying, with gold testing or with post-verification, the quality of each worker. Two key differences: The heavy-tailed participation of workers means that gold-testing each person is not always efficient, as you may end up testing a user a lot, and the the user may leave. Furthermore, it is often the case that a sub-par worker can still generate somewhat useful information, while for tangible products, the product is either acceptable or not.
  • Active learning: Active learning assumes that humans can provide input to a machine learning model (e.g., disambiguate an ambiguous example) and the answers are assumed to be perfect. In crowdsourcing this is not the case, and we need to explicitly take the noise into account.
  • Test theory and Item Response Theory: Test theory focuses on how to infer the skill of a person through a set of questions. For example, to create a SAT or GRE test, we need to have a mix of questions of different difficulties, and we need to whether these questions really separate the persons that have different abilities. Item Response Theory studies exactly these questions, and based on the answers that users give to the tests, IRT calculates various metrics for the questions, such as the probability that a user of a given ability will answer correctly the question, the average difficulty of a question, etc. Two things make IRT unapplicable directly to a crowdsourcing setting: First, IRT assumes that we know the correct answer to each question; second, IRT often requires 100-200 answers to provide robust estimates of the model parameters, a cost that is typically too high for many crowdsourcing applications (except perhaps the citizen science and other volunteer based projects).
  • Theory of distributed systems: This part of CS theory is actually much closer to many crowdsourcing problems than many people realize, especially the work on asynchronous distributed systems, which attempts to solve many coordination problems that appear in crowdsourcing (e.g. agree on an answer). The work on analysis of byzantine systems, which explicitly acknowledges the existence of malicious agents, provides significant theoretical foundations for defending systems against spam attacks, etc. One thing that I am not aware of, is the explicit dealing of noisy agents (as opposed to malicious ones), and I am not aware of any study of incentives within that context that will affect the way that people answer to a given question.
  • Database systems and User-defined-functions (UDFs): In databases, a query optimizer tries to identify the best way to execute a given query, trying to return the correct results as fast as possible. An interesting part of database research that is applicable to crowdsourcing is the inclusion of user-defined-functions in the optimization process. A User-Defined-Function is typically a slow, manually-coded function that the query optimizer tries to invoke as little as possible. The ideas from UDFs are typically applicable when trying to optimize in a human-in-the-loop-as-UDF approach, with the following caveats: (a) UDFs were considered to be return perfect information, and (b) the UDFs were assumed to have a deterministic or a stochastic but normally distributed execution time. The existence of noisy results and the fact that execution times with humans can be often long-tailed make the immediate applicability of UDF research in optimizing crowdsourcing operations rather challenging. However, it is worth reading the related chapters about UDF optimization in the database textbooks.
  • (Update) Information Theory and Error Correcting Codes: We can model the workers are noisy channels, that get as input the true signal and return back a noisy representation. The idea of using advanced error correcting codes to improve crowdsourcing is rather underexplored, imho. Instead we rely too much on redundancy-based solutions, although pure redundancy has been theoretically proven to be a suboptimal technique for error correction. (See an earlier, related blog post.) Here are a couple of potential challenges: (a) The errors of the humans are very rarely independent of the "message" and (b) It is not clear if we can get humans to compute properly functions that are commonly required for the implementation of error correcting codes. See a related e
  • (Update) Information Retrieval and Interannotator Agreement: In information retrieval, it is very common to examine the agreement of the annotators when labeling the same set of items. My own experience with reading the literature, and the related metrics is that they implicitly assume that all workers have the same level of noise, an assumption that is often violated in crowdsourcing.
Any other fields and what other caveats that should be included in the list?