A Computer Scientist in a Business School

Wednesday, June 10, 2015

An API for MTurk Demographics

A few months back, I launched demographics.mturk-tracker.com, a tool that runs continuously surveys of the Mechanical Turk worker population and displays live statistics about gender, age, income, country of origin, etc.

Of course, there are many other reports and analyses that can be presented using the data. In order to make easier for other people to use and analyze the data, we now offer a simple API for retrieving the raw survey data.

Here is a quick example: We first call the API and get back the raw responses:

In [1]:

import requests
import json
import pprint
import pandas as pd
from datetime import datetime
import time

# The API call that returns the last 10K survey responses
url = "https://mturk-surveys.appspot.com/" + \
    "_ah/api/survey/v1/survey/demographics/answers?limit=10000"
resp = requests.get(url)
json = json.loads(resp.text)

Then we need to reformat the returned JSON object and transform the responses into a flat table

In [2]:

# This function takes as input the response for a single survey, and transforms it into a flat dictionary
def flatten(item):
    fmt = "%Y-%m-%dT%H:%M:%S.%fZ"
    
    hit_answer_date = datetime.strptime(item["date"], fmt)
    hit_creation_str = item.get("hitCreationDate")
    
    if hit_creation_str is None: 
        hit_creation_date = None 
        diff = None
    else:
        hit_creation_date = datetime.strptime(hit_creation_str, fmt)
        # convert to unix timestamp
        hit_date_ts = time.mktime(hit_creation_date.timetuple())
        answer_date_ts = time.mktime(hit_answer_date.timetuple())
        diff = int(answer_date_ts-hit_date_ts)
    
    result = {
        "worker_id": str(item["workerId"]),
        "gender": str(item["answers"]["gender"]),
        "household_income": str(item["answers"]["householdIncome"]),
        "household_size": str(item["answers"]["householdSize"]),
        "marital_status": str(item["answers"]["maritalStatus"]),
        "year_of_birth": int(item["answers"]["yearOfBirth"]),
        "location_city": str(item.get("locationCity")),
        "location_region": str(item.get("locationRegion")),
        "location_country": str(item["locationCountry"]),
        "hit_answered_date": hit_answer_date,
        "hit_creation_date": hit_creation_date,
        "post_to_completion_secs": diff
    }
    return result

# We now transform our API answer into a flat table (Pandas dataframe)
responses = [flatten(item) for item in json["items"]]
df = pd.DataFrame(responses)
df["gender"]=df["gender"].astype("category")
df["household_income"]=df["household_income"].astype("category")

We can then save the data to a vanilla CSV file, and see how the raw data looks like:

In [3]:

# Let's save the file as a CSV
df.to_csv("data/mturk_surveys.csv")

!head -5 data/mturk_surveys.csv

,gender,hit_answered_date,hit_creation_date,household_income,household_size,location_city,location_country,location_region,marital_status,post_to_completion_secs,worker_id,year_of_birth
0,male,2015-06-10 15:57:23.072000,2015-06-10 15:50:23,"$25,000-$39,999",5+,kochi,IN,kl,single,420.0,4ce5dfeb7ab9edb7f3b95b630e2ad0de,1992
1,male,2015-06-10 15:57:01.022000,2015-06-10 15:35:22,"Less than $10,000",4,?,IN,?,single,1299.0,cd6ce60cff5e120f3c006504bbf2eb86,1987
2,male,2015-06-10 15:21:53.070000,2015-06-10 15:20:08,"$60,000-$74,999",2,?,US,?,married,105.0,73980a1be9fca00947c59b93557651c8,1971
3,female,2015-06-10 15:16:50.111000,2015-06-10 14:50:06,"Less than $10,000",2,jacksonville,US,fl,married,1604.0,a4cdbe00c93728aefea6cdfb53b8c489,1992

Or we can take a peek at the top countries:

In [4]:

# Let's see the top countries
country = df['location_country'].value_counts()
country.head(20)

Out[4]:

US    5748
IN    1281
CA      30
PH      22
GB      16
ZZ      15
DE      14
AE      11
BR      10
RO      10
TH       7
AU       7
PE       7
MK       7
FR       6
IT       6
NZ       6
SG       6
RS       5
PK       5
dtype: int64

I hope that the examples are sufficient to get people started using the API, and I am looking forward to see what analyses people will perform.

Monday, June 8, 2015

Postdoc Position for Quality Control in Crowdsourcing

The Center for Data Science at NYU invites applications for a post-doctoral fellowship in statistical methodology relating to evaluating rater quality for a new research program in the application of crowdsourcing ratings of human speech production.

Duties and Responsibilities: This is a two-year postdoctoral position in the affiliated with the NYU Center for Data Science. The successful candidate will join a dynamic group of researchers in several NYU Centers including PRIISM, MAGNET, the Stern School of Business, the NYU Medical School and the Department of Communicative Sciences and Disorders. We are seeking highly motivated individuals to develop and test novel statistical and computational methods for evaluating rater quality in crowdsourced tasks. Responsibilities will include development, testing and implementation of statistical algorithms, as well as preparation of manuscripts for academic publication. Advanced knowledge of R is preferred.

Position Qualifications: Candidates will ideally have a doctoral degree in Statistics, Biostatistics, Data Science, Computer Science, or a related field, as well as genuine interests and experiences in interdisciplinary research that integrates study of human speech, citizen science games and computational statistics. Candidates will ideally have expertise in the following areas: Bayesian statistics, numerical methods and techniques, psychometrics and/or knowledge of programming languages. Outstanding computing and communication skills are required.

Please send CV, letter of intent, and three reference letters to Daphna Harel (daphna dot harel at nyu dot edu) by July 31, 2015.

The position is for 2 years (subject to good research progress). The successful candidate will be based at the NYU Center for Data Science, under the primary supervision of NYU faculty members Panos Ipeirotis and Daphna Harel, and will closely work with a multidisciplinary team including NYU faculty members Tara McAllister Byun, R. Luke DuBois, and Mario Svirsky. The position will preferably start by September 2015 (start date negotiable).

Friday, May 29, 2015

The World Bank Report on Online Labor

I am often asked about statistics and data about the global population of "crowdsourcing" workers, going beyond Mechanical Turk. I am happy to say that from now on I will be able to point everyone to a study from The World Bank, which I was fortunate to participate. The reports examines the global landscape of online labor, identifying the opportunities, and providing statistics about the global landscape.

The study will be officially released on Wednesday June 3rd, and for those of you willing to attend the launch event through Webex, here is the information:

---

When:

Wednesday, June 3, 2015, 9:00AM - 11:30AM EDT

Where:
Webex URL

Meeting number: 730 125 194

Meeting password: online1

Audio connection: 1-650-479-3207 Call-in toll number (US/Canada)

Access code: 730 125 194

Title:
The New Online Outsourcing Approach for Jobs, Youth and Women's Empowerment and Services Exports

Abstract:

This event will discuss the new online outsourcing (OO) phenomena in the world today, its implications for developing countries, and how your clients can leverage it as an innovative approach for jobs, youth employment and women's empowerment.

OO refers to the contracting of third-party workers and providers (often overseas) to supply services or perform tasks via Internet-based marketplaces or platforms. Also known as paid crowdsourcing, online work, microwork and other names - these technology-mediated channels allow clients to outsource their paid work to a large, distributed, global labor pool of remote workers, to enable performance, coordination, quality control, delivery, and payment of such services online.

The global OO marketplace today includes numerous emerging and growing platforms; such as Upwork (formerly Elance-oDesk), Crowdflower, CloudFactory, Amazon Mechanical Turk, etc. There are also wide variety of services that can be performed online - such as data entry, digitization, graphics rendering and design, programming and apps development, accounting and legal services, etc. Workers in developing countries can have access and perform jobs from all over the world - as long as they have computer and Internet access. In addition to jobs and income - OO offers workers flexible time and working environment, develop skills for professional, and drive positive social change for youth and women.

The event will share with participants the OO study that covers comprehensively the definition and segments, trends and market size, economic and non-financial impact on workers, and the implications and policy recommendations. In addition the event will show how u can apply the online toolkit to assess the readiness of your client countries for OO.

The World Bank's ICT Unit is excited to share this new global study and toolkit, which was developed in partnership with the Rockefeller Foundation and Dalberg Global Development Advisors.

Who:

Chair: Mavis Ampah, Lead ICT Policy Specialist and Practice Lead on Jobs, GTIDR
Siou Chew Kuek, Senior ICT Specialist and TTL, GTIDR
Cecilia Paradi-Guilford, ICT Innovation Specialist and Co-TTL, GTIDR
Saori Imaizumi, ICT Innovation and Education Consultant, GTIDR

Monday, April 6, 2015

Demographics of Mechanical Turk: Now Live! (April 2015 edition)

One of the most common question that I receive is whether I have new data about the demographics of Mechanical Turk workers. The latest data that I had collected were back in 2010, and it was not clear how things have changed since then. The key problem was not that I could not run additional surveys; that would have been trivial. However, the results of the surveys were always changing over time: the aggregate data varied too much across surveys, so I refrained from publishing data that seemed to be unreliable.

So, I thought of how I tackle two problems at once:

Make it easy for people to see current data about the demographics of Mechanical Turk workers
Make it easy to understand the inherent variability of the collected data, and potentially understand the source of the variability

For that reason, we built a new site:

http://demographics.mturk-tracker.com/

(please also check the API)

The site displays live data about the demographics of the workers, based on a small 5-question survey that users are asked to answer (paying 5 cents for each). To be able to capture the time variability, we post one survey every 15 mins, allowing us to observe changes in the answers over time. We also restrict each worker to be able to answer the survey only once per month.

A few key results:

Country

Overall, we see that approximately 80% of the Mechanical Turk workers are from the US and 20% are from India.

However, this mix is not stable during the day. Around 8-10am UTC (ie 3am NYC time, 1.30pm India time), there is much higher number of workers from India (~50%), which then goes down to 5% at 8-10pm UTC.

Gender

The gender participation seems to be balanced, with roughly 50% males and 50%. The charts that examine variability based on hour of day and day of the week do not show any change in this pattern.

Year of birth

Roughly 50% of the workers are born in the 1980's and are around 30 yrs old. Approximately 20% of the workers are born in the 1990's, and another 20% are born in the 1970's.

Marital Status

Approximately 40% of the workers are single, 40% are married, and 10% are cohabitating.

Household Size

Approximately 15% live alone. Then 25% have a household size of two and 25% have a household size of three. Around 25% live in a household of four, and around 10% have five or more members in their household.

Income level

The median household income is around \$50K per year for US Turkers, which is on par with the median US household income. Indian workers have considerably lower household income, with most of them being around \$10K/yr.

Next steps

In our next steps, we plan on making the (anonymized) survey responses available through an API, and potentially add a few more graphs of interest. If you have any idea or suggestion, please send it my way.