A few months back, I launched demographics.mturk-tracker.com, a tool that runs continuous surveys of the Mechanical Turk worker population and displays live statistics about gender, age, income, country of origin, etc.
Of course, there are many other reports and analyses that can be presented using the data. In order to make it easier for other people to use and analyze the data, we now offer a simple API for retrieving the raw survey data.
Here is a quick example: We first call the API and get back the raw responses:
import requests
import json
import pprint
import pandas as pd
from datetime import datetime
import time
# The API call that returns the last 10K survey responses
url = "https://mturk-surveys.appspot.com/" + \
"_ah/api/survey/v1/survey/demographics/answers?limit=10000"
resp = requests.get(url)
data = json.loads(resp.text)
Then we need to reformat the returned JSON object and transform the responses into a flat table:
# This function takes as input the response for a single survey, and transforms it into a flat dictionary
def flatten(item):
fmt = "%Y-%m-%dT%H:%M:%S.%fZ"
hit_answer_date = datetime.strptime(item["date"], fmt)
hit_creation_str = item.get("hitCreationDate")
if hit_creation_str is None:
hit_creation_date = None
diff = None
else:
hit_creation_date = datetime.strptime(hit_creation_str, fmt)
# convert to unix timestamp
hit_date_ts = time.mktime(hit_creation_date.timetuple())
answer_date_ts = time.mktime(hit_answer_date.timetuple())
diff = int(answer_date_ts-hit_date_ts)
result = {
"worker_id": str(item["workerId"]),
"gender": str(item["answers"]["gender"]),
"household_income": str(item["answers"]["householdIncome"]),
"household_size": str(item["answers"]["householdSize"]),
"marital_status": str(item["answers"]["maritalStatus"]),
"year_of_birth": int(item["answers"]["yearOfBirth"]),
"location_city": str(item.get("locationCity")),
"location_region": str(item.get("locationRegion")),
"location_country": str(item["locationCountry"]),
"hit_answered_date": hit_answer_date,
"hit_creation_date": hit_creation_date,
"post_to_completion_secs": diff
}
return result
# We now transform our API answer into a flat table (Pandas dataframe)
responses = [flatten(item) for item in data["items"]]
df = pd.DataFrame(responses)
df["gender"]=df["gender"].astype("category")
df["household_income"]=df["household_income"].astype("category")
We can then save the data to a vanilla CSV file, and see how the raw data looks like:
# Let's save the file as a CSV
df.to_csv("data/mturk_surveys.csv")
!head -5 data/mturk_surveys.csv
,gender,hit_answered_date,hit_creation_date,household_income,household_size,location_city,location_country,location_region,marital_status,post_to_completion_secs,worker_id,year_of_birth
0,male,2015-06-10 15:57:23.072000,2015-06-10 15:50:23,"$25,000-$39,999",5+,kochi,IN,kl,single,420.0,4ce5dfeb7ab9edb7f3b95b630e2ad0de,1992
1,male,2015-06-10 15:57:01.022000,2015-06-10 15:35:22,"Less than $10,000",4,?,IN,?,single,1299.0,cd6ce60cff5e120f3c006504bbf2eb86,1987
2,male,2015-06-10 15:21:53.070000,2015-06-10 15:20:08,"$60,000-$74,999",2,?,US,?,married,105.0,73980a1be9fca00947c59b93557651c8,1971
3,female,2015-06-10 15:16:50.111000,2015-06-10 14:50:06,"Less than $10,000",2,jacksonville,US,fl,married,1604.0,a4cdbe00c93728aefea6cdfb53b8c489,1992
Or we can take a peek at the top countries:
# Let's see the top countries
country = df['location_country'].value_counts()
country.head(20)
US 5748
IN 1281
CA 30
PH 22
GB 16
ZZ 15
DE 14
AE 11
BR 10
RO 10
TH 7
AU 7
PE 7
MK 7
FR 6
IT 6
NZ 6
SG 6
RS 5
PK 5
dtype: int64
I hope that the examples are sufficient to get people started using the API, and I am looking forward to seeing what analyses people will perform.