Showing posts with label research. Show all posts
Showing posts with label research. Show all posts

Wednesday, June 10, 2015

An API for MTurk Demographics

A few months back, I launched demographics.mturk-tracker.com, a tool that runs continuously surveys of the Mechanical Turk worker population and displays live statistics about gender, age, income, country of origin, etc.

Of course, there are many other reports and analyses that can be presented using the data. In order to make easier for other people to use and analyze the data, we now offer a simple API for retrieving the raw survey data.

Here is a quick example: We first call the API and get back the raw responses:

In [1]:
import requests
import json
import pprint
import pandas as pd
from datetime import datetime
import time

# The API call that returns the last 10K survey responses
url = "https://mturk-surveys.appspot.com/" + \
    "_ah/api/survey/v1/survey/demographics/answers?limit=10000"
resp = requests.get(url)
json = json.loads(resp.text)


Then we need to reformat the returned JSON object and transform the responses into a flat table

In [2]:
# This function takes as input the response for a single survey, and transforms it into a flat dictionary
def flatten(item):
    fmt = "%Y-%m-%dT%H:%M:%S.%fZ"
    
    hit_answer_date = datetime.strptime(item["date"], fmt)
    hit_creation_str = item.get("hitCreationDate")
    
    if hit_creation_str is None: 
        hit_creation_date = None 
        diff = None
    else:
        hit_creation_date = datetime.strptime(hit_creation_str, fmt)
        # convert to unix timestamp
        hit_date_ts = time.mktime(hit_creation_date.timetuple())
        answer_date_ts = time.mktime(hit_answer_date.timetuple())
        diff = int(answer_date_ts-hit_date_ts)
    
    result = {
        "worker_id": str(item["workerId"]),
        "gender": str(item["answers"]["gender"]),
        "household_income": str(item["answers"]["householdIncome"]),
        "household_size": str(item["answers"]["householdSize"]),
        "marital_status": str(item["answers"]["maritalStatus"]),
        "year_of_birth": int(item["answers"]["yearOfBirth"]),
        "location_city": str(item.get("locationCity")),
        "location_region": str(item.get("locationRegion")),
        "location_country": str(item["locationCountry"]),
        "hit_answered_date": hit_answer_date,
        "hit_creation_date": hit_creation_date,
        "post_to_completion_secs": diff
    }
    return result

# We now transform our API answer into a flat table (Pandas dataframe)
responses = [flatten(item) for item in json["items"]]
df = pd.DataFrame(responses)
df["gender"]=df["gender"].astype("category")
df["household_income"]=df["household_income"].astype("category")


We can then save the data to a vanilla CSV file, and see how the raw data looks like:

In [3]:
# Let's save the file as a CSV
df.to_csv("data/mturk_surveys.csv")

!head -5 data/mturk_surveys.csv

,gender,hit_answered_date,hit_creation_date,household_income,household_size,location_city,location_country,location_region,marital_status,post_to_completion_secs,worker_id,year_of_birth
0,male,2015-06-10 15:57:23.072000,2015-06-10 15:50:23,"$25,000-$39,999",5+,kochi,IN,kl,single,420.0,4ce5dfeb7ab9edb7f3b95b630e2ad0de,1992
1,male,2015-06-10 15:57:01.022000,2015-06-10 15:35:22,"Less than $10,000",4,?,IN,?,single,1299.0,cd6ce60cff5e120f3c006504bbf2eb86,1987
2,male,2015-06-10 15:21:53.070000,2015-06-10 15:20:08,"$60,000-$74,999",2,?,US,?,married,105.0,73980a1be9fca00947c59b93557651c8,1971
3,female,2015-06-10 15:16:50.111000,2015-06-10 14:50:06,"Less than $10,000",2,jacksonville,US,fl,married,1604.0,a4cdbe00c93728aefea6cdfb53b8c489,1992

Or we can take a peek at the top countries:

In [4]:
# Let's see the top countries
country = df['location_country'].value_counts()
country.head(20)
Out[4]:
US    5748
IN    1281
CA      30
PH      22
GB      16
ZZ      15
DE      14
AE      11
BR      10
RO      10
TH       7
AU       7
PE       7
MK       7
FR       6
IT       6
NZ       6
SG       6
RS       5
PK       5
dtype: int64

I hope that the examples are sufficient to get people started using the API, and I am looking forward to see what analyses people will perform.

Monday, January 20, 2014

Crowdsourcing research: What is really new?

A common question that comes up when discussing research in crowdsourcing, is how it compares with similar efforts in other fields. Having discussed these a few times, I thought it would be good to collect all these in a single place.
  • Ensemble learning: In machine learning, you can generate a large number of "weak classifiers" and then build a stronger classifier on top. In crowdsourcing, you can treat each human as a weak classifier and then learn on top. What is the difference? In crowdsourcing, each judgement has a cost. With ensembles, you can trivially easy create 100 weak classifiers, classify each object, and then learn on top. In crowdsourcing, you have a cost for every classification decision. Furthermore, you cannot force every person to participate, and often you have a heavy-tailed participation: A few humans participate a lot, but from many of them we get only a few judgments.
  • Quality assurance in manufacturing: When factories create batches of products, they also have a sampling process where they examine the quality of the manufactured products. For example, a factory creates light bulbs, and wants 99% of them to be operating. The typical process involves setting aside a sample for testing and testing if they meet the quality requirement. In crowdsourcing, this would be equivalent to verifying, with gold testing or with post-verification, the quality of each worker. Two key differences: The heavy-tailed participation of workers means that gold-testing each person is not always efficient, as you may end up testing a user a lot, and the the user may leave. Furthermore, it is often the case that a sub-par worker can still generate somewhat useful information, while for tangible products, the product is either acceptable or not.
  • Active learning: Active learning assumes that humans can provide input to a machine learning model (e.g., disambiguate an ambiguous example) and the answers are assumed to be perfect. In crowdsourcing this is not the case, and we need to explicitly take the noise into account.
  • Test theory and Item Response Theory: Test theory focuses on how to infer the skill of a person through a set of questions. For example, to create a SAT or GRE test, we need to have a mix of questions of different difficulties, and we need to whether these questions really separate the persons that have different abilities. Item Response Theory studies exactly these questions, and based on the answers that users give to the tests, IRT calculates various metrics for the questions, such as the probability that a user of a given ability will answer correctly the question, the average difficulty of a question, etc. Two things make IRT unapplicable directly to a crowdsourcing setting: First, IRT assumes that we know the correct answer to each question; second, IRT often requires 100-200 answers to provide robust estimates of the model parameters, a cost that is typically too high for many crowdsourcing applications (except perhaps the citizen science and other volunteer based projects).
  • Theory of distributed systems: This part of CS theory is actually much closer to many crowdsourcing problems than many people realize, especially the work on asynchronous distributed systems, which attempts to solve many coordination problems that appear in crowdsourcing (e.g. agree on an answer). The work on analysis of byzantine systems, which explicitly acknowledges the existence of malicious agents, provides significant theoretical foundations for defending systems against spam attacks, etc. One thing that I am not aware of, is the explicit dealing of noisy agents (as opposed to malicious ones), and I am not aware of any study of incentives within that context that will affect the way that people answer to a given question.
  • Database systems and User-defined-functions (UDFs): In databases, a query optimizer tries to identify the best way to execute a given query, trying to return the correct results as fast as possible. An interesting part of database research that is applicable to crowdsourcing is the inclusion of user-defined-functions in the optimization process. A User-Defined-Function is typically a slow, manually-coded function that the query optimizer tries to invoke as little as possible. The ideas from UDFs are typically applicable when trying to optimize in a human-in-the-loop-as-UDF approach, with the following caveats: (a) UDFs were considered to be return perfect information, and (b) the UDFs were assumed to have a deterministic or a stochastic but normally distributed execution time. The existence of noisy results and the fact that execution times with humans can be often long-tailed make the immediate applicability of UDF research in optimizing crowdsourcing operations rather challenging. However, it is worth reading the related chapters about UDF optimization in the database textbooks.
  • (Update) Information Theory and Error Correcting Codes: We can model the workers are noisy channels, that get as input the true signal and return back a noisy representation. The idea of using advanced error correcting codes to improve crowdsourcing is rather underexplored, imho. Instead we rely too much on redundancy-based solutions, although pure redundancy has been theoretically proven to be a suboptimal technique for error correction. (See an earlier, related blog post.) Here are a couple of potential challenges: (a) The errors of the humans are very rarely independent of the "message" and (b) It is not clear if we can get humans to compute properly functions that are commonly required for the implementation of error correcting codes. See a related e
  • (Update) Information Retrieval and Interannotator Agreement: In information retrieval, it is very common to examine the agreement of the annotators when labeling the same set of items. My own experience with reading the literature, and the related metrics is that they implicitly assume that all workers have the same level of noise, an assumption that is often violated in crowdsourcing.
Any other fields and what other caveats that should be included in the list?

Wednesday, September 11, 2013

CrowdScale workshop at HCOMP 2013

A public service announcement, to advertise CrowdScale (http://www.crowdscale.org/) a cool workshop at HCOMP 2013 that focuses on challenges that people face when applying crowdsourcing at scale.

A couple of interesting twists from the classic workshop recipe; 
  • The workshop invites submission of short (2-page) position papers which identify and motivate key problems or potential approaches for crowdsourcing at scale, even if there aren’t satisfactory solutions proposed. (Deadline: October 4)
  • Second, there is a shared task challenge, which also carries a cool $1500 reward for the winner.
The CfP follows:

Crowdsourcing at a large scale raises a variety of open challenges:
  • How do we programmatically measure, incentivize and improve the quality of work across thousands of workers answering millions of questions daily? 
  • As the volume, diversity and complexity of crowdsourcing tasks increase, how do we scale the hiring, training and evaluation of workers? 
  • How do we design effective elastic marketplaces for more skilled work? 
  • How do we adapt models for long-term, sustained contributions rather than ephemeral participation of workers?We believe tackling such problems will be key to taking crowdsourcing to the next level – from its uptake by early adopters today, to its future as how the world’s work gets done. 
To advance the research and practice in crowdsourcing at scale, our workshop invites position papers tackling such issues of scale. In addition, we are organizing a shared task challenge regarding how to best aggregate crowd labels on large crowdsourcing datasets released by Google and CrowdFlower.
Twitter: #crowdscale@CrowdAtScale
Organizers

Monday, July 9, 2012

Discussion on Disintermediating a Labor Channel

Last Friday, I wrote a short blog post with the title "Disintermediating a Labor Channel: Does it Make Sense?" where I argued that trying to bypass a labor channel (Mechanical Turk, oDesk, etc) in order to save on the extra fees does not make much sense.

Despite the fact that there was no discussion in the comments, that piece seemed to generate a significant amount of feedback, across various semi-private channels (fb/plus/twitter) and in many real-life discussions

Fernando Pereira wrote on Google Plus:
Your argument sounds right, but I'm wondering about quality: can I control quality/biases in the outside labor platform? How do I specify labor platform requirements to meet my requirements? It could be different from quality control for outsourced widgets because outsourced labor units might be interdependent, and thus susceptible to unwanted correlation between workers.?
Another friend wrote in my email:
So, do you advocate that oDesk should be controlling the process? Actually, I'd rather have higher control over my employees and know who is doing what.
Both questions have similar flavor, and it indicates that I failed in expressing my true thoughts on the issue.

I do not advocate giving up control of the "human computation" process. I advocate in letting a third-party platform handle the "low level" recruiting and payment of the workers, preferably through an API-fied process. Payments, money transfer regulations, and immigration are big tasks that are best handled by specialized platforms. They are too much for most other companies. Handling such things on your own is as interesting as handling issues like aircondition, electricity supply, and failed disks and motherboards when you are building a software application: Let someone else do these things for you.



One useful classification that I think will clarify further my argument. Consider the different "service models" for crowdsourcing, which I have adapted from the NIST definition of cloud services.
  • Labor Applications/Software as a Service (LSaaS). The capability provided to the client is to use the provider’s applications running on a cloud-labor infrastructure. [...] The client does not manage or control the underlying cloud labor, with the possible exception of limited user-specific application configuration settings. Effectively, the client only cares about the quality of the provided results of the labor and does not want to know about the underlying workflows, quality management, etc. [Companies like CastingWords and uTest fall into this category: They offer a vertical service, which is powered by the crowd, but the end client typically only cares about the result]
  • Labor Platform as a Service (LPaaS). The capability provided to the client is to deploy onto the labor pool consumer-created or acquired applications created using programming languages and tools supported by the provider. The client does not manage or control the underlying labor pool, but has control of the overall task execution, including workflows, quality control, etc. The platform provides the necessary infrastructure to support the generation and implementation of the task execution logic. [Companies like Humanoid fall into this category: Creating a platform for other people to build their crowd-powered services on top.]
  • Labor Infrastructure as a Service (LIaaS). The capability provided to the client is to provision labor for the client, who then allocates workers to tasks. The consumer of labor services does not get involved with the recruiting process or the details of payment, but has full control everything else. Much like the Amazon Web Services approach (use EC2, S3, RDS, etc. to build your app), the service provider just provides raw labor and guarantees that the labor force satisfies a particular SLA (e.g., response time within X minutes, has the skills that are advertised in the resume, etc) [Companies like Amazon Mechanical Turk, oDesk, etc. fall into this category]
From these definitions, I believe that it does not make sense to build your own "infrastructure" if you are going to rely on remote workers. (I have a very different attitude for creating an in-house, local, team of workers that provides the labor, but this gets very close to being a traditional temp agency, so I do not treat this as crowdsourcing.)

I have no formed opinion on the "platform as a service" or a "software as a service" model (yet).

For the software as a service model, I think it is up to you to decide whether you like the output of the system (transcription, software testing, etc). The crowdsourcing part is truly secondary.

For the platform as a service model, I do not have enough experience with existing offerings to know whether to trust the quality assurance scheme. (Usual cognitive bias of liking-best-what-you-built-yourself applies here.) Perhaps in a couple of years, it would make no sense to build your own quality assurance scheme. But at this point, I think that we are all still relying on bespoke, custom-made schemes, with no good argument to trust a standardized solution offered by a third-party.

Monday, July 2, 2012

Visualizations of the oDesk "oConomy"

[Crossposted from the oDesk Blog. Blog post written together with John Horton.]

A favorite pastime of the oDesk Research Team is to run analyses using data from oDesk’s database in order to provide a better understanding of oDesk’s online workplace and the way the world works. Some of these analyses were so interesting we started sharing them with the general public, and posted them online for the world to see.

Deep inside, however, we were not happy with our current approach. All our analyses and plots were static. We wanted to share something more interactive, using one of the newer javascript-based visualization packages. So, we posted a job on oDesk looking for d3.js developers and found Zack Meril, a tremendously talented Javascript developer. Zack took our ideas and built a great tool for everyone to use:


The oDesk Country Dashboard

This dashboard allows you to interactively explore the world of work based upon oDesk’s data. We list below some of our favorite discoveries from playing with its visualizations. Do let us know if you find something interesting. Note that the tool supports “deep linking,” which means that the URL in your address bar fully encodes the view that you see.

Visualization #1: Global Activity

The first interactive visualization shows the level of contractor activity of different countries across different days of the week and times of day. The pattern seems pretty “expected”:


On a second thought, though, we started wondering. Why do we see such regularity? The x-axis is GMT time. Given that oDesk is a global marketplace, shouldn’t the contractor activity to be smoother? Furthermore, oDesk has a relatively smaller number of contractors from Western Europe, so it seems kind of strange that our contractor community generally follows the waking and sleeping patterns of UK. Investigating closer, if you hover around the visualization, you see a closer look at what contractors are doing throughout the world:

At 8am GMT on Wednesday morning: Russia, India, and China are awake and their activity is increasing.


As we move towards the peak of the global activity at 3pm, the activity of the Asian countries has already started declining. However, at the same time North and Latin America start waking up, compensating for the decrease in activity in Asia, and leading to the world peak.


After 4pm GMT, Asia starts going to sleep, and the activity decreases. The activity continues to decline as America signs off, hitting the low point of activity at 4am GMT (but notice how China, Philippines, and Australia start getting active, preventing the activity level from going to zero).


Visualization #2: Country-Specific Activity

A few weeks back, we also wrote about the rather unusual working pattern of Philippines: contractors from the Philippines tend to keep a schedule that mostly follows U.S. working hours, rather than a “normal” 9-5 day. Since then, we realized that the Philippines is not the only country following this pattern. For example, Bangladesh and Indonesia have similar activity patterns to Philippines. So, we thought, why not make it easy to explore and find working patterns. They reveal something about the culture, habits, and even type of work that gets done in these countries. A few findings of interest:

Visualization #3: Work Type By Country

Finally, we wondered “What are the factors that influence these working patterns?” Why do some culturally similar countries have very similar working patterns (e.g., Russia and Ukraine), while others have very different patterns (e.g., Pakistan, Bangladesh, and India)? So, with our third visualization we examine types of work completed on oDesk broken down by country. We used the bubble chart from d3.js to visualize the results. Here is, for example, the breakdown for U.S.:


U.S. contractors are mainly working in tasks related to writing. We do see many clients explicitly limit their search for writing contractors to U.S.-based only, both for English proficiency but also (and perhaps more importantly) for the cultural affinity of the writers to their audience. Take a look at Russia: Almost all the work done in Russia is Web programming and design, followed by mobile and desktop development.


At the opposite end is the Philippines, where few programming tasks are being completed, but significant amounts of data entry, graphic design, and virtual assistant work happen:


Another interesting example is Kenya. As you can see, most of the work done there (and there is a significant amount of work done in Kenya) is about blog and article writing:


Exploring Further: Activity Patterns and Types of Projects

One pattern that was not directly obvious was the correlation between activity patterns and type of work. Countries that are engaging mainly in computer programming tend to have a larger fraction of users that use oDesk. For example, see the similarity in the activity patterns of Bolivia, Poland, Russia, and Ukraine: and the corresponding project types that get completed in these countries:






We should note however that the opposite does not hold: There are other countries that have similar activity patterns and high degree of contractor stickiness (e.g., Argentina, Armenia, Bolivia, Belarus, China, Uruguay, and Venezuela) that have rather different project completion dates.


Source available on Github


One thing that attracted me to spend my sabbatical at oDesk was the fact that oDesk has been pretty open with its data from the beginning. To this end, you will notice that the Country Explorer is an open source project, so you are welcome to just fork us on Github and get the code for the visualizations.


New ideas and visualizations


I am thinking of what other types of graphs would be interesting to create. Supply and demand of skills? Asking prices and transaction prices of contractors across countries and across skills? Of course, if you have specific ideas you’d like to see us work on, tell us in the comments! Happy to explore directions and data that you are interested in exploring.

Friday, May 25, 2012

The Emergence of Teams in Online Work

When I started as an assistant professor, back in 2004, and I joined the NYU/Stern Business School, I got into a strange position. I had funding to spend, but no students to work with. I had work to be done (mainly writing crawlers) that was time-consuming, but not particularly novel, or intellectually rewarding. Semi-randomly, at the same time, I have heard about the website Rent-A-Coder, which was being used by undergraduate students that were "outsourcing" their programming assignments. I started using Rent-A-Coder, tentatively at first, to get programming tasks done, and then, over time, I got fascinated by the concept of online work, and the ability to hire people online, and get things done. (My Mechanical Turk research, and my current appointment at oDesk is a natural evolution of these interests.)

As I started completing increasingly complicated projects using remote contractors, I started thinking on how we can best manage a diverse team of remote workers, each one being in a different location, working on different tasks, etc. The topic has many interesting questions that arise, both in terms of theory, and in terms of developing practical "best practices" guidelines.

While trying to understand better the theoretical problems that arise in the space, I was reading the paper "Online Team Formation in Social Networks" that was published in WWW2012; the paper describes a technique for identifying teams of people in a social network (i.e., graph) that have complementary skills and can form a well-functioning unit, and tries to do so while preserving workload restrictions for individual workers.

Given my personal experience, from the practical side, and the existence of research papers that deal with the topic, I got curious to understand whether the topic of online team formation is a fringe topic, or something that deserves further attention.

Do we see teams being formed online? If yes, is this a phenomenon that increases in significance?

So, I pulled the oDesk data and tried to answer the question.

How many teams have a given size? How this distribution evolves over time? I plotted the number of projects in each week that had x contractors that were active in the project (i.e., billed some time)

The results were revealing: Not only we observe teams of people being formed online but we also see an exponential increase in the number of teams of any given size. 



In fact, in the above graph, if we account for the fact that bigger teams contain an (exponentially) larger number of people, we can see that the majority of the online workers today are not working as individuals but are now part of an online team.



Update [thanks for the question, Yannis!]: Since the exponential growth of oDesk.com makes it difficult to understand the fraction of people working in teams and whether it is increasing/decreasing , here is the chart that shows what percentage of workers work in teams of a given size:

What is interesting is the consistent decrease in the fraction of people working along (teams of one), and in teams of 2-3. Instead, we see a slow but consistent increase in teams with size 4-7 and 8-16, as an overall fraction of the population. As you can see, over the last year, the percentage of contractors in teams with size 4-7 is getting close to surpass the number of contractors working along. Similarly, the percentage of contractors in teams of 8-16 is getting close to surpass the percentage of contractors in teams of 2-3. The trends for bigger teams seem also to be increasing but there is still too much noise to be able to infer anything.

What's coming?

Given the trend for online work to be done in teams, formed online, I expect to see a change in the way that many companies are being formed in the future. At this point, it seems far fetched that a startup company can be formed online, being distributed across the globe, and operate on a common project. (Yes, there are such teams but they are more of an exception, rather than the norm.)

But if these trends continue, expect sooner rather than later to see companies naturally hiring online and working with remote collaborators, no matter where the talent is located. People have been talking about online work being an alternative to immigration, but this seemed to be a solution for the remote future.

With the exponential increase that we observe, the future may come much sooner than expected.

Thursday, May 10, 2012

TREC 2012 Crowdsourcing Track

TREC 2012 Crowdsourcing Track - Call for Participation

 June 2012 – November 2012
https://sites.google.com/site/treccrowd/

Goals

As part of the National Institute of Standards and Technology (NIST)'s annual Text REtrieval Conference (TREC), the Crowdsourcing track investigates emerging crowd-based methods for search evaluation and/or developing hybrid automation and crowd search systems.

This year, our goal is to evaluate approaches to crowdsourcing high quality relevance judgments for two different types of media:
  1. textual documents
  2. images
For each of the two tasks, participants will be expected to crowdsource relevance labels for approximately 20k topic-document pairs (i.e., 40k labels when taking part in both tasks). In the first task, the documents will be from an English news text corpora, while in the second task the documents will be images from Flickr and from a European news agency.

Participants may use any crowdsourcing methods and platforms, including home-grown systems. Submissions will be evaluated against a gold standard set of labels and against consensus labels over all participating teams.

Tentative Schedule

  • Jun 1: Document corpora, training topics (for image task) and task guidelines available
  • Jul 1: Training labels for the image task
  • Aug 1: Test data released
  • Sep 15: Submissions due
  • Oct 1: Preliminary results released
  • Oct 15: Conference notebook papers due
  • Nov 6-9: TREC 2012 conference at NIST, Gaithersburg, MD, USA
  • Nov 15: Final results released
  • Jan 15, 2013: Final papers due

Participation

To take part, please register by submitting a formal application directly to NIST (even if returning participant). See the bottom part of the page at http://trec.nist.gov/pubs/call2012.html

Participants should also join our Google Group discussion list, where all track related communications will take place.

Organizers

  • Gabriella Kazai, Microsoft Research
  • Matthew Lease, University of Texas at Austin
  • Panagiotis G. Ipeirotis, New York University
  • Mark D. Smucker, University of Waterloo

Further information

For further information, please visit https://sites.google.com/site/treccrowd/

We welcome any questions you may have, either by emailing the organizers or by posting on the Google Group discussion page.

Saturday, May 5, 2012

Tuesday, April 3, 2012

Philippines: The country that never sleeps (or, When is the world working? The oDesk Edition)

Why are you awake?

Over the last few months, I have used oDesk to hire a couple of virtual assistants, who help me with a variety of tasks. They are coming from Philippines and we communicate over Skype whenever I have tasks for them to do. (Hi Maria! Hi Reineer!). One of the things that I found puzzling was the fact that they seemed to be online during the working hours in New York, despite the fact that we have a 12 hour difference with Manila. When I asked them, they told me that most of the time they work for US-based clients, and their work is much easier when they are synchronized with a US-schedule (real-time interactions with the clients, and so on). So they tend to stay awake until late at night and then sleep during their morning in Philippines.

I found that behavior strangely fascinating, so I decided to dig deeper and figure out if this is some quirkiness of my own virtual assistants, or whether this is a more systematic pattern.

The oDesk Team client: All-you-can-eat data

One characteristic that differentiates oDesk from other online labor platforms is the focus on hourly contracts, instead of project-based or piecemeal contracts. To enable truthful billing, oDesk asks the service providers to use the oDesk client whenever they are billing time. The client records the time billed and at the same time it takes screenshots at random intervals (that are given to the client who pays, only) and records the level of activity on the computer. This, in turn, ensures that clients can audit what service providers were doing while they were billing hours for work.

So, I got the data recorded by the oDesk Team client that show when a worker is active. I plotted the number of active workers at different times of the day (time is local to the location of the service provider, and not the global UTC time), for various days of the week. Here is the plot with numbers from the top-7 countries, ranked by number of workers:


One thing that is immediately interesting: Philippines never sleeps!

All other countries have very natural patterns of being awake and asleep; Philippines is an exception. We see that the minimum for Philippines rarely drops below 5,000 active workers! All other countries (combined!) in their downtime time cannot beat Philippines in their low time. The supply of work is very constant over time.

There are a couple of natural break points (see the small dip around lunch time and another one at around dinner time) but even during the (Philippines) night the work keeps going on. In fact, you can see clearly the peak of employment is at around 9pm-10pm in Philippines, which is the time that the East Coast in the US starts working as well. The low point for Philippines is at around 4am-5am their time, which is 4pm-5pm in the East Coast.

Update: A couple of fascinating comments from the Hacker News thread for this post:

I have cousins that work at help desks in the Philippines, and their work schedules are designed to match US time zones. After work, they hang out at bars with happy hours designed for them - I believe around ten in the morning. They hang out, then go home to sleep for the rest of the day. Globalisation at work.

I'm a Filipino Developer. This is actually an alternative for us developers in the Philippines, instead of going abroad working overseas which will be very far from our families. We got a lot of opportunities from foreigners who want to outsource their development projects. This earns us quite substantial income Although it's not as high as when your really working abroad, being with your family and seeing your children grow up mostly makes up for it. Staying up late is not that hard as me myself is most productive at night when kids are asleep. I know most programmers share this work time.

The Data

For those that want to play more with the data, here is a link to a Google Spreadsheet. If you want more details or a slightly different view of the data, I would be happy to dig more in the oDesk database.


What is the application? Real-time human computation

So, why do we care that Philippines is awake all the time? The immediate benefit is that getting a team in Philippines can ensure the availability of labor for handling real-time tasks. If you have a human-powered application, you do not want to have any dead periods of time, where the application is slowing down or becomes completely unresponsive. However, by hiring people from Philippines, it is possible to have a "private crowd" available around the clock, by simply asking the Philippines contractors to "show up" at different points during the day/week.

What is the difference with other services? If you hire a big outsourcing company, then the expectation is that they will work during (their) normal business hours, leaving the service down for many hours. On Mechanical Turk, this drop in performance comes naturally. If you restrict your tasks to US only, the speed drops when US goes to sleep. If you run the task on India, the same thing will happen. (Mixing the two crowds tend to result in many complications as the expectations for price are very different and Indians tend to overwhelm tasks that are priced for US workers.)

Overall, Philippines seems to have a nice balance of availability throughout the day, and generally low prices. In terms of quality, things tend to be somewhere between US and India, so careful screening and quality control is important. But for many people experienced with managing crowds, it seems that Philippines is a great source of "crowds."

Myself, I have already put my money where my mouth is, across multiple crowd applications that I have built.

Thursday, March 22, 2012

ACM EC 2012 Workshops

Thursday, June 7th, 2012:
Friday, June 8th, 2012:

The (Unofficial) NIST Definition of Crowdsourcing

A few weeks ago, I was attending the NSF Workshop on Social Networks and Mobility in the Cloud. There, I ran into the NIST definition of cloud computing.

After reading it, I felt that it would be a nice exercise to transform the definition into something similar for the dual area of "cloud labor" (aka crowdsourcing). I found it to be a useful exercise. While the NIST definition is focused and is  highlighting features that are commonly available in computing services, they do have have corresponding interpretations within the framework of "cloud labor". At the same time, we can also see that there are significant differences, as there are fundamental differences between humans and computers.

Anyway, here is my attempt to take the NIST definition, and translate into a similar definition for crowdsourcing. Intentionally, I am plagiarizing the NIST definition, introducing changes only where necessary.

In the definition, I am trying to use the term "worker" for the person doing the job, the term "client" for the person that is paying for the labor, and "service provider" for the platforms that connect clients and workers.

The (Unofficial) NIST Definition of Cloud Labor / Crowdsourcing

Cloud labor is a model for enabling convenient, on-demand network access to a (shared) pool of human workers with different skills (e.g., transcribers, translators, developers, virtual assistants, graphic designers, etc) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.

Essential Characteristics
  • On-demand self-service. A client can unilaterally provision labor capabilities, (e.g., as virtual assistants, content moderators, developers, and so on) as needed automatically without requiring human interaction with service’s provider.
  • Broad access. Capabilities are available and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., from PhD students hiring for a small survey, to companies such as uTest and TopCoder that engage deeply their workers)
  • Resource pooling. The labor resources are pooled by the service provider to serve multiple clients using a multi-tenant model, with different workers dynamically assigned and reassigned according to employer demand. There is a sense of location and time independence in that the client generally has no control or knowledge over the exact location of the provided labor but may be able to specify location and other desirable qualifications at a higher level of abstraction (e.g., country, language knowledge, or skill proficiency).
  • Rapid elasticity. Labor can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the client, the labor capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
  • Measured service. Labor cloud provision systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., content generation, translation, software development, etc). Resource usage can be monitored, controlled, and reported providing transparency for both the service provider, the client and the worker, so that there is a better understanding of the quality of the provisioned labor services.
Service Models
  • Labor Applications/Software as a Service (LSaaS). The capability provided to the client is to use the provider’s applications running on a cloud-labor infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web application for ordering content generation, or proofreading, or transcription, or software testing, or ...). The client does not manage or control the underlying cloud labor, with the possible exception of limited user-specific application configuration settings. Effectively, the client only cares about the quality of the provided results of the labor and does not want to know about the underlying workflows, quality management, etc. [Companies like CastingWords and uTest fall into this category]
  • Labor Platform as a Service (LPaaS).  The capability provided to the client is to deploy onto the labor pool consumer-created or acquired applications created using programming languages and tools supported by the provider. The client does not manage or control the underlying labor pool, but has control of the overall task execution, including workflows, quality control, etc. The platform provides the necessary infrastructure to support the generation and implementation of the task execution logic.
    [Companies like Humanoid fall into this category]
  • Labor Infrastructure as a Service (LIaaS). The capability provided to the client is to provision labor for the client, who then allocates workers to tasks. The consumer of labor services does not get involved with the recruiting process or the details of payment, but has full control everything else. Much like the Amazon Web Services approach (use EC2, S3, RDS, etc. to build your app), the service provider just provides raw labor and guarantees that the labor force satisfies a particular SLA (e.g., response time within X minutes, has the skills that are advertised in the resume, etc)
    [Companies like Amazon Mechanical Turk fall into this category] 
Deployment Models
  • Private labor pool. The labor pool is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise.
  • Community labor pool. The labor pool is shared by several organizations and supports a specific community that has shared concerns (e.g., enthusiasts of an application such as birdwatchers, or volunteers for a particular cause such as disaster management). It may be managed by the organizations or a third party and may exist on premise or off premise.
  • Public labor pool. The labor pool is made available to the general public or a large industry group and is provisioned by an organization (or coalition of organizations) selling labor services.
  • Hybrid labor pool. The labor pool is a composition of two or more pools (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., handling activity bursts by fetching public labor to support the private labor pool of a company).
Differences between a Computing and Labor Cloud

The NIST definition highlights some of the key aspects of a "cloud labor" service. However, by omission, it also illustrates some key differences that we need to take into consideration when thinking about "cloud labor" services.
  • Need for training and lack of instantaneous duplication. In the computing cloud we can pre-configure computing units with a specific software installation (e.g. with a LAMP stack) and then replicate as necessary to meet the needs of the application. With human workers, the equivalent of software installation part is training. The key difference is that training takes time and we cannot “store the image and replicate as needed.” So, for cases where an client wants the workers to have a task-specific training, we will observe a latency in starting the task completion equal to the time necessary for training the worker to learn the requirements specific to the given task. When training is specific to the client, this latency can be significant. When training is transferable across clients, things are expected to be a better, assuming a well-functioning and designed market.
  • Allocation over space. In computing cloud we can request allocation of services in different geographical locations, but this is a desirable and not a key feature. With human labor though, especially when it contains an offline component, we may need to explicitly request specific geographic regions.
  • Allocation over time. With computing services, time is of little importance, excluding the normal part of load fluctuations over time of day, and days of the week. Furthermore, we can easily operate a computing device 24/7. With human labor, this is not possible. Not only we have to face the fact that humans get tired but also humans typically are available for work during the “working hours” of their timezone. Since we cannot take a person and replicate across time zones, this becomes a crucial difference when we expect real-time on-demand labor services around the clock.
How Mature are Today's Online Labor Markets?

If we examine the existing “labor cloud” we will see that many of the characteristics that define the computing cloud (on-demand self-service, broad access through APIs, resource pooling, rapid elasticity, and measured service) only a subset of the capabilities are available through today's labor platforms.

Take the case of Amazon Mechanical Turk:
  • On-demand self-service: Yes.
  • Broad access through APIs: Yes
  • Resource pooling: Yes and No. While there is a pool of workers available, there is no assignment done from the service provider. This implies that there may be nobody willing to work on the posted task and this cannot be inferred before testing the system. It is really up to the workers to decide whether they will serve a particular labor request.
  • Rapid elasticity: Yes and No. The scaling out capability is rather limited (scaling in is trivially easy). As in the case of resource pooling, it is up to the workers to decide whether to work on a task.
  • Measured Service: No. Quality and productivity measurement is done by the employer side.
2 yes, 1 no, and 2 "yes and no". Glass half-full? Glass half-empty? I will go for the half-full interpretation for now but we can see that we still have a long way to go.

Wednesday, March 14, 2012

When do reviewers submit their reviews? (ACM EC 2012 version)

A few weeks back, just after the deadline for the submission of papers for ACM EC'12, I wrote a brief blog post, showing how more than 60% of the submissions came within the last 24 hours before the deadline.

Now we are in the process of reviewing the papers, and the deadline for reviewers to submit their reviews was on March 5th, a few days before sending the reviews back to the authors for feedback. Here is the plot of the submission activity. On the x-axis we have the time, and on the y-axis the percent or reviews received by that time. With a yellow line, I marked the official deadline.


The similarities with the submission dynamics for papers are striking. One day before the deadline, we had received only 40% of the reviews. Within the next 24 hours, we jumped from 40% to 85%, receiving approximately 300 reviews during that period. If we go to 36 hours before the deadline, we can see a jump from 20% to 85%.

The key difference with the paper submissions plot is that reviewers can submit late, much to the chagrin of the PC Chairs and the Senior PC members that are trying to get the discussion going. You can see clearly that, after the deadline, we needed one additional day to go from 85% to 90%, and then another extra day to reach the 98% completion rate.

On the positive note, despite the love of both authors and reviewers to submit material very close to the deadline, the overall quality of the submissions for EC'12 seems to be pretty high. (Self-selection at work, I guess.) With Kevin, we are doing our best to see how we can accommodate as many papers as possible.