Monday, March 26, 2012

Mechanical Turk: More SETI@Home and less Amazon Web Services

A few days back, I wrote about the requirements that labor markets need to satisfy in order to claim that they offer scalable "cloud labor" services. As a reminder, the characteristics that define cloud services are:
  • on-demand self-service
  • broad access through APIs
  • resource pooling
  • rapid elasticity
  • measured service
I used Amazon Mechanical Turk for a first test of these condition, and the results were:
  • On-demand self-service: Yes. We can access the labor pool whenever it is needed.
  • Broad access through APIs: Yes. Computers can handle the overall process of hiring, task handling, etc.
  • Resource pooling: Yes and No. While there is a pool of workers available, there is no assignment done from the service provider. This implies that there may be nobody willing to work on the posted task and this cannot be inferred before testing the system. It is really up to the workers to decide whether they will serve a particular labor request.
  • Rapid elasticity: Yes and No. The scaling out capability (increasing rapidly the labor pool) is rather limited. We simply cannot suddenly hire hundreds of workers to work in parallel in a task, for a sustained period of time (workers that do 1-2 task and then leave cannot be counted for the purpose of elasticity). As in the case of resource pooling, it is up to the workers to decide whether to work on a task, and it is highly unclear what level of pricing could achieve what level of elasticity.
  • Measured Service: No. Quality and productivity measurement is done by the employer side, and there is no SLA with the client that is paying for the provided services, which could guarantee a minimum level of performance.

So, why MTurk fails these tests?

The root cause of failure is the voluntarily, market-based mechanism for allocating labor to tasks. (Yes, markets are not necessarily efficient, especially when they are not designed properly.)

The fact that MTurk cannot "forcibly" assign a task to a worker, makes it almost impossible to ever satisfy the requirements for these conditions. If someone wants to solicit someone a large number of workers (rapid elasticity), it is not clear that the market will have enough participants to satisfy the needs. Even if they are, we do not know the wage that the available workers will require. If, however, there was a guaranteed pool available, with known prices, then MTurk could say what are the limits of elasticity, and how much it would cost. Similar for pooling.

In a sense, today's Mechanical Turk is more similar to the SETI@Home in 1999, rather than to EC2 and S3 from Amazon in 2009. Here are the similarities:
  • Distributed, voluntarily participating infrastructure
    • With Amazon Web Services (AWS) such as EC2, S3, etc. there is a single provider of hardward infrastructure, who plans for availability, does capacity planning by upgrading the infrastructure when needed, etc.
    • In SETI@Home, the computation was coming from volunteers that were joining the network at their own will, and could potentially donate time to other projects beyond SETI (e.g., protein folding and others). There was no single provider of hardware capabilities, as in the Amazon case, but rather a distributed, completely heterogeneous infrastructure.
    • On Mechanical Turk(and crowdsourcing in general), every person comes and leaves at will. There is no single agency that hires all the workers and plans for availability, does capacity planning, etc.
  • Diversity of underlying infrastructure
    • With EC2 and S3, we have an SLA guarantee for the services we are buying. If we buy 3 m1.medium machines, Amazon provides the memory, cpu speed, and other characteristics of these machines.
    • In SETI@Home, the computation was split into multiple pieces and distributed to a large number of computers, each with different capabilities. Through testing SETI was building profiles of the different machines to potentially allocate data units more efficiently.
    • On Mechanical Turk, we observe the same setting today but with human tasks. We have no idea what are the skills of the underlying "human units", unless we probe and test beforehand.
  • No guarantee of "uptime" (task completion)
    • With EC2 and S3, we have a reasonable guarantee of uptime: When a service receives a request, we expect that the answer will come back, with probability following the SLA guarantees (which is very high). Very rarely we need to plan for cases where the system is unavailable; such planning is not seen as a common everyday need.
    • In SETI@Home, there was no guarantee that an data unit was ever going to be returned by the client. The client may decide to uninstall the application, switch off the computer, or do any action that could interrupt the computation process. SETI was keeping track of the reliability of the machines and how often they returned their data units back, within a reasonable amount of time.
    • On Mechanical Tuk, we also need to handle the fact that a task may not be completed after the assignment, may be returned and need to be reposted etc. MTurk keeps track of such failures and keeps statistics about the tasks that were returned and abandoned by each worker.
  • Malicious clients
    • With EC2 and S3, we have almost a guarantee that the CPU will not misrepresent its capabilities and will always return correct results. Similarly for storage we have a 99.99999% guarantee that the data will not be lost. We may maintain multiple servers for a service, mainly as an attempt to increase reliability and have load balacing, but we start with the understanding that even the first machine will operate in a “best effort” basis and will not behave maliciously.
    • In SETI@Home, there were many attempts from people to game the system and return back non-properly processed data, just to increase their statistics and place in the standings. To avoid malicious clients, SETI was performing the computation multiple times, effectively wasting the available computing capacity for reliability purposes.
    • We observe the same thing with Mechanical Turk. Instead of trusting each individual to do an honest effort, we need to resort to redundancy, gold tests, and so on, effectively wasting capacity. The introduction of "trusted" workers (Mechanical Turk masters) reduces the problem but the fundamental problem is still there.

So, what is the future? 

The naive solution is to have a "traditional" outsourcing service, sending tasks to a classic BPO company such as Tata Consulting, and rely on their reliability and availability guarantees. (Interestingly enough, many of these BPO's use crowdsourcing-like approaches to manage internally their tens of thousands of employees that handle basic tasks.)While I see the appeal, I do not find the solution satisfactory.

Personally, I see a supply side market to emerge in which workers can advertise what they offer and clients can place requests against these services. (Fiverr is currently offering such a "supply-side" service, which mirrors the "demand-side" service offered by Mechanical Turk.) The service that will successfully merge the two sides and connect efficiently supply and demand will be the winner...

Thursday, March 22, 2012

ACM EC 2012 Workshops

Thursday, June 7th, 2012:
Friday, June 8th, 2012:

The (Unofficial) NIST Definition of Crowdsourcing

A few weeks ago, I was attending the NSF Workshop on Social Networks and Mobility in the Cloud. There, I ran into the NIST definition of cloud computing.

After reading it, I felt that it would be a nice exercise to transform the definition into something similar for the dual area of "cloud labor" (aka crowdsourcing). I found it to be a useful exercise. While the NIST definition is focused and is  highlighting features that are commonly available in computing services, they do have have corresponding interpretations within the framework of "cloud labor". At the same time, we can also see that there are significant differences, as there are fundamental differences between humans and computers.

Anyway, here is my attempt to take the NIST definition, and translate into a similar definition for crowdsourcing. Intentionally, I am plagiarizing the NIST definition, introducing changes only where necessary.

In the definition, I am trying to use the term "worker" for the person doing the job, the term "client" for the person that is paying for the labor, and "service provider" for the platforms that connect clients and workers.

The (Unofficial) NIST Definition of Cloud Labor / Crowdsourcing

Cloud labor is a model for enabling convenient, on-demand network access to a (shared) pool of human workers with different skills (e.g., transcribers, translators, developers, virtual assistants, graphic designers, etc) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.

Essential Characteristics
  • On-demand self-service. A client can unilaterally provision labor capabilities, (e.g., as virtual assistants, content moderators, developers, and so on) as needed automatically without requiring human interaction with service’s provider.
  • Broad access. Capabilities are available and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., from PhD students hiring for a small survey, to companies such as uTest and TopCoder that engage deeply their workers)
  • Resource pooling. The labor resources are pooled by the service provider to serve multiple clients using a multi-tenant model, with different workers dynamically assigned and reassigned according to employer demand. There is a sense of location and time independence in that the client generally has no control or knowledge over the exact location of the provided labor but may be able to specify location and other desirable qualifications at a higher level of abstraction (e.g., country, language knowledge, or skill proficiency).
  • Rapid elasticity. Labor can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the client, the labor capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
  • Measured service. Labor cloud provision systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., content generation, translation, software development, etc). Resource usage can be monitored, controlled, and reported providing transparency for both the service provider, the client and the worker, so that there is a better understanding of the quality of the provisioned labor services.
Service Models
  • Labor Applications/Software as a Service (LSaaS). The capability provided to the client is to use the provider’s applications running on a cloud-labor infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web application for ordering content generation, or proofreading, or transcription, or software testing, or ...). The client does not manage or control the underlying cloud labor, with the possible exception of limited user-specific application configuration settings. Effectively, the client only cares about the quality of the provided results of the labor and does not want to know about the underlying workflows, quality management, etc. [Companies like CastingWords and uTest fall into this category]
  • Labor Platform as a Service (LPaaS).  The capability provided to the client is to deploy onto the labor pool consumer-created or acquired applications created using programming languages and tools supported by the provider. The client does not manage or control the underlying labor pool, but has control of the overall task execution, including workflows, quality control, etc. The platform provides the necessary infrastructure to support the generation and implementation of the task execution logic.
    [Companies like Humanoid fall into this category]
  • Labor Infrastructure as a Service (LIaaS). The capability provided to the client is to provision labor for the client, who then allocates workers to tasks. The consumer of labor services does not get involved with the recruiting process or the details of payment, but has full control everything else. Much like the Amazon Web Services approach (use EC2, S3, RDS, etc. to build your app), the service provider just provides raw labor and guarantees that the labor force satisfies a particular SLA (e.g., response time within X minutes, has the skills that are advertised in the resume, etc)
    [Companies like Amazon Mechanical Turk fall into this category] 
Deployment Models
  • Private labor pool. The labor pool is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise.
  • Community labor pool. The labor pool is shared by several organizations and supports a specific community that has shared concerns (e.g., enthusiasts of an application such as birdwatchers, or volunteers for a particular cause such as disaster management). It may be managed by the organizations or a third party and may exist on premise or off premise.
  • Public labor pool. The labor pool is made available to the general public or a large industry group and is provisioned by an organization (or coalition of organizations) selling labor services.
  • Hybrid labor pool. The labor pool is a composition of two or more pools (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., handling activity bursts by fetching public labor to support the private labor pool of a company).
Differences between a Computing and Labor Cloud

The NIST definition highlights some of the key aspects of a "cloud labor" service. However, by omission, it also illustrates some key differences that we need to take into consideration when thinking about "cloud labor" services.
  • Need for training and lack of instantaneous duplication. In the computing cloud we can pre-configure computing units with a specific software installation (e.g. with a LAMP stack) and then replicate as necessary to meet the needs of the application. With human workers, the equivalent of software installation part is training. The key difference is that training takes time and we cannot “store the image and replicate as needed.” So, for cases where an client wants the workers to have a task-specific training, we will observe a latency in starting the task completion equal to the time necessary for training the worker to learn the requirements specific to the given task. When training is specific to the client, this latency can be significant. When training is transferable across clients, things are expected to be a better, assuming a well-functioning and designed market.
  • Allocation over space. In computing cloud we can request allocation of services in different geographical locations, but this is a desirable and not a key feature. With human labor though, especially when it contains an offline component, we may need to explicitly request specific geographic regions.
  • Allocation over time. With computing services, time is of little importance, excluding the normal part of load fluctuations over time of day, and days of the week. Furthermore, we can easily operate a computing device 24/7. With human labor, this is not possible. Not only we have to face the fact that humans get tired but also humans typically are available for work during the “working hours” of their timezone. Since we cannot take a person and replicate across time zones, this becomes a crucial difference when we expect real-time on-demand labor services around the clock.
How Mature are Today's Online Labor Markets?

If we examine the existing “labor cloud” we will see that many of the characteristics that define the computing cloud (on-demand self-service, broad access through APIs, resource pooling, rapid elasticity, and measured service) only a subset of the capabilities are available through today's labor platforms.

Take the case of Amazon Mechanical Turk:
  • On-demand self-service: Yes.
  • Broad access through APIs: Yes
  • Resource pooling: Yes and No. While there is a pool of workers available, there is no assignment done from the service provider. This implies that there may be nobody willing to work on the posted task and this cannot be inferred before testing the system. It is really up to the workers to decide whether they will serve a particular labor request.
  • Rapid elasticity: Yes and No. The scaling out capability is rather limited (scaling in is trivially easy). As in the case of resource pooling, it is up to the workers to decide whether to work on a task.
  • Measured Service: No. Quality and productivity measurement is done by the employer side.
2 yes, 1 no, and 2 "yes and no". Glass half-full? Glass half-empty? I will go for the half-full interpretation for now but we can see that we still have a long way to go.

Wednesday, March 14, 2012

When do reviewers submit their reviews? (ACM EC 2012 version)

A few weeks back, just after the deadline for the submission of papers for ACM EC'12, I wrote a brief blog post, showing how more than 60% of the submissions came within the last 24 hours before the deadline.

Now we are in the process of reviewing the papers, and the deadline for reviewers to submit their reviews was on March 5th, a few days before sending the reviews back to the authors for feedback. Here is the plot of the submission activity. On the x-axis we have the time, and on the y-axis the percent or reviews received by that time. With a yellow line, I marked the official deadline.

The similarities with the submission dynamics for papers are striking. One day before the deadline, we had received only 40% of the reviews. Within the next 24 hours, we jumped from 40% to 85%, receiving approximately 300 reviews during that period. If we go to 36 hours before the deadline, we can see a jump from 20% to 85%.

The key difference with the paper submissions plot is that reviewers can submit late, much to the chagrin of the PC Chairs and the Senior PC members that are trying to get the discussion going. You can see clearly that, after the deadline, we needed one additional day to go from 85% to 90%, and then another extra day to reach the 98% completion rate.

On the positive note, despite the love of both authors and reviewers to submit material very close to the deadline, the overall quality of the submissions for EC'12 seems to be pretty high. (Self-selection at work, I guess.) With Kevin, we are doing our best to see how we can accommodate as many papers as possible.

50% of the online ads are never seen

Almost a year back, I was involved in an advertising fraud case, as part of my involvement with AdSafe Media. (See the related Wall Street Journal story.) Long story short, it was a sophisticated scheme for generating user traffic to websites that were displaying ads to real users but these users could never see these ads, as they were never visible to the user. While we were able to uncover the scheme, what triggered our investigation was almost an accident: our adult-content classifier seemed to detect porn in websites that had absolutely nothing suspicious. While it was a great investigative success, we could not overlook the fact that this was not a systematic method for discovering such attempts for fraud. As part of this effort to make more systematic, the following idea came up:

Let's monitor the duration for which a user can actually see an ad?

After a few months of development to get this feature to work, it became possible to measure the exact amount of time an was visible to a user. While this feature could easily now detect any fraud attempt that delivers ads to users that never see them, this was now almost secondary. It was the first time that we could monitor the amount of time that users get exposed to ads.

50% of the Ads are (almost) Never Seen.

By measuring the statistics of more than 1.5 billion ad impressions per day, it was possible to understand deeply how different websites perform. Some of the high level results:
  • 38% of the ads are never in view to a user
  • 50% of the ads are in view for less than 0.5 seconds
  • 56% of the ads are in view for less than 5 seconds
Personally, I found these numbers impressive. 50% of the delivered ads are never seen for more than 0.5 seconds! I wanted to check myself whether 0.5 seconds is sufficient to understand the ad. Apparently, the guys at AdSafe thought about that as well, so here is their experiment:

You know the old saying, "half of my marketing budget is completely wasted, I just do not know which half"? Well, apparently this intuition was correct :-) The cool thing now is that you can find out which half of the budget is wasted :-)

Give me More Data!

OK, the high level results were good, but honestly, I was not satisfied. The 50%-of-the-ads-are-never-seen is a good one-liner but I was craving for more data. Were these results reliable? Or some convenient accident? So, I talked with Arun Ahuja, who gave me access to much more detailed data, sending my way the measurements for the top-1000 websites that run ads, ranked by number of visitors. (Fun fact of the day: Arun is working for AdSafe after replying to a tweet of mine. Who said that Twitter is not recruiting mechanism?)

The first thing that I wanted to check is whether the timing measurements are reliable. For that, I got the visitorship and time-on-page data from Comscore, and compared the ranked list by AdSafe and Comscore. The two lists had more than 75% overlap, which was pretty significant, given that the Comscore list also contained sites that do not display ads (e.g., Wikipedia). I also ranked the sites by number of visitors by time spent on page and compared the rankings of AdSafe and Comscore. The resulting Spearman ranking correlation coefficient was at 0.72, which was strong enough to convince me that the measurements were solid.

The first time that wanted to see was the distribution of time that people spend on a web page. The times within a website followed a log-normal distribution, so the best way to summarize these values was by using the geometric mean of the samples, which is equal to $\left(\prod^n_i t_i\right)^{1/n} = \exp\left(\frac{1}{n}\cdot\sum^n_i \ln(t_i)\right)$; for the lognormal distribution, the geometric mean is equal to the median of the distribution, which is a pretty robust statistic. OK, done with the geeky stat details.

The next thing was to plot the median time on page across different sites. Not surprisingly, the distribution is also a heavy-tailed one. While most people stay on a particular web page for just a few seconds on average (cough, median), there a few sites for which people spend significantly more time. Here is the distribution:

What is the site with the highest median time on page? No, it is not Facebook. (You see, on Facebook people do move from one page to another...) The puzzles page of USA Today and Pandora are two of the top sites in terms of time on page, with median times around 10 minutes each.

Percent of Users Exposed to Ads, for Various Periods of Time

Unlike "time on page" checking the median ad visibility per site is not a very informative metric, given that the median time is close to zero for many sites. Instead it is better to set different thresholds for ad visibility, and see what percentage of user sessions reach that level of ad visibility.

You can see below the distributions for $t>0 secs$, $t>2.5 secs$, $t>5.0 secs$, $t>7.5 secs$, and $t>10 secs$.

How to interpret these plots?

For example, for the $t>0$ plot, we that for ~12% of the sites in the dataset, were displaying the ad to 90%-100% of the visiting users. However, based on the $t>2.5$ plot, we can see that only 5% of the sites manage to show the ad for more than 2.5 seconds to 90%-100% of the visiting users, and these numbers plummet further for higher thresholds.

On the other side of the distribution, we can see that ~5% of the sites do not manage to make their ads visible to their users for more than 2.5 seconds for 90%+ of their visitors, and this number grows to 10% of the sites if we ask for the visibility to be higher than 10 seconds.

If you want to have the overall picture, here is a summary plot that puts together the histograms above:

Again, just a few data points to get you to interpret this plot quickly:
  • In 15% of the sites, the ad is not visible for 40% or more of the user sessions (see $t>0$ line)
  • In 60% of the sites, the ad is not visible for 70% or more of the user sessions (see $t>0$ line)
  • In 60% of the sites, the ad is not visible for more than 10 seconds for 40% of the user sessions (see $t>10$ line)
  • In 75% of the sites, the ad is not visible for more than 10 seconds for 50% of the user sessions (see $t>10$ line)

Correlation of time on page and ad visibility

And now let's move to the juicy stats. What is the correlation between the time on the page vs the time that people actually see the ads in the page? Interesting enough, the two numbers are not correlated:

What is wrong here? Well the main problem lies in the fact that many ads are never visible to the user (38% of them to be exact), or are visible for only brief periods of time (50% are seen for less than 0.5 seconds). From the above, we can see that the metric "percentage of user sessions with ad visibility greater than X seconds" is more descriptive than just the median.

In fact, if we compute the correlation of the visibility metrics with time on page and ad visibility, we get a more clear picture:

Correlation between time in page and  percent of user sessions exposed to ad for various periods of time
0 secs
2.5 secs
5.0 secs
7.5 secs
10 secs

As you can see, the metric that correlates best with time on page is the metric that examines what percentage of user sessions are exposed to an ad for more than 10 seconds. Indeed, we can see that there is a more clear trend, but still the variance is extremely high.

"Above the Fold" vs. "Below the Fold"?

Another common way to evaluate the visibility of an ad is to examine whether it is "above the fold" (i.e., near the top of the page and visible when the page loads), or "below the fold". This is a concept that is borrowed from the printed press and is a decent heuristic; unfortunately, it is not always accurate in the digital world. The site "Life below 600px" does a good job in explaining this. (Please visit the site, it is worth checking out :-)

To examine the effect of the "above the fold" visibility, we also measured the probability that an ad is visible when the site loads. (We decided not to use a hard metric such as "600px and below" as display sizes come in all sorts of variants).

Here is the median ad visibility, as a function of the probability of seeing an ad on load:

Here is the probability of seeing an ad for more than 10 seconds, as a function of the probability of seeing an ad on load:

There is definitely a positive correlation. But there is still significant amount of remaining noise. As you can see, there are cases where the ad is visible on load ("above the fold") but people do not see the ad for long periods of time, and there are cases where the ad is not visible on load ("below the fold").

Example Sites

Given all the metrics and combinations, it would be good to examine a few sample sites to understand better what layouts and content generate the different combinations of time on site, ad visibility, etc.
  • High time on page, high ad visibility, above the fold: Check the ZeroHedge site. This combination is the "expected" combination. Ads are visible when the page loads, users stay at the site for long (3-4 minutes median time on page), and they get exposed to the ads for long periods of time, with high probability (The probability of ad visibility above 10 seconds is greater than 70%.)
  • High time on page, low ad visibility, above the fold: Check the "That Guy with the Glasses" site. (It is better to see a representative internal page). In this site, there is a banner ad on top, but the actual content of the site is the video. So users quickly scroll down to the video and never see the top banner ad.
  • High time on page, low ad visibility, below the fold: Consider the page with puzzles at USA Today. This is a page where users spend a significant amount of time. However, they rarely see the ad, as it is rendered below the game, and users simply do not scroll down there. (Median time on page 12 minutes, with median ad visibility being 0, and probability of seeing the ad for any period of time below 10%)
  • Low time on page, high ad visibility, below the fold: Check the site In this site, the main banner ad is rarely above the fold. However, the users seem to habitually scroll down to the options in the lower part of the page, so they get exposed to the ad for significant amounts of time. (The probability of ad visibility above 10 seconds is greater than 40%, while the median time on page is just 20 seconds.)

The Future of Ad Pricing

I would be very surprised if the pricing model for ads does not change to account for the visibility statistics. For display ads that get paid per impression, it is a no brainer. If the user never sees the ad, there is no real impression, and the ad should not be paid. But even for ads that get paid on a per-click mode, the visibility statistics are important. How can we compute the clickthrough rate reliably in the presence of ads that are not even seen? I would expect visibility statistics to become standard part of the clickthrough computation process, which is a key metric of effectiveness for an ad.

The question is how fast this change will come. Perhaps the moment advertisers realize that they should not be paying for ads that are never shown to the users.