Thursday, March 22, 2012

The (Unofficial) NIST Definition of Crowdsourcing

A few weeks ago, I was attending the NSF Workshop on Social Networks and Mobility in the Cloud. There, I ran into the NIST definition of cloud computing.

After reading it, I felt that it would be a nice exercise to transform the definition into something similar for the dual area of "cloud labor" (aka crowdsourcing). I found it to be a useful exercise. While the NIST definition is focused and is  highlighting features that are commonly available in computing services, they do have have corresponding interpretations within the framework of "cloud labor". At the same time, we can also see that there are significant differences, as there are fundamental differences between humans and computers.

Anyway, here is my attempt to take the NIST definition, and translate into a similar definition for crowdsourcing. Intentionally, I am plagiarizing the NIST definition, introducing changes only where necessary.

In the definition, I am trying to use the term "worker" for the person doing the job, the term "client" for the person that is paying for the labor, and "service provider" for the platforms that connect clients and workers.

The (Unofficial) NIST Definition of Cloud Labor / Crowdsourcing

Cloud labor is a model for enabling convenient, on-demand network access to a (shared) pool of human workers with different skills (e.g., transcribers, translators, developers, virtual assistants, graphic designers, etc) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.

Essential Characteristics
  • On-demand self-service. A client can unilaterally provision labor capabilities, (e.g., as virtual assistants, content moderators, developers, and so on) as needed automatically without requiring human interaction with service’s provider.
  • Broad access. Capabilities are available and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., from PhD students hiring for a small survey, to companies such as uTest and TopCoder that engage deeply their workers)
  • Resource pooling. The labor resources are pooled by the service provider to serve multiple clients using a multi-tenant model, with different workers dynamically assigned and reassigned according to employer demand. There is a sense of location and time independence in that the client generally has no control or knowledge over the exact location of the provided labor but may be able to specify location and other desirable qualifications at a higher level of abstraction (e.g., country, language knowledge, or skill proficiency).
  • Rapid elasticity. Labor can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the client, the labor capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
  • Measured service. Labor cloud provision systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., content generation, translation, software development, etc). Resource usage can be monitored, controlled, and reported providing transparency for both the service provider, the client and the worker, so that there is a better understanding of the quality of the provisioned labor services.
Service Models
  • Labor Applications/Software as a Service (LSaaS). The capability provided to the client is to use the provider’s applications running on a cloud-labor infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web application for ordering content generation, or proofreading, or transcription, or software testing, or ...). The client does not manage or control the underlying cloud labor, with the possible exception of limited user-specific application configuration settings. Effectively, the client only cares about the quality of the provided results of the labor and does not want to know about the underlying workflows, quality management, etc. [Companies like CastingWords and uTest fall into this category]
  • Labor Platform as a Service (LPaaS).  The capability provided to the client is to deploy onto the labor pool consumer-created or acquired applications created using programming languages and tools supported by the provider. The client does not manage or control the underlying labor pool, but has control of the overall task execution, including workflows, quality control, etc. The platform provides the necessary infrastructure to support the generation and implementation of the task execution logic.
    [Companies like Humanoid fall into this category]
  • Labor Infrastructure as a Service (LIaaS). The capability provided to the client is to provision labor for the client, who then allocates workers to tasks. The consumer of labor services does not get involved with the recruiting process or the details of payment, but has full control everything else. Much like the Amazon Web Services approach (use EC2, S3, RDS, etc. to build your app), the service provider just provides raw labor and guarantees that the labor force satisfies a particular SLA (e.g., response time within X minutes, has the skills that are advertised in the resume, etc)
    [Companies like Amazon Mechanical Turk fall into this category] 
Deployment Models
  • Private labor pool. The labor pool is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise.
  • Community labor pool. The labor pool is shared by several organizations and supports a specific community that has shared concerns (e.g., enthusiasts of an application such as birdwatchers, or volunteers for a particular cause such as disaster management). It may be managed by the organizations or a third party and may exist on premise or off premise.
  • Public labor pool. The labor pool is made available to the general public or a large industry group and is provisioned by an organization (or coalition of organizations) selling labor services.
  • Hybrid labor pool. The labor pool is a composition of two or more pools (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., handling activity bursts by fetching public labor to support the private labor pool of a company).
Differences between a Computing and Labor Cloud

The NIST definition highlights some of the key aspects of a "cloud labor" service. However, by omission, it also illustrates some key differences that we need to take into consideration when thinking about "cloud labor" services.
  • Need for training and lack of instantaneous duplication. In the computing cloud we can pre-configure computing units with a specific software installation (e.g. with a LAMP stack) and then replicate as necessary to meet the needs of the application. With human workers, the equivalent of software installation part is training. The key difference is that training takes time and we cannot “store the image and replicate as needed.” So, for cases where an client wants the workers to have a task-specific training, we will observe a latency in starting the task completion equal to the time necessary for training the worker to learn the requirements specific to the given task. When training is specific to the client, this latency can be significant. When training is transferable across clients, things are expected to be a better, assuming a well-functioning and designed market.
  • Allocation over space. In computing cloud we can request allocation of services in different geographical locations, but this is a desirable and not a key feature. With human labor though, especially when it contains an offline component, we may need to explicitly request specific geographic regions.
  • Allocation over time. With computing services, time is of little importance, excluding the normal part of load fluctuations over time of day, and days of the week. Furthermore, we can easily operate a computing device 24/7. With human labor, this is not possible. Not only we have to face the fact that humans get tired but also humans typically are available for work during the “working hours” of their timezone. Since we cannot take a person and replicate across time zones, this becomes a crucial difference when we expect real-time on-demand labor services around the clock.
How Mature are Today's Online Labor Markets?

If we examine the existing “labor cloud” we will see that many of the characteristics that define the computing cloud (on-demand self-service, broad access through APIs, resource pooling, rapid elasticity, and measured service) only a subset of the capabilities are available through today's labor platforms.

Take the case of Amazon Mechanical Turk:
  • On-demand self-service: Yes.
  • Broad access through APIs: Yes
  • Resource pooling: Yes and No. While there is a pool of workers available, there is no assignment done from the service provider. This implies that there may be nobody willing to work on the posted task and this cannot be inferred before testing the system. It is really up to the workers to decide whether they will serve a particular labor request.
  • Rapid elasticity: Yes and No. The scaling out capability is rather limited (scaling in is trivially easy). As in the case of resource pooling, it is up to the workers to decide whether to work on a task.
  • Measured Service: No. Quality and productivity measurement is done by the employer side.
2 yes, 1 no, and 2 "yes and no". Glass half-full? Glass half-empty? I will go for the half-full interpretation for now but we can see that we still have a long way to go.