A Computer Scientist in a Business School

Thursday, October 17, 2013

Badges and the Lake Wobegon effect

For those not familiar with the term, the Lake Wobegon effect is the case when all or nearly all of a group claim to be above average, and comes from the finctional town where "all the women are strong, all the men are good looking, and all the children are above average."

Interestingly enough, as Wikipedia states, this effect of the majority of the group thinking that they are performing above-average "has been observed among drivers, CEOs, hedge fund managers, presidents, coaches, radio show hosts, late night comedians, stock market analysts, college students, parents, and state education officials, among others."

So, a natural question was whether this effect also appears in an online labor setting. We took some data from an online certification company, similar to Smarterer, where people take tests to show how well they know a particular skill (e.g., Excel, Audio Editing, etc.) The tests are not pass/fail but more like a GRE/SAT score: there is no "passing" score, only a percentile indicator that shows what percentage of other participants have a lower score.

Interestingly enough, we noticed a Lake Wobegon effect there as well: Most of the workers that displayed the badge of achievement, have scores above average, giving yet another point for the Lake Wobegon effect.

Of course, this does not mean that all users that took the test performed above average. Test takers have the choice to make their final score public to the world, or keep it private. Given that the user's profile is also used in a site where employers look for potential hires, there is some form of strategic choice in whether the test score is visible or not. Having a low score is often worse than having no score at all.

So, we wanted to see what scores make users comfortable with their performance, and incentivizes them to display their badge of achievement. Marios analyzed the data, and compared the distribution of scores for workers that decided to keep their score private, compared to the workers that made their performance public. Here is the outcome:

It becomes clear that scores below 50% are not posted often, while scores that exceed 60% have significantly higher odds of being posted online for the world to see. This becomes more clear if we take the log-odds of a worker deciding to make the score public, given the achieved percentile:

So, in the world of online labor if you ever hire someone who chose to display a certification, you know that there are good chances that you picked a worker that is better than average, at least in the test. (We have some other results on the predictive power of tests in terms of work performance, but this is a topic that cannot fit into the margins of this blog post :-)

Needless to say, this effect illustrates a direction that will take crowdsourcing, and labor markets in general, out of the race-to-the-bottom, market-for-lemons-style, pricing, where only price can separate the various workers. As education history serves in an offline setting as signaling for the potential quality of the employee, we are going to see more and more globally recognized certifications replacing educational history for many online workers.

Wednesday, September 11, 2013

CrowdScale workshop at HCOMP 2013

A public service announcement, to advertise CrowdScale (http://www.crowdscale.org/) a cool workshop at HCOMP 2013 that focuses on challenges that people face when applying crowdsourcing at scale.

A couple of interesting twists from the classic workshop recipe;

The workshop invites submission of short (2-page) position papers which identify and motivate key problems or potential approaches for crowdsourcing at scale, even if there aren’t satisfactory solutions proposed. (Deadline: October 4)
Second, there is a shared task challenge, which also carries a cool $1500 reward for the winner.

The CfP follows:

Crowdsourcing at a large scale raises a variety of open challenges:

How do we programmatically measure, incentivize and improve the quality of work across thousands of workers answering millions of questions daily?

As the volume, diversity and complexity of crowdsourcing tasks increase, how do we scale the hiring, training and evaluation of workers?

How do we design effective elastic marketplaces for more skilled work?

How do we adapt models for long-term, sustained contributions rather than ephemeral participation of workers?We believe tackling such problems will be key to taking crowdsourcing to the next level – from its uptake by early adopters today, to its future as how the world’s work gets done.

To advance the research and practice in crowdsourcing at scale, our workshop invites position papers tackling such issues of scale. In addition, we are organizing a shared task challenge regarding how to best aggregate crowd labels on large crowdsourcing datasets released by Google and CrowdFlower.
Twitter: #crowdscale • @CrowdAtScale
Organizers

Tatiana Josephy (@tatianajosephy), CrowdFlower

Matthew Lease (@mattlease), University of Texas at Austin

Praveen Paritosh (@heuristicity), Google

Wednesday, July 31, 2013

Online labor markets: Why they can't scale and the crowdsourcing solution.

I am a big proponent of outsourcing work using online labor markets. Over the last decade I outsourced hundreds of projects, ranging from simple data entry to big, complex software products. I learned to create project specs, learned how to manage contractors, and learned how to keep projects moving forward. In general, I consider myself competent in managing distributed teams and projects.

I also met and talked with many people that share my passion for this style of work. We discuss strategies for hiring, for managing the short- and long-term projects, for pricing, for handling legal risks, and other topics of interest. After many such discussions, I reached a striking conclusion: Everyone has a completely different style of managing this process.

This plurality of "best practices" is a bad thing. Having too many best practices means that there are no best practices. The lack of consensus makes it impossible to effectively teach a newcomer of how to handle the process.

The problem with manual hiring in online labor markets

People that want to use contractors for their projects face the following problems:

Few people know what they want: Just for fun, go and check random projects on oDesk, eLance, and Freelancer. An endless list of poorly described projects, requests for "clone of Facebook" for $500, and a lot of related crap. It is not a surprise that many of these projects remain open for ever.
Few people know how to hire: Ask any startup CEO how easy is to hire an employee. It is a pain. The art and craft of inferring the match of an individual to a given task is a very hard problem. Few people know how to do it right. Even within Google and Microsoft, with their legendary interviewing processes, interviewing is seen by many as a hard, time-consuming, and unrewarding experience.
Few people know how to manage a project: Even fewer people know how to manage a project. The harrowing fact is that most people believe that they can. Most people hire someone, hoping that the employee will be in their head, will understand what these vague specifications mean, will know everything that is not documented in a project, and will be able to do a great job. Very few people realize that outsourcing a project means that you will need to spend significant amount of time managing the project.

The result of the combination of these factors? Online labor does not scale through manual hiring. (Of course, this is not unique to online outsourcing. Offline hiring has the same problem.) There are simply not enough qualified employers that can hire effectively, who will be able to create demand for jobs for the online labor markets to continue to grow.

Online hiring vs online shopping

The counter-argument is that labor was always like that. Since the market for labor operates "manually," the transition to electronic hiring will allow for growth. In the same way people were initially afraid of shopping online, they started buying things online, they are going to switch to hiring online.

I do not buy this argument. When people buy an item online, they buy a standardized product. They are not ordering a bespoke item, which is created according to the customer specifications. Customization is typically limited and allowed on a specific set of dimensions. You can customize your Mac to have a better processor, more memory, and a larger hard disk. But you cannot order a laptop with a 19 in screen, and cannot ask for 96 Gb of memory.

But in online markets this is what happens. The random customer comes and asks for a web application ("just the functionality of the X website"), and wants this app to be built for $500. It is the same as if someone goes to a computer store and asks for a laptop with a 19 inch screen, with 128Gb of memory, and 10Tb disk. And, since 1Gb of memory costs 7 dollars, it is reasonable to just pay $1000 for 128Gb, right?

Lessons from online shopping

Based on the experience for the transition of shopping from offline to online, let's see how online labor can move forward.

Standardize and productize: Currently, in online markets, most people ask for a specific set of tasks. Content generation, website authoring, transcriptions, translations, etc. Many of these can be "productized" and be offered as standardized packages, perhaps with a few pre-set customizations available. (Instead of "select the hard disk size, you have a "select blog post length".) This vertical-oriented strategy is followed by many crowdsourcing companies and offers to the client a clean separation from the process of hiring and managing a task. This vertical strategy works well to create small offerings but it is not clear if there is sufficient demand within each vertical to fuel the growth expected for a startup. This is a topic for a new blog post.
Productize the project creation/management: When a standardized offering is not sufficient, the client is directed into hiring a product manager that will spec the requirements, examine if there is sufficient supply of skills in the market, hire individual contractors, manage the overall process, etc. This is similar to renovating a house. The delivered product is often completely customized, but the client does not seek to hire separately electricians, carpenters, painter, etc. Instead, the owner hires a "general contractor" who creates the master plan for the renovation, procures the materials, hires subcontractors, etc. While it eases some of the problems, this is a process suitable only for reasonably big project.
Become a staffing agency: A problem with all existing marketplaces is that they are not acting as employers, but only as matching agents. Few, if any, marketplaces are guaranteeing quality. Every transaction is a transaction between "consenting adults." Unfortunately, very few potential employers understand that, and hire with the implicit assumption that the marketplace is placing a guarantee on the quality of the contractors. So, if the contractor ends up being unqualified for the task, there is very little recourse. By guaranteeing quality, the employer (who is the one spending the money) gets some minimum level of guarantee about the deliverable. Unfortunately, providing such quality guarantees is easier said than done.
Let contractors build offerings: By observing the emergence of marketplaces like Etsy, you can see that people are becoming more comfortable with ordering semi-bespoke, handcrafted items online, for which they have little information. A potential route is to allow the contractors in online markets to build such "labor products" and price them themselves, in the same way that Etsy sellers are putting up their handcrafted stuff online.

All these approaches are fine, and I expect most current marketplaces to adopt one or more of these strategies over time. However, all of them rely on the same assumption: That hiring, as shopping, will be a human activity.

What happens, though, if we stop assuming that hiring is a human-mediated effort?

Crowdsourcing practices to the rescue

I will not pretend that the current state of the crowdsourcing industry offers concrete solutions to the problems listed above. But today's efforts in crowdsourcing move us towards an algorithmically-mediated work environment.

Of course, like all automatic solutions, the initial environment is much worse than "traditional" approaches. We see that in all the growing pains of Mechanical Turk. It is often easier to just hire a couple of trusted virtual assistants from oDesk to do the job, instead of trying to implement the full solution stack to get things done properly on MTurk.

However, the initial learning curve starts paying off later. Production environments that rely on a "crowd" need to automate as much as possible the hiring and management of workers. This automation makes the tasks much more scalable than traditional hiring and project management. High-startup costs, then lower marginal costs of adding workers to a process.

This leads to easier scalability. Of course, the moment the benefits of easier scalability start becoming obvious, it will be too late for players that rely on manual hiring to catch up. It is one of the reasons that I believe that Mechanical Turk has the potential to be the major labor platform, even if this seems a laughable proposition at this point.

I will make a prediction: Crowdsourcing is currently at the forefront of defining the methods and practices in the workplace for the next few decades. Assembly lines and integration of machines in the work environment led to the mass production revolution of the 20th century. The current crowdsourcing practices will define how the majority of people are going to work on knowledge tasks in the future. A computer process will monitor and manage the working process, and hiring manually will be soon a thing of the past, for many "basic" knowledge tasks.

Some will find this prospect frightening. I do not find it any more frightening than having traffic lights regulate traffic in intersections, or having the auto-pilot taking care of my flight.

Sunday, July 28, 2013

Crowdsourcing and information theory: The disconnect

In crowdsourcing, redundancy is a common approach to ensure quality. One of the questions that arises in this setting is the question of equivalence. Let's assume that a worker has a known probability $q$ of giving a correct answer, when presented with a choice of $n$ possible answers. If I want to simulate one high-quality worker workers of quality $q$, how many workers of quality $q' < q$ do we need?

Information Theory and the Noisy Channel Theorem

Information theory, and the noisy channel theorem, can give an answer to the problem: Treat each worker as a noisy channel, and measure the "capacity" of each user. Then, the sum of the capacities of the different workers should give us the equivalent capacity of a high-quality worker.

We have that the capacity $C(q,n)$ of a worker with quality $q$, who returns the correct answer with probability $q$, when presented with $n$ choices, is:

$C(q,n) = H(\frac{1}{n}, n) - H(q, n)$

where $H(q,n) = -q \cdot \log(q) - (1-q) \cdot \log(\frac{1-q}{n-1}) $ is the entropy (aka uncertainty) of the worker.

Examples

The value $H(\frac{1}{n}, n) = \log(n)$ is the initial entropy that we have for the question, when no answers are given. Intuitively, when $n=2$, the initial uncertainty is equal to $\log(2)=1$ bit, since we need one bit to describe the correct answer out of the 2 available. When $n=4$, the uncertainty is equal to $\log(4)=2$ bits, as we need 2 bits to describe the correct answer out of the 4 available.

Therefore, a perfect worker, with quality $q=1$ will have $H(1,n)=0$ entropy, and therefore the capacity of a perfect worker is $\log(n)$.

Can Two Imperfect Workers Simulate a Perfect One?

Now, here comes the puzzle. Assume that we have $n=4$, and the workers have to choose among 4 possible answers. We also have also two workers with $q=0.85$, that select with 85% probability the correct answer out of 4 available. These workers have each capacity equal to $C(0.85, 4) = 1.15$ bits. At the same time, we have one perfect worker with $q=1$. This worker has a capacity of $C(1,4)=2$ bits. So, in principle, the two noisy workers are sufficient to simulate a perfect worker (and would leave a remaining 0.3 bits to use :-)

What am I missing?

My problem is that I do not get how to reach this theoretical limit. I cannot figure out how to use these two workers with $q=0.85$, in order to reconstruct the correct answer. Asking two workers to work in parallel will not cut it (still possible for both workers to agree and be incorrect). Sequential processing (get first a worker to select two out of the four answers, then the second one pick the correct out of the two) seems more powerful, but again I do not understand how to operationalize this.

According to information theory, these two $q=0.85$ workers are equivalent, on average, with one perfect $q=1.0$ worker. (Actually, they seem to carry more information). And even if we avoid perfection, and we set target quality at $q=0.99$, $C(0.99,4)=1.9$. I still cannot see how I can combine two workers with 85% accuracy to simulate a 99% accurate worker.

Update 1 (thanks to Michael Nielsen): Information theory operates over a large amount of transmitted information, so posing the question as "answering a single question" makes it sound more impossible than it should.

We need 2 bits of information to transfer the answer for a multiple choice question with n=4 choices. Say that we have a total of N such questions. So, we need 2N bits to transfer perfectly ann the answers. If we have perfect workers, with $q=1$, we have that $C(1,4)=2$, and we need 2N bits / 2 bits/answer = N answers, from these workers.

Now, let's say that we have workers with $q'=0.85$. In that case $C(0.85, 4) = 1.15$ bits per answer. Therefore, we need 2N bits / 1.15 bits/answer = 1.74N answers from these 85% accurate workers in order to perfectly reconstruct the answers for these N questions.

So, if we get from these 85% workers a total of 100 answers (each one 85% correct), we should be able to reconstruct the 100% correct answer for ~57 (=100/1.74) questions.

Of course we should be intelligent of what exactly to ask and get these 100 answers.

I see in Wikipedia, in the article about the noisy channel theorem, that "Simple schemes such as 'send the message 3 times and use a best 2 out of 3 voting scheme if the copies differ' are inefficient error-correction methods" and that "Advanced techniques such as Reed–Solomon codes and, more recently, turbo codes come much closer to reaching the theoretical Shannon limit". Unfortunately, my familiarity with such coding schemes is minimal (i.e., I have no clue), so I cannot understand their applicability in a crowdsourcing setting.

So, here is my question: What coding schemes should we use in crowdsourcing in order to get closer to the theoretical limits given by Shannon? Or what is the fundamental thing that I miss? Because I do feel that I am missing something...

Any help would be appreciated.

Update 2 (thanks to the comments by stucchio and syrnik): Information theory predicts that we can always recover the perfect answers from noisy workers, given sufficient worker capacity. For anyone that has worked in crowdsourcing, this sounds very strange, and seems practically infeasible. The problem does not seem to be in the assumptions of the analysis; instead it seems to rely on the feasibility of implementing a proper encoding scheme on top of human computation.

The key concept in information theory is the coding scheme that is used to encode the information, to make the transmission of information robust to errors. Information theory does not say how we can recover this perfect information using a noisy channel. Over time, researchers came up with appropriate encoding schemes that approach very closely the theoretical maximum (see above, Reed-Solomon codes, turbo codes, etc). However, it is not clear whether these schemes are translatable into a human computation setting.

Consider this gross simplification (which, I think, is good enough to illustrate the concept): In telecommunications, we put a "checksum" together with each message, to capture cases of incorrect information transmission. When the message gets transmitted erroneously, the checksum does not match the message content. This may be the result of corruption in the message content, or the result of corruption in the checksum (or both). In such cases, we re-transmit the message. Based on the noise characteristics of the channel, we can decide how long the message should be, how long the checksum should be, etc., to achieve maximum communication efficiency.

For example, consider using a parity bit, the simplest possible checksum computation. We count the number of 1 bits in the message: if the number of 1's is odd, we set the parity bit to 1, if the number of 1's is even, we set the parity bit to 0. The extra parity bit increases the size of the message but can be used to detect errors when the message gets transmitted over a noisy channel, and reduce the error rate. By increasing the number of parity bits we can reduce the error rate to arbitrarily low levels.

In a human computation setting, computing such a checksum is highly non-trivial. Since we do not really know the original message, we cannot compute at the source an error-free checksum. We can of course try to create "meta"-questions that will try to compute the "checksum" or even try to modify all the original questions to have an error-correcting component in them.

See now the key difference: In information theory, we have computed error-free the message to be transmitted with built-in error-correction. Consider now the same implementation in a human computation setting: We ask the user to inspect the previous k questions, and report some statistics about the previously submitted answers. The user now operates on the noisy message (i.e., the given, noisy answers), therefore even the error-free computation of the checksum is going to be noisy, defeating the purpose of an error-correcting code.

Alternatively, we can try to take the original questions, and try to ask them in a way that enforces some error-correcting capability. However, it is not clear that these "meta-questions" are going to have the same level of complexity for the humans, even if in the information-theoretic sense, they carry the same amount of information.

It seems that in a human computation setting we have noisy computation, as opposed to noisy transmission. Since the computation is noisy, there is a good chance that the computation these "checksums" is going to be correlated with the original errors. Therefore, it is not clear whether we can actually implement the proper encoding schemes on top of human computers, to achieve the theoretical maximums predicted by information theory.

Or, at least, this seems like a very interesting, and challenging research problem.

Friday, June 28, 2013

Facebook implements brand safety, doing it "manually" (crowdsourcing?)

I just read that Facebook started thinking about brand safety, and will restrict ads from appearing next to content that may be controversial (e.g., adult-oriented content).

I was rather surprised to find out that Facebook has not been doing that already. It is know that Facebook has been using crowdsourcing to detect content that violates the terms of service. So, I assumed that the categorization of the content as brand-inappropriate was also part of that process. Apparently not.

Given the similarities of the two tasks (the difference between no-ads-for-brand-safety and violating-terms-of-service is often just part of intensity of the offense), I assume that Facebook is also going to adopt a crowdsourcing-style solution (perhaps with a private crowd), and then they will build a machine learning algorithm on top using the crowd judgements. At least the wording "In order to be thorough, this review process will be manual at first, but in the coming weeks we will build a more scalable, automated way" in the announcement seems to imply that.

Or perhaps, to blow my own horn, Facebook should just use Integral Ad Science, (aka AdSafe Media). At AdSafe, we built a solution for exactly this problem back in 2009, employing a combination of crowdsourcing and machine learning to detect brand-inappropriate content. We did not go just for porn, but also for other categories, such as alcohol use, offensive language, hate speech, etc. In fact, most of my work in crowdsourcing was inspired, one way or another, through the problems faced when trying to deploy a crowdsourcing solution at scale. Also, except for the academic research, my work with Integral also led to one of the best blog posts that I have written, "Uncovering an advertising fraud scheme (or, the Internet is for Porn)".

Perhaps, the next step is to demonstrate how to use Project Troia, together with a good machine learning toolkit in order to deploy quickly a system for detecting brand inappropriate content. Maybe Facebook could use that ;-)

Mechanical Turk account verification: Why Amazon disables so many accounts

Over the last year, Amazon embarked into a big effort: All holders of an Amazon Payments account (which includes all the Mechanical Turk worker) had to verify their accounts, by providing their social security number, address, full legal name, etc. Users that did not provide this information found their accounts disabled, and unable to perform any financial transaction.

This led to big changes in the market, as many international workers realized that Amazon could not verify their identity (even if they provided the correct information), and they found themselves locked out of Mechanical Turk.

So, why would Amazon start doing that?

Low quality of international workers. While there are certainly many high-quality workers outside the US, there is a certain segment of workers that join the market with the sole purpose of getting something for nothing. Especially after Indian workers became eligible to receive cash compensation (instead of just gift cards available to other non-US workers), the number of spam attacks from India went up significantly.

So, identity verification can help in that front. It is well-known that it is difficult to have a good reputation scheme that allows for cheap generation of identities. When identities are easy to create, every time someone commits a bad action and gets caught, the account gets closed and a new account is created, ready to commit the same bad actions again. This hurts significantly new workers, that are defacto treated as potential spammers, discouraging them to join the market.

I have long criticized the fact that Amazon allowed for easy generation of ids. Even though it seemed that Amazon required SSN numbers, and other information to create an account, this was an effectively optional step. In fact, it was possible to use Fake Name Generator, and create plenty of seemingly authentic "US based" accounts, using simply SSN numbers of dead people. This meant that many fake accounts existed, many of them being "US based" that then used Amazon Payments to forward their earnings to the true puppetmaster holder.
Labor law. Even though many (small) requesters are unaware of the fact, when you post jobs on Mechanical Turk, you directly engage into hiring contractors to do some work for you. Many people believe that you are paying Amazon, who then pays the workers but in reality Amazon acts simply as a payment processor. Amazon does not act as an employer; the requester acts as an employer. As discussed in the past, this forces many requesters to unknowingly participate in a black market.

The moment requesters realize that they are actually employing all these contractors is when some workers end up receiving more than $600 in payments from the requester over the fiscal year. At that point, due to IRS regulations, the requester needs to send a 1099-MISC form to the MTurk worker. Amazon then provides the full information (SSN, address, etc) of the workers to the requester. So Amazon would like to have the correct information, to avoid forcing the requesters to send 1099 forms to fake addresses, with fake names and SSNs.

I should clarify here that the $600 limit is the point where the employer is forced to send a 1099-MISC form. In principle, a requester may want to send 1099-MISC forms to all workers, and Amazon may want to provide this information on demand. (I doubt that this can be the reason, though).

Finally, there was a new regulation from IRS last year: IRS introduced the concept of a 1099-K form. Since Amazon acts as a defacto payment processor (and not as an employer), Amazon should also report the amount of payments sent to each worker. So, even if no worker have met the $600 limit from a single requester, if the overall payments for a single worker was high enough (specifically $20,000/yr or more, and more than 200 requesters) then again Amazon needs to report this information and include valid worker information there.
Money laundering: Since Mechanical Turk started becoming a marketplace with significant volume, this may have raised some flags in all the places that monitor financial transactions for money laundering. All US companies need to comply with the infamous US Patriot Act, and for Mechanical Turk the provisions about money laundering and financing of terrorist activities may have been a reason for cleaning up the marketplace from fake worker identities. The basic idea, known as the "Know Your Customer (KYC)" doctrine, is that Amazon should know from whom they get money and to whom they send the money. Since Amazon accepts payments from US requesters only, they know where the money come from. Now, with cleaning up the marketplace from fake identities and verifying the existing ones, they also know where the money flows to, so they seem to be more in compliance with the money laundering laws.

Overall, there are many reasons for Amazon to check and clean up the market from fake accounts and prevent any anonymous activity. For me, this is a good step, despite all the problems that it may generate for workers that have problems proving their identity. Even in India, the new UID system will eventually allow the legitimate Indian workers to prove their identity without problems.

One concerns that someone expressed to me was that this direction was removing the ability of workers to be truly anonymous. I am not exactly sure how this can be a concern, given that it is well established that in the workplace (electronic or not) there is very limited right to privacy. Knowing the true identity of your workers (contractors or employees) is a pretty fundamental right of the employer, and I doubt that the expectation that a worker remains anonymous can be a "reasonable expectation of privacy". The only case that I see this happening is if Amazon switches from being a payment processor to being an employer of all the Mechanical Turk workers, but I doubt this will happen anytime soon.

At the end of the day, markets do not mix well with true and complete anonymity.

Tuesday, June 18, 2013

Project Troia: Quality Assurance in Crowdsourcing

One of the key problems in crowdsourcing is the issue of quality control. Over the last few years, a large number of methods have been proposed for estimating the quality of workers and the quality of the generated data. A few years back, we have released the Get Another Label toolkit, which allowed people to run their data through a command-line interface, and get back estimates of the worker quality, estimates of how well the data have been labeled, and identify the data points that have high uncertainty and therefore may require additional attention.

The next step for the Get Another Label was to get it ready to work in more practical settings. The GAL toolkit, assumed that we have all the labels assigned by the workers, we process them, and get the results. In reality, though, most tasks run in an incremental mode. The task is running over time, new data arrive, new workers arrive, and the "load-analyze-output" process was not a good fit. We wanted to have something that gives back estimates of worker quality on the fly, and again on-the-fly identifies the data points that need most attention.

Towards this goal, over the last few months we have been porting the GAL code into a web service, called Project Troia. You can load the data as the crowdsourced project runs and get back the results immediately. This allows for very fast estimation of worker quality, and also allows the quick identification of data points that either meet the target quality, or require additional labeling effort.

Supports labeling with any number of discrete categories, not just binary.
Supports labeling with continuous variables.
Allows the specification of arbitrary misclassification costs (e.g., "marking spam as legitimate has cost 1, marking legitimate content as spam has cost 5").
Allows for seamless mixing of gold labels and redundant labels for quality control.
Estimates the quality of the workers that participate in the task and returns the estimates on-the-fly.
Estimates the quality of the data that are returned back by the algorithm and returns the estimate of labeling accuracy on-the-fly.
Estimates a quality-sensitive payment for every worker, based on the quality of the work done so far.

If you are interested in the description of the methods implemented in the toolkit, please take a look at the paper "Quality-based Pricing for Crowdsourced Workers". Our experiments indicate that when labeling allocation happens following the suggestions of Project Troia, we achieve the target data quality with almost optimal budget, and workers are fairly compensated for their effort. (For details, see the paper :-)

Special thanks to Tagasauris, oDesk, and Google for providing support for developing the software. Needless to say, the API is free to use, and the source code is available on Github. We hope that you will find it useful.

Tuesday, April 2, 2013

Intrade Archive: Data for Posterity

A few years back, I have done some work on prediction markets. For this line of research, we have been collecting data from Intrade, to perform our experimental analysis. Some of the data is available through the Intrade Archive, a web app that I wrote in order to familiarize myself with the Google App Engine.

In the last few weeks, through, after the effective shutdown of Intrade, I started receiving requests on getting access to the data stored in the Intrade Archive. So, after popular demand, I gathered all the data from the Intrade Archive, and also all the past data that I had about all the Intrade contracts going back to 2003, and I put them all on GitHub for everyone to access and download. The Excel file contains a description of the contracts, while the zip file contains information about all the individual trades and the daily opening and closing prices.

On purpose, I exclude all the Financial contracts, as the trading of these events have limited research interest. (Plus, they were too many of them.) The information from "official" stock and options exchanges has much higher volume and is a better source of information than the comparatively illiquid contracts on Intrade.

The link to the GitHub repository is also now available from the home page of the Intrade Archive. I hope that the resource hungry crawlers can now be put to sleep, not to ever come back again :-)

Enjoy!

Monday, February 25, 2013

WikiSynonyms: Find synonyms using Wikipedia redirects

Many many years back, I worked with Wisam Dakka on a paper to create faceted interfaced for text collections. One of the requirements for that project was to discover synonyms for named entities. While we explored a variety of directions, the one that I liked most was Wisam's idea to use the Wikipedia redirects to discover terms that are mostly synonymous.

Did you know, for example, that ISO/IEC 14882:2003 and X3J16 are synonyms of C++? Yes, me neither. However, Wikipedia reveals that through its redirect structure.

The Wikisynonyms web service

What we mean by redirects? Well, if you try to visit the Wikipedia page for President Obama, you will be redirected to the canonical page Barack Obama. Effectively "President Obama" is deemed by Wikipedians to be a close synonym of "Barack Obama", and therefore the redirect. Similarly, the term "Obama" is also a redirect, etc. (You can check the full list of redirects here.)

While I was visiting oDesk, I felt that this service can be useful for a variety of purposes so, following the oDesk model, we hired a contractor to implement this synonym extraction as a web API and service. If you want to try it out please go to:

http://wikisynonyms.ipeirotis.com/search

The API is very simple. Just issue a GET request like this:

curl 'http://wikisynonyms.ipeirotis.com/api/{TERM}'

For example, to find synonyms for Hillary Clinton:

curl 'http://wikisynonyms.ipeirotis.com/api/Hillary_Clinton'

and for Obama

curl 'http://wikisynonyms.ipeirotis.com/api/Obama'

Mashape integration

Since we may change the URL of the service, I would recommend registering and using Mashape to access the WikiSynonyms service through Mashape instead:

curl 'https://wikisynonyms.p.mashape.com/{TERM}' --header 'X-Mashape-Authorization: your_mashape_key'

You can easily download Wikipedia

Interestingly enough, this synonym extraction technique remains little-known, despite the easiness of extracting these synonyms. And whenever I mention Wikipedia, most people are worried that they will need to scrape the HTML from Wikipedia, and nobody likes this monkey business.

Strangely, most people are unaware that you can download Wikipedia in a relational form and put it directly in a database. In fact, you can download only the parts that you need. Here are the basic links:

The Wikipedia schema is at http://www.mediawiki.org/wiki/Manual:Database_layout
The files are at http://dumps.wikimedia.org/enwiki/latest/ and you can download each table individually.
To implement the synonyms service, you only need to fetch the redirect table and the page table.
More instructions at http://en.wikipedia.org/wiki/Wikipedia:Database_download

This redirect structure (as opposed, say to the normal link structure and the related anchor text) is highly precise. By eyeballing the results, I would guess that precision is around 97% to 99%.

Application: Extracting synonyms of oDesk skills

One application that we used the service was to extract synonyms for the set of skills that are used to annotate the jobs posted on oDesk. For example, you can find the synonyms for C++:

Or you can find the synonyms for Python:

Oops, as you see the term Python is actually ambiguous, and Wikipedia has a disambiguation page with the different 'senses' of the term. Since we are not doing any automatic disambiguation, we return a 300 HTTP response and ask the user to select one of the applicable terms. So, if we query now with the term 'Python (programming language)' we get:

Open source and waiting for feedback

The source code together with the installation instructions for the service is available on GitHub. Feel free to point any problems or suggestions for improvement. And thank oDesk Research for all the support in creating the service and making it open source for everyone to use.

Monday, January 28, 2013

Towards a Market for Intelligence

Last September, I was visiting CMU and a student asked me a question: "Do you know any crowdsourcing market, where we can assign tasks to people, as opposed to waiting for the workers to pick the tasks they want to work on?"

Most crowdsourcing services do not satisfy this requirement. Mechanical Turk, oDesk, eLance, and all others typically expect the workers to express interest to a task. At most, you may be able to invite workers to participate in a task, but you cannot really assign a task to a worker.

A notable exception is Fiverr, which plays the supply side of the market; however, Fiverr has a different limitation: The tasks that can be performed are posted by the workers, and the employers pick from a set of existing tasks.

After thinking for a while, I realized that there more such markets in which you can assign tasks to "workers". The main difference is that the "workers" are not necessarily humans, but APIs. Enter the world of Mashape, an API marketplace. Are you looking for someone to classify tweets according to their sentiment? Query the Mashape marketplace for sentiment analysis API's and then assign the task to the intelligence units of your choice.

With the advent of the API marketplaces, we see the emergence of marketplaces for "intelligence units". All these intelligence units (API's, or human workers) have different levels of quality, various levels of pricing, and even various levels of capacity and responsiveness.

With the proper abstraction, a task management platform, can use and optimize these distributed intelligence units without having to worry about the "implementation details" of this intelligence.