## Friday, June 28, 2013

### Facebook implements brand safety, doing it "manually" (crowdsourcing?)

I was rather surprised to find out that Facebook has not been doing that already. It is know that Facebook has been using crowdsourcing to detect content that violates the terms of service. So, I assumed that the categorization of the content as brand-inappropriate was also part of that process. Apparently not.

Given the similarities of the two tasks (the difference between no-ads-for-brand-safety and violating-terms-of-service is often just part of intensity of the offense), I assume that Facebook is also going to adopt a crowdsourcing-style solution (perhaps with a private crowd), and then they will build a machine learning algorithm on top using the crowd judgements. At least the wording "In order to be thorough, this review process will be manual at first, but in the coming weeks we will build a more scalable, automated way" in the announcement seems to imply that.

Or perhaps, to blow my own horn, Facebook should just use Integral Ad Science, (aka AdSafe Media). At AdSafe, we built a solution for exactly this problem back in 2009, employing a combination of crowdsourcing and machine learning to detect brand-inappropriate content. We did not go just for porn, but also for other categories, such as alcohol use, offensive language, hate speech, etc. In fact, most of my work in crowdsourcing was inspired, one way or another, through the problems faced when trying to deploy a crowdsourcing solution at scale. Also, except for the academic research, my work with Integral also led to one of the best blog posts that I have written, "Uncovering an advertising fraud scheme (or, the Internet is for Porn)".

Perhaps, the next step is to demonstrate how to use Project Troia, together with a good machine learning toolkit in order to deploy quickly a system for detecting brand inappropriate content. Maybe Facebook could use that ;-)

### Mechanical Turk account verification: Why Amazon disables so many accounts

Over the last year, Amazon embarked into a big effort: All holders of an Amazon Payments account (which includes all the Mechanical Turk worker) had to verify their accounts, by providing their social security number, address, full legal name, etc. Users that did not provide this information found their accounts disabled, and unable to perform any financial transaction.

This led to big changes in the market, as many international workers realized that Amazon could not verify their identity (even if they provided the correct information), and they found themselves locked out of Mechanical Turk.

So, why would Amazon start doing that?

• Low quality of international workers. While there are certainly many high-quality workers outside the US, there is a certain segment of workers that join the market with the sole purpose of getting something for nothing. Especially after Indian workers became eligible to receive cash compensation (instead of just gift cards available to other non-US workers), the number of spam attacks from India went up significantly.

So, identity verification can help in that front. It is well-known that it is difficult to have a good reputation scheme that allows for cheap generation of identities. When identities are easy to create, every time someone commits a bad action and gets caught, the account gets closed and a new account is created, ready to commit the same bad actions again. This hurts significantly new workers, that are defacto treated as potential spammers, discouraging them to join the market.

I have long criticized the fact that Amazon allowed for easy generation of ids. Even though it seemed that Amazon required SSN numbers, and other information to create an account, this was an effectively optional step. In fact, it was possible to use Fake Name Generator, and create plenty of seemingly authentic "US based" accounts, using simply SSN numbers of dead people. This meant that many fake accounts existed, many of them being "US based" that then used Amazon Payments to forward their earnings to the true puppetmaster holder.

• Labor law. Even though many (small) requesters are unaware of the fact, when you post jobs on Mechanical Turk, you directly engage into hiring contractors to do some work for you. Many people believe that you are paying Amazon, who then pays the workers but in reality Amazon acts simply as a payment processor. Amazon does not act as an employer; the requester acts as an employer. As discussed in the past, this forces many requesters to unknowingly participate in a black market.

The moment requesters realize that they are actually employing all these contractors is when some workers end up receiving more than $600 in payments from the requester over the fiscal year. At that point, due to IRS regulations, the requester needs to send a 1099-MISC form to the MTurk worker. Amazon then provides the full information (SSN, address, etc) of the workers to the requester. So Amazon would like to have the correct information, to avoid forcing the requesters to send 1099 forms to fake addresses, with fake names and SSNs. I should clarify here that the$600 limit is the point where the employer is forced to send a 1099-MISC form. In principle, a requester may want to send 1099-MISC forms to all workers, and Amazon may want to provide this information on demand. (I doubt that this can be the reason, though).

Finally, there was a new regulation from IRS last year: IRS introduced the concept of a 1099-K form. Since Amazon acts as a defacto payment processor (and not as an employer), Amazon should also report the amount of payments sent to each worker. So, even if no worker have met the $600 limit from a single requester, if the overall payments for a single worker was high enough (specifically$20,000/yr or more, and more than 200 requesters) then again Amazon needs to report this information and include valid worker information there.

• Money laundering: Since Mechanical Turk started becoming a marketplace with significant volume, this may have raised some flags in all the places that monitor financial transactions for money laundering. All US companies need to comply with the infamous US Patriot Act, and for Mechanical Turk the provisions about money laundering and financing of terrorist activities may have been a reason for cleaning up the marketplace from fake worker identities. The basic idea, known as the "Know Your Customer (KYC)" doctrine, is that Amazon should know from whom they get money and to whom they send the money. Since Amazon accepts payments from US requesters only, they know where the money come from. Now, with cleaning up the marketplace from fake identities and verifying the existing ones, they also know where the money flows to, so they seem to be more in compliance with the money laundering laws.
Overall, there are many reasons for Amazon to check and clean up the market from fake accounts and prevent any anonymous activity. For me, this is a good step, despite all the problems that it may generate for workers that have problems proving their identity. Even in India, the new UID system will eventually allow the legitimate Indian workers to prove their identity without problems.

One concerns that someone expressed to me was that this direction was removing the ability of workers to be truly anonymous. I am not exactly sure how this can be a concern, given that it is well established that in the workplace (electronic or not) there is very limited right to privacy. Knowing the true identity of your workers (contractors or employees) is a pretty fundamental right of the employer, and I doubt that the expectation that a worker remains anonymous can be a "reasonable expectation of privacy". The only case that I see this happening is if Amazon switches from being a payment processor to being an employer of all the Mechanical Turk workers, but I doubt this will happen anytime soon.

At the end of the day, markets do not mix well with true and complete anonymity.

## Tuesday, June 18, 2013

### Project Troia: Quality Assurance in Crowdsourcing

One of the key problems in crowdsourcing is the issue of quality control. Over the last few years, a large number of methods have been proposed for estimating the quality of workers and the quality of the generated data. A few years back, we have released the Get Another Label toolkit, which allowed people to run their data through a command-line interface, and get back estimates of the worker quality, estimates of how well the data have been labeled, and identify the data points that have high uncertainty and therefore may require additional attention.

The next step for the Get Another Label was to get it ready to work in more practical settings. The GAL toolkit, assumed that we have all the labels assigned by the workers, we process them, and get the results. In reality, though, most tasks run in an incremental mode. The task is running over time, new data arrive, new workers arrive, and the "load-analyze-output" process was not a good fit. We wanted to have something that gives back estimates of worker quality on the fly, and again on-the-fly identifies the data points that need most attention.

Towards this goal, over the last few months we have been porting the GAL code into a web service, called Project Troia. You can load the data as the crowdsourced project runs and get back the results immediately. This allows for very fast estimation of worker quality, and also allows the quick identification of data points that either meet the target quality, or require additional labeling effort.
• Supports labeling with any number of discrete categories, not just binary.
• Supports labeling with continuous variables.
• Allows the specification of arbitrary misclassification costs (e.g., "marking spam as legitimate has cost 1, marking legitimate content as spam has cost 5").
• Allows for seamless mixing of gold labels and redundant labels for quality control.
• Estimates the quality of the workers that participate in the task and returns the estimates on-the-fly.
• Estimates the quality of the data that are returned back by the algorithm and  returns the estimate of labeling accuracy on-the-fly.
• Estimates a quality-sensitive payment for every worker, based on the quality of the work done so far.
If you are interested in the description of the methods implemented in the toolkit, please take a look at the paper "Quality-based Pricing for Crowdsourced Workers". Our experiments indicate that when labeling allocation happens following the suggestions of Project Troia, we achieve the target data quality with almost optimal budget, and workers are fairly compensated for their effort. (For details, see the paper :-)

Special thanks to Tagasauris, oDesk, and Google for providing support for developing the software. Needless to say, the API is free to use, and the source code is available on Github. We hope that you will find it useful.