A Computer Scientist in a Business School

Wednesday, December 5, 2012

Mechanical Turk changing the defaults: The game has changed

Back in the summer of 2011, Mechanical Turk introduced a new type of qualification, the Mechanical Turk "Masters". The Master qualification was assigned by Amazon to workers that have proven themselves in the marketplace.

What exactly makes someone "proven"? This is, understandably, a well-kept secret by Amazon. The opacity of the qualification process annoys many workers: It is hard to prove that you are a Master and qualify for it, when you do not know how this qualification is granted. The rumor says that Amazon deploys decoy tasks on Mechanical Turk just to examine the performance of the workers and decide which ones to qualify as Masters. If this is correct, then it also explains why Amazon is rather secretive about the exact requirements: Workers would try to ace these test tasks, and let their guards down in others.

The existence of Masters was an good development towards creating a true reputation scheme for Mechanical Turk. However, an action taken by Amazon a month back has changed the dynamic of the market: Now the default requirement, for all tasks created through the UI interface, is to require using Masters workers. Removing the requirement is done only through the "advanced" menu, and is followed by a warning that you may not get good results if you opt not to use Masters.

Tiny change? No. This is huge. Here are a few of the immediate, positive effects:

People that use the Web UI are typically the newcomers, that do not know (or want) to implement sophisticated quality control schemes. They just want to execute some simple tasks. The task templates help a lot to create a usable interface, and the Masters requirement ensures that they are not going to get back crappy results. A happy customer, is a long term customer.
Masters will not touch badly designed and ambiguous tasks. This enforces discipline from the requester side, to get things designed properly. Otherwise the tasks are left untouched, which is a good signal that something is wrong with the task.
Masters will not touch offensively priced tasks, paying less than minimum wage, while demanding high-quality work. This (hurray!) removes the impression that Mechanical Turk is about dirty cheap work and emphasizes what crowdsourcing is about: Dynamic allocation of labor on tasks, without the overhead of hiring, negotiations, etc.

There are of course, a few downsides:

There are much fewer Masters workers. A current search reveals 20,744 workers. This is at least an order of magnitude lower than the number of active workers that Amazon used to advertise. Of course, these Masters are much more active than the average worker, but still there are not enough of them for all the tasks that require them.
There is now a significant lag in the task being picked by workers. Masters are much more careful about the requesters they work with, and a new requester will need to prove that is not rejecting work unfairly, and that they pay on time. Until then, the task will get only a few workers willing to test it.
The tasks now take much longer to complete. My current sense is that there is a 10x slowdown, (but the improvement in quality is definitely worth it).
There is an increased cost. Masters require decent wages (so no more 5 cents for 5-minutes of work), and there is an increased overhead from Amazon (30% overhead for Masters vs 10% for regular workers). My take? You get what you pay for.
It is not clear in what tasks the Masters are tested and how a new worker can become a master. It would be great if Amazon also gets quality signals from a few reliable big requesters, but I can see many practical problems in implementing such a solution.

Overall though, this change in the defaults is showing that Amazon started acting on the criticism. It is clear that this is a risky move, as there will be a lot of work posted on Mechanical Turk will not get done due to lack of interest for poorly paying or badly designed tasks.

But on the other hand, it shows that Amazon is looking for the long term: Let newcomer requesters get guaranteed results, and if they want to get things done faster they can focus on pricing and better task design. If they want to get further and engage other Turkers, such requesters will be aware of the risks and benefits of such a move.

So, effectively now we have the "novice" requesters, who get protected by default through the Masters qualification, and the "advanced" requesters that can implement their own qualification schemes to replace the Masters qualification. This default level of protection makes the life of wannabe-scammer workers very difficult: no obvious victims to attack. Just hunting down for a victim requester will become so difficult that it makes sense to just give up scamming and either convert into doing real work, or abandon the market.

A tiny change in the defaults with short-term problems and many big, long-term benefits. Personally, I find this move exhilarating.

Sunday, November 18, 2012

How big is Mechanical Turk?

A question that people ask me very often is about the size of Mechanical Turk. How many tasks are being completed on the marketplace every day? What is the transaction volume? Let me give a quick answer: I have no idea. Since Amazon does not release any statistics about the marketplace, it is pretty much impossible to know for sure.

Mechanical Turk Tracker

However, I do have some estimates, mainly by using the data that I have been collecting through the Amazon Mechanical Turk Tracker. For those not familiar with the site, over the last four years, we are crawling the Mechanical Turk site every few minutes and we capture the complete state of the market: What tasks are available, their prices, the number of HITs available, etc.

One feature that we revamped lately is the ability to see the number of tasks that are posted and completed every day. You can check the "Arrivals" tab to see the details.

Estimating HITs posted and completed

How do we estimate the number of tasks that get posted and completed? The estimation is a little bit tricky and not 100% foolproof but it works reasonably well, based on my current observations.

Since we can keep track of the history of a task over time, we can see the changes in the number of available HITs over time. For example, we may observe a task that has the following number of HITs in sequential crawls, over time:

1000...700...500...2000...1000...100...[disappeared]

For this task, we estimate that we have an initial posting of 1000 HITs. Then, we see 1000-700 = 300 HITs completed between the first and second crawl. Then, 700-500=200 HITs completed between the second and third crawls. However, between the third and fourth crawl we see a "refill" with 2000-500=1500 HITs, which have been posted. Then we see 2000-1000 = 1000 HITs being completed, then 1000-100=900 HITs completed, and finally the task disappears and the last 100 HITs are assumed to be completed. This generates a total of 1000+1500 HITs posted, and 300+200+1000+900+100 HITs completed.

We do have some extra sanity tests but let's consider the current description as sufficient. For the record, I have checked with a few big requesters and my estimated numbers were pretty close to the actual ones, so I feel reasonably confident that I am not off completely.

Analyzing daily volumes

Now, by looking at the current arrivals data, we can see that my tracker estimates approximate \$30K-\$40K of tasks completed per day. Given that I cannot observe redundancy, and that I may miss HITs that are getting posted and completed between my crawls, I may be underestimating. However, I may also be wrong by considering as "completed" tasks that were simply taken down, without being done. To be on the safe side, I will put my under-reporting factor somewhere between 1 to 10. In other words, I estimate the real daily volume to be somewhere between \$30K to \$400K. Yes, there is a huge difference between the two, but we get the order of magnitude, and you can be as pessimistic or as optimistic as you want.

These numbers generate a yearly transaction volume for Mechanical Turk between \$10M and \$150M. Given that Mechanical Turk takes 10% to 20% as fees, this is a revenue for Amazon between \$1M (low estimate) to $30M (high estimate) per year.

What would be the value of Mechanical Turk as a startup?

I love that question. Not because it is sensible. But because I get to be completely tongue-in-cheek, and make fun of the absolutely ridiculous P/E ration for the Amazon stock: Currently the trailing P/E for Amazon is a wonderful 2,681 (yep, not a typo). Assuming that the Mechanical Turk division generates some earnings in the \$1M to \$5M range, the valuation of Mechanical Turk is somewhere between \$2 billion to \$10 billion dollars! Not shabby for a 7-year old startup :-p.

OK, getting more serious: The price-to-sales ratio for Amazon is somewhere in the 1.75 range. Therefore, given an estimated yearly transaction volume for Mechanical Turk between \$10M and \$150M, the estimated valuation for Amazon Mechanical Turk is somewhere between \$15M (pathetic) to \$250M (respectable).

What is the growth?

While I am less certain about the numbers that have to do with the absolute transaction volume, I am much more confident about the growth numbers. Since my methodology remained the same over time, the growth of the sample should match reasonably well the growth of the overall market.

If you go again to the Arrivals tab on Mechanical Turk Tracker, and change the date range to go back to 2009, you will be able to see how the arrivals and completions have changed over time.

Forget about the absolute numbers. What is very clear is the last few years were very good for Mechanical Turk. While the numbers were pretty low early on, there was a 3x to 6x YoY growth in terms of transaction volume. This was really healthy.

One thing that puzzles me is what happened around March 2012. My tracker seems to detect a sudden stop in the growth. I am not quite sure what is going on there. Is there something about my crawler? Did something change on the Mechanical Turk site that caused a lower rate of completed jobs? I noticed for example, that now Amazon puts the "Masters" qualification as a default option for all the HITs posted through the web interface. This can definitely decrease the rate of completing jobs but I am sure that it will also increase the overall level of satisfaction of the requesters with the answers submitted by the Turkers. Anyhoo, I have not enough information, so I do not want to try to overanalyze that part.

Conclusion

Mechanical Turk is an interesting experiment for Amazon. It is not clear how important is the project for the rest of the company and how much Jeff Bezos supports the effort after all these years. But Bezos is well-known for planning for the long term, and my (imperfect) statistics tend to confirm (tentatively) that the market is on a good path.

Let's see how things play out...

Tuesday, November 13, 2012

Why I love crowdsourcing (the concept) and hate crowdsourcing (the term)

The term crowdsourcing is in fashion. It is being used to describe pretty much everything under the sun today.

Unfortunately, the word crowdsourcing is also getting increasingly associated with "getting things done for free", or at least at ultra-cheap prices. The "crowd" will generate the content for the website. The "crowd" will fix the mistakes. The "crowd" will do everything, and preferably for "points", for "badges", for a spot on the leaderboard, or may be for a few pennies if we end up using Mechanical Turk.

But this association of the term crowdsourcing with low cost labor, is now visibly turning people off. Everybody wants to "use" the crowd but the workers in the crowd feel stiffed. The NoSpec movement was an early warning. The angry tone of some of the threads in Turker Nation is also an indication that many workers are not very happy with the way that they are treated by some requesters.

However, these negative associations are now endangering a very important concept: The idea that we can structure tasks in a way that are robust to the presence of imperfect workers, and that anyone can participate, as long as there is work available. Well-structured tasks allow the on-the-task evaluation of the workers, and can automatically infer whether someone is a good fit for a task or not.

This is not insignificant. It is well-known that one of the biggest barriers for breaking into the workforce is to have prior relevant experience. Students today often beg to get unpaid internships, just to have in their resume the lines with the coveted work experience. In online labor markers, newcomers often bid lower than what they would accept normally, just to build their feedback history. Crowdsourcing can change that.

But as long as crowdsourcing gets associated with low wages, nobody will see the real benefit: That work is within reach, immediately. That someone can experiment with different types of work easily (stock trading? product design?).

Perhaps a new term can describe better the true value of crowdsourcing, and also get the stigmatizing term "crowd" out of the name. (Nobody wants to be part of a "crowd".)

Personally, I favor the term "open work". As in the case of "open access" and "open source software", it describes the opportunity to access work, without barriers. I also like the "fair trade work" motto from MobileWorks but this is more closely connected to work being offered to developing countries. But I think that "open work" captures better the essence of the advantages behind crowdsourcing.

Update: The term open is indeed also associated with free-as-in-beer consumption. However, open can refer both to the supply-side (production) and the demand-side (consumption). For example:

Linux is open, in the sense that anyone can take the source code, modify it, and contribute back (open production); open source software is also available, often, for free, for installation to any machine (open consumption).
In publishing, open access typically means accessing papers without paying (open consumption), but there are also journals (e.g., PLoS ONE) that accept pretty much any technically-valid paper (open production).

In the case of crowdsourcing, "open work" would refer mainly to the open production side. As in the production side of open source, and open access publishing, it does not mean that the participants are not paid for the generation of the artifacts.

What do you think?

Sunday, October 21, 2012

New version of Get-Another-Label available

I am often asked what type of technique I use for evaluating the quality of the workers on Mechanical Turk (or on oDesk, or ...). Do I use gold tests? Do I use redundancy?

Well, the answer is that I use both. In fact, I use the code "Get-Another-Label" that I have developed together with my PhD students and a few other developers. The code is publicly available on Github.

We have updated the code recently, to add some useful functionality, such as the ability to pass (for evaluation purposes) the true answers for the different tasks, and get back answers about the quality of the estimates of the different algorithms.

So, now, if you have a task where the answers are discrete (e.g., "is this comment spam or not?", or "how many people in the photo? (a) none, (b) 1-2, (c) 3-5, (d) more than 5", etc) then you can use the Get-Another-Label code, which supports the following:

Allows any number of discrete categories, not just binary
Allows the specification of arbitrary misclassification costs (e.g., "marking spam as legitimate has cost 1, marking legitimate content as spam has cost 5")
Allows for seamless mixing of gold labels and redundant labels for quality control
Estimates the quality of the workers that participate in your tasks. The metric is normalized to be between 0% for a worker that gives completely random labels, and 100% for a perfect worker.
Estimates the quality of the data that are returned back by the algorithm. The metric is normalized to be 0% for data that have the same quality as unlabeled data, and 100% for perfectly labeled data.
Allows the use of evaluation data, that are used to examine the accuracy of the quality control algorithms, both for the data and for the worker quality.

Currently, we support the vanilla majority voting, and the expectation-maximization algorithm to combine the labels assigned by the workers. We also support maximum likelihood, minimum cost, and "soft" classification schemes. In most cases, the expectation maximization together with the minimum cost classification approach tend to work best, but you can try it yourself.

An important side-effect of reporting the estimated quality of the data, is that you can then allocate further labeling resources in the data points that have the highest expected cost. Jing has done plenty of experiments and has concluded that, in the absence of any other information (e.g., who is the worker who will label the example), it is always best to focus the labeling efforts in the examples with the highest expected cost.

I expect this version of the code to be the last iteration of the GAL codebase. In our next step, we will transfer GAL into a web service environment, allowing for streaming, real-time estimation of worker and data quality, and also allowing for continuous labels, supporting quality-sensitive payment estimation, and many other tasks. Stay tuned: Project-Troia is just around the corner.

Saturday, October 20, 2012

Why oDesk has no scammers

So, in my last blog post, I described a brief outline on how to use oDesk to execute automatically a set of tasks, in a "Mechanical Turk" style (i.e., no interviews for hiring and completely computer-mediated process for posting a job, hiring, and ending a contract).

A legitimate question by appeared in the comments:

"Well, the concept is certainly interesting. But is there a compelling reason to do microstasks on oDesk? Is it because oDesk has a rating system?"

So, here is my answer: If you hire contractors on oDesk you will not run into any scammers, even without any quality control. Why is that? Is there a magic ingredient at oDesk? Short answer: Yes, there is an ingredient: Lack of anonymity!

It is a very well-known fact that if a marketplace allows anonymous participants and cheap generation of new identities, the marketplace is going to fall victim to malicious participants. There are many examples of markets that allowed anonymity and each generation of pseudonyms, that ultimately became "market for lemons". Unfortunately, when you have cheap identity generation, the reputation system of the marketplace becomes extremely easy to manipulate.

So, what is different with oDesk? oDesk has contractors that are not anonymous and their userids are tied (strongly) to a real world identity (onymous?). For example, to withdraw money from oDesk into a bank account, the name in the bank account needs to match the name that listed on oDesk. There are other mechanisms as well for verifying the identify of the contractors (e.g., when I listed myself as a contractor, I had to upload copies of my driving license, copies of my bank statements, etc), but the details of the implementation do not matter. The key element is to make it difficult or costly to create new or false identities.

A strong identify verification pretty much eliminates any type of scam. Why? Because the scammers cannot simply shut down their account after being caught scamming and create a new one. Therefore, all the oDesk contractors with 99.9% probability will not try to scam you. Now, do not get me wrong: you are going to run into incompetent contractors. But there is a difference between an incompetent contractor and one that deliberately tries to scam you.

As my colleague John Horton says: "An incompetent worker who puts some effort in the task is like a bad bus driver: Very slow to take you to your destination but at least you are going towards the correct place, albeit slowly. The scammers are like the unlicensed cab drivers that take you to a random place in order to demand arbitrary fare amounts afterwards to take you to your correct destination".

Sunday, October 14, 2012

Using oDesk for microtasks

Quite a few people keep asking me about Mechanical Turk. Truth be told, I have not used MTurk for my own work for quite some time. Instead I use oDesk to get workers for my tasks, and, increasingly, for my microtasks as well.

When I mention that people can use oDesk for micro-tasks, people get often surprised: "oDesk cannot be used through an API, it is designed for human interaction, right?" Oh well, yes and no. Yes, most jobs require some form of interviewing, but there are certainly jobs where you do not need to manually interview a worker before engaging them. In fact, with most crowdsourcing jobs having both the training and the evaluation component built in the working process, the manual interview is often not needed.

For such crowdsourcing-style jobs, you can use the oDesk API to automate the hiring of workers to work on your tasks. You can find the API at http://developers.odesk.com/w/page/12364003/FrontPage (Saying that the API page is, ahem, badly designed, is an understatement. Nevertheless, it is possible to figure out how to use it, relatively quickly, so let's move on.)

Here are the typical steps for a crowdsourcing-style contract on oDesk:

First, post a job: Use the "Post a Job" call from the Jobs API
Once the job is posted, poll the job openings to find who applied: Use the "List all the offers" call from the Offers API
Examine the details of the contractors that bid on the job: Use the "Get Offer" from the Offers API, to examine the details of each contractor. For example, for a task we had to have at most 10 people from a given country. So, the first 10 people from each country were hired, while subsequent applications from a country that had already 10 applicants were denied. Other people may decide not to hire contractors with less than 50 hours of prior work. It seems to be an interesting research topic to intelligently decide what aspects of the contractor matter most for a job, and hire/decline applications based on such info.
Make offers to the contractors: [That is the stupid part: Apparently the API does not allow the buyer to simply "accept" the bid by the contractor, although this is trivially possible through the web interface]. Use the "Post a Job" call, and create a new job opening for the contractor. Then use the "Make Offer" call from the Offers API, to generate an offer for the contractor(s) that you want to hire.

If you do not want to pay per hour, but rather per task, create an hourly contract, but set the maximum working hours per week at zero. Yes, this is not a mistake. You will be using the Custom Payments functionality to effectively submit "bonus payments" to the contractor.
Typically, it is better to have a mixture of both hourly wage and a fixed price component. You can have a no-hourly-wage policy by setting at 0 the maximum hours that can be charged, simulating MTurk. Or you can specify the hourly wage, and set the limit of how many hours can be charged per week.

Direct the contractor how to work: For that use the Message Center API, to send a message to the contractor, with the URL where you host your task. [Note: oDesk does not provide functionality for handling the task execution, so it is up to you to build that infrastructure. If you have ever built an "external HIT" on MTurk, you are ready to go. Just now you need to send the oDesk workers a url, where they can login to your website, and their username/password. You can go full force and allow an oDesk authentication, but this seems a little bit too much for me.]
Whenever the contractor has completed enough tasks, use the Custom Payment API to submit the payment. Repeat as needed.
When the task is done, end the contract using the contracts API.

That's all folks! In the next few weeks, I will try to post the code for some of the crowdsourcing experiments that we conducted with oDesk.

Sunday, July 29, 2012

The disintermediation of the firm: The feature belongs to individuals

My experience with online outsourcing

I joined the Stern Business School, back in 2004. In my first couple of year, my research approach was pretty much a continuation of my PhD years: I was doing a lot of coding and experimentation myself. However, at some point I got tired to writing boring code: Crawlers, front-end websites, and other "non-research" pieces of code were not only uninteresting but were also a huge drain of time.

So, I started experimenting with hiring coders. First locally at NYU. Unfortunately, non-research student coders turned out to be a bad choice. They were not experienced enough to write good code, and were doing this task purely for monetary reasons, not for learning. I got nothing useful out of this. Just expensive pieces of crap code.

In summer of 2005, I started experimenting with online outsourcing. I tried eLance, Guru, and Rent-A-Coder. I tentatively started posting there programming projects that were not interesting conceptually (e.g., "crawl this website and store the data in a CSV file", "grab the CSV data from that website and put them in a database", "create a demo website that does that", etc)

Quickly, I realized that this was a win-win situation: The code was completed quickly, the quality of the websites was much better than what I could prepare myself, and I was free to focus on my research. Once I started getting PhD students, outsourcing non-research coding requirements became a key part of my research approach: PhD time was too valuable to waste on writing crawlers and dealing with HTML parsing peculiarities.

Seven years of outsourcing: Looking back

Seven years have passed since my first outsourcing experiments. I thought it is now a good time to look back and evaluate.

Across all outsourcing sites (excluding Mechanical Turk), I realized that I had posted and hired contractors for a total of 235 projects. Honestly, I was amazed by the number but amortized this is just one project per 10 days, which is reasonably close to my expectations.

Given the reasonably large number of projects, I thought that I may be able to do some basic quantitative analysis to figure out what patterns lead to my own, personal satisfaction with the result. I started coding the results, adding variables that were both personal (how much did I care about the project? how detailed were the specs? how much did I want to spend?) and contractor-specific (past history, country of origin, communication while bidding, etc).

Quickly, even before finished coding, a pattern emerged: All the "exceeded expectations" projects were done by individual contractors or small teams of 2-3 people. All the "disappointments" were with contractors that were employees of a bigger contracting firm.

In retrospect, it is a matter of incentives: The employees do not have the incentive to produce to the maximum of their labor power. In contrast, individuals with their own company, produce much closer to their maximum capacity; the contractor-owners are also are connected to the product of their work, and they are better workers overall.

I would not attribute causality to my observation but rather self-selection: Individuals that are knowledgeable understand that the bigger firm does not have much to offer. In the past, the bigger firm was fulfilling the role of being visible and, therefore, bringing projects; the firm also offers a stable salary but for talented individuals this quickly becomes a stagnating salary.

With the presence of online marketplaces, the need to have a big firm to get jobs started decreasing. Therefore, the talented contractors do not really a bigger firm to bring the work.

The capable individuals disintermediate the firm.

The emergence of the individual

Although the phenomenon is still in its infancy, I expect to see the rise of individuals and the emergence of small teams to be an important trend in the next few years. The bigger firms will feel the increase pressure from agile teams of individuals that can operate faster and get things done quicker. Furthermore, talented individuals, knowing that they can find good job prospects online, they will start putting higher pressure on their employers: Either there is a more equitable share of the surplus, or the value-producing individuals will move into their own ventures.

Marx would have been proud: The value-generating labor is now getting into the position of reaping the value of the generated work. Ironically, this "emancipation" is happening through the introduction of capitalist free markets that connect the planet, and not through a communist revolution.

Friday, July 20, 2012

On Retention Rates

I spent this week in Suncadia, a small resort near Seattle, in the amazing workshop on Crowdsourcing Personalized Education, organized by Dan Weld, Mausam, Eric Horvitz, and Meredith Ringel Morris. (The slides should be available online soon.) It was an amazing workshop, and the quality of the projects that were presented was simply exceptional.

Beyond the main topic of the workshop (online education and crowdsourcing), I noticed one measure of success being mentioned by multiple projects: Retention. Retention is typically defined as the number of users that remain active, compared to the total number of users.

How exactly to define retention is a tricky issue. What is the definition of an "active" user? What is the total number of users? You can manipulate the number to give you back something that looks nice.

For example, for online courses many people list the number of registered participants as users (e.g, 160,000 students enrolled for the AI class). Then, if you take the 22,000 students that graduated as active, you get a retention rate of 13.75%.

Of course, if you want to make the number higher (13.75% seems low) you can just change the definition of what counts as user (e.g., "watched at least one video") and decrease the denominator, or change the definition of active user (e.g., "submitted an assignment") and increase the nominator.

A relatively common definition is number of users that come back at least once a week, divided by the number of users registered in that time period. At least Duolingo, Foldit, and a few other projects seemed to have a similar definition. With this definition, a number of 20% and above is typically considered successful, as this was also noted to be the retention rate for Twitter.

How to measure retention in online labor?

So, I started wondering what is the appropriate measure of retention for online labor sites. The "come back" at least once every week" is a weak one. We need people to engage with the available tasks.

One idea is to measure percentage of users that earn >X dollars per week. To avoid comparing workers with different lifetimes, it is a good practice to compare users that started at the same time (e.g., the "May 2012 cohort") and see the retention rates stratified by cohort. The problem with the previous metric is that you need to examine it not only for different cohorts but also for different values of X.

An alternative approach is to examine the "hour worked per week". In that case, we need to examine what percentage of the 40-hour working week is captured by the labor site.

Say that we have 500,000 registered users and we observe that at any given time we have 5,000 of them active on the site. (These are commonly quoted numbers for Mechanical Turk.) What this 1% activity mean?

First, we need to see what a good comparative metric. Suppose that full success is that all 500,000 workers come and work full time. In that case, we can expect an average activity level of (40*50)/(24*365)=22.8% (40 is the total working hours in a week, 50 is the working weeks in a year, and 24*365 is the total number of hours in a year). So an average of 22.8% of activity is the maximum attainable; to keep things simpler, we can say that seeing on average 20% of the users working on the site is perfect.

So, if a site has an average of 1% activity, it is not as bad as it sounds. It means that 1 out of 20 registered users actually work full time on the site.

Friday, July 13, 2012

Why is oDesk called oDesk?

[This post has been removed after a request. You will need to stay in the dark until the Singularity. Or may be not.]

Monday, July 9, 2012

Discussion on Disintermediating a Labor Channel

Last Friday, I wrote a short blog post with the title "Disintermediating a Labor Channel: Does it Make Sense?" where I argued that trying to bypass a labor channel (Mechanical Turk, oDesk, etc) in order to save on the extra fees does not make much sense.

Despite the fact that there was no discussion in the comments, that piece seemed to generate a significant amount of feedback, across various semi-private channels (fb/plus/twitter) and in many real-life discussions

Fernando Pereira wrote on Google Plus:

Your argument sounds right, but I'm wondering about quality: can I control quality/biases in the outside labor platform? How do I specify labor platform requirements to meet my requirements? It could be different from quality control for outsourced widgets because outsourced labor units might be interdependent, and thus susceptible to unwanted correlation between workers.?

Another friend wrote in my email:

So, do you advocate that oDesk should be controlling the process? Actually, I'd rather have higher control over my employees and know who is doing what.

Both questions have similar flavor, and it indicates that I failed in expressing my true thoughts on the issue.

I do not advocate giving up control of the "human computation" process. I advocate in letting a third-party platform handle the "low level" recruiting and payment of the workers, preferably through an API-fied process. Payments, money transfer regulations, and immigration are big tasks that are best handled by specialized platforms. They are too much for most other companies. Handling such things on your own is as interesting as handling issues like aircondition, electricity supply, and failed disks and motherboards when you are building a software application: Let someone else do these things for you.

One useful classification that I think will clarify further my argument. Consider the different "service models" for crowdsourcing, which I have adapted from the NIST definition of cloud services.

Labor Applications/Software as a Service (LSaaS). The capability provided to the client is to use the provider’s applications running on a cloud-labor infrastructure. [...] The client does not manage or control the underlying cloud labor, with the possible exception of limited user-specific application configuration settings. Effectively, the client only cares about the quality of the provided results of the labor and does not want to know about the underlying workflows, quality management, etc. [Companies like CastingWords and uTest fall into this category: They offer a vertical service, which is powered by the crowd, but the end client typically only cares about the result]

Labor Platform as a Service (LPaaS). The capability provided to the client is to deploy onto the labor pool consumer-created or acquired applications created using programming languages and tools supported by the provider. The client does not manage or control the underlying labor pool, but has control of the overall task execution, including workflows, quality control, etc. The platform provides the necessary infrastructure to support the generation and implementation of the task execution logic. [Companies like Humanoid fall into this category: Creating a platform for other people to build their crowd-powered services on top.]

Labor Infrastructure as a Service (LIaaS). The capability provided to the client is to provision labor for the client, who then allocates workers to tasks. The consumer of labor services does not get involved with the recruiting process or the details of payment, but has full control everything else. Much like the Amazon Web Services approach (use EC2, S3, RDS, etc. to build your app), the service provider just provides raw labor and guarantees that the labor force satisfies a particular SLA (e.g., response time within X minutes, has the skills that are advertised in the resume, etc) [Companies like Amazon Mechanical Turk, oDesk, etc. fall into this category]

From these definitions, I believe that it does not make sense to build your own "infrastructure" if you are going to rely on remote workers. (I have a very different attitude for creating an in-house, local, team of workers that provides the labor, but this gets very close to being a traditional temp agency, so I do not treat this as crowdsourcing.)

I have no formed opinion on the "platform as a service" or a "software as a service" model (yet).

For the software as a service model, I think it is up to you to decide whether you like the output of the system (transcription, software testing, etc). The crowdsourcing part is truly secondary.

For the platform as a service model, I do not have enough experience with existing offerings to know whether to trust the quality assurance scheme. (Usual cognitive bias of liking-best-what-you-built-yourself applies here.) Perhaps in a couple of years, it would make no sense to build your own quality assurance scheme. But at this point, I think that we are all still relying on bespoke, custom-made schemes, with no good argument to trust a standardized solution offered by a third-party.

Friday, July 6, 2012

Disintermediating a Labor Channel: Does it Make Sense?

Over the years, I have talked with plenty of startups on building crowdsourcing services and platforms. Mechanical Turk (for the majority) and oDesk (for the cooler kids :-p) are common choices for recruiting workers. (For a comparison of the two, based on my personal experiences, look here.)

A common aspiration of many startups is to be able to build their own labor force and channel. Through Facebook, through cell phone, through ads, everyone wants to have direct control of the labor.

My reaction: This is stupid!

(Usual disclaimer that I work for oDesk for this year, etc., applies here, but I will stand behind my opinion even without any relationship to any labor marketplace.)

A very short-sighted reason for this is cost savings: oDesk and Mechanical Turk have a 10% fee. Therefore by disintermediating the labor platform, the company can save 10% of the labor cost. Well, to immediately make the adjustment, you do not save 10%. You save maximum 7%. The other 3% is the fee that will be taken by the payment channel (credit card, paypal, etc). The fact that the cost is borne by the worker when using Paypal is not true savings. For foreign workers, you also have a 1%-2% hit when using Paypal or a credit card, which goes on top of the best FX exchange rates. Add extra overhead to handle fraud, mistakes, and other things-that-happen, and the true savings are at most 5% to 6%.

But even 5%, isn't that something worth saving? No.

Problem #1: If you are a small startup, saving 5% in labor costs should not be the goal. Just the cost of developing, managing, and handling complaints about payment is going to cost much more of development time than the corresponding savings. Creating a payment network is typically not at the core of a crowdsourcing startup, and it should not be. Let others deal with the payment and build your product.

Problem #2: If you are a bigger company, saving 5% in labor costs may be more important. However, if we are talking about labor, then bigger companies start hitting compliance issues. Handling money laundering regulations, handling IRS regulations, and many other HR-related aspects are typically worth the 5% extra. Who wants to be in the HR business if they have a product that is doing something else?

So, why people still obsess about this? Why everyone wants to build its own labor platform?

Well, because VC's ask for this. "If you are building on top of MTurk/oDesk/whatever, what is your competitive advantage? What prevents others from duplicating what you have done?"

The knee-jerk reaction to this demand from VC's is to build a bespoke labor network. Which works fine, as long as you are talking about a relatively-small sized network. Once the size of the labor force becomes bigger, then other problems appear: Identity verification, compliance, regulations, immigration, are all tasks that are time consuming. (Especially when dealing with foreign contractors.) And they are never tasks that add value to the company. They are all pure overhead and solving such issues is absolutely non-trivial.

Do you think it is accidental that Amazon does not pay in cash the MTurk contractors outside India and US? Having seen from the inside at oDesk what is the overhead to build reliable and compliant solutions for handling international payments, I can easily say: Stay away, this is not something you want to do at scale, having to deal with bureaucrats from all different countries around the world.

The parallel with building your own data centers vs getting computing resources from the cloud is direct and should be evident. Unless there is a very good reason to handle your own machines (and space, and aircondition, and handling electrical failures over the summer, etc etc), you just build your infrastructure using the cloud. Same thing with labor.

Allocating resources to handle overhead tasks, is taking aware resources from the main goal: Building a better product! Let others take care of infrastructural issues.

Monday, July 2, 2012

Visualizations of the oDesk "oConomy"

[Crossposted from the oDesk Blog. Blog post written together with John Horton.]

A favorite pastime of the oDesk Research Team is to run analyses using data from oDesk’s database in order to provide a better understanding of oDesk’s online workplace and the way the world works. Some of these analyses were so interesting we started sharing them with the general public, and posted them online for the world to see.

Deep inside, however, we were not happy with our current approach. All our analyses and plots were static. We wanted to share something more interactive, using one of the newer javascript-based visualization packages. So, we posted a job on oDesk looking for d3.js developers and found Zack Meril, a tremendously talented Javascript developer. Zack took our ideas and built a great tool for everyone to use:

The oDesk Country Dashboard

This dashboard allows you to interactively explore the world of work based upon oDesk’s data. We list below some of our favorite discoveries from playing with its visualizations. Do let us know if you find something interesting. Note that the tool supports “deep linking,” which means that the URL in your address bar fully encodes the view that you see.

Visualization #1: Global Activity

The first interactive visualization shows the level of contractor activity of different countries across different days of the week and times of day. The pattern seems pretty “expected”:

On a second thought, though, we started wondering. Why do we see such regularity? The x-axis is GMT time. Given that oDesk is a global marketplace, shouldn’t the contractor activity to be smoother? Furthermore, oDesk has a relatively smaller number of contractors from Western Europe, so it seems kind of strange that our contractor community generally follows the waking and sleeping patterns of UK. Investigating closer, if you hover around the visualization, you see a closer look at what contractors are doing throughout the world:

At 8am GMT on Wednesday morning: Russia, India, and China are awake and their activity is increasing.

As we move towards the peak of the global activity at 3pm, the activity of the Asian countries has already started declining. However, at the same time North and Latin America start waking up, compensating for the decrease in activity in Asia, and leading to the world peak.

After 4pm GMT, Asia starts going to sleep, and the activity decreases. The activity continues to decline as America signs off, hitting the low point of activity at 4am GMT (but notice how China, Philippines, and Australia start getting active, preventing the activity level from going to zero).

Visualization #2: Country-Specific Activity

A few weeks back, we also wrote about the rather unusual working pattern of Philippines: contractors from the Philippines tend to keep a schedule that mostly follows U.S. working hours, rather than a “normal” 9-5 day. Since then, we realized that the Philippines is not the only country following this pattern. For example, Bangladesh and Indonesia have similar activity patterns to Philippines. So, we thought, why not make it easy to explore and find working patterns. They reveal something about the culture, habits, and even type of work that gets done in these countries. A few findings of interest:

Visualization #3: Work Type By Country

Finally, we wondered “What are the factors that influence these working patterns?” Why do some culturally similar countries have very similar working patterns (e.g., Russia and Ukraine), while others have very different patterns (e.g., Pakistan, Bangladesh, and India)? So, with our third visualization we examine types of work completed on oDesk broken down by country. We used the bubble chart from d3.js to visualize the results. Here is, for example, the breakdown for U.S.:

U.S. contractors are mainly working in tasks related to writing. We do see many clients explicitly limit their search for writing contractors to U.S.-based only, both for English proficiency but also (and perhaps more importantly) for the cultural affinity of the writers to their audience. Take a look at Russia: Almost all the work done in Russia is Web programming and design, followed by mobile and desktop development.

At the opposite end is the Philippines, where few programming tasks are being completed, but significant amounts of data entry, graphic design, and virtual assistant work happen:

Another interesting example is Kenya. As you can see, most of the work done there (and there is a significant amount of work done in Kenya) is about blog and article writing:

Exploring Further: Activity Patterns and Types of Projects

One pattern that was not directly obvious was the correlation between activity patterns and type of work. Countries that are engaging mainly in computer programming tend to have a larger fraction of users that use oDesk. For example, see the similarity in the activity patterns of Bolivia, Poland, Russia, and Ukraine: and the corresponding project types that get completed in these countries:

We should note however that the opposite does not hold: There are other countries that have similar activity patterns and high degree of contractor stickiness (e.g., Argentina, Armenia, Bolivia, Belarus, China, Uruguay, and Venezuela) that have rather different project completion dates.

Source available on Github

One thing that attracted me to spend my sabbatical at oDesk was the fact that oDesk has been pretty open with its data from the beginning. To this end, you will notice that the Country Explorer is an open source project, so you are welcome to just fork us on Github and get the code for the visualizations.

New ideas and visualizations

I am thinking of what other types of graphs would be interesting to create. Supply and demand of skills? Asking prices and transaction prices of contractors across countries and across skills? Of course, if you have specific ideas you’d like to see us work on, tell us in the comments! Happy to explore directions and data that you are interested in exploring.

Thursday, June 21, 2012

The oDesk Flower: Playing with Visualizations

In the few couple of weeks, while at oDesk, I am trying to learn the data stored in the database, and I create random plots to understand what is happening in the market. My absolutely favorite source of data is the data about the micro-level activity of the workers (when they work, how much they type, how much they move the mouse, etc.).

A few weeks back, I posted a blog about the activity levels of different countries, with the basic observation that the activity in Philippines fluctuates much less within the 24-hr day compared to all other countries.

You are doing it wrong: The use of radar plots

After I posted that plot, I received the following email:

This is periodic data, which means modular thinking. When you visualize periodic data using a linear plot, you necessarily have a cutting point for the x-axis, which can affect the perception of various trends in the data. You should use something similar to the Flickr Flow, e.g a radar plot in Excel.

So, following the advice of people that really understand visualization, I transformed the activity plot into a radar plot, (in Excel):

The oDesk Flower

As you can see, indeed the comment was correct. Given the periodicity of the data, having a cyclical display is better than having a single horizontal line display. Beautiful to look at? Check. I called this visualization "The oDesk Flower" :-)

Unfortunately, it is not truly informative due to the huge number of countries in the plot. But I think it works well to give the global pace of activity over the week and across countries.

One thing that I did not like in this plot was the fact that I could not really compare the level of activity from one country to other. So, I normalized the values to be the percentage of contractors from that country that are active. A new flower emerged:

For comparison, here is the corresponding linear plot, illustrating the percentage of contractors from various countries that are active at any given time:

Fighting overplotting using kernel smoothing and heatmaps

The plot above is kind of interesting and indeed it shows the pattern of activity. However, we have a lot of "overplotting", which makes the plot busy. It is hard to understand where the majority of the lines are falling.

To understand better the flow of the lines, I decided to play a little bit with R. I loaded the data set with the activity line from each country, and then used kernel based smoothing (bkde2D) to find the regions of the space that had the highest density. To plot the result, I used a contour plot (filled.contour), which allows for the easy generation of heatmaps. Here is the R code:

and here is the resulting plot:

I like how this plot shows the typical activity across countries, which ranges from 2% to 6% of the total registered users. At the same time, we can see (the yellow-green "peaks) that there are also countries that have 8% to 10% of their users being active every week.

Need for interactivity

So, what did I learn from all these exercises? While I could create nice plots, I felt that static visualization are at the end of limited value. Other people cannot do any dynamic exploration of the data. Nobody can customize the plot to show a slightly different view and in general we lack the flexibility given by, say, the visualization gadgets of Google or by the data driven documents created using d3.js.

I would love to be able to create some more interactive plots and let other people play with and explore the data that oDesk has. Perhaps I should hire a contractor on oDesk to do that :-)

Friday, May 25, 2012

The Emergence of Teams in Online Work

When I started as an assistant professor, back in 2004, and I joined the NYU/Stern Business School, I got into a strange position. I had funding to spend, but no students to work with. I had work to be done (mainly writing crawlers) that was time-consuming, but not particularly novel, or intellectually rewarding. Semi-randomly, at the same time, I have heard about the website Rent-A-Coder, which was being used by undergraduate students that were "outsourcing" their programming assignments. I started using Rent-A-Coder, tentatively at first, to get programming tasks done, and then, over time, I got fascinated by the concept of online work, and the ability to hire people online, and get things done. (My Mechanical Turk research, and my current appointment at oDesk is a natural evolution of these interests.)

As I started completing increasingly complicated projects using remote contractors, I started thinking on how we can best manage a diverse team of remote workers, each one being in a different location, working on different tasks, etc. The topic has many interesting questions that arise, both in terms of theory, and in terms of developing practical "best practices" guidelines.

While trying to understand better the theoretical problems that arise in the space, I was reading the paper "Online Team Formation in Social Networks" that was published in WWW2012; the paper describes a technique for identifying teams of people in a social network (i.e., graph) that have complementary skills and can form a well-functioning unit, and tries to do so while preserving workload restrictions for individual workers.

Given my personal experience, from the practical side, and the existence of research papers that deal with the topic, I got curious to understand whether the topic of online team formation is a fringe topic, or something that deserves further attention.

Do we see teams being formed online? If yes, is this a phenomenon that increases in significance?

So, I pulled the oDesk data and tried to answer the question.

How many teams have a given size? How this distribution evolves over time? I plotted the number of projects in each week that had x contractors that were active in the project (i.e., billed some time)

The results were revealing: Not only we observe teams of people being formed online but we also see an exponential increase in the number of teams of any given size.

In fact, in the above graph, if we account for the fact that bigger teams contain an (exponentially) larger number of people, we can see that the majority of the online workers today are not working as individuals but are now part of an online team.

Update [thanks for the question, Yannis!]: Since the exponential growth of oDesk.com makes it difficult to understand the fraction of people working in teams and whether it is increasing/decreasing , here is the chart that shows what percentage of workers work in teams of a given size:

What is interesting is the consistent decrease in the fraction of people working along (teams of one), and in teams of 2-3. Instead, we see a slow but consistent increase in teams with size 4-7 and 8-16, as an overall fraction of the population. As you can see, over the last year, the percentage of contractors in teams with size 4-7 is getting close to surpass the number of contractors working along. Similarly, the percentage of contractors in teams of 8-16 is getting close to surpass the percentage of contractors in teams of 2-3. The trends for bigger teams seem also to be increasing but there is still too much noise to be able to infer anything.

What's coming?

Given the trend for online work to be done in teams, formed online, I expect to see a change in the way that many companies are being formed in the future. At this point, it seems far fetched that a startup company can be formed online, being distributed across the globe, and operate on a common project. (Yes, there are such teams but they are more of an exception, rather than the norm.)

But if these trends continue, expect sooner rather than later to see companies naturally hiring online and working with remote collaborators, no matter where the talent is located. People have been talking about online work being an alternative to immigration, but this seemed to be a solution for the remote future.

With the exponential increase that we observe, the future may come much sooner than expected.

Thursday, May 10, 2012

TREC 2012 Crowdsourcing Track

TREC 2012 Crowdsourcing Track - Call for Participation

June 2012 – November 2012
https://sites.google.com/site/treccrowd/

Goals

As part of the National Institute of Standards and Technology (NIST)'s annual Text REtrieval Conference (TREC), the Crowdsourcing track investigates emerging crowd-based methods for search evaluation and/or developing hybrid automation and crowd search systems.

This year, our goal is to evaluate approaches to crowdsourcing high quality relevance judgments for two different types of media:

textual documents
images

For each of the two tasks, participants will be expected to crowdsource relevance labels for approximately 20k topic-document pairs (i.e., 40k labels when taking part in both tasks). In the first task, the documents will be from an English news text corpora, while in the second task the documents will be images from Flickr and from a European news agency.

Participants may use any crowdsourcing methods and platforms, including home-grown systems. Submissions will be evaluated against a gold standard set of labels and against consensus labels over all participating teams.

Tentative Schedule

Jun 1: Document corpora, training topics (for image task) and task guidelines available
Jul 1: Training labels for the image task
Aug 1: Test data released
Sep 15: Submissions due
Oct 1: Preliminary results released
Oct 15: Conference notebook papers due
Nov 6-9: TREC 2012 conference at NIST, Gaithersburg, MD, USA
Nov 15: Final results released
Jan 15, 2013: Final papers due

Participation

To take part, please register by submitting a formal application directly to NIST (even if returning participant). See the bottom part of the page at http://trec.nist.gov/pubs/call2012.html

Participants should also join our Google Group discussion list, where all track related communications will take place.

Organizers

Gabriella Kazai, Microsoft Research
Matthew Lease, University of Texas at Austin
Panagiotis G. Ipeirotis, New York University
Mark D. Smucker, University of Waterloo

Further information

For further information, please visit https://sites.google.com/site/treccrowd/

We welcome any questions you may have, either by emailing the organizers or by posting on the Google Group discussion page.

Saturday, May 5, 2012

ACM EC 2012 schedule

Schedule at a glance:

And the papers within each session:

Wednesday, April 25, 2012

The Google attack: How I attacked myself using Google Spreadsheets and I ramped up a $1000 bandwidth bill

It all started with an email.

From: Amazon Web Services LLC

Subject: Review of your AWS Account Estimated Month to Date Billing Charges of $720.85

Greetings from AWS,

During a routine review of your AWS Account's estimated billing this month, we noticed that your charges thus far are a bit larger than previous monthly charges. We'd like to use this opportunity to explore the features and functionality of AWS that led you to rely on AWS for more of your needs.

You can view your current estimated monthly charges by going here:

https://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&action=activity-summary

AWS Account ID: XXXXXXX27965

Current Estimated Charges: $720.85

If you have any feedback on the features or functionality of AWS that has helped enable your confidence in our services to begin ramping your usage we would like to hear about it. Additionally, if you have any questions pertaining to your billing, please contact us by using the email address on your account and logging in to your account here:

https://aws-portal.amazon.com/gp/aws/html-forms-controller/contactus/aws-account-and-billing

Regards,

AWS Customer Service

This message was produced and distributed by Amazon Web Services LLC, 410 Terry Avenue North, Seattle, Washington 98109-5210

What? \$720 in charges? My usual monthly charges for Amazon Web Services were around \$100, so getting this email with a usage of \$720 after just two weeks within the month was a big alert. I login to my account to see what is going on, and I see this:

An even bigger number: \$1177.76 in usage charges! A thousand, one hundred, seventy seven dollars. Out of which \$1065 in outgoing bandwidth transfer costs. The scary part: 8.8 Terabytes of outgoing traffic! Tera. Not Giga. Terabytes.

To make things worse, I realized that the cost was going up hour after hour. Fifty to hundred dollars more in billing charges with each. passing. hour. I started sweating.

What happened?

Initially I was afraid that a script that I setup to backup my photos from my local network to S3 consumed that bandwidth. But then I realized that I am running this backup-to-S3 script for a few months now, so it could not suddenly start consuming more resources. In any case, all the traffic that is incoming to S3 is free. This was a matter of outgoing traffic.

Then I started suspecting that the cause of this spike maybe due to the developers that are working in various projects of mine. Could they have mounted the S3 bucket into an EC2 machine that is in a different region? In that case, we may have indeed problems, as all the I/O operations that are happening within a machine would count as bandwidth costs. I checked all my EC2 machines. No, this is not the problem. All EC2 machines are in us-east, and my S3 buckets are all in US Standard. No charges for operations between EC2 machines and S3 buckets within the same region.

What could be causing this? Unfortunately, I did not have any logging enabled to my S3 buckets. I enabled logging and expected to see what would happen next. But logging would take a few hours, and the bandwidth meter was running. No time to waste.

Thankfully, even in the absence of logging, Amazon provides access to the usage reports of all the AWS resources. The report indicated the bucket that was causing the problem:

My S3 bucket with the name "t_4e1cc9619d4aa8f8400c530b8b9c1c09" was generating 250GB of outgoing traffic, per hour.

Two-hundred-fifty Gigabytes. Per hour.

At least I knew what was the source of the traffic. It was a big bucket with images that were being used for a variety of tasks on Amazon Mechanical Turk.

But still something was strange. The bucket was big, approximately 250GB of images. Could Mechanical Turk generate so much traffic? Given that on average the size of each image was 500Kb to 1MB, the bucket should have been serving 250,000 images per hour. This is 100+ requests per second.

There was no way that Mechanical Turk was responsible for this traffic. The cost of Mechanical Turk would have trumpeted the cost of bandwidth. Somehow the S3 bucket was being "Slashdotted" but without being featured on Slashdot or in any other place that I was aware of.

Strange.

Very strange.

Checking the Logs

Well, I enabled logging for the S3 bucket, so I was waiting for the logs to appear.

The first logs showed up and I was in a for a surprise. Here are the IP's and the User-agent of the requests.


74.125.156.82 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.84 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.81 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.86 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.92 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.87 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.81 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.82 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.85 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.89 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.90 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.92 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.85 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.82 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.88 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.86 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.89 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.94 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.88 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.92 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.80 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.88 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.84 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.87 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)

So, it was Google that was crawling the bucket. Aggressively. Very aggressively.

Why would Google crawl this bucket? Yes, the URLs were technically public but there was no obvious place to get the URLs. Google could not have gotten the URLs from Mechanical Turk. The images in the tasks posted to Mechanical Turk are not accessible to Google to crawl.

At least we know it is Google. I guess, somehow, I let Google learn about the URLs of the images in the bucket (how?) and Google started crawling them. But something was still puzzling. How can an S3 bucket with 250Gb of data generate 40 times that amount of traffic? Google would just download once and get done with that. It would not re-crawl the same object many times.

I checked the logs again. Interestingly enough, there was a pattern: Each image was being downloaded every hour. Every single one of them. Again and again. Something was very very strange. Google kept launching its crawlers, repeatedly, to download the same content in the S3 bucket, every hour. For a total of 250GB of traffic, every hour.

Google would have been smarter than that. Why wasting all the bandwidth to re-download an identical image every hour?

Why would Google download the same images again and again?

Wait, this is not the real Google crawler...

Looking more carefully, there was one red flag. This is not the Google crawler. The Google crawler is named GoogleBot for web pages and Googlebot-Image for images. It is not called Feedfetcher as this user agent.

What the heck is Feedfetcher? A few interesting pieces of information from Google:

Feedfetcher is how Google grabs RSS or Atom feeds when users choose to add them to their Google homepage or Google Reader
Feedfetcher retrieves feeds only after users have explicitly added them to their Google homepage or Google Reader
[Feedfetcher] is not retrieving content to be added to Google's search index
Feedfetcher retrieves feeds only after users have explicitly added them to their Google homepage or Google Reader. Feedfetcher behaves as a direct agent of the human user, not as a robot, so it ignores robots.txt

Interesting. So these images were in some form of a personal feed.

Shooting myself in the foot, the Google Spreadsheet way

And this information started unraveling the full story. I remembered!

All the URLs for these images were also stored in a Google Spreadsheet, so that I can inspect the results of the crowdsourcing process. (The spreadsheet was not being used or accessed by Mechanical Turk workers, it was just for viewing the results.) I used the =image(url) command to display a thumbnail of the image in a spreadsheet cell.

So, all this bandwidth waste was triggered by my own stupidity. I asked Google to download all the images to create the thumbnails in Google Spreadsheet. Talking about shooting myself in the foot. I launched the Google crawler myself.

But why did Google download the images again and again? That seemed puzzling. It seemed perfectly plausible that Google would fetch 250Gb of data (i.e., the total size of the bucket), although I would have gone for a lazy evaluation approach (i.e., loading on demand, as opposed to pre-fetching). But why downloading the same content again and again?

Well, the explanation is simple: Apparently Google is using Feedfetcher as a "url fetcher" for all sorts of "personal" URLs someone adds to its services, and not only for feeds. Since these URLs are private, Google does not want to store them anywhere permanently in the Google servers. Makes perfect sense from the point of view of respecting user privacy. The problem is that this does not allow for any form of caching, as Google does not store anywhere the personal data.

So, every hour, Google was launching the crawlers against my bucket, generating a tremendous amount of crawler traffic. Notice that even if I had a robots.txt, Feedfetcher would have ignored it in any case. (Furthermore, it is not possible to place a robots.txt file in the root directory of https://s3.amazonaws.com as this is a common server for many different accounts; but in any case Feedefetcher would have ignored it.)

The final touch in the overall story? Normally, if you were to do the same thing with URLs from a random website, Google would have rate limited its crawlers, not to overload the website. However, the s3.amazonaws.com domain is a huuuge domain, containing terabytes (petabytes?) of web content. Google has no reason to rate limit against such a huge domain with huge traffic. It made perfect sense to launch 100+ connections per second against a set of URLs that were hosted in that domain...

So, I did not just shoot myself in the foot. I took a Tsar Bomba and I launched it against my foot. The $1000 bandwidth bill (generated pretty much within a few hours) was the price of my stupidity.

Ooof, mystery solved. I killed the spreadsheet and make the images private. Google started getting 403 errors, and I hope that it will soon stop. Expensive mistake, but at least resolved.

And you cannot help but laugh at the following irony: One of the main arguments for using the AWS infrastructure is that it is virtually invincible to any denial of service attack. On the other hand, the avoidance of the denial of service breeds a new type of attack: Bring the service down not by stopping the service but by making it extremely expensive to run...

The real lesson: Google as a medium for launching an attack against others

Then I realized: This is a technique that can be used to launch a denial of service attack against a website hosted on Amazon (or even elsewhere). The steps:

Gather a large number of URLs from the targeted website. Preferably big media files (jpg, pdf, etc)
Put these URLs in a Google feed, or just put them in a Google Spreadsheet
Put the feed into a Google service, or use the image(url) command in Google spreadsheet
Sit back and enjoy seeing Google launching a Slashdot-style denial of service attack against your target.

What I find fascinating in this setting is that Google becomes such a powerful weapon due to a series of perfectly legitimate design decisions. First, they separate completely their index from the URLs that they fetch for private purposes. Very clean and nice design. The problem? No caching. Second, Google is not doing lazy evaluation in the feeds but tries to pre-fetch them to be ready and fresh for the user. The problem? Google is launching its Feedfetcher crawlers again and again. Combine the two, and you have a very, very powerful tool that can generate untraceable denials of service attacks.

The law of unintended consequences. Scary and instructive at the same time: You never know how the tools that you build can be used, no matter how noble the intentions and the design decisions.

PS: Amazon was nice enough to refund the bandwidth charges (before the post went public), as they considered this activity accidental and not intentional. Thanks TK!