A Computer Scientist in a Business School

Wednesday, April 25, 2012

The Google attack: How I attacked myself using Google Spreadsheets and I ramped up a $1000 bandwidth bill

It all started with an email.

From: Amazon Web Services LLC

Subject: Review of your AWS Account Estimated Month to Date Billing Charges of $720.85

Greetings from AWS,

During a routine review of your AWS Account's estimated billing this month, we noticed that your charges thus far are a bit larger than previous monthly charges. We'd like to use this opportunity to explore the features and functionality of AWS that led you to rely on AWS for more of your needs.

You can view your current estimated monthly charges by going here:

https://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&action=activity-summary

AWS Account ID: XXXXXXX27965

Current Estimated Charges: $720.85

If you have any feedback on the features or functionality of AWS that has helped enable your confidence in our services to begin ramping your usage we would like to hear about it. Additionally, if you have any questions pertaining to your billing, please contact us by using the email address on your account and logging in to your account here:

https://aws-portal.amazon.com/gp/aws/html-forms-controller/contactus/aws-account-and-billing

Regards,
AWS Customer Service
This message was produced and distributed by Amazon Web Services LLC, 410 Terry Avenue North, Seattle, Washington 98109-5210

What? $720 in charges? My usual monthly charges for Amazon Web Services were around $100, so getting this email with a usage of $720 after just two weeks within the month was a big alert. I logged in to my account to see what was going on, and I saw this:

An even bigger number: $1177.76 in usage charges! A thousand, one hundred, seventy seven dollars. Out of which $1065 in outgoing bandwidth transfer costs. The scary part: 8.8 Terabytes of outgoing traffic! Tera. Not Giga. Terabytes.

To make things worse, I realized that the cost was going up hour after hour. Fifty to hundred dollars more in billing charges with each. passing. hour. I started sweating.

What happened?

Initially I was afraid that a script that I set up to backup my photos from my local network to S3 consumed that bandwidth. But then I realized that I am running this backup-to-S3 script for a few months now, so it could not suddenly start consuming more resources. In any case, all the traffic that is incoming to S3 is free. This was a matter of outgoing traffic.

Then I started suspecting that the cause of this spike maybe due to the developers that are working in various projects of mine. Could they have mounted the S3 bucket into an EC2 machine that is in a different region? In that case, we may have indeed problems, as all the I/O operations that are happening within a machine would count as bandwidth costs. I checked all my EC2 machines. No, this is not the problem. All EC2 machines are in us-east, and my S3 buckets are all in US Standard. No charges for operations between EC2 machines and S3 buckets within the same region.

What could be causing this? Unfortunately, I did not have any logging enabled to my S3 buckets. I enabled logging and expected to see what would happen next. But logging would take a few hours, and the bandwidth meter was running. No time to waste.

Thankfully, even in the absence of logging, Amazon provides access to the usage reports of all the AWS resources. The report indicated the bucket that was causing the problem:

My S3 bucket with the name "t_4e1cc9619d4aa8f8400c530b8b9c1c09" was generating 250GB of outgoing traffic, per hour.

Two-hundred-fifty Gigabytes. Per hour.

At least I knew what was the source of the traffic. It was a big bucket with images that were being used for a variety of tasks on Amazon Mechanical Turk.

But still something was strange. The bucket was big, approximately 250GB of images. Could Mechanical Turk generate so much traffic? Given that on average the size of each image was 500KB to 1MB, the bucket should have been serving 250,000 images per hour. This is 100+ requests per second.

There was no way that Mechanical Turk was responsible for this traffic. The cost of Mechanical Turk would have trumped the cost of bandwidth. Somehow the S3 bucket was being "Slashdotted" but without being featured on Slashdot or in any other place that I was aware of.

Strange.

Very strange.

Checking the Logs

Well, I enabled logging for the S3 bucket, so I was waiting for the logs to appear.

The first logs showed up and I was in for a surprise. Here are the IPs and the User-agent of the requests.

74.125.156.82 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.84 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.81 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.86 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.92 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.87 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.81 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.82 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.85 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.89 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.90 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.92 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.85 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.82 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.88 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.86 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.89 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.94 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.88 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.83 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.92 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.156.80 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.64.88 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.84 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
74.125.158.87 Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)

So, it was Google that was crawling the bucket. Aggressively. Very aggressively.

Why would Google crawl this bucket? Yes, the URLs were technically public but there was no obvious place to get the URLs. Google could not have gotten the URLs from Mechanical Turk. The images in the tasks posted to Mechanical Turk are not accessible to Google to crawl.

At least we know it is Google. I guess, somehow, I let Google learn about the URLs of the images in the bucket (how?) and Google started crawling them. But something was still puzzling. How can an S3 bucket with 250GB of data generate 40 times that amount of traffic? Google would just download once and get done with that. It would not re-crawl the same object many times.

I checked the logs again. Interestingly enough, there was a pattern: Each image was being downloaded every hour. Every single one of them. Again and again. Something was very very strange. Google kept launching its crawlers, repeatedly, to download the same content in the S3 bucket, every hour. For a total of 250GB of traffic, every hour.

Google would have been smarter than that. Why waste all the bandwidth to re-download an identical image every hour?

Why would Google download the same images again and again?

Wait, this is not the real Google crawler...

Looking more carefully, there was one red flag. This is not the Google crawler. The Google crawler is named GoogleBot for web pages and Googlebot-Image for images. It is not called Feedfetcher as this user agent.

What the heck is Feedfetcher? A few interesting pieces of information from Google:

Feedfetcher is how Google grabs RSS or Atom feeds when users choose to add them to their Google homepage or Google Reader
Feedfetcher retrieves feeds only after users have explicitly added them to their Google homepage or Google Reader
[Feedfetcher] is not retrieving content to be added to Google's search index
Feedfetcher retrieves feeds only after users have explicitly added them to their Google homepage or Google Reader. Feedfetcher behaves as a direct agent of the human user, not as a robot, so it ignores robots.txt

Interesting. So these images were in some form of a personal feed.

Shooting myself in the foot, the Google Spreadsheet way

And this information started unraveling the full story. I remembered!

All the URLs for these images were also stored in a Google Spreadsheet, so that I could inspect the results of the crowdsourcing process. (The spreadsheet was not being used or accessed by Mechanical Turk workers; it was just for viewing the results.) I used the =image(url) command to display a thumbnail of the image in a spreadsheet cell.

So, all this bandwidth waste was triggered by my own stupidity. I asked Google to download all the images to create the thumbnails in Google Spreadsheet. Talking about shooting myself in the foot. I launched the Google crawler myself.

But why did Google download the images again and again? That seemed puzzling. It seemed perfectly plausible that Google would fetch 250GB of data (i.e., the total size of the bucket), although I would have gone for a lazy evaluation approach (i.e., loading on demand, as opposed to pre-fetching). But why downloading the same content again and again?

Well, the explanation is simple: Apparently Google is using Feedfetcher as a "URL fetcher" for all sorts of "personal" URLs someone adds to its services, and not only for feeds. Since these URLs are private, Google does not want to store them anywhere permanently in the Google servers. Makes perfect sense from the point of view of respecting user privacy. The problem is that this does not allow for any form of caching, as Google does not store anywhere the personal data.

So, every hour, Google was launching the crawlers against my bucket, generating a tremendous amount of crawler traffic. Notice that even if I had a robots.txt, Feedfetcher would have ignored it in any case. (Furthermore, it is not possible to place a robots.txt file in the root directory of https://s3.amazonaws.com as this is a common server for many different accounts; but in any case Feedfetcher would have ignored it.)

The final touch in the overall story? Normally, if you were to do the same thing with URLs from a random website, Google would have rate limited its crawlers, not to overload the website. However, the s3.amazonaws.com domain is a huge domain, containing terabytes (petabytes?) of web content. Google has no reason to rate limit against such a huge domain with huge traffic. It made perfect sense to launch 100+ connections per second against a set of URLs that were hosted in that domain...

So, I did not just shoot myself in the foot. I took a Tsar Bomba and I launched it against my foot. The $1000 bandwidth bill (generated pretty much within a few hours) was the price of my stupidity.

Oof, mystery solved. I killed the spreadsheet and made the images private. Google started getting 403 errors, and I hope that it will soon stop. Expensive mistake, but at least resolved.

And you cannot help but laugh at the following irony: One of the main arguments for using the AWS infrastructure is that it is virtually invincible to any denial of service attack. On the other hand, the avoidance of the denial of service breeds a new type of attack: Bring the service down not by stopping the service but by making it extremely expensive to run...

The real lesson: Google as a medium for launching an attack against others

Then I realized: This is a technique that can be used to launch a denial of service attack against a website hosted on Amazon (or even elsewhere). The steps:

Gather a large number of URLs from the targeted website. Preferably big media files (jpg, pdf, etc.)
Put these URLs in a Google feed, or just put them in a Google Spreadsheet
Put the feed into a Google service, or use the image(url) command in Google spreadsheet
Sit back and enjoy seeing Google launching a Slashdot-style denial of service attack against your target.

What I find fascinating in this setting is that Google becomes such a powerful weapon due to a series of perfectly legitimate design decisions. First, they separate completely their index from the URLs that they fetch for private purposes. Very clean and nice design. The problem? No caching. Second, Google is not doing lazy evaluation in the feeds but tries to pre-fetch them to be ready and fresh for the user. The problem? Google is launching its Feedfetcher crawlers again and again. Combine the two, and you have a very, very powerful tool that can generate untraceable denials of service attacks.

The law of unintended consequences. Scary and instructive at the same time: You never know how the tools that you build can be used, no matter how noble the intentions and the design decisions.

PS: Amazon was nice enough to refund the bandwidth charges (before the post went public), as they considered this activity accidental and not intentional. Thanks TK!

Feedback, Unemployment, and Crowdsourcing: A Modest Proposal

I had finished reading the paper "Inefficient Hiring in Entry-Level Labor Markets" by Amanda Pallais, an assistant professor of Economics at Harvard University.

This is the first paper that I have read that provides experimental evidence that labor markets are "not efficient" in the following way: If we have a new worker, or a worker with no known past history, we do not know what the worker can and cannot do. Most employers will not hire this worker due to this lack of knowledge. And since the worker is never hired, nobody is able to leave feedback about the performance of the worker. This leads to a vicious cycle for the new entrants, who cannot break into the market because they do not have feedback, and they cannot get feedback because they cannot get into the market.

While this phenomenon is known, it was not obvious that lack of feedback is causing this inefficiency. The alternative explanation was that good workers will find work to do, and bad workers simply do not get jobs because they do not even know how to apply and enter the market efficiently.

What Amanda did was pretty interesting. She created a randomized experiment. She used oDesk and opened a position for data entry, a position that required pretty much no special skills. She received approximately 3000 job applications. Out of these, she hired randomly 1000 workers. The 2000 non-hired workers formed the "control" group. Within the 1000 workers, she created two groups. One that received a detailed public feedback and evaluation, and another that received a generic, uninformative feedback (e.g., "Good work"). Given the randomized selection, the differences in the future evolution of the workers were pretty much the result of the treatments in this controlled field experiment.

The results were revealing:

Workers randomly selected to receive jobs were more likely to be employed, requested higher wages, and had higher earnings than control group workers.
In the two months after the experiment, inexperienced workers' earnings approximately tripled as a result of obtaining a job.
Providing workers with more detailed evaluations substantially increased their earnings and the wages they requested.
The benefits of detailed evaluations were not universal: detailed performance evaluations helped those who performed well and hurt those who performed poorly.

Even more notable, the benefit of the workers that received the "you get a job" treatment, did not come at the expense of other workers. Employment increased and the money that was "wasted" to conduct the experiment (the tasks were not useful to anyone) generated enough return to cover the cost.

In principle, oDesk may want to engage in such "wasteful" hiring just to get workers to bootstrap and start with some meaningful feedback in their profiles: When you create an account at oDesk, you get a random job (for which nobody cares) and then the quality of the submitted work is evaluated, to generate some meaningful feedback for the worker (e.g., "great at setting up a map reduce task on Amazon Web Services")

Or, perhaps, they can skip the wasteful part, and use crowdsourcing as a perfectly valid mechanism for generating this valuable public feedback by letting people do actual work.

Crowdsourcing as a solution to the cold start problem

Note how this need for early feedback so that workers can enter the market naturally leads to crowdsourcing as a solution to the entrance problem.

If getting a job is the blocker for starting your career, then crowdsourcing allows new entrants to pick jobs without having to worry about the interview process. Just pick an available task and do it.

The findings of the study also suggest that crowdsourcing by itself is not enough. Any crowdsourcing application that provides jobs should be accompanied by a detailed feedback/scoring system. For example, if the crowdsourcing platform is about, say, translation, then there should be public feedback that will list the tasks that the person completed (what language pairs, etc), and list the corresponding performance statistics (e.g., time taken to complete the task, quality of the outcome, etc.)

In a setting like this, crowdsourcing becomes not a novelty item but an integral part of any labor platform, facilitating entry of the workers. It is not a place where jobs get done on the cheap. It is the place that generates information about the quality of the workers, which in turn makes the workers more valuable to the firms.

Should crowdsourcing firms receive favorable treatment by the government?

So, if crowdsourcing tasks that generate public feedback for the performance of the participating workers benefit the workers, the future employers, and the overall society (by decreasing unemployment), the question is why not encourage companies to make more of their work available in such format. While a service like Mechanical Turk would not qualify (anonymity of workers, plus lack of reputation), other services that generate useful public information could be the focus of favorable legislation and/or tax treatment.

Perhaps it is time to give to crowdsourcing the attention and stature it deserves.

Tuesday, April 3, 2012

Philippines: The country that never sleeps (or, When is the world working? The oDesk Edition)

Why are you awake?

Over the last few months, I have used oDesk to hire a couple of virtual assistants, who help me with a variety of tasks. They are coming from the Philippines and we communicate over Skype whenever I have tasks for them to do. (Hi Maria! Hi Reineer!). One of the things that I found puzzling was the fact that they seemed to be online during the working hours in New York, despite the fact that we have a 12-hour difference with Manila. When I asked them, they told me that most of the time they work for US-based clients, and their work is much easier when they are synchronized with a US schedule (real-time interactions with the clients, and so on). So they tend to stay awake until late at night and then sleep during their morning in the Philippines.

I found that behavior strangely fascinating, so I decided to dig deeper and figure out if this is some quirkiness of my own virtual assistants, or whether this is a more systematic pattern.

The oDesk Team client: All-you-can-eat data

One characteristic that differentiates oDesk from other online labor platforms is the focus on hourly contracts, instead of project-based or piecemeal contracts. To enable truthful billing, oDesk asks the service providers to use the oDesk client whenever they are billing time. The client records the time billed and at the same time it takes screenshots at random intervals (that are given to the client who pays, only) and records the level of activity on the computer. This, in turn, ensures that clients can audit what service providers were doing while they were billing hours for work.

So, I got the data recorded by the oDesk Team client that show when a worker is active. I plotted the number of active workers at different times of the day (time is local to the location of the service provider, and not the global UTC time), for various days of the week. Here is the plot with numbers from the top-7 countries, ranked by number of workers:

One thing that is immediately interesting: Philippines never sleeps!

All other countries have very natural patterns of being awake and asleep; Philippines is an exception. We see that the minimum for Philippines rarely drops below 5,000 active workers! All other countries (combined!) in their downtime cannot beat the Philippines in their low time. The supply of work is very constant over time.

There are a couple of natural break points (see the small dip around lunch time and another one at around dinner time) but even during the (Philippines) night the work keeps going on. In fact, you can see clearly the peak of employment is at around 9pm-10pm in the Philippines, which is the time that the East Coast in the US starts working as well. The low point for the Philippines is at around 4am-5am their time, which is 4pm-5pm in the East Coast.

Update: A couple of fascinating comments from the Hacker News thread for this post:

I have cousins that work at help desks in the Philippines, and their work schedules are designed to match US time zones. After work, they hang out at bars with happy hours designed for them - I believe around ten in the morning. They hang out, then go home to sleep for the rest of the day. Globalization at work.

I'm a Filipino Developer. This is actually an alternative for us developers in the Philippines, instead of going abroad working overseas which will be very far from our families. We got a lot of opportunities from foreigners who want to outsource their development projects. This earns us quite substantial income Although it's not as high as when you're really working abroad, being with your family and seeing your children grow up mostly makes up for it. Staying up late is not that hard as I myself am most productive at night when kids are asleep. I know most programmers share this work time.

The Data

For those that want to play more with the data, here is a link to a Google Spreadsheet. If you want more details or a slightly different view of the data, I would be happy to dig more in the oDesk database.

What is the application? Real-time human computation

So, why do we care that the Philippines is awake all the time? The immediate benefit is that getting a team in Philippines can ensure the availability of labor for handling real-time tasks. If you have a human-powered application, you do not want to have any dead periods of time, where the application is slowing down or becomes completely unresponsive. However, by hiring people from the Philippines, it is possible to have a "private crowd" available around the clock, by simply asking the Philippines contractors to "show up" at different points during the day/week.

What is the difference with other services? If you hire a big outsourcing company, then the expectation is that they will work during (their) normal business hours, leaving the service down for many hours. On Mechanical Turk, this drop in performance comes naturally. If you restrict your tasks to US only, the speed drops when US goes to sleep. If you run the task on India, the same thing will happen. (Mixing the two crowds tend to result in many complications as the expectations for price are very different and Indians tend to overwhelm tasks that are priced for US workers.)

Overall, the Philippines seems to have a nice balance of availability throughout the day, and generally low prices. In terms of quality, things tend to be somewhere between US and India, so careful screening and quality control is important. But for many people experienced with managing crowds, it seems that the Philippines is a great source of "crowds."

Myself, I have already put my money where my mouth is, across multiple crowd applications that I have built.