## Wednesday, April 25, 2012

### The Google attack: How I attacked myself using Google Spreadsheets and I ramped up a $1000 bandwidth bill It all started with an email. From: Amazon Web Services LLC Subject: Review of your AWS Account Estimated Month to Date Billing Charges of$720.85

Greetings from AWS,

During a routine review of your AWS Account's estimated billing this month, we noticed that your charges thus far are a bit larger than previous monthly charges. We'd like to use this opportunity to explore the features and functionality of AWS that led you to rely on AWS for more of your needs.

You can view your current estimated monthly charges by going here:

https://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&action=activity-summary

AWS Account ID: XXXXXXX27965
Current Estimated Charges: $720.85 If you have any feedback on the features or functionality of AWS that has helped enable your confidence in our services to begin ramping your usage we would like to hear about it. Additionally, if you have any questions pertaining to your billing, please contact us by using the email address on your account and logging in to your account here: https://aws-portal.amazon.com/gp/aws/html-forms-controller/contactus/aws-account-and-billing Regards, AWS Customer Service This message was produced and distributed by Amazon Web Services LLC, 410 Terry Avenue North, Seattle, Washington 98109-5210 What? \$720 in charges? My usual monthly charges for Amazon Web Services were around \$100, so getting this email with a usage of \$720 after just two weeks within the month was a big alert. I login to my account to see what is going on, and I see this:

An even bigger number: \$1177.76 in usage charges! A thousand, one hundred, seventy seven dollars. Out of which \$1065 in outgoing bandwidth transfer costs. The scary part: 8.8 Terabytes of outgoing traffic! Tera. Not Giga. Terabytes.

To make things worse, I realized that the cost was going up hour after hour. Fifty to hundred dollars more in billing charges with each. passing. hour. I started sweating.

What happened?

Initially I was afraid that a script that I setup to backup my photos from my local network to S3 consumed that bandwidth. But then I realized that I am running this backup-to-S3 script for a few months now, so it could not suddenly start consuming more resources. In any case, all the traffic that is incoming to S3 is free. This was a matter of outgoing traffic.

Then I started suspecting that the cause of this spike maybe due to the developers that are working in various projects of mine. Could they have mounted the S3 bucket into an EC2 machine that is in a different region? In that case, we may have indeed problems, as all the I/O operations that are happening within a machine would count as bandwidth costs. I checked all my EC2 machines. No, this is not the problem. All EC2 machines are in us-east, and my S3 buckets are all in US Standard. No charges for operations between EC2 machines and S3 buckets within the same region.

What could be causing this? Unfortunately, I did not have any logging enabled to my S3 buckets. I enabled logging and expected to see what would happen next. But logging would take a few hours, and the bandwidth meter was running. No time to waste.

Thankfully, even in the absence of logging, Amazon provides access to the usage reports of all the AWS resources. The report indicated the bucket that was causing the problem:

My S3 bucket with the name "t_4e1cc9619d4aa8f8400c530b8b9c1c09" was generating 250GB of outgoing traffic, per hour.

Two-hundred-fifty Gigabytes. Per hour.

At least I knew what was the source of the traffic. It was a big bucket with images that were being used for a variety of tasks on Amazon Mechanical Turk.

But still something was strange. The bucket was big, approximately 250GB of images. Could Mechanical Turk generate so much traffic? Given that on average the size of each image was 500Kb to 1MB, the bucket should have been serving 250,000 images per hour. This is 100+ requests per second.

There was no way that Mechanical Turk was responsible for this traffic. The cost of Mechanical Turk would have trumpeted the cost of bandwidth. Somehow the S3 bucket was being "Slashdotted" but without being featured on Slashdot or in any other place that I was aware of.

Strange.

Very strange.

Checking the Logs

Well, I enabled logging for the S3 bucket, so I was waiting for the logs to appear.

The first logs showed up and I was in a for a surprise. Here are the IP's and the User-agent of the requests.




So, it was Google that was crawling the bucket. Aggressively. Very aggressively.

Why would Google crawl this bucket? Yes, the URLs were technically public but there was no obvious place to get the URLs. Google could not have gotten the URLs from Mechanical Turk. The images in the tasks posted to Mechanical Turk are not accessible to Google to crawl.

At least we know it is Google. I guess, somehow, I let Google learn about the URLs of the images in the bucket (how?) and Google started crawling them. But something was still puzzling. How can an S3 bucket with 250Gb of data generate 40 times that amount of traffic? Google would just download once and get done with that. It would not re-crawl the same object many times.

I checked the logs again. Interestingly enough, there was a pattern: Each image was being downloaded every hour. Every single one of them. Again and again. Something was very very strange. Google kept launching its crawlers, repeatedly, to download the same content in the S3 bucket, every hour. For a total of 250GB of traffic, every hour.

Google would have been smarter than that. Why wasting all the bandwidth to re-download an identical image every hour?

Wait, this is not the real Google crawler...

Looking more carefully, there was one red flag. This is not the Google crawler. The Google crawler is named GoogleBot for web pages and Googlebot-Image for images. It is not called Feedfetcher as this user agent.

What the heck is Feedfetcher? A few interesting pieces of information from Google:
• [Feedfetcher] is not retrieving content to be added to Google's search index
• Feedfetcher retrieves feeds only after users have explicitly added them to their Google homepage or Google Reader. Feedfetcher behaves as a direct agent of the human user, not as a robot, so it ignores robots.txt
Interesting. So these images were in some form of a personal feed.

And this information started unraveling the full story. I remembered!

All the URLs for these images were also stored in a Google Spreadsheet, so that I can inspect the results of the crowdsourcing process. (The spreadsheet was not being used or accessed by Mechanical Turk workers, it was just for viewing the results.) I used the =image(url) command to display a thumbnail of the image in a spreadsheet cell.

But why did Google download the images again and again? That seemed puzzling. It seemed perfectly plausible that Google would fetch 250Gb of data (i.e., the total size of the bucket), although I would have gone for a lazy evaluation approach (i.e., loading on demand, as opposed to pre-fetching). But why downloading the same content again and again?

Well, the explanation is simple: Apparently Google is using Feedfetcher as a "url fetcher" for all sorts of "personal" URLs someone adds to its services, and not only for feeds. Since these URLs are private, Google does not want to store them anywhere permanently in the Google servers. Makes perfect sense from the point of view of respecting user privacy. The problem is that this does not allow for any form of caching, as Google does not store anywhere the personal data.

So, every hour, Google was launching the crawlers against my bucket, generating a tremendous amount of crawler traffic. Notice that even if I had a robots.txt, Feedfetcher would have ignored it in any case. (Furthermore, it is not possible to place a robots.txt file in the root directory of https://s3.amazonaws.com as this is a common server for many different accounts; but in any case Feedefetcher would have ignored it.)

The final touch in the overall story? Normally, if you were to do the same thing with URLs from a random website, Google would have rate limited its crawlers, not to overload the website. However, the s3.amazonaws.com domain is a huuuge domain, containing terabytes (petabytes?) of web content. Google has no reason to rate limit against such a huge domain with huge traffic. It made perfect sense to launch 100+ connections per second against a set of URLs that were hosted in that domain...

So, I did not just shoot myself in the foot. I took a Tsar Bomba and I launched it against my foot. The \$1000 bandwidth bill (generated pretty much within a few hours) was the price of my stupidity.

Ooof, mystery solved. I killed the spreadsheet and make the images private. Google started getting 403 errors, and I hope that it will soon stop. Expensive mistake, but at least resolved.

And you cannot help but laugh at the following irony: One of the main arguments for using the AWS infrastructure is that it is virtually invincible to any denial of service attack. On the other hand, the avoidance of the denial of service breeds a new type of attack: Bring the service down not by stopping the service but by making it extremely expensive to run...

The real lesson: Google as a medium for launching an attack against others

Then I realized: This is a technique that can be used to launch a denial of service attack against a website hosted on Amazon (or even elsewhere). The steps:
1. Gather a large number of URLs from the targeted website. Preferably big media files (jpg, pdf, etc)