Google App Engine and Java: First Impressions

Over the last few days I have been playing with Google App Engine, the infrastructure provided by Google for building applications in the cloud. To give some context, I tried to build a crawler that will retrieve and store historic information from a marketplace.

I have already built this application and it was running on my local Linux machine, storing information into a SQL database. However, I was getting uncomfortable seeing the database growing significantly and running big queries was interfering with other users who were using the database machine for their own projects. So, I decided to see how easy it is to port such a vanilla project into the cloud.

My impressions so far:

Ease of programming

It was pretty easy to follow the provided tutorials and get a basic application up and running pretty soon. It may be a good way to introduce students to (web) programming. The Eclipse plugin hides a very significant fraction of the complexity, and allows the programmer to focus on the application development.

Database support

No SQL database anymore. While we still save "entities" and there is support for relationships across entities, Google data storage is based on BigTable, not on a SQL database. This means no joins. You can always implement your own version of a join but this is not how the Google datastore is supposed to be used. Slowly you realize that denormalization is desirable and often absolutely necessary. For someone like me that likes a fully normalized scheme, making sure that we do not have inconsistencies anywhere, it felt almost too messy: too much information replicated everywhere, need to be extra careful not to have anomalies, and so on. I can see a significant learning curve for migrating databases into such an environment. Giving up joins is not easy... (Our MBA students who keep all their data in a single spreadsheet, will feel right at home ;-) But it is not that bad. Personally, it helped me to consider the entities in the Google datastore as materialized views of some underlying relations, and use lazy updating techniques to keep the data consistent.

30-second limit

By far the most annoying aspect of the Google App Engine is the limit of 30 seconds execution time for any process. Nothing can run for more than 30 seconds. Since I wanted to build a crawler, I had to re-think the infrastructure. It was necessary to break the task into smaller chunks that can be completed within the 30 second limit.

To achieve this, I built a "task queue" structure that was keeping track of the pages that need to be fetched, and this queue was stored as a persistent structure in the datastore. Then, the "crawler" process was picking URLs from the queue, and was fetching whatever pages can be fetched within the 30 second limit, storing the retrieved pages to the Google datastore. Pretty annoying is the fact that the 30 second limitation also includes the time to fetch the page. Often, I was timing out just because the remote server was slow to send the requested page.

Finally, to get the crawler running "all the time", I scheduled a cron job that was starting the "30-second crawler" process every minute. Almost like trying to take a trip with a car that can run for every 30 seconds at a time, and can be restarted every minute. Not very elegant, not the most efficient, but it works for lightweight tasks.

Quota system

Google App Engine allows applications to run for free, as long as they stay below some usage quota. Once the app exceeds its daily allocated free quota, it gets billed, up to a maximum specified limit.

In other words, you pay for CPU usage. This is in direct contrast to Amazon EC2 that charges by the "wall time" a virtual machine is running. Since Google App Engine charges only for the actually consumed resources it encourages code that is as efficient as possible and spends as few resources as possible.

Artists say that the limitations of a medium are a major force for creativity. I have to say that the quota system has the same effect. I found myself thinking and rethinking of how I can make the process as efficient as possible. Since I actually see all the time the exact amount of resources spent for each process, I am compelled to make the processes as efficient as possible. This is not the case for regular desktop programming. OK, it takes 2 seconds instead of 0.1. So what? I have plenty of resources, and I can afford being sloppy. When I am being billed for the consumed resources, I have a pretty immediate incentive to write the best code possible.

I may be overreaching here, but I see the concept of being billed according to CPU usage a force that will encourage deeper learning in Computer Science. The effect of optimization is immediate, measurable, and it is often necessary to optimize, just to get your process running.

I remember the stories of the old-timers and how they were trying to super-optimize their code, so that the mainframe can execute the code overnight and they can get the results back. Well, the mainframe is back!

A Computer Scientist in a Business School

Monday, April 20, 2009