Monday, April 20, 2009

Google App Engine and Java: First Impressions

Over the last few days I have been playing with Google App Engine, the infrastructure provided by Google for building applications in the cloud. To give some context, I tried to build a crawler that will retrieve and store historic information from a marketplace. 

I have already built this application  and it was running on my local Linux machine, storing information into a SQL database. However, I was getting unconfortable seeing the database growing significantly and running big queries was interfering with other users who was using the database machine for their own projects. So, I decided to see how easy is to port such a vanilla project into the cloud.

My impressions so far:

Ease of programming

It was pretty easy to follow the provided tutorials and get a basic application up and running pretty soon. It may be a good way to introduce students to (web) programming. The Eclipse plugin hides very significant fraction of the complexity, and allows the programmer to focus on the application development.

Database support

No SQL database anymore. While we still save "entities" and there is support for relationships across entities, Google data storage is based on BigTable, not on a SQL database. This means no joins. You can always implement your own version of a join but this is not how the Google datastore is supposed to be used. Slowly you realize that denormalization is desirable and often absolutely necessary. For someone like me that likes a fully normalized scheme, making sure that we do not have inconsistencies anywhere, it felt almost too messy: too much information replicated everywhere, need to be extra careful not to have anomalies, and so on. I can see a significant learning curve for migrating databases into such an environment. Giving up joins is not easy... (Our MBA students who keep all their data in a single spreadsheet, will feel right at home ;-) But it is not that bad. Personally, it helped me to consider the entities in the Google datastore as materialized views of some underlying relations, and use lazy updating techniques to keep the data consistent.

30-second limit

By far the most annoying aspect of the Google App Engine is the limit of 30 seconds execution time for any process. Nothing can run for more than 30 seconds. Since I wanted to build a crawler, I had to re-think the infrastructure. It was necessary to break the task into smaller chunks that can be completed within the 30 second limit. 

To achieve this, I buily a "task queue" structure that was keeping track of the pages that need to be fetched, and this queue was stored as a persistent structure in the datastore. Then, the "crawler" process was picking URLs from the queue, and was fetching whatever pages can be fetched within the 30 second limit, storing the retrieved pages to the Google datastore. Pretty annoying the fact that the 30 second limitation also includes the time to fetch the page. Often, I was timing out just because the remote server was slow to send the requested page. 

Finally, to get the crawler running "all the time", I scheduled a cron job that was starting the "30-second crawler" process every minute. Almost like trying to travel a trip with a car that can run for every 30 seconds at a time, and can be restarted every minute. Not very elegant, not the most efficient, but it works for lightweight tasks.

Quota system

Google App Engine allows applications to run for free, as long as they stay below some usage quota. Once the app exceeds its daily allocated free quota, it gets billed, up to a maximum specified limit. 

In other words you pay for CPU usage. This is in direct contrast to Amazon EC2 that charges by the "wall time" a virtual machine is running. Since Google App Engine charges only for the actually consumed resources it encourages code that is as efficient as possible and spends as few resources as possible.

Artists say that the limitations of a medium are a major force for creativity. I have to say that the quota system has the same effect. I found myself thinking and rethinking of how I can make the process as efficient as possible. Since I actually see all the time the exact amount of resources spent for each process, I am compelled to make the processes as efficient as possible. This is not the case for regular desktop programming. OK, it takes 2 seconds instead of 0.1. So what? I have plenty of resources, and I can afford being sloppy. When I am being billed for the consumed resources, I have a pretty immediate incentive to write the best code possible. 

I may be overreaching here, but I see the concept of being billed according to CPU usage a force that will encourage deeper learning in Computer Science. The effect of optimization is immediate, measurable, and it is often necessary to optimize, just to get your process running.

I remember the stories of the old-timers and how they were trying to super-optimize their code, so that the mainframe can execute the code overnight and they can get the results back. Well, the mainframe is back!


Wednesday, April 8, 2009

LiveOps and Human Computation

When we talk about human computation, the canonical examples are either the Games with a Purpose from Luis von Ahn, or Amazon's Mechanical Turk.

Recently, though I learned about LiveOps, a company that allows "micro-outsourcing" of small tasks, such as handling a telephone call, or taking a pizza order in a drivethrough. Quoting Jonathan Zittrain who wrote a paper Ubiquitous Human Computing:
We are in the initial stages of distributed human computing that can be directed at mental tasks the way that surplus remote server rackspace or Web hosting can be purchased to accommodate sudden spikes in Internet traffic (von Ahn 2005; Hewlett Packard (HP) 2008) or PCs can be arranged in grid computing configurations, each executing code in an identical virtual environment (International Business Machines (IBM) 2006). At some fast food drivethroughs, the microphone and speaker next to the marquee menu are patched through to an order-taker thousands of miles away. That person types up the requested order and dispatches it back to a screen in the food preparation area of the restaurant while the car is idling (Richtel 2006). Services like LiveOps recruit workers for such mental contracting tasks (LiveOps 2008a). Applicants to LiveOps navigate a fully automated hours-long vetting system that tests their skills and suitability. Out of 2,000 applicants per week, roughly 40 emerge for a second round of interviews by LiveOps managers (LiveOps 2008b). 
Those who succeed and become contractors for firms like LiveOps encounter an unusual combination of freedom and control. They can work whenever they like, wherever they like, for as much or as little time as they like. When they log in to work they choose from a menu of assignments tailored to their skill and reputation levels. These might include taking pizza orders, placing sales calls, lobbying for a political candidate, or handling customer service inquiries. Then there is the control: every call and transaction is measured and recorded. Interactions can be monitored live by fellow LiveOps mentors or official LiveOps managers, or pulled up later as part of a larger assessment of contractors’ work. Judgments are developed and recorded about contractors’ performance, such that an incoming pizza order can be routed to the best pizzaorder- taker – who may not be the same as the best political campaigner (Hornik 2007). Contractors can be de-accredited at any time.
I find the similarities with Mechanical Turk striking but I can clearly see how LiveOps differentiates itself by handling tasks that are not suitable for the Mechanical Turk platform. I also find it mildly entertaining that I am using LiveOps as an example in class when we talk about VoIP, but I never thought of actually digging deeper to see how they work.

On a tangentially related note, if you want to find papers related to human computation, you can visit a wiki that we created at http://hcomp2009.wikispaces.com/. Feel free to add more papers, add notes to the current papers, or simply send suggestions on how to improve it. In the Human Computation Workshop (HComp 2009) we are trying to bring together people who are interested in all aspects of human computation, and the wiki is just one part of this effort.