A Computer Scientist in a Business School

Showing posts with label machine learning. Show all posts

Monday, June 20, 2011

Crowdsourcing and the discovery of a hidden treasure

A few months back, I started advising Tagasauris, a company that provides media annotation services, using crowdsourcing.

This month, Tagasauris is featured in a Wired article, titled "Hidden Treasure". It is a story of rediscovering a "lost" set of photos, from the shooting of the movie "American Graffiti". You can see the article by clicking the image:

Hidden Treasure

Rediscovered: Never before seen American Graffiti photos in the Magnum archive.

IN MARCH, the Magnum photo agency stumbled onto a remarkable find: Nearly two dozen lost photos from the set of American Graffiti. The images feature pre-Star Wars George Lucas as well as cast members like Richard Dreyfuss, Mackenzie Phillips, and Ron Howard, and they offer an unparalleled look at the making of the 1973 film. So where did Magnum discover these gems? In its own archive. Magnum had hired Tagasauris, a company that tags photos using Amazon Mechanical Turk workers, to add keywords to hundreds of thousands of untagged images. When those workers came across the Graffiti photos, they quickly identified the actors, scenes, and other image details. Magnum originally hoped the phototagging would improve its archive's searchability, which it has, but the agency was also thrilled that the initiative unearthed such an incredible trove - images that visually resurrect an American classic.

Since there are some interesting aspects of the story, which go beyond the simple "tag using MTurk" story, I would like to give a few more details that I consider interesting.

Magnum Photos

One of the clients of Tagasauris is Magnum Photos, a cooperative owned by its own photographer members, designated to handle the commercial aspect of their own work. The list of members of Magnum Photos include photographers such as Robert Capa, Henri Cartier-Bresson, David Seymour, George Rodger, Steve McCurry, and many others. (See their Wikipedia entry for further details.) A few photos in the Magnum Photos archive that you may recognize:

One of my favorite parts of the Magnum website is the Archival Calendar, where they have a set of photos showcasing various historic events. Beats Facebook browsing by a wide margin. But let's get back to the story.

The problem

So, what is the problem of Magnum Photos? The same problem that almost every single big media company faces: a very large number of media objects without useful, descriptive metadata. No keywords, no description, nothing to aid the discovery process. Just the image file and mechanical data about film number etc. (Well, my own photo archive looks very similar...)

This lack of metadata is the case not only for the archive but also for the new, incoming photos that arrive every day from its members. (To put it mildly, photographers are not exactly eager to sit, tag, and describe the hundreds of photos they shoot every day.) This means that a large fraction of the Magnum Photos archive, which contains millions of photos, is virtually unsearchable. The photos are effectively lost in the digital world, even though they are digitized and available on the Internet.

An example of such case of "lost" photos is a set of photos from the shooting of the movie "American Graffitti". People at Magnum Photos knew that one of their photographers, Dennis Stock who died in 2009, was on set during the production of the movie, and he had taken photos of the, then young and unknown, members of the team. Magnum Photos had no idea where these photos were. They knew they digitized the archive of Dennis Stock, they knew that the photos are in the archive, but nobody could locate the photos within the millions of other, untagged photos.

For those unfamiliar with the movie, American Graffiti is a 1973 film, by George Lucas (pre-Star Wars), with starring actors the, then unknowns, Richard Dreyfuss, Ron Howard, Paul Le Mat, Charles Martin Smith,Cindy Williams, Candy Clark, Mackenzie Phillips and Harrison Ford. The latter shot to stardom of all the actors makes the movie almost a cult.

The Magnum Photos archive is a trove of similar "hidden treasures". Sitting there, waiting for some accidental, serendipitous discovery.

The tagging solution and the machine support

Magnum Photos had its own set of annotators. However, the annotators could not even catch up even with the volume of incoming photos. The task of going back and annotating the archive was an even more daunting task. This meant lost revenue for Magnum Photos, as if you cannot find a photo, you cannot license it, and you cannot sell it.

Tagasauris proposed to solve the problem using crowdsourcing. With hundreds of workers working in parallel, it became possible to tame the influx of untagged incoming photos, and start going backwards and tagging the archive.

Of course, vanilla photo tagging is not a solution. Workers type misspelled words (named entities are systematic offenders), try to get away with generic tags, etc. Following the lessons learned from ESP Game, and all the subsequent studies, Tagasauris built solutions for cleaning the tags, rewarding specificity, and, in general, clean up and ensure high-quality for the noisy tagging process.

A key component was the ability to match the tags entered by the workers with named entities, which themselves were then connected to Freebase entities.

The result? When workers were tagging the photos from Magnum Photos, they identified the actors in the shots, and the machine process in the background assigned "semantic tags" to the photos, such as [George Lucas], [Richard Dreyfuss], [Ron Howard], [Mackenzie Phillips], [Harrison Ford] and others.

Yes, humans + machines generate things that are better than the sum of the parts.

The machine support, cont.

So, how the workers discovered the photos from American Graffiti? As you may imagine, the workers had no idea that the photos that they were tagging were from the shooting of the film. They could identify the actors, but that was it.

Going from actor tagging to understanding the context of the photo shooting, is a task that cannot be required by layman, non-expert taggers. You need experts that can "connect the dots". Unfortunately, subject experts are expensive. And they tend not to be interested in tedious tasks, such as assigning tags to photos.

However, this "connecting the dots" is a task where machines are better than humans. We have recently seen how Watson, by having access to semantically connected ontologies (often generated by humans), could identify the correct answers to a wide variety of questions.

Tagasauris employed a similar strategy. Knowing the entities that appear in a set of photos, it is then possible to identify additional metadata. For example, look at the five actors that were identified in the photos (red boxes, with white background), and the associated semantic graph that links the different entities together:

Bingo! The entity that connects together the different entities is the entity "American Graffiti", which was not used by any worker.

At this point, you can understand how the story evolved. A graph activation/spreading algorithm suggests the tag, experts can verify it, and the rest is history.

Meagan Young looked at the stream of incoming photos, noticed the American Graffiti tag, realized that the "lost" photos were found, and she notified the others at Magnum Photos and Todd Carter, the CEO of Tagasauris. The "hidden treasure" was identified, and the Wired story was underway...

Crowdsourcing: It is not just about the humans

This is not a story to show how cool discovery based on linked entities is. This is old news for many people that work with such data. However, this is a simple example of using crowdsourcing in a more intelligent way that it is currently being used. Machines cannot do everything (in fact, they are especially bad in tasks that are "trivial" for humans) but when humans provide enough input, the machines can take it from there, and improve significantly the overall process.

Someone can even see the obvious next step: Use face recognition and allow tagging to be done collaboratively with humans and machines. Google and Facebook have very advanced algorithms for face recognition. Match them intelligently with humans, and you are way ahead of solutions that rely simply on humans to tag faces.

I think the lesson is clear: Let humans do what they do best, and let machines do what they do best. (And expect the balance to change as we move forward and machines can do more.) Undoing and ignoring decades of research in computer science, just because it is easier to use cheap labor, is a disservice not only to computer science. It is a disservice to the potential of crowdsourcing as well.

Friday, May 21, 2010

Prediction Optimizers

The announcement of the Google Prediction API created quite a lot of discussion about the future of predictive modeling. The reactions were mainly positive, but there was one common concern: Google Predict API operates as a black box. You give the data, you train, you get predictions. No selection of the training algorithm, no parameter tuning, no nothing. Black box. Data in, predictions out.

So, the natural question arises: Is it possible to do machine learning like that? Can we trust something if we do not understand the internals of the prediction mechanism?

Declarative query execution and the role of query optimizers

For me this approach, being trained as a database researcher, directly corresponds to the approach of a query optimizer. In relational databases, you upload your data, and issue declarative SQL queries to get answers. The internal query optimizer selects how to evaluate the SQL query in order to return the results in the fastest possible manner. Most of the users of relational databases today do not even know how a query is executed. Is the join executed as a hash-join or as a sort-merge join? In which order do we join the tables? How is the GROUP BY aggregation computed? I bet that 99% of the users have no idea, and they do not want to know.

Definitely, knowing how a database works can help. If you know that you will perform mainly lookup queries on a given column, you want to build an index. Or create a histogram with the distribution statistics for another column. Or create a materialized view for frequently executed queries. But even these tasks today are increasingly automated. The database tuning advisor on SQL Server routinely suggests indexes and partitionings for my databases that I would never thought of building. (And I have a PhD in databases!)

Declarative predictive modeling

I see an exact parallel of this approach in the Google Prediction API. You upload the data and Google selects for you the most promising model. (Or, more likely, they build many models and do some form of meta-learning on top of these predictions.) I would call this a "prediction optimizer" making the parallel with the query optimize that is built within every relational database system today. My prediction (pun intended) is that such prediction optimizers will be an integral part of every predictive analytics suite in the future.

Someone may argue that it would be better to have the ability to hand tune some parts. For example, if you know something about the structure of the data, you may pass hints to the "prediction optimizer", indicating that a particular learning strategy is better. This has a direct correspondence in the query optimization world: if you know that a particular execution strategy is better, you can pass a HINT to the optimizer as part of the SQL query.

Can we do better? Yes (but 99% of the people do not care)

The obvious question is, can we do better than relying on a prediction optimizer to build a model? The answer is pretty straightforward: Yes!

In fact, if you build any custom solution for handling your data, it will be significantly better than any commercial database. Databases have a lot of extra baggage (e.g., transactions) that are not useful in all applications but are slowing down execution considerably. I will not even go to the discussion about web crawlers, financial trading systems, etc. However, these solutions come at a cost (time, people, money...). Many people just want to store and manage their data. For them, existing database systems and their behind-the-scenes optimizers are good enough!

Similarly, you can expect to see many people building predictive models "blindly" using a blackbox approach, relying on a prediction optimizer to handle the details. If people do not care about hash joins vs. sort-merge joins, I do not think that anyone will care if the prediction came from a support vector machine with a radial basis function, from a random forest, or from a Mechanical Turk worker (Yes, I had to put MTurk in the post).

The future

I know that predictions, especially about the future, are hard, but here is my take: We are going to see a market for predictive modeling suites, similar to the market for databases. Multiple vendors will built (or have already built) similar suites. In the same way that today Oracle, Microsoft, IBM, Teradata and so on, compete for the best SQL engine, we will see competition for such turnkey solutions for predictive modeling. You upload the data and then the engines complete for scalability, speed of training, and for the best ROC curve.

Let's see: Upload-train-predict. Waiting for an answer...

Sunday, September 6, 2009

The different attitudes of computer scientists and economists

I was reading Noam Nisan's blog post about the different attitudes of computer scientists and economists. Noam hypothesizes that economists emphasize research on “what is” while computer scientists emphasize on “what can be”, and offers the view of an algorithmic game theorist.

I have my own interpretation on this topic, mainly from the data mining point of view.

Economists are interested in suggesting policies (i.e., suggest to people, "what to do"). Therefore, it is important to built models that assign causality. Computer scientists are rarely interested in the issue of causality. Computer scientists control the system (the computer) and algorithms can be directed to perform one way or another. In contrast, economists cannot really control the system that they study. They do not even know how the system behaves.

When a computer scientist proposes an algorithm, the main focus is to examine the performance of the algorithm under different settings of incoming data. How the (computer) system will behave is controlled. When an economist suggests a policy, it is highly unclear how the underlying (rational?) agents will behave. Therefore, it is important to figure out what exactly "causes" the behavior of the agents, and figure out what policies can change this behavior.

One area that gets closer to economics in this respect is the area of data mining and machine learning. Get the data, and learn how the underlying system behaves. For example, get data about credit card transactions and learn which of them are fraudulent. However, there is a significant difference in focus: Computer scientists are mainly focused on predictive modelling. As long as the system can "predict" the outcome on unseen data, things are ok. A black box with perfect predictive performance is great. Explanatory models are rarely the focus. In the best case, someone may want to understand the internals of the predictive model but even if the model can be understood (e.g., read the rules or the decision tree), these rules are rarely causal in nature.

Let me give you an example: Suppose that you are trying to predict price per square feet for houses. As one independent variable (feature) you add average size of the house in the area. What the predictive model will find? That places that have smaller houses also have higher price per square foot. Unexpected? Not really. Houses in urban areas are typically smaller and more expensive compared to the their suburban and rural counterparts.

For a predictive model, this information is absolutely sufficient; the average house size is a valuable feature for predictive purposes. Think however what would happen is someone was devising policy based on this feature. A house builder would try to build smaller houses in rural areas, hoping that the resulting prices would be higher. Or a politician in Manhattan would encourage construction of bigger apartments, since the experiments have shown that if average house size is increased, the prices will drop. Absurd? Yes.

Even funnier things can come up if someone uses country-wide data to predict demand for apartments using apartment prices. The result will show that increasing prices actually increases demand, even though we would expect the opposite. (Just the effect of prices increasing in places where there is higher demand.)

Predictive modeling can survive (or even thrive) by exploiting such strange correlations. A causal model that captures correlations and presents them as causes can wreak havoc.

So, an economist will try to build a model that will generate causal relationships. In the case above, a model based on supply and demand is more likely to result in a model that captures the true "causes" of increased apartment prices. A house builder can see these effects and make a more informed decision on how to build. Similarly, for a politician that is trying to encourage building more affordable housing.

Often, causal models are called "structural" in economics [not sure where the term comes from; I have seen a few different interpretations]. They typically start by modelling the micro-behavior of agents, and then proceed to explain the behavior of a large system comprising of the interactions of such agents. A benefit of such models is that assumptions are easier to check, test, and challenge. In contrasts to "statistical" models, such models tend to generate relationships that are easier to consider "causal".

An advantage of causal models over predictive models is that causal models are valid even if the underlying data distribution changes. Causal models are supposed to be robust, as long as the behavior of the agents remains the same. A predictive model works under the assumption that the "unseen" data follow the same distribution as the "training" data. Change the distribution of the unseen data, and any performance guarantee for the predictive models disappears.

Update 1: This is not an attempt to downgrade the importance of predictive models. Most of the results presented by Google after a query are generated using predictive modeling algorithms. You get recommendations from Amazon and Netflix as the outcome of predictive algorithms. Your inbox remains spam-free due to the existence of the spam filter, again a system built using predictive modeling techniques. It is too hard, if not impossible, to build "causal" models for these applications.

Update 2: An interesting example of a company deriving policy based on their predictive model is American Express. They realized that the feature "customer buys in a 99c store" is correlated with higher delinquency rates. So, AmEx decided to decrease the credit limit for such customers. Of course, the result will be that potentially affected customers will stop visiting such stores, decreasing the value of this policy for AmEx. Furthermore, this action may cause even more economic stress to these customers that are now "forced" to buy from more expensive stores, and this may result in a much higher default rate for AmEx. This "unexpected" outcome is the effect of devising policy based on non-causal variables.

If AmEx had a variable "customer in economic distress", which arguably has a causal effect on default rates, then it would be possible to perform this action, without the ability of customers to game the system. However, since AmEx relied on a variable "customer buys in a 99c store" that is the outcome of the variable "customer in economic distress" it is possible for consumers to simply change their behavior in the face of economic distress.