Monday, September 28, 2009

Rationality, P=NP, Prediction Markets, and a Paradox

I got the idea for this post after reading the post of Dick Lipton on betting on the P=NP problem. The discussion in the comments was extensive, mainly touching the issues of risk aversion, the inability of humans to estimate properly small probabilities, and so on. (Wolfers and Snowberg argue that it is due to the inability of human to understand very small probabilities.) The discussion continued in the Overcoming Bias blog and there one of the comments, being tongue-in-cheek, caught my eye:

probabilities must agree with logic on certainly-true and certainly-false statements, which means that the probability of logical truths has to be 1, and logical falsehoods have to be 0.

So,if P=NP is a decidable problem, it is either true or false. So, a fully rational agent, participating in the market, should know whether P=NP. It is not a matter of probabilities! All the information to make the decision is available. So, if the market has one or more rational players, the market should converge to a price of 0 or 1 immediately, depending on the state of the problem. Right?

So, which of the following is true?

  • There are no rational agents. So, all the analysis of prediction markets that assume rationality of traders is incomplete.
  • There are rational agents. The market does not converge to 0 or 1 because the P=?NP problem is undecidable.
  • There are rational agents but the return from the risk-free rate until reaching the time to settlement exceeds the return from the market. So, the market gives information on how long it will take for the problem to be officially solved.
  • If your laptop cannot find the solution, neither can the market.
OK, back to more serious work.

Monday, September 14, 2009

Citation Tracker: Monitoring Citations to your Publications

One of the common pastimes of academics is checking services such as Google Scholar to see the number of papers that cite our work. Quite often the statistics from Google Scholar, or from other services such as Web of Science, are used to create a citation report that is used for promotion and tenure purposes.

While Google Scholar is extremely valuable for finding papers that cite a particular piece of work, it has some shortcomings, especially when creating a citation report for promotion. First, Google Scholar does not differentiate between peer-reviewed (journal, conference, or workshop papers), and other publications (such as tech reports, or term papers); so, when preparing a citation report, I have to go over the list of papers, keeping the "legitimate" citations and removing the citations that are not admissible. Second, Google Scholar is noisy sometimes, and lists twice the same paper, or splits citations for the same paper into two different entries; some other times it does not include papers that are possible to find through a web search.

Another feature that I would really like to see is the ability to find the "new" citations for a given paper, creating the appropriate alerts. A simple RSS feed would work wonders, but it is not there.

Of course, Google Scholar also does not monitor the web to find other types of documents that may mention a particular paper. PhD seminars, or even blog posts, are things that I would like to keep track of when monitoring who cites my own work. Especially for such volatile pages, I typically want to keep a copy so that I can retrieve them a few years later, when compiling my promotion packet.

For this reason, over the summer, I created a tool that can augment Google Scholar and monitor Google Scholar (and other services like Libra, CiteSeerX, SSRN), and also monitor the Web (Google, Bing, Ask) for mentions of the paper.

You can access a pre-alpha version at http://www.citation-tracker.com

Some of the features:
  • Import publications from Google Scholar, DBLP, BibTeX, and manually.
  • Review the citations for each paper, and decide which ones to keep, which to discard, and which ones to examine later.
  • Monitor citation services (Google Scholar, Libra, CiteSeerX, SSRN) and see notifications when new citations to your papers appear.
  • Generate automatically a citation report, listing the papers that cite your work.
I have been using the service over the last few weeks and it seems reasonably stable. I import my papers using Google Scholar, "accept" the existing citations, and then wait to see about the new citations that pop up every now and then. I find it pretty useful for finding new papers that cite my work.

Over the last few days I even started importing papers from other researchers that I consider relevant to my work, and for which I want to see what new papers cite them.

Feel free to login and play with the system. Needless to say, it is an early release so I expect to see bugs here and there. If you see any bug, or if you would like to see a new feature, please add a note using the "feedback" tab that is visible on the side of the screen.

Enjoy!

Sunday, September 6, 2009

The different attitudes of computer scientists and economists

I was reading Noam Nisan's blog post about the different attitudes of computer scientists and economists. Noam hypothesizes that economists emphasize research on “what is” while computer scientists emphasize on “what can be”, and offers the view of an algorithmic game theorist.

I have my own interpretation on this topic, mainly from the data mining point of view.

Economists are interested in suggesting policies (i.e., suggest to people, "what to do"). Therefore, it is important to built models that assign causality. Computer scientists are rarely interested in the issue of causality. Computer scientists control the system (the computer) and algorithms can be directed to perform one way or another. In contrast, economists cannot really control the system that they study. They do not even know how the system behaves.

When a computer scientist proposes an algorithm, the main focus is to examine the performance of the algorithm under different settings of incoming data. How the (computer) system will behave is controlled. When an economist suggests a policy, it is highly unclear how the underlying (rational?) agents will behave. Therefore, it is important to figure out what exactly "causes" the behavior of the agents, and figure out what policies can change this behavior.

One area that gets closer to economics in this respect is the area of data mining and machine learning. Get the data, and learn how the underlying system behaves. For example, get data about credit card transactions and learn which of them are fraudulent. However, there is a significant difference in focus: Computer scientists are mainly focused on predictive modelling. As long as the system can "predict" the outcome on unseen data, things are ok. A black box with perfect predictive performance is great. Explanatory models are rarely the focus. In the best case, someone may want to understand the internals of the predictive model but even if the model can be understood (e.g., read the rules or the decision tree), these rules are rarely causal in nature.

Let me give you an example: Suppose that you are trying to predict price per square feet for houses. As one independent variable (feature) you add average size of the house in the area. What the predictive model will find? That places that have smaller houses also have higher price per square foot. Unexpected? Not really. Houses in urban areas are typically smaller and more expensive compared to the their suburban and rural counterparts.

For a predictive model, this information is absolutely sufficient; the average house size is a valuable feature for predictive purposes. Think however what would happen is someone was devising policy based on this feature. A house builder would try to build smaller houses in rural areas, hoping that the resulting prices would be higher. Or a politician in Manhattan would encourage construction of bigger apartments, since the experiments have shown that if average house size is increased, the prices will drop. Absurd? Yes.

Even funnier things can come up if someone uses country-wide data to predict demand for apartments using apartment prices. The result will show that increasing prices actually increases demand, even though we would expect the opposite. (Just the effect of prices increasing in places where there is higher demand.)

Predictive modeling can survive (or even thrive) by exploiting such strange correlations. A causal model that captures correlations and presents them as causes can wreak havoc.

So, an economist will try to build a model that will generate causal relationships. In the case above, a model based on supply and demand is more likely to result in a model that captures the true "causes" of increased apartment prices. A house builder can see these effects and make a more informed decision on how to build. Similarly, for a politician that is trying to encourage building more affordable housing.

Often, causal models are called "structural" in economics [not sure where the term comes from; I have seen a few different interpretations]. They typically start by modelling the micro-behavior of agents, and then proceed to explain the behavior of a large system comprising of the interactions of such agents. A benefit of such models is that assumptions are easier to check, test, and challenge. In contrasts to "statistical" models, such models tend to generate relationships that are easier to consider "causal".

An advantage of causal models over predictive models is that causal models are valid even if the underlying data distribution changes. Causal models are supposed to be robust, as long as the behavior of the agents remains the same. A predictive model works under the assumption that the "unseen" data follow the same distribution as the "training" data. Change the distribution of the unseen data, and any performance guarantee for the predictive models disappears.

Update 1: This is not an attempt to downgrade the importance of predictive models. Most of the results presented by Google after a query are generated using predictive modeling algorithms. You get recommendations from Amazon and Netflix as the outcome of predictive algorithms. Your inbox remains spam-free due to the existence of the spam filter, again a system built using predictive modeling techniques. It is too hard, if not impossible, to build "causal" models for these applications.

Update 2: An interesting example of a company deriving policy based on their predictive model is American Express. They realized that the feature "customer buys in a 99c store" is correlated with higher delinquency rates. So, AmEx decided to decrease the credit limit for such customers. Of course, the result will be that potentially affected customers will stop visiting such stores, decreasing the value of this policy for AmEx. Furthermore, this action may cause even more economic stress to these customers that are now "forced" to buy from more expensive stores, and this may result in a much higher default rate for AmEx. This "unexpected" outcome is the effect of devising policy based on non-causal variables.

If AmEx had a variable "customer in economic distress", which arguably has a causal effect on default rates, then it would be possible to perform this action, without the ability of customers to game the system. However, since AmEx relied on a variable "customer buys in a 99c store" that is the outcome of the variable "customer in economic distress" it is possible for consumers to simply change their behavior in the face of economic distress.