A Computer Scientist in a Business School

Sunday, October 25, 2009

What is the (Real) Cost of Open Access?

After the transformation of Communications of ACM, I find myself increasingly interested in the articles that are published in CACM. As expected, one of the common ways to demonstrate my interest is by sharing the URL for the paper, on Twitter, on Facebook, on the blog, or by sharing the link with friends and colleagues. Unfortunately, CACM has a closed-access policy, effectively preventing anyone without a ACM membership or without a university account from actually reading the papers. Same thing for papers published in conferences and journals, but there I can typically find the paper in the home page of the author. For CACM, this is often not the case.

Needless to say, I hate closed access policies. While I can understand the shortsightedness of for-profit publishers, I fail to see why ACM has not adopted at least a "semi" Open Access model, making, say, the current issue of Communications of ACM available to the public. Or by giving public access to papers published 10 or 20 years back in the different journals and conferences.

The stated goal of the association is to promote the field. By restricting access, ACM simply does not work towards this goal!

The main argument that I hear is that publishing has some costs. But I am really trying to understand what are these costs. What is the magnitude of these costs? And who is being paid? Almost like the health-care debate, we are told that something is expensive but we have no idea of who ends up getting the money.

Let's examine the potential cost factors:

Printing: I understand that printing on paper has costs. But covering the the cost of printing seems easy: Amortize it across the print subscribers. (Or even abolish print versions.)

Servers for distribution: What is the cost of electronically distributing papers? The cost of running a server, should not be a concern. At the worst case, NSF should provide funds for that. I find it hard to think that NSF would turn down a request for funding a server that provides open access to scientific journals!

Submission handling: The cost of the submission website? I doubt that it is above $5K per year, per journal. Ask for a nominal submission fee (say $50 per paper) to cover this. The cost for the copy-editors? We can do much better without them, thank you. (Seriously, why do we still have copyeditors?)

Admin cost: The only cost that I can think of is the cost of the admin staff. But how much is it? I honestly have no idea! Is it so high that the ACM member subscriptions cannot cover the cost? I am trying to find the budget of ACM but I cannot find anything public.

Are there other hidden costs?

If anyone has pointers or extra information, please let me know. I am really trying to understand the real costs of high-quality electronic publishing.

Sunday, October 11, 2009

When Noise is Your Friend: Smoothed Analysis

Have you ever encountered the phrase "the algorithm has exponential running time, in the worst-case scenario, but in practice we observed it to be pretty efficient"? It is the phrase that divides theoreticians and practitioners. Many theoretical computer scientists focus on the analysis of the worst case complexity, generating often results that contradict practice.

For example, the simplex algorithm for linear programming is well known to be pretty efficient in practice. In theory, the worst-case complexity of simplex is exponential, classifying the simplex algorithm as a "non-efficient" algorithm. However, simplex has exponential running time only for very special cases. Most practitioners would even argue that you will never encounter such strange cases in practice. Only an adversary could potentially design such inputs.

Similarly, the Traveling Salesman Problem is a hallmark example of an NP-complete problem, i.e., unlikely to have an efficient algorithm anytime soon. However, there are many implementations of TSP that can provide almost optimal solutions for TSP, for pretty big inputs.

K-means is another such algorithm. It has a horrible worst-case scenario but ask the millions of people that use it for clustering. One of the most efficient clustering algorithms, despite its wost-case exponential complexity.

So, how can we reconcile theory and practice?

A very nice approach towards this reconciliation is the case of smoothed analysis. I first learned about this approach for analyzing algorithms by attending the (fascinating) job talk of Jon Kelner. Jon showed that if you pertubate a little bit the input before feeding it to the simplex algorithm, then it is almost impossible for the pertubed input to generate an exponential running time. In other words, by adding a little bit of noise in the data, there is the guarantee that we avoid the "tricky" parts of the input space.

What is the beauty of this approach? It explains why in many cases "inefficient" algorithms work well in practice: Most real data contain noise, and this noise can actually be beneficial! The other big lesson is that sometimes an algorithm ends up having a horrible worst-case performance just due to a small number of potential inputs, that are almost adversarial. Adding noise, may take care of these strange cases.

The last issue of Communications of ACM, has a great review article by Spielman and Teng on Smoothed Analysis. Explains the difference between worst-case, average-case, and smoothed analysis, and points to a wide variety of problems that have been analyzed using this technique. Highly recommended!