## Tuesday, December 18, 2007

### Freakonomized!

Just came back from a short trip in England and I realized that the number of visitors to the blog has gone up tremendously. After looking at the referrals, I spotted the website that sent the traffic:

Apparently Steven D. Levitt and Stephen J. Dubner, the authors of the well-known Freakonomics book and of the homonymous blog at The New York Times, considered interesting the posting about the efficiency of the prediction markets.

Apparently having a link from a New York Times blog results in a couple of thousand visitors within a day. Given that I was just talking about some very preliminary results in that posting, reaching that level of readability so early on, is as close as it gets to "instant gratification" in terms of research :-)

Plus, I now feel the urge to write down all the results as soon as possible, following the "
authors should write to satisfy their readers"mode of doing research (via Daniel Lemire). Indeed a very good motivation!

## Monday, December 10, 2007

Randall Munroe's interpretation of the article from Reuters "If it's Facebook, it's love":

## Sunday, December 9, 2007

### Political Prediction Markets: Some Thoughts

Apparently, my last postings on the predictability of the political prediction markets generated some interest. The analysis is more difficult in this scenario, but for the next few days we see stabilizing signals with a trend to go upwards" and we were proven wrong: the price declined from 43 on Dec 2nd, to 39.5 on Dec 9th, an 8% decline. I realized what was wrong in my reasoning. What was stabilizing was the sentiment index, not the price. And a stabilized sentiment around 50% tends to be a pretty bad adviser on how the market will move.

Bo's comment made me think about parallels in "prediction market trading" and "stock market trading". As Bo pointed out, in existing stock markets, there is a significant amount of algorithmic trading. This algorithmic trading makes the stock market significantly more efficient than, say, in the early 1980's where the programmatic trading was at its infancy. In fact, I have heard many stories from old-timers, saying that in the early days it was extremely easy to find inefficiencies in the markets and get healthy profits. As algorithmic trading proliferated, it became increasingly harder to spot inefficiencies in the market.

Something similar can happen today with prediction markets. If we have a prediction market platform that allows automatic/algorithmic trading, then we can improve tremendously the efficiency of today's prediction markets. Furthermore, such a tool (if done with play money) can be used as a great educational tool, similar to the now inactive Penn-Lehman Automated Trading (PLAT) Project. Allowing also for some data integration from the existing prediction markets (BetFair, Intrade, etc.) we could have a pretty realistic tool that can be used for many educational purposes that, at the same time, can generate useful and efficient prediction markets.

Now, I need to find someone willing to fund the idea. Ah, there are a couple of NSF call for proposals still open :-)

## Tuesday, December 4, 2007

### By Popular Demand: Mitt Romney

The last post generated some general interest and I got requests to post analysis for more presidential candidates. As one more data point, here is the market for Mitt Romney to be the Republican Presidential Nominee in 2008:

I checked again our sentiment indicator (in maroon), which seems to capture well the upward spikes. (If you see carefully, our indicator spikes upwards before the market.)

This market, similarly to the market of Hillary Clinton, seems to move in cycles. This cyclical behavior can be nicely visualized by plotting the 30-day moving average of our sentiment index (in black). It seems that a downward cycle has started for Romney and should should continue for the next couple of weeks, until the 30-day moving average gets close to 0.3 or so. Time will tell :-)

## Sunday, December 2, 2007

### Prediction Markets are NOT Efficient

I have been wondering in the past if prediction markets are efficient. Then, I was saying:
how long does it take for a prediction market to incorporate all the available information about an event? Liquidity seems to be an issue for the existing prediction markets, preventing them from reaching equilibrium quickly.
In fact, today's prediction markets are far from being efficient. Ari Gilder and Kevin Lerman, as part of an undegraduate project at University of Pennsylvania supervised by Fernando Pereira, have shown that by using linguistic analysis of news articles it is possible to predict the future price movements of the Iowa Electronic Markets. Therefore, the Iowa markets did not incorporate all the available information. Furthermore, the results indicated that it is possible to predict the price movement by simply using past pricing data. Therefore, the markets were not even weakly efficient. (Kevin is now a first year PhD student at Columbia University.)

One question was whether liquidity played a role in that result. The Iowa markets are thinly traded with upper limit on how much someone can bet. This imposes some artificial constraints making it difficult for information to flow freely into the market. Therefore, it is important to examine other markets with higher liquidity.

Over the last months we have been discussing this issue with George Tziralis, trying to examine how to evaluate the "Efficient Prediction Market" hypothesis. After long discussions, we came up with some techniques for extracting signals from the news about the prediction markets and see whether we can use these signals for predicting the future performance of markets in InTrade. Our sentiment indicator seems to work pretty well, even in liquid markets. Here is a preliminary result for the market on whether Hillary Clinton will be the Democratic Presidential Nominee in 2008:

Our sentiment index (in maroon) is close to 1 when we predict that the market will move higher, and it is close to 0 when we predict that the market will move down. Typically, it works pretty well for predicting long periods of price increases and declines. To put our money where our mouth is, the signal from the last few days shows that Hillary's market price will edge lower in the next few days/weeks.

The market prices for whether Giuliani will be the Republican Presidential Nominee in 2008, together with our sentiment index is displayed below.

The analysis is more difficult in this scenario, but for the next few days we see stabilizing signals with a trend to go upwards.

We will need to analyze quite a few more markets before generating the paper, but so far the results seem interesting.

Let's see what the future brings :-)

## Saturday, November 24, 2007

### My First Mechanical Turk Paper

A while back, I wrote about my experiences after using Amazon Mechanical Turk for conducting experiments that require input from users. I am now happy to announce :-) that the first paper that uses this methodology been accepted at IEEE ICDE 2008 and is now available online.

The paper, co-authored with Wisam Dakka, discusses a simple empirical technique for automatically extracting from a text database a set of facet hierarchies that are useful for browsing the contents of the database. We used Mechanical Turk in this context to evaluate the "precision" and "recall" of the generated hierarchies. This experiment would have been almost impossible to conduct without using Mechanical Turk, as it would require multiple users reading and annotating thousands of news articles. Using Mechanical Turk, the experiments were done in less than three days.

In the final experiment of the paper, though, where we needed to interview and time the users while they were using the faceted hierarchies to complete various tasks, we resorted to the traditional, lab-based setting. However, during the summer, we only managed to recruit five users that expressed interest to participate. We observed them in the lab while performing their tasks and recorded their reactions and impressions. (Fortunately, the results were statistically significant.)

Next time, we will attempt to use Mechanical Turk for such "interview+timing" experiments as well. However, I will need to talk more with people that perform often such experiments to see how they would react to such approaches, where the human subjects are completely disconnected from the researcher. Even though simple timing experiments can be easily performed using MTurk, I am a little uncomfortable about the reliability of such experiments.

## Tuesday, November 20, 2007

### GMail Supports IMAP (but forgets to notify the abuse team)

A couple of weeks back, Google announced that GMail supports IMAP, an email protocol that allows synchronization of email across email clients and across platforms. The support of IMAP, together with the option to buy additional disk space on Google was a great incentive for me to migrate all my email to GMail, and have it available online, accessible and searchable from everywhere.

### Ambiguous First Names and Disambiguation

I was preparing an assignment for my class, trying to introduce students to issues of data quality, and I was using Facebook data for this.

As a simple example, I wanted students to find automatically the gender of a person, given only the first name, since 1/3 of the Facebook users do not list their gender. (The "homework motivation" was the need to send letters to customers, and we need to decide whether to put "Dear Mr." or "Dear Ms." as a greeting.) In general, the task is relatively easy and the majority of the names are not ambiguous. However, there is a set of highly ambiguous names, for which inference based on first name is problematic. For your viewing pleasure, the most ambiguous first names, together with the confidence that the name belongs to a male:

 Ariel 50.00% Yang 50.00% Kiran 50.00% Nikita 50.00% Casey 49.30% Min 46.67% Paris 53.85% Dorian 53.85% Adi 45.45% Kendall 45.45% Quinn 54.55% Aubrey 54.55% Sunny 44.83% Angel 55.32% Yan 41.67% Yi 41.67% Yu 58.33% Devon 59.46% Nana 40.00% Jin 38.89% Ji 38.46% Ming 61.54% Taylor 37.80% Rory 62.50% Carey 36.36% Sami 63.64% Robin 34.55% Ali 34.45% Jean 34.09%

The next part of the homework, motivated by the ambiguity for some of the first names, asks students to guess the gender of a person based on the other stated preferences on Facebook profiles, regarding movies, books, TV shows and so on.

Based on the analysis of these features, women favor overwhelmingly the books "Something Borrowed," "Flyy Girl," "Good In Bed," "The Other Boleyn Girl," "Anne Of Green Gables", the movie "Dirty Dancing" and they like dancing as an activity.

On the other hand, characteristics that are unique to men are movies like "Terminator 2," "Wall Street," "Unforgiven," "The Good the Bad and the Ugly," "Seven Samurai"; the book "Moneyball"; sports-related activities (baseball, lifting) and sports-related TV shows (e.g., PTI, Sportscenter, Around the Horn). Another distinguishing feature of men is that they list "women" and "girls" as their interests (and in this case they should also think about taking perhaps some dancing lessons :-)

## Friday, September 7, 2007

### Experiences using Amazon Mechanical Turk

Everyone who has ever done web-related research knows that user studies and obtaining user evaluations is one of the most frustrating parts of the process. We often need to create data sets with annotated data, users have to look at the results of the proposed systems and evaluate their quality, and so on. Commonly, we tend to rely on "volunteers" for such tasks, i.e., asking friends and colleagues to participate in the experiment, doing mundane, repetitive and tedious work. Typically, after the "volunteer" has participated in one experiment it was extremely difficult to be convinced to participate in another one. (PhD students are the exception --- they typically have no choice.) The whole process can easily take 2-3 weeks just to gather enough data, plus it is unclear how accurate are the results that are gathered by such overused human subjects.

When I joined Stern, I was glad to find out about the "behavioral lab", which made recruiting of users much easier. We just have to specify on a web form the target demographics and then a set of 1000 registered (real) volunteers are notified. This facilitates greatly the process and makes sure that the participating users are indeed willing to participate in the experiments. Still though, the process is tiring as someone has to wait in the lab for the users, give directions, pay the participants, and so on.

One interesting alternative is Amazon Mechanical Turk, a service introduced by Amazon in November 2005. Mechanical Turk allows requesters to post a large number of small tasks, and pay a small amount to the person who completes the task. Examples of such tasks include:
• can you see a person in the photo?
• is the document relevant to a query?
• is the review of this product positive or negative?
We have started experimenting with Mechanical Turk early in 2007, to complete various tasks. I was truly surprised with the speed that "Turkers" complete these tasks. Asking users to annotate 150 articles took less than 2 hours. Clearly, much faster than traditional techniques.

One of the problems that we faced was the uncertainty about the validity of the submitted answers. We had no way to ensure that an answer submitted by an "Turker" was a carefully thought answer or just a random response.

To avoid this problem, we decided to get multiple, redundant answers for the same question. To be able to decide about statistical significance, we asked for 5 answers for the same question. We marked an answer as correct only if at least 4/5 answers agreed. Furthermore, to discourage users from submitting random responses, we clarified in the instructions that we will pay only for submissions that agree with the responses submitted by other annotators. This followed the spirit of the ESP game by Luis von Ahn, and ensured that the Turkers had the appropriate incentives to submit correct answers. Even though this approach increases the cost, it ensures that the received answers are consistent and the level of noise is low.

Overall, our experience with Mechanical Turk has been very positive. The interface is very clean, easy to program, and the answers come back very quickly. It is not uncommon to send thousands of requests in the evening and have all the results ready by the following morning.

Now, let's see if the reviewers will like the results :-)

Update (Nov 24, 2007): Our first paper that uses Amazon Mechanical Turk for conducting the experimental evaluation has been accepted at IEEE ICDE 2008 and is now available online. Hopefully, more will follow soon.

## Wednesday, August 29, 2007

### Experimental Repeatability Requirements for SIGMOD 2008

This is a very interesting change for SIGMOD 2008. Starting this year, there is a requirement for authors to submit data sets and code for replicating the experiments. For details, see:

http://www.sigmod08.org/sigmod_research.shtml

I will be very interested to see how much of a burden this will be, how many times people will cite the use of proprietary software, and so on. I expect some small problems for this year but hopefully we will get used quickly to this mode of operation and start working on a paper with reproducibility of experiments in mind.

In any case, this is a very interesting development and I expect many conferences to follow.

## Monday, August 13, 2007

### Only in Greece...

I was having a drink with some friends in Greece, listening to some light music at a summer bar. Suddenly, the music stops and the DJ announces: "The police is checking drivers for blood alcohol content and they are waiting next to the hotel Xenia. The customers are advised to take the alternate route...".

A few minutes later, a new announcement informed the customers for the location of a second block from the police, advising customers to either take a taxi, or use the breathanalyzer that is available in the bar in the restroom (yes, there was a coin-operated breathanalyzer in the restroom!), so that they can "drive home safely."

If only all this creativity was used for something useful...

## Wednesday, July 25, 2007

### Being Accurate when having an Agenda

I was reading a report a few days back, and one of the sentences was:

"we observed that 50% of the women get salaries lower than the median salary of men"

Just think of alternative formulations of the same sentence. I will refrain from commenting further.

## Thursday, July 12, 2007

### Evaluating Information Extraction using xROC Curves

Suppose that you have an information extraction system, which behaves like a blackbox. The box takes as input a document and gives in the output a set of relation tuples. For example, the information extraction system may process newspaper articles and identify mergers and acquisitions (Company1 bought Company2), or management succession events (Person1 succeeds Person2 in the the CEO position), and so on.

As expected, such systems are inherently noisy and generate imperfect output. Sometimes they miss tuples that appear in the documents and sometimes they generate spurious tuples. One of the important questions is how to evaluate such a system objectively and with the minimum amount of effort.

A common evaluation strategy is to use precision-recall curves to show how the system behaves under different settings. The precision of the output is defined as the number of correct tuples in the output over the total number of generated tuples; recall is defined as the as the number of correct tuples in the output over the total number of correct tuples that can be extracted from the documents.

Unfortunately, precision is problematic due to its dependence on the input class distribution, as the following example illustrates:

• Example: Consider an extraction system E that generates a table of companies and their headquarters locations, Headquarters(Company, Location) from news articles in The New York TimesBusiness'' and the "Sports'' section. The "Business'' documents contain many tuples for the target relation, while "Sports'' documents do not contain any. The information extraction system works well, but occasionally extracts spurious tuples from some documents, independently of their topic. If the test set contains a large number of "Sports'' documents then the extraction system will also generate a large number of incorrect tuples from these "bad'' documents, bringing down the precision of the output. Actually, the more "Sports'' documents in the test set, the worse the reported precision, even though the underlying extraction system remains the same. Notice, though, that the recall is not affected by the document distribution in the test set and remains constant, independently of the number of "Sports" documents in the test set.

The fact that precision depends on the distribution of good and bad documents in the test set is well-known in machine learning, from the task of classifier evaluation. To evaluate classifiers, it is preferable to use ROC curves, which are independent of the class distribution in the test set. The ROC curves summarize graphically the tradeoffs between the different types of errors. When characterizing a binary decision process with ROC curves, we plot the
true positive rate (the fraction of true positives correctly classified as positives, i.e., recall) as the ordinate, and the false positive rate (the fraction of true negatives incorrectly classified as positives) as the abscissa.

The standard application of ROC curves for information extraction is unfortunately problematic, for two reasons.

First reason: We typically do not know what a "true negative" is. Unlike document classification, a "bad tuple" does not exist apriori in a document. It only exists because the extraction system can extract it.
• Solution 1: One way to overcome this problem is to measure the number of all bad tuples that can be extracted from a document using all possible settings and all available extraction systems. Then, we can use this number as the normalizing factor to define the false positive rate. This solution works when dealing with a static set of extraction systems. Alas, the definition of false positive rate becomes unstable if we introduce later another system (or another setting) that generates previously unseen noisy tuples; this changes the number of all bad tuples, which serves as the normalizing constant, and forces recomputation of all false positive rates.
• Solution 2: Another way to avoid this problem is by having an un-normalized x-axis (abscissa). Instead of having the false positive rate, we can have the average number of bad tuples generated. In this case, the new curve is called the "Free Response Operating Characteristic" (FROC) curve. Such techniques are widely used in radiology to evaluate the performance of systems that detect nodules in MRI and CAT scans. (A nodule refers to a small aggregation of cells, indicative of a disease.) A problem with this approach is the lack of a "probabilistic" interpretation of the x-axis; the probabilistic interpretation can be convenient when analyzing/integrating the extraction system as part of of a bigger system, and we are not simply trying to measure its performance in a vacuum.
Second reason: It is too expensive to know all the "true positives". You may have noticed that recall is the y-axis in the ROC/FROC curves. Unfortunately, computing recall is a rather expensive task. If we want to be correct, we need to have annotators read a document and infer which tuples can be extracted from each document; the number of all good tuples will be used as the normalizing constant to compute the recall of each extraction system. This is an expensive task, especially compared to the computation of precision or compared to the computation of the false positive rate. To compute the false positives we need only to evaluate the extracted tuples, a presumably much easier task than reading a lengthy document.
• Solution 1: We can play the same trick as above, to avoid the problem of reading/annotating the documents. We process each document multiple times, using all possible settings and all possible extraction systems. The union of the extracted tuples can be then validated to identify the set of all correct tuples. As in the case of true negatives, though, the definition becomes unstable if we have a dynamic set of extraction systems that can identify more good tuples at some point in the future, forcing a re-calculation of recall metrics for all systems.
• Solution 2: We can also have an un-normalized y-axis. For instance, we can have as the ordinate (y-axis) the average number of good tuples extracted for each document. (I have not seen anything like the FROC curves that will leave the true positive rate unnormalized, though.) The downside is that by leaving recall unnormalized, the values now depend on the distribution of good and bad documents in the input: the more bad documents with no good tuples in the test set, the lower the unnormalized value will be. Therefore, this definition seems to go against the spirit of ROC curves.
I am not aware of any alternative definitions of the ROC curves that do not have the problems of the solutions that I described above. If you have any ideas or pointers, post it in the comments.

## Monday, July 9, 2007

### Blending Mind Maps with Wikipedia

Following up on the topic, of how to use wikis to organize research literature about a given topic.

I have been thinking about the idea of keeping wikis to keep track of the state-of-the-art in any field. One way to organize the literature is to transform an existing survey article into a wiki and let people edit at-will (e.g., see this article on duplicate record detection). Another alternative is to keep a set of tables in a wiki, which summarize the results of a paper in a single line (e.g., see here and here for NLP-related tasks).

A problem with both approaches is the lack of a network that will allow people to mark and annotate the relations across different papers. The idea of using CMapTools (http://cmap.ihmc.us/) was interesting but I could not manage to install the software to see how it really works.

Today, I run into WikiMindMap, a web-based service that uses the wiki structure to convert Wikipedia articles into a mind map. The heuristics that it uses are rather basic but one could use this tool to organize the literature in a wiki (having the advantage of collaboration), and still have all the advantages of a visual tool that can show connections across entities. See for example the wikified article, mentioned above, transformed into a mind map.

## Friday, July 6, 2007

### Data Cleaning Service

Often, I get questions from colleagues and students on how to match "dirty" data, i.e., string database entries that are similar but not exactly identical. Although there is a very significant amount of research literature, there are not that many software packages available for free. For this reason, I decided to implement the technique that we described in our WWW2003 paper, and make it available on the web for anyone to use:

http://hyperion.stern.nyu.edu/SMWebSite/Default.aspx

So, now if you have two lists of names that need to be matched you can try this service. It works well for small datasets with a few thousands entries in each. I am pretty sure that it can scale to much bigger datasets, but it will take some time for the service to finish the job. One of the neat features of the service is that you can submit the job, keep the id of the submitted job, and come back to retrieve the results later.

If you have any ideas, or any bug reports, feel free to contact me.

### Wikipedia-based Web Services

We have launched a beta version of some web services that use Wikipedia data for named entity extraction, for term identification, and for document expansion. You can take a look and try the services at:

http://wikinet.stern.nyu.edu/Wikinet/Wikinet.aspx

We are actively working on this, so if you see a bug or some unexpected behavior, feel free to drop me an email.

## Monday, June 25, 2007

### Brain Activity and Economic Judgment?

Just attended the keynote talk of Tom Mitchel at ACL 2007 in Prague. He described how the brain reacts when seeing a word or a picture associated with different concepts. For example, there are different parts of the brain activated when seeing a hammer (the word or a picture) and different parts when seeing an apartment. There seem to be different parts of the brain that process different concepts, so there is evidence that the "meaning" of a word corresponds to the activation of particular neurons.

I am wondering if similar things happen when someone is trying to judge the economic value of something. How do people act when seeing a bargain compared to seeing something that they perceive as expensive? Is there a similar activation pattern in the brain?

Update (July 2): I was reading the latest issue of Scientific American and I found an answer to the question above. Apparently, there is a related paper published in the January 26th issue of Science with the title "The Neural Basis of Loss Aversion in Decision Making under Risk" that discusses which parts of the brain get activated when someone needs to make a decision that will result in a future gain or loss. Interesting paper to read.

## Monday, June 18, 2007

### Are Publishers Making Themselves Useless?

No matter how nicely I try to format my camera-ready papers, occasionally I will get an email like this from the publisher:
1. On page 2, top of column 1, you have a widow (last line of a paragraph at the top of a column), this is a bad break, please tighten the previous column or force a line over to eliminate this bad break.
2. On page 2, bottom of column 1, there is an orphan, please tighten the previous material or force the head over to the next column. (The use of \vfill\eject placed before the head commmand will force text to the next page or column).
3. Please make the body of the references section flush left/ragged right (not justified). This will eliminate letter and word spacing of references with urls (web addresses).
4. On the last page, please use \vfill\eject before reference 19 to move more over to column two (2) and produce a more balanced last page. (This command is great for forcing material over to the next page.) Be sure that you have run bibtex, then cut and paste the references from your bbl file into your tex file.
I cannot understand why the author has to deal with all these typesetting issues, if there is publisher involved in the process. Isn't typesetting something that the publisher does? Given that I have submitted my LaTeX sources and all the necessary files needed to compile the paper, it is trivially easy to fix all these issues.

In the pre-digital age, a publisher would assign an editor to each accepted paper who would read the manuscript, fix grammatical and syntactic errors, typeset the article, and prepare a nice, readable version from the manuscript, which was either written by hand or typed in a typewriter. Today it seems that publishers want to push all the typesetting work to the authors, minimizing as much as possible the cost of production. Makes sense (economically).

However, as publishers push more and more of their responsibilities to the authors, they increasingly make themselves irrelevant. There is no reason to pay a publisher to prepare proceedings or even journals when the publisher does not even want to take care of the most esoteric publishing-related issues. In fact, there is no reason for a publisher to exist at all. What is the added value that a publisher offers? Open access and the web made already a significant part of the publisher's role obsolete; if publishers voluntarily make typesetting the task of their suppliers (authors), then they voluntarily make themselves a useless part of the workflow, ripe to be dropped completely in the near future.

## Wednesday, June 13, 2007

### Are Prediction Markets Efficient?

I was reading Fred Wilson's post about the information efficiency of the venture capital market.

While reading the posting, I started wondering whether prediction markets are efficient, according to the definition of Eugene Fama. In other words, how long does it take for a prediction market to incorporate all the available information about an event? Liquidity seems to be an issue for the existing prediction markets, preventing them from reaching equilibrium quickly. But if we had enough liquidity, how long would it take for humans to "agree" on prices that incorporate all the available information about an event?

## Wednesday, June 6, 2007

### Playing with Wikipedia

I was working with Wisam Dakka on a Wikipedia project, and I was puzzled by some Wikipedia entries that had really long titles.

The first one that I noticed was a term with 163 characters: "Krungthepmahanakornamornratanakosinmahintarayu
janiwesmahasatharnamornphimarnavatarnsathitsakkattiyavi
"
which redirects to Bangkok. I do not know if this is a prank, or a valid entry. (Update: It is a correct entry, according to the talk page of the entry.)

Then, I noticed another term with 255 characters:
"Wolfeschlegelsteinhausenbergerdorffvoralternwarenge wissenhaftschaferswessenschafewarenwohlgepflegeund sorgfaltigkeitbeschutzenvonangreifendurchihrraubgierig
feindewelchevoralternzwolftausendjahresvorandieer
scheinenwanderersteerdemenschderraumschiffgebrauchl
,"
which in fact is a valid term and the 255 characters is simply a shortcut for the 580 character entry :-)

Finally, there is a term with 182 characters:
matosilphioparaomelitokatakechymenokichlepikossypho
phattoperisteralektryonoptekephalliokigklopeleio
lagoiosiraiobaphetraganopterygon
,"
that has Greek roots, and I will let you click to find out its exact meaning.

Also, these entries seem to trigger some buggy behavior on Google. If you do a web search for the above terms, you will find no web page with these words. However, Google returns a set of product matches on Google, none of which is really correct.

The joy of large-scale data processing!

## Friday, June 1, 2007

### Uses and Abuses of Student Ratings

I recently revisited an old posting from Tomorrow's Professor mailing list about "Uses and Abuses of Student Ratings". It is an excerpt from the book "Evaluating Faculty Performance, A Practical Guide to Assessing Teaching, Research, and Service" and lists a set of common problems in the use of student ratings for evaluating the teaching performance of a faculty member. I enjoyed (re-)reading the whole list, but I particularly liked these three items:
• Abuse 1: Overreliance on Student Ratings in the Evaluation of Teaching ...
• Abuse 2: Making Too Much of Too Little ... Is there really a difference between student ratings averages of 4.0 and 4.1? ....To avoid the error of cutting a log with a razor, student ratings results should be categorized into three to five groups ... Utilizing more than three to five groups will almost certainly exceed the measurement sophistication of the instrument being used.
• Abuse 5: Using the Instrument (or the Data Collected) Inappropriately ... While we have 20 items on our ratings form ... only #7 really matters for making personnel decisions.
I am confident that if we read an academic paper that analyzes the results of a questionnaire in the same way that we currently analyze student ratings, the paper would have been considered naive (no statistical significance of findings, no control variables, use of a single instrument for evaluation, and so on). Unfortunately, it is does not seem likely that we will ever apply the same rigor when analyzing the student ratings forms.

## Sunday, May 13, 2007

### Impact Factors and Other Metrics for Faculty Evaluations

A few months back, as a department we had to submit to the school a list of the "A" journals in our field. The ultimate reason for this request was to have a list of prime venues for each field, and thus facilitate the task of the promotion & tenure committees that include researchers with little (or no) knowledge of the candidates field.

Generating such a list can be a difficult task, especially if we try to keep the size of the list small, and directly connected with the problem of ranking journals. There are many metrics that can be used to for such a ranking, and the used metric is the "impact factor", proposed by Eugene Garfield. The impact factor ranks the "impact" of the research published in each journal by counting the average number of citations to the 2- and 3-year old articles published in the journal, over the last year. The basic idea is that journals with a large number of recent incoming citations (from last year) that also point to relatively recent articles (2- and 3-year old papers), show the importance of the topics published in the given journal. The choice of the time window makes comparisons across fields difficult, but generally within a single field the impact factor is a good metric for ranking journals.

Garfield, most probably expecting the outcome, explicitly warned that the impact factor should not be used to judge the quality of the research of an individual scientist. The simplest reason is the fact that the incoming citations to the papers that are published in the same journal follow a power-law: a few papers receive a large number of citations, while many others get only a few. Copying from a related editorial in Nature: "we have analysed the citations of individual papers in Nature and found that 89% of last year’s figure was generated by just 25% of our papers. [...snip...] Only 50 out of the roughly 1,800 citable items published in those two years received more than 100 citations in 2004. The great majority of our papers received fewer than 20 citations." So, the impact factor is a pretty bad metric for really examining the quality of an individual article, even though the article might have been published in an "A" journal.

The impact factor (and other journal-ranking metrics) was devised be used as a guideline for librarians to allocate the subscription resources, and as a rough metric for providing guidance to scientists that are trying to decide which journals to follow. Unfortunately, such metrics have been mistakenly used as convenient measures for summarily evaluating the quality of someone's research. ("if published in a journal with high impact factor, it is a good paper; if not, it is a bad paper"). While impact factor can (?) be a prior, it almost corresponds to a Naive Bayes classifier that does not examine any feature of the classified object before making the classification decision.

For the task of evaluating the work of an individual scientist, other metrics appear to be better. For example, the much-discussed h-index and its variants seem to get traction. (I will generate a separate post for this subject.) Are such metrics useful? Perhaps. However, no metric, no matter how carefully crafted can substitute careful evaluation of someone's research. These metrics are only useful as auxiliary statistics, and I do hope that they are being used like that.

## Friday, May 11, 2007

### Names (or Parents?) Make a Difference

I just read an article about a study showing that the name of a girl can be used to predict whether a girl will study math or physics after the age of 16. The study, done by David Figlio, professor of economics at the University of Florida indicated that girls with "very feminine" names, such as Isabella, Anna, and Elizabeth, are less likely to study hard sciences compared to girls with names like Grace or Alex.

Myself, I find it hard to understand how someone can estimate the "femininity" of a name but it might be just me. Even if there is such a scale though, I do not see any causality in the finding, as implied in the article. (I see predictive power, but no causality.) In my own interpretation, parents that choose "very feminine" names also try to steer their daughters towards more "feminine" careers. I cannot believe that names by themselves set a prior probability on the career path of a child. (The Freakonomics book had a similar discussion about names and success.)

Oh well, how you can lie with statistics...

## Tuesday, May 8, 2007

### Replacing Survey Articles with Wikis?

Earlier this year, together with Ahmed Elmagarmid and Vassilios Verykios, we published a survey article at IEEE TKDE on duplicate record detection (also known as record linkage, deduplication, and with many other names).

Although I see this paper as a good effort in organizing the literature in the field, I will be the first to recognize that the paper is incomplete. We tried our best to include every research effort that we identified, and the reviewers helped a lot in this respect. However, I am confident that there are still many nice papers that we missed.

Furthermore, since the time the paper has been accepted for publication, many more papers have been published and many more will be published in the future. So, this means that the useful half-life of (any?) such survey is necessarily short.

How can we make such papers more relevant and more resistant to deprecation? One solution that I am experimenting with is to make the survey article a wiki, and then post it to Wikipedia, allowing other researchers to add their own papers in the survey.

I am not sure if Wikipedia is the best option, due to licensing issues, though. A personal wiki may be a better option, but I do not have a good grasp of the pros and cons of each approach. One of the benefits of Wikipedia is the existence of nice templates for handling citations. One of the disadvantages is the copyright license of Wikipedia, which may discourage (or prevent) people from posting material there.

Furthermore, it is not clear that a wikified document is the best way to organize a survey. A few days back, I got a (forwarded) email from Foster Provost, who was seeking my opinion for the best way to organize an annotated bibliography. (Dragomir Radev had a similar question.) Is a wiki the best option? Or is it by construction too flat? Should we use some other type of software that allows people to generate explicit, annotated connections between the different papers? (Any public tool?)

Any ideas?