Showing posts with label evaluation. Show all posts
Showing posts with label evaluation. Show all posts

Thursday, October 17, 2013

Badges and the Lake Wobegon effect


For those not familiar with the term, the Lake Wobegon effect is the case when all or nearly all of a group claim to be above average, and comes from the finctional town where "all the women are strong, all the men are good looking, and all the children are above average."

Interestingly enough, as Wikipedia states, this effect of the majority of the group thinking that they are performing above-average "has been observed among drivers, CEOs, hedge fund managers, presidents, coaches, radio show hosts, late night comedians, stock market analysts, college students, parents, and state education officials, among others."

So, a natural question was whether this effect also appears in an online labor setting. We took some data from an online certification company, similar to Smarterer, where people take tests to show how well they know a particular skill (e.g., Excel, Audio Editing, etc.) The tests are not pass/fail but more like a GRE/SAT score: there is no "passing" score, only a percentile indicator that shows what percentage of other participants have a lower score. 

Interestingly enough, we noticed a Lake Wobegon effect there as well: Most of the workers that displayed the badge of achievement, have scores above average, giving yet another point for the Lake Wobegon effect.

Of course, this does not mean that all users that took the test performed above average. Test takers have the choice to make their final score public to the world, or keep it private. Given that the user's profile is also used in a site where employers look for potential hires, there is some form of strategic choice in whether the test score is visible or not. Having a low score is often worse than having no score at all.

So, we wanted to see what scores make users comfortable with their performance, and incentivizes them to display their badge of achievement. Marios analyzed the data, and compared the distribution of scores for workers that decided to keep their score private, compared to the workers that made their performance public. Here is the outcome:


It becomes clear that scores below 50% are not posted often, while scores that exceed 60% have significantly higher odds of being posted online for the world to see. This becomes more clear if we take the log-odds of a worker deciding to make the score public, given the achieved percentile:


So, in the world of online labor if you ever hire someone who chose to display a certification, you know that there are good chances that you picked a worker that is better than average, at least in the test. (We have some other results on the predictive power of tests in terms of work performance, but this is a topic that cannot fit into the margins of this blog post :-)

Needless to say, this effect illustrates a direction that will take crowdsourcing, and labor markets in general, out of the race-to-the-bottom, market-for-lemons-style, pricing, where only price can separate the various workers. As education history serves in an offline setting as signaling for the potential quality of the employee, we are going to see more and more globally recognized certifications replacing educational history for many online workers.

Friday, July 22, 2011

A tale about parking

The media attention to my prior blog post was really not something that I enjoyed. Not so much for the attention itself but for focusing on exactly the wrong issues. That post was NOT about me and my evaluation. This is not the main point. I thought that the salary issue was worth mentioning (apparently, it was not) but it was, indeed, a MINOR part of the issue.

In fact, after reflecting on this point, I realized the following: Even if I had received a $1M bonus from NYU for my efforts, the basic problem would still be there: the teaching experience would degenerate into a witch hunt, focusing on cheating, instead of being about learning. And yes, I would still write the same blog post even if I were fully satisfied with my annual evaluation. In fact, the blog post was in my folder of draft posts for a few months now, long before receiving my annual evaluation.

If you want a a parallel, consider this hypothetical story:



A tale about parking

Suppose that you live in a city with a huge traffic problem, and a resulting huge parking problem. Too many cars on the street.

People try to find parking and they drive around, drive around. A lot. Some drivers get frustrated and they double park. Some drivers are stupid enough to double park during rush hour, block the traffic, and leave the car unattended. As expected, the police arrives and assigns a ticket to the offender, sometimes taking the car as well. However, during quiet hours, when there is no traffic many drivers double park, but they do not block the traffic, and nobody gives them a ticket.

Suddenly, in one neighborhood only, call it Redwich Village, a lone policeman starts assigning tickets for every parking violation. No matter if it is minor or major. No matter if the driver just stepped out, or if it is the first time that the driver double parked. Zero-tolerance policy.

By doing that, and being more vigilant, our lone policeman assigns 10 times more tickets that before. By doing that, he also lost countless hours fighting with the offenders. This continuous fight, also annoys some other residents of the neighborhood that want the policeman to focus on policing the neighborhood, and not spend all his time giving parking tickets.

But even our lone policeman gets frustrated: he realizes that he did not become a policeman to give parking tickets. While it is part of his duties, he feels that it is just better not to be so aggressive. His boss also gets a report that many neighborhood residents are annoyed. His boss knows that the complaints are due to the zero-tolerance policy on parking tickets. So he says that he would like our lone policeman to both continue this idiosyncratic zero-tolerance policy enforced just by our lone policeman, and be as diligent with his other duties as before.

Our lone policeman goes on and reflects on the overall experience. He realizes that he is fighting a losing battle. As the number of cars increase in the city, there will be more people parking illegally.

So, our lone policeman suggests that we need to do something more fundamental about the parking problem: He suggests that people could carpool, use bicycles, mass transit, or simply walk. And he asks for people to think of more such alternatives. If there are less cars in the city, the problem will be resolved.

He describes all his thoughts in his blog, in a long post, titled "Why I will never give parking tickets again." He describes the futility of parking tickets to fight the underlying problem, and vows never to be so vigilant about parking tickets. He will be as vigilant as all the other policemen, which is as vigilant as he was before.

His blog post goes viral. Media pick up fragments, everyone reads whatever they want to read. Some headlines:
  • "Parking tickets in Redwich Village increase by 1000%. Is it impossible to park your car in Redwich?"
  • "Parking-related violations skyrocket in Redwich Village. Policeman punished for enforcing the rules."
  • "RedWich Village sucks. Only scumbags live in RedWich Village, what did you expect? Any lawful behavior?"
  • "Stupid city residents: We know that all people that live in cities are cheaters and park illegally"
  • "Why the government does not reward this honest policeman?"
  • "Why this policeman is vowing not to obey the law? Oh the society..."
Now, some of the business owners of Redwich Village are annoyed because people may not drive to Redwich, if they think it is impossible to find parking. Some residents are also annoyed because real estate prices may go down if people believe that Redwich is a place where you cannot park your car. After all, it is all a matter of reputation.

And in this bruhaha, nobody pays any attention to the underlying problem. Is increased vigilance the solution to the parking problem? Should we give more tickets? Should we install cameras? Or should we try to follow the suggestions of our lone policeman and think of other ways to reduce traffic, and therefore resolve the parking problem on a more fundamental level?

The blog post of our lone policeman is neither about the policeman nor about Redwich. It is about the fact that there is too much traffic in the whole city. Which in turn causes the parking problem. Parking scarcity is the symptom, not the real problem. And while he wrote about the traffic problem and suggested solutions, 99% of the coverage was about Rewich and about his own evaluation.



This is exactly how the discussion about cheating evolved in the media. Instead of focusing on how to make student evaluation objective and cheating-proof, the discussion focused on whether my salary went sufficiently up or not. This is not the main point. It is not even a minor point, in reflection. The real question is on how we can best evaluate our students and which evaluation strategies are robust to cheating, encourage creativity, and evaluate true learning.

And this is not a discussion that can be done while screaming.

Sunday, July 17, 2011

Why I will never pursue cheating again

The post is temporarily removed. I will restore it after ensuring that there are no legal liabilities for myself or my employer.

Until then, you can read my commentary in my new blog post: A tale about parking.

The discussion on Hacker News was good as well. Also see the response that I posted at the Business Insider website and the coverage at Inside Higher Education.

Monday, May 25, 2009

Evaluation Feedback and Stakhanovist Research Profiles

Every year, after the Spring semester, we receive a report with our annual evaluation, together with feedback and advice for career improvement (some written, some verbal). Part of the feedback that I received this year:
  1. You get too many best paper awards, and you do not have that many journal papers. You may want to write more journal papers instead of spending so much time polishing the conference papers that you send out.
  2. You are a member of too many program committees. You may consider reviewing less and write more journal papers instead.
I guess that having a Stakhanovist research profile (see the corresponding ACM articles) is a virtue after all.

(*) Alexey Stakhanov was a miner in Soviet Union, who cut 102 tons of coal during a six-hour shift with a pneumatic drill, when the average production being 6-7 tons. Stakhanov's record created the Stakhanovite movement where workers were encouraged to exceeded production targets, typically trying to overcome previous production records.

Friday, May 23, 2008

The (Statistical) Significance of the Impact Factor

Being in the middle of my tenure track, I cannot skip running into different ways that people use to evaluate research. One of the most common ways to evaluate papers (at least at a very high level) is to look at the impact factor of the journal, and classify the paper as "published in a top journal," "published in an OK journal," or "published in a B-class journal". I have argued in the past that this is a problematic practice, and an article published in Nature provides the evidence. To summarize the reasoning: articles published within the same journal have widely different citation numbers, therefore using the average is simply misleading.

I think that the best example that I have heard that illustrates the problem of reporting averages of highly-skewed distributions is from Paul Krugman's book "The Conscience of a Liberal":
...Bill Gates walks into a bar, the average wealth of the bar's clientele soars...
This is exactly what happens when evaluating papers using impact factors for journals. So, this introduces two problems:
  • If you evaluate a paper using the impact factor of the journal, the evaluation is almost always a significant overestimate or a significant underestimate of the paper's "impact". (Assuming that citations measure "impact".) Read the analysis below for an illustrating example.
  • The impact factor itself is a very brittle metric, as it is heavily influenced by a few outliers. If indeed the in-journal citation distribution is a power-law, then the impact factor itself is a useless metric.
To make this more clear, I will pick as an example the ACM Transactions of Information Systems. The journal has a rather impressive impact factor for a computer science journal, with an increasing trend:
Now, let's try to dissect the 5.059 impact factor for 2006. The impact factor is the number of citations generated in 2006, pointing to the papers published in 2005 and 2004, divided by the total number of published articles. According to ISI Web of Knowledge, we have:
2006 Impact Factor

Cites in 2006 to articles published in:
2005 = 25
2004 = 147
Sum: 172

Number of articles published in:
2005 = 15
2004 = 19
Sum: 34

Calculation: 172/34 = 5.059
Now, let's split down these numbers by publication. By looking at the number of citations per publication, we can see that there is a single paper "Evaluating collaborative filtering recommender systems" by Herlocker, which has almost 30 citations in 2006. Taking this single publication out, the impact factor is reduced to 4.3.

In fact, if we take out of the calculations the papers published in the Special Issue for Recommender Systems (Jan 2004), then the impact factor drops even more, and comes close to 2.5. At the same time, the impact factor of the papers published in the special issue is much higher, getting closer to 15.0 or so.

Given the unusual high impact of that special issue, we can expect for the 2007 impact factor for TOIS to decrease substantially. It would not be surprising to see the impact factor for 2007 to be in the pre-2003 levels.

This simple example illustrate that the impact factor rarely represents the "average" paper published in the journal. There are papers that are significantly stronger than the impact factor illustrates and papers that are significantly weaker. (Implication: Authors that use the impact factor of the journals as a representative metric of the quality of their research, they use a metric that is almost never representative.)

Therefore, a set of other metrics may be preferable. The obvious choices is to use the median instead of the average, and report the Gini coefficient for the papers published in the journal. The Gini coefficient will show how representative is the impact factor. Next step is to examine the distribution of the number of citations within the journals. Is it a power-law, or an exponential? (I was not able to locate an appropriate reference.) Having these answers can lead to better analysis and easier comparisons.

Tuesday, March 4, 2008

Course Evaluations and Prediction Markets: The Results

A couple of weeks back, I described my attempt to use prediction markets to predict my final course evaluation. The final result of the market was:

PREDICTIONS CURRENT VALUE TODAY
$49.95 (closed)
$36.86 (closed)
$4.60 (closed)
$4.34 (closed)
$4.23 (closed)


Taking the weighted average of these predictions, the markets shows a predicted outcome of:

$\begin{align}49.95 \cdot 6.25 & + \\ 36.86 \cdot 6.75 & + \\ 4.60 \cdot 5.25 & + \\ 4.34 \cdot 5.75 & + \\ 4.23 \cdot 3.0 & = 6.227 \end{align}$

And what was the final course evaluation? Did the market work? Well, the final course evaluation was a 6.212 with 35 student ratings. A relative error of 0.002 or 0.2%. I cannot think that I could get a more accurate prediction!

Interestingly enough, by observing the market, I can see that very few people actually picked the 6.0-6.5 range. Most of the players bought contracts of the 6.5-7.0 range. These contracts played their role of counterbalancing the few players that bought contracts in the 1.0-5.0 and in the 5.0-6.0 ranges. Therefore, while most of the action was in contracts that did not predict the correct range, this activity was crucial for the market to balance and give the correct prediction.

Thursday, July 12, 2007

Evaluating Information Extraction using xROC Curves

Suppose that you have an information extraction system, which behaves like a blackbox. The box takes as input a document and gives in the output a set of relation tuples. For example, the information extraction system may process newspaper articles and identify mergers and acquisitions (Company1 bought Company2), or management succession events (Person1 succeeds Person2 in the the CEO position), and so on.

As expected, such systems are inherently noisy and generate imperfect output. Sometimes they miss tuples that appear in the documents and sometimes they generate spurious tuples. One of the important questions is how to evaluate such a system objectively and with the minimum amount of effort.

A common evaluation strategy is to use precision-recall curves to show how the system behaves under different settings. The precision of the output is defined as the number of correct tuples in the output over the total number of generated tuples; recall is defined as the as the number of correct tuples in the output over the total number of correct tuples that can be extracted from the documents.

Unfortunately, precision is problematic due to its dependence on the input class distribution, as the following example illustrates:

  • Example: Consider an extraction system E that generates a table of companies and their headquarters locations, Headquarters(Company, Location) from news articles in The New York TimesBusiness'' and the "Sports'' section. The "Business'' documents contain many tuples for the target relation, while "Sports'' documents do not contain any. The information extraction system works well, but occasionally extracts spurious tuples from some documents, independently of their topic. If the test set contains a large number of "Sports'' documents then the extraction system will also generate a large number of incorrect tuples from these "bad'' documents, bringing down the precision of the output. Actually, the more "Sports'' documents in the test set, the worse the reported precision, even though the underlying extraction system remains the same. Notice, though, that the recall is not affected by the document distribution in the test set and remains constant, independently of the number of "Sports" documents in the test set.

The fact that precision depends on the distribution of good and bad documents in the test set is well-known in machine learning, from the task of classifier evaluation. To evaluate classifiers, it is preferable to use ROC curves, which are independent of the class distribution in the test set. The ROC curves summarize graphically the tradeoffs between the different types of errors. When characterizing a binary decision process with ROC curves, we plot the
true positive rate (the fraction of true positives correctly classified as positives, i.e., recall) as the ordinate, and the false positive rate (the fraction of true negatives incorrectly classified as positives) as the abscissa.

The standard application of ROC curves for information extraction is unfortunately problematic, for two reasons.

First reason: We typically do not know what a "true negative" is. Unlike document classification, a "bad tuple" does not exist apriori in a document. It only exists because the extraction system can extract it.
  • Solution 1: One way to overcome this problem is to measure the number of all bad tuples that can be extracted from a document using all possible settings and all available extraction systems. Then, we can use this number as the normalizing factor to define the false positive rate. This solution works when dealing with a static set of extraction systems. Alas, the definition of false positive rate becomes unstable if we introduce later another system (or another setting) that generates previously unseen noisy tuples; this changes the number of all bad tuples, which serves as the normalizing constant, and forces recomputation of all false positive rates.
  • Solution 2: Another way to avoid this problem is by having an un-normalized x-axis (abscissa). Instead of having the false positive rate, we can have the average number of bad tuples generated. In this case, the new curve is called the "Free Response Operating Characteristic" (FROC) curve. Such techniques are widely used in radiology to evaluate the performance of systems that detect nodules in MRI and CAT scans. (A nodule refers to a small aggregation of cells, indicative of a disease.) A problem with this approach is the lack of a "probabilistic" interpretation of the x-axis; the probabilistic interpretation can be convenient when analyzing/integrating the extraction system as part of of a bigger system, and we are not simply trying to measure its performance in a vacuum.
Second reason: It is too expensive to know all the "true positives". You may have noticed that recall is the y-axis in the ROC/FROC curves. Unfortunately, computing recall is a rather expensive task. If we want to be correct, we need to have annotators read a document and infer which tuples can be extracted from each document; the number of all good tuples will be used as the normalizing constant to compute the recall of each extraction system. This is an expensive task, especially compared to the computation of precision or compared to the computation of the false positive rate. To compute the false positives we need only to evaluate the extracted tuples, a presumably much easier task than reading a lengthy document.
  • Solution 1: We can play the same trick as above, to avoid the problem of reading/annotating the documents. We process each document multiple times, using all possible settings and all possible extraction systems. The union of the extracted tuples can be then validated to identify the set of all correct tuples. As in the case of true negatives, though, the definition becomes unstable if we have a dynamic set of extraction systems that can identify more good tuples at some point in the future, forcing a re-calculation of recall metrics for all systems.
  • Solution 2: We can also have an un-normalized y-axis. For instance, we can have as the ordinate (y-axis) the average number of good tuples extracted for each document. (I have not seen anything like the FROC curves that will leave the true positive rate unnormalized, though.) The downside is that by leaving recall unnormalized, the values now depend on the distribution of good and bad documents in the input: the more bad documents with no good tuples in the test set, the lower the unnormalized value will be. Therefore, this definition seems to go against the spirit of ROC curves.
I am not aware of any alternative definitions of the ROC curves that do not have the problems of the solutions that I described above. If you have any ideas or pointers, post it in the comments.

Friday, June 1, 2007

Uses and Abuses of Student Ratings

I recently revisited an old posting from Tomorrow's Professor mailing list about "Uses and Abuses of Student Ratings". It is an excerpt from the book "Evaluating Faculty Performance, A Practical Guide to Assessing Teaching, Research, and Service" and lists a set of common problems in the use of student ratings for evaluating the teaching performance of a faculty member. I enjoyed (re-)reading the whole list, but I particularly liked these three items:
  • Abuse 1: Overreliance on Student Ratings in the Evaluation of Teaching ...
  • Abuse 2: Making Too Much of Too Little ... Is there really a difference between student ratings averages of 4.0 and 4.1? ....To avoid the error of cutting a log with a razor, student ratings results should be categorized into three to five groups ... Utilizing more than three to five groups will almost certainly exceed the measurement sophistication of the instrument being used.
  • Abuse 5: Using the Instrument (or the Data Collected) Inappropriately ... While we have 20 items on our ratings form ... only #7 really matters for making personnel decisions.
I am confident that if we read an academic paper that analyzes the results of a questionnaire in the same way that we currently analyze student ratings, the paper would have been considered naive (no statistical significance of findings, no control variables, use of a single instrument for evaluation, and so on). Unfortunately, it is does not seem likely that we will ever apply the same rigor when analyzing the student ratings forms.

Sunday, May 13, 2007

Impact Factors and Other Metrics for Faculty Evaluations

A few months back, as a department we had to submit to the school a list of the "A" journals in our field. The ultimate reason for this request was to have a list of prime venues for each field, and thus facilitate the task of the promotion & tenure committees that include researchers with little (or no) knowledge of the candidates field.

Generating such a list can be a difficult task, especially if we try to keep the size of the list small, and directly connected with the problem of ranking journals. There are many metrics that can be used to for such a ranking, and the used metric is the "impact factor", proposed by Eugene Garfield. The impact factor ranks the "impact" of the research published in each journal by counting the average number of citations to the 2- and 3-year old articles published in the journal, over the last year. The basic idea is that journals with a large number of recent incoming citations (from last year) that also point to relatively recent articles (2- and 3-year old papers), show the importance of the topics published in the given journal. The choice of the time window makes comparisons across fields difficult, but generally within a single field the impact factor is a good metric for ranking journals.

Garfield, most probably expecting the outcome, explicitly warned that the impact factor should not be used to judge the quality of the research of an individual scientist. The simplest reason is the fact that the incoming citations to the papers that are published in the same journal follow a power-law: a few papers receive a large number of citations, while many others get only a few. Copying from a related editorial in Nature: "we have analysed the citations of individual papers in Nature and found that 89% of last year’s figure was generated by just 25% of our papers. [...snip...] Only 50 out of the roughly 1,800 citable items published in those two years received more than 100 citations in 2004. The great majority of our papers received fewer than 20 citations." So, the impact factor is a pretty bad metric for really examining the quality of an individual article, even though the article might have been published in an "A" journal.

The impact factor (and other journal-ranking metrics) was devised be used as a guideline for librarians to allocate the subscription resources, and as a rough metric for providing guidance to scientists that are trying to decide which journals to follow. Unfortunately, such metrics have been mistakenly used as convenient measures for summarily evaluating the quality of someone's research. ("if published in a journal with high impact factor, it is a good paper; if not, it is a bad paper"). While impact factor can (?) be a prior, it almost corresponds to a Naive Bayes classifier that does not examine any feature of the classified object before making the classification decision.

For the task of evaluating the work of an individual scientist, other metrics appear to be better. For example, the much-discussed h-index and its variants seem to get traction. (I will generate a separate post for this subject.) Are such metrics useful? Perhaps. However, no metric, no matter how carefully crafted can substitute careful evaluation of someone's research. These metrics are only useful as auxiliary statistics, and I do hope that they are being used like that.