Wednesday, September 15, 2010

Worker Evaluation in Crowdsourcing: Gold Data or Multiple Workers?

Evaluating the quality of workers on crowdsourcing environments is standing problem.

The two most common approaches for dealing with the problem are:
  • Use "gold" data: Give to the workers questions for which we already know the answers and see how the workers are performing. (Crowdflower is using this approach.)
  • Use multiple workers: Give the same answer to multiple workers and then use a latent class model ala Dawid & Skene to estimate the quality of the workers.
The nice thing about the two approaches is that they can be seamlessly combined into a unified algorithm.

What was not clear to me, though, was the relative importance of the two. How much better and faster can we estimate the quality of the workers if we use gold data? What is the rate of improving the quality estimation as we add more gold data? Generating high-quality gold data is an expensive process, so having good answers to these questions is important.

Being a professor, the moment these questions came up, I knew what I had to do: Ask a PhD student to give me the answers! (Thanks Jing!) So, I am just the messenger here, Jing did all he work and the analysis. You know where to send your thanks.

The assumptions: We have examples belonging to 2 categories. The examples are equally distributed in the two categories (i.e., 50% in each). We created a set of workers with their quality picked randomly and uniformly from the range (55% correct) to (100% correct), for an average quality of ~77%. The workers assigned (noisy) labels to the examples, with an accuracy directly proportional to their quality.

We examined the performance of the Dawid & Skene algorithm, which we modified to take into consideration the existence of gold data. We measured two things:
  • Classification error: How well the algorithm estimates the correct class of the examples
  • Quality estimation error: How well the algorithm estimates the quality of the workers
We experimented with having 0% gold examples in the data, 25% gold, 50% gold, and 75% gold.

Results on classification error

So, we measured the accuracy of estimating the correct class of each example. The results are listed below:

3 workers per example
5 workers per example
10 workers per example

One immediate observation is that the value of having gold data is limited when we have significant number of workers per label. With 10 workers per example, no matter if we have gold data, the difference is minimal. Even with just 5 workers, the additional value of gold data is small.

The cases where it makes sense to use gold data is when we have only small number of workers per label. (Not an uncommon case!) 

An interesting observation, though, is that we can achieve the same effect by simply forcing workers to work on more examples. Once a worker has given us 30 answers, the completely unsupervised algorithm can work almost as well as the algorithm that uses 75% gold data. This holds even when we have just 3 workers per example.

Of course, on an environment like Mechanical Turk, forcing workers to work on a large number of HITs may not be feasible. But we can always bundle multiple questions in a single HIT, achieving the same result. 

OK, so gold data do not seem to be very useful in for getting better accuracy in class estimation. 

But, it should help in estimating the quality of workers, right?

Results on worker quality estimation

For the quality estimation, we also calculated the error when having 100% gold examples. (This is the lower bound for the estimation error, of course.)

3 workers per example
5 workers per example
10 workers per example
As expected, with 10 workers per example, we gain almost nothing in terms of quality estimation when we use gold data. 

With 3-5 workers per example, having gold data improves the quality estimation in a rather consistent manner. Again, though, we observe that if we have each worker completing a large enough number of assignments, we can get most of the benefits of having gold data, without actually having gold data.

Why gold data then?

So, given the results above, someone would ask: Why do we even need gold data? The unsupervised approach seems to work very well!

In reality there are a few reasons for which we may still need to have gold data:
  • Imbalanced datasetsWhen we have very imbalanced data sets, the estimation becomes more challenging. For imbalanced data, we need to quickly and preemptively test workers using data from all categories, rather than waiting for the occasional object from the minority category to appear. To give an example: if we monitor a security camera trying to detect the presence of people in prohibited areas, we want to ensure that the workers will be tested early on with images that have people in them. Otherwise it may take a long time to get a reliable estimate of their ability to classify correctly examples from the minority class.
  • Very low quality of workers: When workers have very low quality, we need more workers per example and more labels per worker to replicate the results above. In this case, having gold data allows us quickly to get rid of the workers that do not meet quality standards. This is very useful on high-noise environments like the "unprotected" Amazon Mechanical Turk marketplace (by unprotected I mean without using qualification tests or other quality assurance mechanisms).
  • Giving confidence to non-technical people: If you say that you test the workers with known examples, everyone understands the process. If you say that you rely on agreement between workers, or on latent class models ("what?") and on expectation maximization or Bayesian estimation ("come again?") , most people will start feeling uncomfortable. Everyone understands random tests, not everyone is willing to let unsupervised methods to direct the quality assurance process. So, even if the gold examples do not help much, it is a very reassuring factor for people that just want to know that there is a familiar, understandable, and easy-to-explain quality control mechanism in place.
  • Calibrating results and giving feedback: One of the final reasons for having gold data is to be able to calibrate workers and give them feedback about the expected coding standards. For example, when rating pages as porn or not and into degrees of severity, different people have different levels of sensitivity. If we have enough "sensitive" workers in the workforce, we may end up with results that are consistent but shifted upwards in terms of severity. (Or vice versa if the coders are more tolerant, the results may be shifted downwards.) This can, incorrectly, give the impression that all data collected through crowdsourcing are wrong. However, if the final user of the data provides a few gold data as anchor points, the Dawid & Skene code gives back results that are more in line with expectations. At the same time, these gold data points can be used to give immediate feedback to the workers about their errors and implicitly direct them to use the expected rating guidelines and self-calibrate.
In practice, the last two reasons are often more important than the technical aspects of estimation. So, before starting any big crowdsourcing annotation project, spend some time and create some gold data. Or, alternatively, take a small dataset and label it using a large number of workers per example. Then verify the outcome, correct and calibrate some of the unexpected results, and run Dawid & Skene again. The generated data will be close enough to gold. Having such gold data will pay back the effort and cost multiple times during the overall process.

Tuesday, September 14, 2010

Analytics for Class Lectures

The classes for the new academic year have started, so naturally I started thinking about teaching-related topics.

Mining video interactions

A few days back, FXPal released TalkMiner, a system for indexing and searching video of lecture broadcasts. One of the interesting ideas is that it is possible to mine the interactions of students with the video, to see what are the topics of interest for the students, what parts of the class get skipped, and so on. From the blog post of FXPal:

The Berkeley webcasting system (developed by our president Larry Rowe while he was a professor there) showed that
… students almost always watched the lectures on-demand rather than in real-time, and they rarely watched the entire lecture.  Students use the webcasts to study for exams – we could see this clearly by patterns of usage – and, they primarily wanted to review selected material covered by the instructor.  In one class we discovered that for over 50% of the lectures, students watched less than 10 minutes from a 50-minute lecture and students watched the entire lecture only 10% of the time.  Consequently, for using the system, effective search is a big issue.

At Stern, all the classes get recorded and are available to students for reviewing the class material. The students get access to a layout like the following and have the ability to rearrange the layout, emphasizing the slides, or the video. (You can see a lecture of mine; login: scribe and password: Scribe987!)

It seems to be a natural next step to show to the instructor the patterns of interaction that students have with the videos. It would be very interested to see what parts of the class go largely unexamined and which ones are played again and again. Needless to say, these are either complicated topics, or topics that the instructor did not explain clearly.

Mining search queries using transcripts

Another interesting idea is to also have transcripts of the class. (For example, for this lecture [login: scribe and password: Scribe987!] see the transcript, done by CastingWords for $0.75/min.) This would allow students to search the class not only using text in the slides but also to recall particular points of the class discussion. This is especially important for courses that have a significant component of in-class discussion. We already know, from web search, that query logs are important source of information. Doing the same for class content would easily identify what students are looking for in the class recordings.

One problem with transcription is that it is rather expensive. CastingWords and SpeakerText seem to charge one or two dollars per minute for human-verified transcriptions. (Fully-automatic solutions are not ready for prime time, as the automatic transcription of these YouTube videos shows. Make sure to click the "cc" button and then "transcribe audio".) With approximately 28 lectures a semester, 75 minutes each, at 1-2 dollars per minute, we have a cost of $2000 to $4000 per semester. At this cost level, it is certainly more beneficial to hire an extra TA rather than provide the transcription of the lecture to the students.

Mining class participation 

Another thing that I would love to have is the ability to transcribe not only what the instructor said but also who are the students that contributed to the discussion, together with what they said. This would allow not only to track and quantify participation but also uncover some patterns that may not be obvious to the instructor.

For example, take a look at this diagram below, created as part of the yearly teaching evaluation that we undergo at Stern:

The diagram was created by an evaluator who sat in my class, tracked the composition of the student body, where each student was sitting in the amphitheater, how many times they raised their hand, and how many times I asked them to answer a question. (To answer the inevitable question: No, the teaching feedback is not focused only on such analyses. In my earlier years, the feedback was focused more on substantial issues, e.g., structuring lectures and discussions, encouraging participation, etc. Now, with feedback and experience, the more substantial and important issues are addressed.  So we focus on such, seemingly more superficial, but also important, stuff...)

The results? I was paying significantly more attention to the left part of the amphitheater: I asked 80% of the time students sitting in the left, and only 20% of the time I asked students on the right. Also, the percentage of female students participating in the discussion was significantly lower: 50% for male students participated, but only 21% of the female students did.

These are patterns that are hard to understand while teaching, but would be easier to find out if we had detailed transcripts of the class discussion together, potentially, with a standardized seat chart. I was also told that some universities (the rumor is about Harvard Business School) use software to track student participation. However, I was not able to locate any such software offerings. 

Moving forward

The ability to videotape lectures has been around for a while and is being used extensively for distance learning applications. (Columbia Engineering had a well-established distance learning program when I joined the PhD program back in 1999.) However, it was mainly a broadcast mechanism, and not a medium for providing feedback to the instructor (and even to the students who can see that they are lacking in terms of participation). 

It would be interesting to start having such technologies for providing feedback on teaching. Analytics have been changing many industries. Education has been surprisingly behind in that respect.