Monday, June 9, 2014

My Peer Grading Scheme

One of the components that I use in my class is student presentations. 

While I like having students present, I had always a hard time grading the presentations. Plus, many students seemed to target the presentation to me, trying to sound too technical and advanced, leaving the audience in the class bored and uninterested.

For that reason, I adopted a peer-grading scheme. Students have to present to the class, and get rated by the class, and not me. (Although, I still reserve a small degree of editorial judgement for assigning the grades.) Here is how my scheme works, after a few years of experience.
  1. Rating scale: Students assign a grade from 0 to 10 to the presentations.
  2. No self-grading: Students do not grade their own presentations. (Early on, there were students that were assigning 10 to themselves, and lower grade to everyone else. Now they can still grade themselves if they want but the grade is ignored.)
  3. Normalization: All assigned grades are normalized, to have a zero mean and one standard deviation. (This normalization was introduced to fight the problem where a student would try to game the system by assigning low grades to everyone else, hoping to lower the average rating of all other students.)
  4. Grade assignment: The presentation grade is the average of the assigned normalized scores. Formally, each student $s_i$ assigns to presentation $t$ a grade $z(s,t)$. The overall grade of the presentation is the mean value $E[z(*,t)]$ of the $z(s_i,t)$ grades.
  5. Ensuring careful grading by asking students to estimate class rating: One problem with the peer grading scheme was that many students did not take it seriously enough, and assigned random grades (typically, the same grade to everyone). To avoid indifferent grading, I decided to give credit (~10%) based on the correlation of the assigned grades $z(s,t)$ against the mean value $E[z(*,t)]$ (across all presentations $t$). This ensured that students will at least try to figure out what other students will assign to the presentation, and will not assign random grades.
  6. Separate assigned and estimated grades: The problem with introducing the requirement to agree with the class was that some students believed to be better assessors than the rest of the class. So, they felt that their own grade was the correct one, and did not like losing credit for assigning their own "true" grade. To address that issue, I now ask students to assign two grades: their own grade $z_p(s,t)$, and an estimate of the class grade $z_c(s,t)$. The personal grade $z_p$ is used to compute $E(z(*,t)]$ in Step 4, and I use the $z_c$ to compute the correlation in Step 5. 
  7. Examine self-grading: Given that the class-estimate grades are not directly used to grade a presentation, students are also asked to provide an estimate of their own grade as part of Step 6. Effectively, students are encouraged to estimate properly their own grade.
The only thing that I have not tried to far is to modify Step 4 to take into consideration the different correlations from Step 5, effectively weighting each student's grades based on their correlation with the rest of the class. However, most students tend to exhibit the same, moderate agreement with the class (typical correlation values are in the 0.4-0.6 range, after rating 15-20 presentations), so in practice I do not expect to see a difference.

Overall, I am pretty happy with the scheme. Students indeed try to impress the class (and not me), and many presentations are interesting, interactive, and engaging. The grades are also very consistent with the overall feeling that I get for each presentation, so I did not have to practice my "editorial oversight" and adjust the grade very often (only in a couple of cases, where the students ran into technical problems during the presentation). I would be really interested to try this scheme in one of the big MOOC classes that use peer grading, and see if it can instill the same sense of responsibility in peer grading. 

Tuesday, April 1, 2014

Online Markets: Selling products vs. selling time

We had an interesting discussion a few days back about online job markets, and why they are not a huge success so far, when other, comparatively less important products are getting huge valuation and visibility. For example, oDesk reached a total transaction volume of a billion dollars, for the 10 years of their existence, and roughly 5% to 10% of the volume becomes revenue for the company. Other labor marketplaces have typically even smaller number of report.

While nobody can ignore a billion dollar of transaction volume, I am puzzled why this number has not skyrocketed. It is very clear that the market serves a purpose: work is a trillion dollar industry. Allowing people to work online allows for better and more efficient access to human capital, alleviates need for immigration, and improves the lives of people involved. It is a no-brainer.

Why does it take so long for online work to takeoff? What is missing?

***

I was puzzled by these questions for long. I postulated that there are obstacles that prevent employers from hiring online, but recently I got some hints that there are obstacles from the worker side as well. I talked with some friends of mine back in Greece, who are making a very comfortable living working through the platform. I asked how they like making US salaries while living in Greece, and their answer was surprising. They did not see online work as a long term solution, but rather as a temporary gig.

When I asked why, they both indicated the same problem: There is no room in such markets for career evolution. You end up selling your time, and time is not something that scales. It is very hard to grow your business when you are always a freelancer, without the ability to hire new people, delegate tasks, and build a business. Compare now online work with a market like Amazon and eBay. Both Amazon and eBay allow sellers to effectively build businesses. Currently, online job markets allow workers to just sell their time.

When sellers have a capped growth, the market faces headwinds of growth as it tries to reach maturity.
***

On a general note, this gives birth to a general hypothesis on what can make a marketplace (hugely) successful: The market should allow sellers to grow, without an obvious ceiling. Otherwise, the best sellers are unlikely to be attracted to participate in the platform, due to the lack of upside.

Take some marketplace companies and interpret them through this framework:
  • Google Helpouts: Same restrictions on seller growth as all other job marketplaces.
  • Uber: Obviously, currently the sellers have a cap on growth, which is limited by their time. However, Uber allows the enrollment of limo/taxi agencies, which potentially grow indefinitely.
  • AirBnB: No obvious seller cap for someone who wants to enter the hospitality business.
  • TaskRabbit: Very obvious growth cap for the individual sellers of services.
  • OpenTable: No obvious limit of growth for participating restaurants.
  • eBay/Amazon: No obvious limit of growth for sellers that sell products online
  • Etsy: This is an interesting case. On the surface, the company looks like eBay/Amazon. However, the etsy guidelines dictate that "Everything on Etsy must be Handmade, Vintage, or a Craft Supply." Unfortunately, this places restrictions on seller growth as it implicitly limits sellers to be (very) small businesses. My bet is that either Etsy will revise this policy down the road, once more and more sellers start hitting their growth ceiling.
How accurate is the hypothesis? Time will tell...

Wednesday, January 22, 2014

Future of Education: Fighting Obesity or Fighting Hunger?

I have been following with interest the discussion about the future of education.

***

Some people criticize existing educational institutions, indicating that they offer little in terms of real training, and that real learning occurs outside the classroom, by actually doing. "Nobody learns how to build a system in a computer science class." "Nobody learns how to build a company in an entrepreneurship program."

Others are lamenting that by shifting to training-oriented schemes, we are losing the ability to offer deeper education, on topics that are not marketable. Who is going to study poetry if it has no return on investment? Who is going to teach literature if there is no demand for it?

These two criticisms seem to be pushing in two different directions.

***

In reality, we need to address two different needs:

One need is to really try and democratize education, trying to take the content of the top courses and make it accessible and available to everyone. People that want to learn machine learning, can now take courses from top professors, instead of having to read a book. People can now advance their careers easily, without having to enroll to expensive degree programs.

The other need is to preserve the breadth of education, shielding it from market forces. This need wants to preserve the structure where students during their education get exposed to diverse fields, no matter if there is a market and demand for these fields.

***

This tension reminded me about the discussion about genetically modified foods.

Mass production of food pretty much solved the problem of world hunger. A few decades ago, there was a real problem with world hunger. Famine was a real problem in many areas of the world, due to the inability to produce enough food to feed the growing population: floods, droughts, diseases were disrupting production, resulting in shortages. Today, the advances in agriculture allow the abundant production of grains and food: wheat and rice varieties are now robust, resistant to diseases, adaptable to many different climates, and allow us to feed the world.

The advances that solved the problem of world hunger, ended up creating other problems. Processed carbohydrates and causing obesity, diabetes, gout, and many other "luxury" diseases in the developed world. The poor in the developed world are not dying because they are hungry. They are dying by starving themselves from essential ingredients in their diet.

***

The parallels are striking. The MOOCs, Khan Academies, and Code Academies of the world are the genetically modified foods for those living in the "third world of education". These courses may not be the most nutritious, and they may not provide all the "nutrition" for their education. However, the choice for many of these people in the "third world of education" is not Stanford vs. a Coursera MOOC. It is nothing vs. a Coursera MOOC. Given the choice, take the MOOC at any time.

Those that live in the "developed world of education" can be pickier. They may have access to the genetically modified MOOCs, but if they can afford it, the organic, artisanal, locally sourced education can be potentially better than the mass produced MOOC. 

***

Horses for courses (pun intended).


Monday, January 20, 2014

Crowdsourcing research: What is really new?

A common question that comes up when discussing research in crowdsourcing, is how it compares with similar efforts in other fields. Having discussed these a few times, I thought it would be good to collect all these in a single place.
  • Ensemble learning: In machine learning, you can generate a large number of "weak classifiers" and then build a stronger classifier on top. In crowdsourcing, you can treat each human as a weak classifier and then learn on top. What is the difference? In crowdsourcing, each judgement has a cost. With ensembles, you can trivially easy create 100 weak classifiers, classify each object, and then learn on top. In crowdsourcing, you have a cost for every classification decision. Furthermore, you cannot force every person to participate, and often you have a heavy-tailed participation: A few humans participate a lot, but from many of them we get only a few judgments.
  • Quality assurance in manufacturing: When factories create batches of products, they also have a sampling process where they examine the quality of the manufactured products. For example, a factory creates light bulbs, and wants 99% of them to be operating. The typical process involves setting aside a sample for testing and testing if they meet the quality requirement. In crowdsourcing, this would be equivalent to verifying, with gold testing or with post-verification, the quality of each worker. Two key differences: The heavy-tailed participation of workers means that gold-testing each person is not always efficient, as you may end up testing a user a lot, and the the user may leave. Furthermore, it is often the case that a sub-par worker can still generate somewhat useful information, while for tangible products, the product is either acceptable or not.
  • Active learning: Active learning assumes that humans can provide input to a machine learning model (e.g., disambiguate an ambiguous example) and the answers are assumed to be perfect. In crowdsourcing this is not the case, and we need to explicitly take the noise into account.
  • Test theory and Item Response Theory: Test theory focuses on how to infer the skill of a person through a set of questions. For example, to create a SAT or GRE test, we need to have a mix of questions of different difficulties, and we need to whether these questions really separate the persons that have different abilities. Item Response Theory studies exactly these questions, and based on the answers that users give to the tests, IRT calculates various metrics for the questions, such as the probability that a user of a given ability will answer correctly the question, the average difficulty of a question, etc. Two things make IRT unapplicable directly to a crowdsourcing setting: First, IRT assumes that we know the correct answer to each question; second, IRT often requires 100-200 answers to provide robust estimates of the model parameters, a cost that is typically too high for many crowdsourcing applications (except perhaps the citizen science and other volunteer based projects).
  • Theory of distributed systems: This part of CS theory is actually much closer to many crowdsourcing problems than many people realize, especially the work on asynchronous distributed systems, which attempts to solve many coordination problems that appear in crowdsourcing (e.g. agree on an answer). The work on analysis of byzantine systems, which explicitly acknowledges the existence of malicious agents, provides significant theoretical foundations for defending systems against spam attacks, etc. One thing that I am not aware of, is the explicit dealing of noisy agents (as opposed to malicious ones), and I am not aware of any study of incentives within that context that will affect the way that people answer to a given question.
  • Database systems and User-defined-functions (UDFs): In databases, a query optimizer tries to identify the best way to execute a given query, trying to return the correct results as fast as possible. An interesting part of database research that is applicable to crowdsourcing is the inclusion of user-defined-functions in the optimization process. A User-Defined-Function is typically a slow, manually-coded function that the query optimizer tries to invoke as little as possible. The ideas from UDFs are typically applicable when trying to optimize in a human-in-the-loop-as-UDF approach, with the following caveats: (a) UDFs were considered to be return perfect information, and (b) the UDFs were assumed to have a deterministic or a stochastic but normally distributed execution time. The existence of noisy results and the fact that execution times with humans can be often long-tailed make the immediate applicability of UDF research in optimizing crowdsourcing operations rather challenging. However, it is worth reading the related chapters about UDF optimization in the database textbooks.
  • (Update) Information Theory and Error Correcting Codes: We can model the workers are noisy channels, that get as input the true signal and return back a noisy representation. The idea of using advanced error correcting codes to improve crowdsourcing is rather underexplored, imho. Instead we rely too much on redundancy-based solutions, although pure redundancy has been theoretically proven to be a suboptimal technique for error correction. (See an earlier, related blog post.) Here are a couple of potential challenges: (a) The errors of the humans are very rarely independent of the "message" and (b) It is not clear if we can get humans to compute properly functions that are commonly required for the implementation of error correcting codes. See a related e
  • (Update) Information Retrieval and Interannotator Agreement: In information retrieval, it is very common to examine the agreement of the annotators when labeling the same set of items. My own experience with reading the literature, and the related metrics is that they implicitly assume that all workers have the same level of noise, an assumption that is often violated in crowdsourcing.
Any other fields and what other caveats that should be included in the list?