Today is October 18th. It is 41 years since Greece voted for Andreas Papandreou with a 48% vote percentage to be elected as prime minister, fundamentally changing the course of history for Greece. Positively or negatively, this is still debated, but the change was real.
A Computer Scientist in a Business School
Random thoughts of a computer scientist who is working behind the enemy lines; and lately turned into a double agent.
Tuesday, October 18, 2022
Tell these fucking colonels to get this fucking economist out of jail.
On October 6th, Roy Radner passed away at the age of 95. He was a faculty member at our department and a famous microeconomist with a highly distinguished career. Many others have written about him and his accomplishments as an economist and academic, so I will not try to do the same.
But Roy also played an important role in making that election in 1981 possible. Why? Let me tell you his story.
Tuesday, November 23, 2021
"Geographic Footprint of an Agent" or one of my favorite data science interview questions
Last week we wrote in the Compass blog how we estimate the geographic footprint of an agent.
At the very core, the technique is simple: Use the addresses of the houses that an agent has bought or sold in the past; get their longitude and latitude; and then apply a 2-dimensional kernel density estimation to find what are the areas where the agent is likely to be active. Doing the kernel density estimation is easy; the fundamentals of our approach are material that you can find in tutorials for applying a KDE. There are two interesting twists that make the approach more interesting:
- How can we standardize the "geographic footprint" score to be interpretable? The density scores that come back from a kernel density application are very hard to interpret. Ideally, we want a score from 0 to 1, with 0 being "completely outside of the area of activity" and 1 being "as important as it gets". We show how to use a percentile transformation of the likelihood values to create a score that is normalized, interpretable, and very well calibrated.
- What are the metrics for evaluating such a technique? We show how we can use the concept of "recall-efficiency" curves to provide a common way to evaluate the models.
Monday, November 18, 2019
Mechanical Turk, 97 cents per hour, and common reporting biases
The New York Times has an article about Mechanical Turk in today's print edition: "I Found Work on an Amazon Website. I Made 97 Cents an Hour". (You will find a couple of quotes from yours truly).
The content of the article follows the current zeitgeist: Tech companies exploiting gig workers.
While it is hard to argue that there are tasks on MTurk that are really bad, I think that the article paints an unfairly gloomy picture of the overall platform.
Here are a few of the issues:
- Availability and survivorship bias. While the paper does describe accurately the cesspool of low-paying tasks that are available on Mechanical Turk, it fails to convey the fact that these tasks are available on the platform because nobody wants to work on them. The tasks that are easily available for everyone are the ones for which nobody competes to grab: low-paying, badly designed tasks.
- The activity levels of workers follow a power-law. We have plenty of evidence that a significant part of the work on MTurk is done by a small minority of workers. While it is hard to have a truly accurate measurement of what percent of the workers do what percent of the tasks, the 1% rule is a good approximation. For example, in my demographic surveys, where I explicitly limit the participation to only once per month, 50% of the responses come from 5% of the participants. Expect the bias to be much stronger in other, more desirable tasks. Such a heavily biased propensity to participate introduces strong sampling problems when trying to find the right set of workers to interview.
- Doing piecemeal work while untrained results in low pay. This is a pet peeve of mine, for all the articles of the type "I tried working on MTurk / driving Uber / delivering packages / etc / and I got a lousy pay". Well, if you work piecemeal on any task, the tasks will take a very long time initially, and the hourly wage will suck. This will hold for Turking, coding, lawyering, or anything else. If someone decides to become a freelance journalist, the first few articles will result in abysmally bad hourly wages as well; expert freelance writers often charge 10x the rates that beginner freelancer writers charge, if not more. I am 100% confident that the same applies to MTurk workers as well: Experienced workers make 10x what beginners make.
Having said that, I do agree that Amazon could prohibit tasks that are obviously paying very little (as a rule of thumb, it is impossible to get paid more than minimum wage when the HIT is paying less than 5c/task). But I also think that regular workers are smart enough to know that and avoid such tasks.
Friday, November 16, 2018
Distribution of paper citations over time
A few weeks ago we had a discussion about citations, and how we can compare the citation impact of papers that were published in different years. Obviously, older papers have an advantage as they have more time to accumulate citations.
To compare papers, just for fun, we ended up opening the profile page of each paper in Google Scholar, and we analyzed the paper citations years by year to find the "winner." (They were both great papers, by great authors, fyi. It was more of a "Lebron vs. Jordan" discussion, as opposed to anything serious.)
This process got me curious though. Can we tell how a paper is doing at any given point in time? How can we compare a 2-year-old article, published in 2016, with 100 citations against a 10-year-old document, published in 2008, with 500 citations?
To settle the question, we started with the profiles of faculty members in the top-10 US universities and downloaded about 1.5M publications, across all fields, and their citation histories over time.
We then analyzed the citation histories of these publications, and, for each year, we ranked the papers based on the number of citations received over time. Finally, we computed the citation numbers corresponding to different percentiles of performance.
Cumulative percentiles
The plot below shows the number of citations that a paper needs to have at different stages to be placed in a given percentile.
A few data points, focusing on certain age milestones: 5-years after publication, 10-years after publication, and lifetime.
Yearly percentiles and peak years
We also wanted to check at which point papers reach their peak, and start collecting fewer citations. The plot below shows the percentiles based on the yearly numbers of accumulated citations. The vast majority of papers tend to reach their peak 5-10 years after publication; the number of yearly citations starts declining after 5-10 years.
Below is the plot of the peak year for a paper based on the paper percentile:
There is an interesting effect around the 97.5% percentile: After that level, it seems that a 'rich-gets-richer' effect kicks in, and we effectively do not observe a peak year. The number of citations per year keeps increasing. You could call these papers the "classics".
What does it take to be a "classic"? 200 citations at 5 years or 500 citations at 10 years.
To compare papers, just for fun, we ended up opening the profile page of each paper in Google Scholar, and we analyzed the paper citations years by year to find the "winner." (They were both great papers, by great authors, fyi. It was more of a "Lebron vs. Jordan" discussion, as opposed to anything serious.)
This process got me curious though. Can we tell how a paper is doing at any given point in time? How can we compare a 2-year-old article, published in 2016, with 100 citations against a 10-year-old document, published in 2008, with 500 citations?
To settle the question, we started with the profiles of faculty members in the top-10 US universities and downloaded about 1.5M publications, across all fields, and their citation histories over time.
We then analyzed the citation histories of these publications, and, for each year, we ranked the papers based on the number of citations received over time. Finally, we computed the citation numbers corresponding to different percentiles of performance.
Cumulative percentiles
The plot below shows the number of citations that a paper needs to have at different stages to be placed in a given percentile.
A few data points, focusing on certain age milestones: 5-years after publication, 10-years after publication, and lifetime.
- 50% line: The performance of a "median" paper. The median paper gets around 20 citations 5 years after publication, 50 citations within 10 years, and around 100 citations in its lifetime. Milestone scores: 20,50,90
- 75% line: These papers perform "better," citation-wise than 75% of the remaining papers with the same age. Such papers get around 50 citations within 5 years, 100 citations within 10 years of publication, and around 200 citations in their lifetime. Milestone scores: 50,100,200
- 90% line: These papers perform better than 90% of the papers in their cohort. Around 90 citations within 5 years, 200 citations within 10 years, and 500 citations in their lifetime. Milestones scores: 90,200,500
Yearly percentiles and peak years
We also wanted to check at which point papers reach their peak, and start collecting fewer citations. The plot below shows the percentiles based on the yearly numbers of accumulated citations. The vast majority of papers tend to reach their peak 5-10 years after publication; the number of yearly citations starts declining after 5-10 years.
What does it take to be a "classic"? 200 citations at 5 years or 500 citations at 10 years.
Monday, January 29, 2018
How many Mechanical Turk workers are there?
TL;DR: There are about 100K-200K unique workers on Amazon. On average, there are 2K-5K workers active on Amazon at any given time, which is equivalent to having 10K-25K full-time employees. On average, 50% of the worker population changes within 12-18 months. Workers exhibit widely different patterns of activity, with most workers being active only occasionally, and few workers being very active. Combining our results with the results from Hara et al, we see that MTurk has a yearly transaction volume of a few hundreds of millions of dollars.
For more details read below, or take a look at our WSDM 2018 paper.
Question
A topic that frequently comes up when discussing Mechanical Turk is "how many workers are there on the platform"?
In general, this is a question that is very easy for Amazon to answer, but much harder for outsiders. Amazon claims that there are 500,000 workers on the platform. How can we check the validity of this statement?
Basic capture-recapture model
A common technique for this problem is the capture-recapture technique, that is widely used in the field of ecology, to measure the population of a species.
The simplest possible technique is the following:
In our analysis, we estimated that the propensity distribution follows a Beta(0.3,20) distribution. We plot above the "inverse CDF" of the distribution (Inverse CDF: "what percentage of the workers have propensity higher than x").
As you can see, the propensity follows a familiar (and expected) pattern. Only 0.1% of the workers have propensity higher than 0.2, and only 10% have propensity higher than 0.05.
Intuitively, a propensity of 0.2 means that the worker is active and willing to participate 20% of their time (this is roughly equivalent to full-time level of activity; full-timer employees work around 2000 hrs per year, out of 24*365 available hours in a year). A propensity of 0.05 means that the worker is active and available approximately 24 hr * 0.05 ~ 1 hour per day.
--
For more details read below, or take a look at our WSDM 2018 paper.
--
Question
A topic that frequently comes up when discussing Mechanical Turk is "how many workers are there on the platform"?
In general, this is a question that is very easy for Amazon to answer, but much harder for outsiders. Amazon claims that there are 500,000 workers on the platform. How can we check the validity of this statement?
--
Basic capture-recapture model
A common technique for this problem is the capture-recapture technique, that is widely used in the field of ecology, to measure the population of a species.
The simplest possible technique is the following:
- Capture/marking phase: Capture $n_1$ animals, mark them, and release them back.
- Recapture phase: A few days later, capture $n_2$ animals. Assuming there are $N$ animals overall, $n_1/N$ of them are marked. So, for each of the $n_2$ captured animals, the probability that the animal is marked is $n_1/N$ (from the capture/marking phase).
- Calculation: On expectation, we expect to see $n_2 \cdot \frac{n_1}{N}$ marked animals in the recapture phase. (Notice that we do not know $N$.) So, if we actually see $m$ marked animals during the recapture phase, we set $m = n_2 \cdot \frac{n_1}{N}$ and we get the estimate that:$N = \frac{n_1 \cdot n_2}{m}$.
In our setting we adapted the same idea, where "capture" and "recapture" correspond to participating in a demographics survey. In other words, we "capture/mark" MTurk users that complete the survey in one day. Then, in another day, we also "recapture" by surveying more workers and we see how many workers overlap in the two surveys.
First (naive) attempt
We decided to apply this technique to estimate the size of the Mechanical Turk population. We considered as "capture" period the set of surveys running over a period of 30 days. Then we considered as "recapture" period, the surveys that we ran on another 30-day period. The plot below shows the results.
--
First (naive) attempt
We decided to apply this technique to estimate the size of the Mechanical Turk population. We considered as "capture" period the set of surveys running over a period of 30 days. Then we considered as "recapture" period, the surveys that we ran on another 30-day period. The plot below shows the results.
The x-axis shows the beginning of the recapture period, and the y-axis the estimate of the number of workers. The color of each dot corresponds to the difference in time between the capture-recapture periods: black is a short time, and red is a long time.
If we focus on the black-color dots (~60 days between the surveys), we get a (naive) estimate of around 10K-15K workers. (Warning: this is incorrect.)
While we could stop here, we see some results that are not consistent with our model. Remember, that color encodes time between samples: black is for short time (~2 months) between samples, red is for long time (~2yrs) between samples. Notice that, as the time between the two periods increases, the estimates are becoming higher, and we get the "rainbow cake" effect in the plot. For example, for July 2017, our estimate is 12K workers if we compare with a capture from May 2017, but the estimate goes up to 45K workers if we compare with a sample from May 2015. Our model, though, says that the time between captures should not affect the population estimates. This indicates that there is something wrong with the model.
Assumptions of basic model
If we focus on the black-color dots (~60 days between the surveys), we get a (naive) estimate of around 10K-15K workers. (Warning: this is incorrect.)
While we could stop here, we see some results that are not consistent with our model. Remember, that color encodes time between samples: black is for short time (~2 months) between samples, red is for long time (~2yrs) between samples. Notice that, as the time between the two periods increases, the estimates are becoming higher, and we get the "rainbow cake" effect in the plot. For example, for July 2017, our estimate is 12K workers if we compare with a capture from May 2017, but the estimate goes up to 45K workers if we compare with a sample from May 2015. Our model, though, says that the time between captures should not affect the population estimates. This indicates that there is something wrong with the model.
--
Assumptions of basic model
The basic capture-recapture estimation described above relies on a couple of assumptions. Both of these assumptions are violated when applying this technique to an online environment.
- Assumption of no arrivals / departures ("closed population"): The vanilla capture-recapture scheme assumes that there are no arrivals or departures of workers between the capture and recapture phase.
- Assumption of no selection bias ("equal catchability"): The vanilla capture-recapture scheme assumes that every worker in the population is equally likely to be captured.
In ecology, the issue of closed population has been examined under many different settings (birth-death of animals, immigration, spatial patterns of movement, etc.) and there are many research papers on the topic. Catchability, by comparison, has received comparatively less attention. This is reasonable, as in ecology the assumption of closed population is problematic in many settings. By comparison, assuming that the probability of capturing an animal is uniform among similar animals is reasonable. Typically the focus is on segmenting the animals into groups (e.g., nesting females vs hunting males) and assign different catchability heterogeneity to groups (but not to individuals).
In online settings though, the assumption of equal catchability is more problematic. First we have the activity bias: Workers exhibit very different levels of activity: A worker who works every day is much more likely to see and complete a task, compared to someone who works once a month. Similarly, we have a selection bias: Some workers may like to complete surveys, while others may avoid such tasks.
So, to improve our estimates, we need to use models that alleviate these assumptions.
So, to improve our estimates, we need to use models that alleviate these assumptions.
--
Endowing workers with survival probabilities
We can extend the model, allowing each worker to have a certain survival probability, to allow workers to "disappear" from the platform. If we see the plot above, we can see that the population estimate increases as the time between two samples increases. This hints that workers leave the platform, and the intersection of capture-recapture becomes smaller over time.
If we account for that, we can get an estimate that the "half-life" of a Mechanical Turk worker is between 12-18 months. In other words, approximately 50% of the Mechanical Turk population changes every 12-18 months.
--
Endowing workers with propensity to participate
We can also extend the model by associating a certain propensity for each worker. The propensity is the probability that a worker is active and willing to participate in a task, at any given time.
In our work, we assumed that the underlying "propensity to participate" follows a Beta distribution across the worker population, and the parameters of the Beta distribution are unknown. When we assume that follow a Beta distribution, then the probability that a worker participates in the survey k times, follows a Beta Binomial distribution. Since we know how many workers participated k times in our surveys, it is then easy to estimate the underlying parameters of the Beta distribution.
Notice that we had to depart from the simple "two occasion" model above, and instead use multiple capturing periods over time. Intuitively, workers that have high propensity to participate will appear many times in our results, while inactive workers will appear only a few times.
In our work, we assumed that the underlying "propensity to participate" follows a Beta distribution across the worker population, and the parameters of the Beta distribution are unknown. When we assume that follow a Beta distribution, then the probability that a worker participates in the survey k times, follows a Beta Binomial distribution. Since we know how many workers participated k times in our surveys, it is then easy to estimate the underlying parameters of the Beta distribution.
Notice that we had to depart from the simple "two occasion" model above, and instead use multiple capturing periods over time. Intuitively, workers that have high propensity to participate will appear many times in our results, while inactive workers will appear only a few times.
By doing this analysis, we can observe that (as expected) the distribution of activity is highly skewed: A few workers are very active in the platform, while others are largely inactive. A nice property of the Beta distribution is its flexibility: Its shape can be pretty much anything: uniform, Gaussian-like, bimodal, heavy-tailed... you name it.
In our analysis, we estimated that the propensity distribution follows a Beta(0.3,20) distribution. We plot above the "inverse CDF" of the distribution (Inverse CDF: "what percentage of the workers have propensity higher than x").
As you can see, the propensity follows a familiar (and expected) pattern. Only 0.1% of the workers have propensity higher than 0.2, and only 10% have propensity higher than 0.05.
Intuitively, a propensity of 0.2 means that the worker is active and willing to participate 20% of their time (this is roughly equivalent to full-time level of activity; full-timer employees work around 2000 hrs per year, out of 24*365 available hours in a year). A propensity of 0.05 means that the worker is active and available approximately 24 hr * 0.05 ~ 1 hour per day.
How big is the platform?
So, how many workers are there? Under such highly skewed distributions, giving an exact number for the number of workers is rather futile. The best that you can do is give a ballpark estimate, and hope to be roughly correct on the order of magnitude. What our estimates are showing is that there are round 180K distinct workers in the MTurk platform. This is good news for anyone who is trying to reach a large number of distinct workers through the platform.
Our analysis also allows us to estimate how many workers are active and willing to participate in our task at any given time. For that, we estimate that around 2K to 5K workers are available, at any given time. If we want to convert this number to full-time employee equivalence, this is equivalent to 10K-25K full-time workers.
The latter part also allows us to give some low and high estimates on the transaction volume of MTurk.
- Lower bound: Assuming 2K workers active at any given time, this is 2000*24*365=17,520,000 work hours in a year. If we assume that the median wage is \$2/hr, this is roughly \$35M/yr transaction volume on Amazon Mechanical Turk (with Amazon netting ~\$7M in fees).
- Upper bound: Assuming 5K workers active at any given time, this is 5000*24*365=43,800,000 work hours in a year. If we assume average wage of \$12/hr, this is around \$525M/yr transaction volume (with Amazon netting ~$100M in fees).
I understand that a range of \$35M to \$500M may not be very helpful, but these are very rough estimates. If someone wanted my own educated guess, I would put it somewhere in the middle of the two, i.e., transaction volume of a few hundreds of millions of dollars.
Subscribe to:
Posts (Atom)