Wednesday, August 5, 2009

Top Requesters on Mechanical Turk

Today I had a chat with Dahn Tamir about all things MTurk. He was particularly interested in the archive of all requesters that I have collected over the last 7 months. So, I queried the database, computed some basic statistics and sent him the results.

Then I thought: why not exporting the live results as well? A few php lines later, the leaderboard with the top Mechanical Turk requesters was born and is now available at http://mturk-tracker.com/top_requesters/


You can see for each requester the total number of projects they have posted on Mechanical Turk since January 2009, the total number of HITs, and the total value of the posted HITs. If you are also interested in whether the requester is still active, you can see when was the last time that they posted a HIT.

By clicking on their names, you can see the archive of the last 100 tasks that they have posted and by clicking at the requesterid you get to Amazon and you can see the tasks that are available now.

Enjoy!

20 comments:

  1. Excellent work, Panos! Thanks for sharing.

    If your server can handle the traffic, here's a crazy idea: since you're already archiving and displaying a requester's most recent tasks, how about allowing people to subscribe by RSS to any requester's task list? Sure, one may miss some of the faster moving HITs since you only scrape MTurk 4(?) times per hour, but it'd still be useful keeping tabs on some requesters' postings.

    ReplyDelete
  2. Sounds like a nice idea... If only I knew how to export the query results as RSS feeeds... Any pointers?

    ReplyDelete
  3. RSS is, like its name suggests, a really simple format. This will fill you in on the basics: www.petefreitag.com/item/465.cfm

    And how to enable autodiscovery so browsers know where to find your feed:
    www.petefreitag.com/item/384.cfm

    But you'd definitely need good caching in place in case things took off. If you're willing to share your data, I'd be happy to put together something on Google App Engine to serve up feeds for you. That way your current server won't be overburdened if you start getting a lot of traffic.

    Feel free to contact me at (my first name) (at) turkerz.com

    ReplyDelete
  4. You know, it occurs to me I might as well do the same scraping Amazon myself, without bothering you for data. A worker would be typically be more interested in new posts rather than historical data, so I could just HIT the requester listings to source my feeds. I'll go ahead and do that.

    ReplyDelete
  5. I evaluated Google App Engine, and it was not a good platform for this task. Running a crawler on the AppEngine is not something you want to do due to the limitations on how much time each request can take.

    Also, to display all these statistics, I rely heavily on being able to execute SQL queries over the raw data. For GAE, doing any adhoc heavy-duty data processing is almost impossible.

    I thought of interacting GAE with the database, having GAE as the frontend for this but it started being too much work for the purposes of this project.

    ReplyDelete
  6. And you are correct that RSS feeds for requesters are useful. Actually, it makes sense to get Amazon to provide RSS feeds for the requesters, directly.

    In fact, an API for accessing the available tasks would be nice. And if someone can build a bot to actually complete the tasks as nicely as a human, I would not mind at all...

    ReplyDelete
  7. It definitely makes sense for Amazon to offer RSS feeds themselves, but I'm not going to hold my breath after asking them for that for years now. :) API support would be nice too, and is something a number of people have asked for with no ensuing commitment on Amazon's part.

    I would cache the feeds for 15 minutes or so, so at worst App Engine would hit Amazon 4 times per hour per requester being followed. A quick test I just ran showed that I can grab and extract a requester's HIT listing in 2-6 seconds, well within the 30-second limit. (I already had all the code I needed from a previous MTurk scraper I wrote and just needed to make some minor tweaks to adapt it for this purpose.) Now I need to add in caching and provide a public interface so people can easily subscribe.

    For far more ambitious efforts like you've put together, however, App Engine definitely wouldn't be a good match. Too many limitations to be useful.

    ReplyDelete
  8. I should clarify: 2-6 seconds for loading and processing one page of a requester's HIT listings. Only a few requesters post separate HITs in enough volume to need more than one page, so those timings should be more or less the common case, barring connection or server issues.

    ReplyDelete
  9. If you manage to get it to run efficiently, drop me a line.

    I am running another crawler fully on GAE (http://intrade-archive.appspot.com/), archiving results from prediction markets. (You can see that I have a sweet spot for "wisdom of the crowds" applications :-) While it works fine for getting daily data, it fails miserably for markets that need to be pinged every 10 minutes or so.

    ReplyDelete
  10. Loading times are much improved today. 1-2 seconds on an empty cache. I still need to stick a proper user interface on it, but otherwise things are looking pretty good. Might run into daily bandwidth limit issues if things take off, though if things get to that point I should probably invest a few dollars in the service anyway, or start bugging Amazon again to provide this kind of thing themselves. :)

    I'll send you the link when I finish up, hopefully this evening if I have time.

    ReplyDelete
  11. I chanced upon to view your blog and found it very interesting as well as very informative, i was need such type information, which you have submitted. I really thankful to you, this posting help a huge number of people. Great ... Keep it up!

    ReplyDelete
  12. Should we presume that the amount offered is the total amounts that could be rewarded, or was rewarded. If so, then the sum total being offered continues to be small (as you have pointed out before) Even the top listing does not represent more than a full time junior employees worth of work over 7 months.

    ReplyDelete
  13. It is the total amount that could be rewarded, as I cannot tell if the requester actually paid, or they cancelled the HIT.

    One thing that I am missing is the possibility that the requester asks for redundancy and gets 2 or 3 Turkers to work on the same assignment. So, there is some undervaluation there. (In fact, for surveys, the undervaluation may be very significant as the same assignment is done by many workers.)

    But even assuming a 10x average undervaluation (and this is too much imho), the total amount of pay is still not much. I think that we need a few more things in place in order for people to trust MTurk (e.g., better quality control of the submissions).

    ReplyDelete
  14. RSS feeds have been working for a while now, but I have some minor issues to iron out and still need to finish up an easy way to subscribe. One can subscribe manually though as follows:

    feed.crowdsauced.com/r/req/requesterid (replacing 'requesterid')

    feed.crowdsauced.com/r/req/A3MI6MIUNWCR7F (castingwords)

    or

    feed.crowdsauced.com/r/key/keyword (replacing 'keyword' with a search term)

    eg., feed.crowdsauced.com/r/key/transcription

    As it is now, when new HITs are posted to an existing group, there is no new notification. Also, since I'm not signing into MTurk, there are a lot of things I can't provide direct links for, due to qualifications restrictions. In those cases, a link is provided to the requester page instead. Of course, even signed in I would have the same issue, but less so.

    ReplyDelete
  15. I can put the links to the RSS feeds from my "requesters" page. I will do that once I return from my trips. Remind me if you do not see the RSS links in a week.

    ReplyDelete
  16. The link to the RSS feeds is now active.

    ReplyDelete
  17. I'm on the list! Currently 39th. Cool.

    ReplyDelete
  18. No, simply moved to http://mturk-tracker.com/top_requesters/

    ReplyDelete