Tuesday, March 22, 2011

Crowdsourcing goes professional: The rise of the verticals

Over the last few months, I see a trend. Instead of letting end-users interact directly with the crowd (e.g., on Mechanical Turk), we see a rise of the number of solutions that target a very specific vertical.
Add services like Trada for crowd-optimizing paid advertising campaigns, uTest for crowd-testing software applications, etc. and you will see that for most crowd applications there is now a professionally developed crowd-app.

Why do we see these efforts? This is the time that most people realize that crowdsourcing is not that simple. Using Mechanical Turk directly is a very costly enterprise and cannot be done effectively by amateurs: The interface needs to be professionally designed, quality control needs to be done intelligently, and the crowd needs to be managed in the same way that any employee is managed. Most companies do not have time or the resources to invest in such solutions. So, we see the rise of such verticals that address the most common tasks that were accomplished on Mechanical Turk.

(Interestingly enough, if I remember correctly, the rise of vertical solutions was also a phase during web search. In the period in which AltaVista started being spammed and full of irrelevant results, we saw the rise of topic-specific search engines that were trying to eliminate the problems of polysemy by letting you search only for web pages within a given topic.)

For me, this is the signal that crowdsourcing will stop being the fad of the day. Amateurish solutions will be shunned, and most people will find it cheaper to just use the services of the verticals above. Saying "oh, I paid just $[add offensively low dollar amount] to do [add trivial task] on Mechanical Turk" will stop being a novelty and people will just point to a company that does the same thing professionally and in a large scale.

This also means that the crowdsourcing space will become increasingly "boring." All the low-hanging fruits will be gone. Only people that are willing to invest time and effort in the long term will get into the space. 

And it will be the time that we will get to separate the wheat from the chaff.

Wednesday, March 16, 2011

Uncovering an advertising fraud scheme. Or "the Internet is for porn"

You have heard about fraud and online advertising. You may have seen the Wall Street Journal video  "Porn Sites Scam Advertisers", or even read the story at today's Wall Street Journal about "Off Screen, Porn Sites Trick Advertisers" (Hint: to avoid the WSJ paywall, search the title of the article through Google News and click from there, to read the full article).

Since I am intimately familiar with the story covered by WSJ (i.e., I was part of the team at AdSafe that uncovered it), I thought it would be also good to cover the technical aspects in more detail, uncovering the way in which this advertising fraud scheme operated.

It is long but (I think) interesting. It is a story of a one-man-making-a-million-dollar-per-month fraud scheme. It shows how a moderately sophisticated advertising fraud scheme can generate very significant monetary benefits for the fraudster: Profits of millions of dollars per year.

If you want to skip the technical sleuthing details, you can skip directly to the overall picture and the discussion.

Disclaimer: In the story below, I will only mention by name the sites performing the fraudulent activities. All the brand names that you see are just for illustration purposes. They are not the ones affected by this case of fraud. Also remember that this is a personal blog. The views and opinions that I express here are my own and do not necessarily represent the views of AdSafe or the views of New York University.

The erroneous classifier

It all started while working at AdSafe. For those not familiar with AdSafe: The role of AdSafe is to provide brand protection services to online advertisers. In plain English, AdSafe analyzes website content and can block ads from appearing on individual web pages with content inappropriate for a brand. Porn, hate speech, gambling, celebrity gossip, torrents, are among the many categories that we detect.

On a nice Monday, the data science team gets the notification: The web page classifier was detecting a large number of porn web pages within legitimate, clean, big-brand-name websites. Think of websites such as BabyCenter, MSN MoneyCentral, HGTV, and so on. These sites would never have anything racy in their pages. However, we could see them being classified as having hard-core porn!

Why do we detect porn in clean sites? None of the pages within the sites contained anything offensive. No porn, no offensive material. Nothing. The website was clean as it gets. What was going on?

The invisible iframe hosting

The lifesaver was a technique developed at AdSafe: The key to the solution was the ability to read the address of the top frame that was hosting the ad(*). We were detecting porn because the ads that were supposed to appear within a "clean publisher" site were appearing within the frame of a porn website. Think of HGTV as an illustrative, but not the real, example of such a "clean publisher."

(*) For the technically curious: reading the address of the top frame is a challenging problem. For security reasons, browsers do not allow cross-domain scripting. So, it is not possible to just call the "top" object and read its properties. We have a proprietary solution for this.

By using this technique, we got our explanation: The HGTV website was appearing within an iframe of a porn website. In our case, the porn website was www.hqtubevideos.com.

WTF? This made no sense. Why would the porn website display HGTV (and the associated ads) within an iframe? Why would the porn site generate this "invisible" traffic towards HGTV? Just so that HGTV would get paid for the CPM ads? Or was the porn site trying to decrease the clickthrough rate of HGTV and ruin the performance the CPC campaigns? Did the porn website love HGTV so much and it was trying to increase its traffic? No way. Did HGTV employ a porn website to increase its traffic? No way, either.

Made no sense whatsoever.

Checking the structure of the porn website

So, we decided to investigate. Let's see what is going on. First, we go and see the HTML source of www.hqtubevideos.com/play.html that was the top frame what we were detecting. Here is the source:

The highlighted part shows an interesting redirection. We go to the www.hqtubevideos.com/index.php, but with the parameter ?x=1 at the end.

Loading the page www.hqtubevideos.com/index.php without this parameter loads a "vanilla" porn website. A few semi-suspicious attempts to add the website in the bookmarks in the beginning. Then porn pictures and links to affiliate sites. Plenty of porn but nothing to set an alarm. So far, so good.

They key, though, is this parameter ?x=1. Loading the www.hqtubevideos.com/index.php?x=1 we see a key part added in the website, at the very bottom. Here is the corresponding source.

Aha! A 0x0 iframe, loading the following URLs:
The first URL seems to be loading some randomized hashid. Ignore.

The second URL, www.hqtubevideos.com/counter2.php, is a little bit more interesting and puzzling:

What is going on? Why would a porn site link to these domains? What is the connection?

The parked domains

We started by doing a whois to figure out the ownership of this domains. Unfortunately, the registration information for the hqtubevideos is private and protected. However, the registration info for all the other domains is available. Not surprisingly, we see a common ownership for all these seven domains:

Thomas Schneider
519 S. York Road
Dillsburg, Pennsylvania 17019
United States
Registered through: GoDaddy.com, Inc. (http://www.godaddy.com)
Created on: 11-Mar-09
Expires on: 11-Mar-12
Last Updated on: 10-Jan-11
Administrative Contact:
Schneider, Thomas garret.and@gmail.com
519 S. York Road
Dillsburg, Pennsylvania 17019
United States
7174327575 Fax --
Technical Contact:
Schneider, Thomas garret.and@gmail.com
519 S. York Road
Dillsburg, Pennsylvania 17019
United States
7174327575 Fax --
Domain servers in listed order:

Now we start seeing something being uncovered. Would this guy, Thomas Schneider, be behind this? Too easy to be true. We went and did a reverse whois to find other domains that contained the email garret.and@gmail.com. And here we are: The email is associated with the registration of 89 other domains, which are registered under a variety of last names, but all listing garret.and@gmail.com as the contact email:

Let's see what we have so far: The owner of a porn domain loads in a set of 0x0 iframes, a set of other websites, all operated by the same owner. But still, no clear motivation. Also, the connection with the publishers that we checked remains elusive.

Re-directions within the parked domains

Now, let's see what is going on within these URL calls, such as www.takemedical.com/go_with_post.php. Here is the HTML source of one of those URLs:

Interesting. Another redirection. The site automatically submits a search form, searching for the term "hihijiji". Loading the page in the browser with the GET method (as opposed to the POST method indicated in the form), takes us to the normal page of a parked domain.

But let's submit with the POST method:
curl www.takemedical.com/search.php -d token=hihijiji
Ha! The result is different:

<iframe frameborder="No" height="1" src="index2.php" width="1"> </iframe>
We uncovered a hidden URL. This is the point where everything will start falling into place.

Re-directions and generating click fraud with normal click patterns

Within this hidden URL is where all the interesting things are happening! Let's load www.takemedical.com/index2.php and see the network activity. (In Chrome, go to Tools, Developer Tools, and then to the Network tab.). Here is the screenshot:

Indeed, here is where all the action is happening: These innocent sounding parked domain load all sorts of ad sites, and then "clicks" on the ads. By click, we do not mean any actual click. Instead the site loads the URL in the ad, that is typically a redirection to the ad server, which then redirects to the advertised URL. After this "click" within the iframe we finally have the publisher website (the "HGTV" that we mentioned before)!

Interestingly enough, the click fraud was very well-done: It was not loading all the time the same website. Sometimes it was mevio, other times it was tremor, other times bodyarchitect.tv, and so on. And once we have been redirected enough times from the same IP address, the final redirect was going to find.fm, to execute a straight-forward search. Clever! Engage in fraud, but be careful not to trigger any alarms.

Also, notice that the traffic patterns for the clicks are not bot-generated. These are actual users. With real and different web browsers. Different IP addresses. Different times of the day, following the usual traffic patterns per region. Good job: these click-fraud patterns are the least likely to be caught as they have patterns very similar to normal traffic.

For those interested in the details, here is the set of screenshots with the redirects:

The role of parked domains: Laundering traffic

At this point, we now know how this person makes money. Clearly, there is click-fraud: the scammer is employing click-fraud services to click on the pay-per-click ads "displayed" in his parked domains. If some of the ads are also pay-per-impression, he may also get paid for these invisible impressions that happen within the 0x0 iframe.

Why the parked domains though? Why not doing the same directly within the porn site? The answer is simple: Traffic laundering.

What do I mean by "traffic laundering"? First, the ad networks are unlikely to place many ads within a porn site. On the other hand, they have ad-placement services for parked domains. Second, the publishers that get the traffic from the parked domains see in the referral URLs some legitimately-sounding domain names, not a porn site. Even if they go and check the site, they will only see an empty site full of ads. Nothing too suspicious. Hats off to the scammer. Clever scheme.

You think we are done? No. There is one more piece in the puzzle. How does the scammer attract visitors to the porn site?

Generating traffic through an adult traffic exchange

The other interesting part: The porn website does not really contain porn! There are a few images but most of the links are to other porn website that actually host the video. In other words, the scammer does not even pay the cost of hosting porn!

However, according to QuantCast and Compete, the website has a pretty significant number of unique visitors per month. Here is the traffic over the last year:

This porn website gets 500K to 1M unique visitors per month! That is a lot of traffic for a website without any real content! So, how does the guy get all the traffic?

The answer surprised me. Apparently, there is an exchange (yes, my dear readers, an exchange!) for buying and selling adult traffic! Its name: TrafficHolder.com

Do you want to buy traffic for people interested in midget sex? The price is \$2.94 per thousand visitors. Interested in latex? The cost is \$2.54 per thousand visitors. Interested in HD video? \$3.54 per thousand visitors. (The running price catalog and the available traffic volume is available at http://www.trafficholder.com/cgi-bin/traffic/manager/buying100.cgi)

How do the porn sites sell traffic to each other? Through pop-ups, pop-unders, by causing the first click to the website to redirect to the buyer's site. The term for this traffic is "skimmed traffic"

Following the trail, we figured out the source of the traffic for hqtubevideos.com: It was coming from the (very popular, apparently) website www.pornoxo.com. The reports from QuantCast and Compete confirm that PornoXo gets approximately 1 million unique visitors per month. If you visit the PornoXo.com website, you will see that the first click will create a pop-under that loads the page hqtubevideos.com/play.html. This is the page responsible for all the fraud that I described above.

Based on the exchange prices and the visitorship at PornoXo, this traffic has a cost of \$3K/month for hqtubevideos.com, which is significant. So, we need to figure out how the scammer recovers this cost.

How much money are we talking about?

So, the key question now: How much money the hqtubevideos.com generates through the scheme? To get a feeling of how much fraud is going on, please do the following:
  • Open Chrome
  • Open the options, and then Tools, then Developer Tools. This will load the monitoring tool.
  • Switch to the "Network" tab
  • Visit http://www.hqtubevideos.com/play.html and see what is being loaded in the background (my own counting was approximately 1 ad loading per 10 seconds)
Let's do some back-of-the-envelope, very conservative, approximations:
  • The site gets 500K-1M visitors per month
  • The cost of this traffic is approximately \$1.5K to \$3K per month
  • Each unique visit loads 7 sites, which then generate clicks. Let's assume that there is no reload of the invisible sites, to keep the estimates low.
    • Assuming 500K visitors and that just one click out of the seven sites goes through, this is 500K clicks per month (low estimate)
    • Assuming 1M visitors and that all clicks, in all 7 sites, go through, this is 7M clicks per month (high estimate)
  • The a low-end estimate for CPC click costs is 30 cents, out of which we can assume that the scammer gets, say, 10 cents.
  • This generates a total income of \$50K to \$700K per month
  • The scheme is running for 8 months now, generating total revenue of \$400K to \$5M so far. (And you thought that investment bankers were getting paid a lot...)
Notice that these approximations assume that the site only generates the direct clicks discussed above. You will notice that there is no end in the loading of ads, if you leave the website open for a while. Given that the site visitors come from PornoXo, there is a good chance they will keep watching the porn video at PornoXo, leaving hqtubevideos to load the ads in the background.

But even with the modest estimates listed above, we are talking about a business that generates tens of thousands of dollars, with really minimum requirements. This is a scheme that a single person can set up in a week...

Overall picture

Trying to put all pieces together, I created the following graphical summary to see what is going on:

Let's follow the flow of the users:
  1. Scammer buys user traffic from PornoXo.com and sends it to HQTubeVideos.
  2. HQTubeVideos loads, in invisible iframes, some parked domains with innocent-sounding names (relaxhealth.com, etc)
  3. In the parked domains, ad networks serve display and PPC ads.
  4. The click-fraud sites click on the ads that appear within the parked domains.
  5. The legitimate publishers gets invisible/fraudulent traffic through the (fraudulently) clicked ads from parked domains.
  6. Brand advertisers place their ad on the websites of the legitimate publishers, which in reality appear within the (invisible) iframe of HQTubeVideos.
  7. AdSafe detects the attempted placement within the porn website, and prevents the ads of the brand publisher from appearing in the legitimate website, which is hosted within the invisible frame of the porn site.
Notice how nicely orchestrated is the whole scheme: The parked domains "launder" the porn traffic. The ad networks place the ads in some legitimately-sounding parked domains, not in a porn site. The publishers get traffic from innocent domains such as RelaxHealth, not from porn sites. The porn site loads a variety of publishers, distributing the fraud across many publishers and many advertisers.

Who has the incentives to fight this?

And now let's see who has the incentives to fight this. It is fraud, right? But I think it is well-executed type of fraud. It targets and defrauds the player that has the least incentives to fight the scam.

Who is affected? Let's follow the money:
  1. The big brand advertisers (Continental, Coca Cola, Verizon, Vonage,...) pay the publishers and the ad networks for running their campaigns.
  2. The publishers pay the ad network and the scammer for the fraudulent clicks.
  3. The scammer pays PornoXo and TrafficHolder for the traffic.
The ad networks see clicks on their ads, they get paid, so not much to worry about. They would worry if their advertisers were not happy. But here we have a piece of genius:

The scammer did not target sites that would measure conversions or cost-per-acquisition. Instead, the scammer was targeting mainly sites that sell pay-per-impression ads and video ads. If the publishers display CPM ads paid by impression, any traffic is good, all impressions count. It is not an accident that the scammer targets publishers with video content, and plenty of pay-per-impression video ads. The publishers have no reason to worry if they get traffic and the cost-per-visit is low.

Effectively, the only one hurt in this chain are the big brand advertisers, who feed the rest of the advertising chain.

Do the big brands care about this type of fraud? Yes and no, but not really deeply. Yes, they pay for some "invisible impressions". But this is a marketing campaign. In any case, not all marketing attempts are successful. Do all readers of Economist look at the printed ads? Hardly. Do all web users pay attention to the banner ads? I do not think so. Invisible ads are just one of the things that make advertising a little bit more expensive and harder. Consider it part of the cost of doing business. In any case, compared to the overall marketing budget of these behemoths, the cost of such fraud is peanuts.

The big brands do not want their brand to be hurt. If the ads do not appear in places inappropriate for the brand, things are fine. Fighting the fraud publicly? This will just associate the brand with fraud. No marketing department wants that.

Note also that the fraudster does not target a single publisher, does not target a single advertiser. The damage is amortized so nicely that nobody feels that it is a big deal. A mastery of the long tail...

Well, but what if fraud is big? What if big bucks are wasted? Maybe some newspapers would like to investigate. Let's break the big story. What would be the effect? Publicizing that a significant source of their income (online advertising) is a dangerous thing, full of fraud? Who would like to shoot himself in the foot?

Fraud as (harmless?) parasite

Really. Genius. Defraud many rich guys a little bit each, and ensure that nobody has the incentives to really fight and chase the fraud.

The guy essentially realized that this type of fraud is really behaving like a parasite within a much bigger ecosystem. And it is a parasite that is so costly to remove that it makes sense to leave it there. As long as the parasite does not annoy the host too much, things will be fine.

Only if fraud becomes really big there will be the real incentive to fight advertising fraud. Until then, you know how to make \$500K/month...

Monday, March 14, 2011

Do Mechanical Turk workers lie about their location?

A few weeks back, Dahn Tamir graciously allowed me to take a peek at the data that he has been gathering about this workers on Mechanical Turk. He has assigned tasks over time to more than 50,000 workers on Mechanical Turk, so I consider his data to be one of the most representative samples of workers.

One of the nice tasks that he has been running is a simple HIT in which he asks workers to report their location. At the same time, in this task, Dahn was recording the IP of the worker. Why the task was nice? Because there is absolutely no incentive for the workers to be truthful. The submission will be accepted and paid no matter what. In a sense, it is a test that check if workers will be truthful in cases where it is not possible to check their accuracy.

So, we used this test to check how sincere are the workers: We can simply geocode the IP address and find out the actual location of the worker. (With some degree of error, but good enough for approximation purposes.) For the workers that reported to be based in the US (approximately 22,000 workers), the HIT was asking for the zip code of the worker, making it easy to assign an approximate long/lat location.

To measure how accurately the worker report their location, we measured the distance between the location of the IP and the location of the zip code. The plot below shows the distribution of the differences:

As you can see, most of the workers were pretty truthful about their location. The difference in distance was less than 10 miles for more than 60% of the workers: this difference can be easily explained by the limited accuracy of the geocoding API's and by the approximation of using zipcode locations.

Of course, the flip side of the coin is that a significant fraction of the workers were essentially lying about their location: For 10% of the workers (i.e., ~2250 of them) the IP address was more than 100 miles away from the reported zip code. For 2% of the workers (i.e., ~500 workers) the distance was more than 1000 miles away.

The biggest liar? A worker from Chennai, India who reported a zip code corresponding to Tampa in Florida. The IP was a cool 9500 miles away from the reported location!

Friday, March 11, 2011

The Road to Serfdom, ACM Edition


A couple of days back, I got the following email from ACM:

Dear Moderator/Chairs,

This is being sent to everyone with the chairs cc'd as the last and final requeset for the eform below to be completed or your panel overview abstract will be removed from the WWW 2011 Companion Publication and will NOT appear in the ACM DL.

Your prompt and immediate attention to the form below is needed.

permission release form URL: ....

ACM Copyrights & Permissions

Given that this was the "last and final requeset"[sic], I assumed that somehow I missed the previous requests. So, I checked my email to find out how late I was. Nope. Nothing in the archive, nothing in the trash, nothing in the spam, no entry in the delivery log. This was the first notification sent by ACM. They have just forgotten about this. But since they were running late, why not just threaten the authors? It is so much easier to pass the blame to others and be the first one to be aggressive.

What happened ACM, did you start get advice on customer service from your pals at Sheridan Printing, who tend to send requests like this?

But I should not have been so surprised. This email just reflects the overall attitude of ACM. I have experienced this many times in the past. Anyway, I decided to sign the e-form, without firing back.

Donating copyright to ACM

Signing the form was a mechanic action before. However, after reading Matt Blaze's post on copyright and academic publishing, I decided to read the form a little bit more carefully, to see exactly what I was signing.

As usual, we start with a transfer of copyright to ACM. The authors agree to transfer all their copyright rights to ACM, blah blah...

Wait a minute! Why does ACM needs to own the copyright? No good reason. To publish and distribute the article, ACM just needs a non-exclusive license to print and distribute. There is no need to own the copyright.

If we follow ACM's logic, any artist that wants to see their work exhibited in any museum, they need to give up the ownership of their work and give full ownership of their creations to the museum. For free. Without expecting any royalties back in return. Ever. Furthermore, the museum instead of promoting the work, they would lock it in a "patron members access only". For all others, the museum would demand a separate entrance ticket to show each of the collection pieces.  (Say, for a friendly price of $5 to see each painting?) .

Anyway, let's not belabor the point with copyright. We know that ACM's policy sucks. We know that ACM is a bureaucracy serving just itself and not its members or the profession. Let's move on.

Let's move to the point that really got me fired up.

Protecting ACM from liability

What got me really pissed was the last part of the agreement:

Liability Waiver

* Your grant of permission is conditional upon you agreeing to the terms set out below.

I hereby release and discharge ACM and other publication sponsors and organizers from any and all liability arising out of my inclusion in the publication, or in connection with the performance of any of the activities described in this document as permitted herein. This includes, but is not limited to, my right of privacy or publicity, copyright, patent rights, trade secret rights, moral rights or trademark rights.

All permissions and releases granted by me herein shall be effective in perpetuity unless otherwise stipulated, and extend and apply to the ACM and its assigns, contractors, sublicensed distributors, successors and agents.

So, not only we should donate "voluntarily" ownership of our copyright to ACM . We also need to protect ACM from any liability.

In other words, ACM wants to get all the upside from owning the copyright, without ever distributing royalties to the contributing authors. (Not that it would be worth much. It is a matter of principle and a signal of respect to the authors, not an issue of monetary importance.) At the same, ACM also wants the authors to provide guarantee that if there is any problem with the copyright, the author will be the one liable for the damages.

All the upside for ACM, no revenue to the authors. All the downside to the authors, no obligations for ACM.

Thank you ACM for caring so much about your members. You will not be missed when you disappear.

Yours truly,
A lifetime member of ACM.

PS: In retrospect, the title of the post is offensive: From Wikipedia's definition of serfdom: "Serfdom included the forced labor of serfs bound to a hereditary plot of land owned by a lord in return for protection". In other words, the slave owners took the product of slaves' work, but in return they provided the protection and military support, to defend the slaves that were working the land. ACM also wants the slaves to "protect the land" as well. I owe an apology to the slave owners for the comparison.


Thursday, March 3, 2011

The promise and fear of an assembly line for knowledge work

Last week, together with Amanda Michel from ProPublica, we were presenting at the CAR 2011 conference (CAR stands for Computer-Assisted Reporting), on how to best use Mechanical Turk for a variety of tasks pertaining to data-driven journalism.

We discussed issues of quality assurance, how TurkIt-like workflow-based tasks can generate nice outcomes, and briefly touched upon the CrowdForge work from Niki Kittur and the team at CMU, showing that crowdsourcing can potentially generate intellectual outcomes comparable to those of trained humans.

The discussion after the session was a mix of excitement and fear. We have observed in the past how "assembly line" work for industrial production lead to massive productivity improvements and was the basis for much of the progress in the 19th and 20th century. But that was for mechanical work. Yes, it replaced centuries old crafts of the blacksmiths, carpenters, potters, but that was just part of progress.

What happens if we see now the assembly line extended into tasks that were traditionally considered creative and intellectual in nature? What would be the effect of an assembly line for knowledge work?

A few months back, I quoted Marx and Engels who, back in 1848, wrote in their Communist manifesto:
the work of the proletarians has lost all individual character, and, consequently, all charm for the workman. ... [The workman] becomes an appendage of the machine, and it is only the most simple, most monotonous, and most easily acquired knack, that is required of him
(Btw, TIME magazine liked that connection enough to put it into their own article about Mechanical Turk.)

But how likely it is to see this style of work to be extended further in the intellectual field? Are these Mechanical Turk experiments something generalizable, or just cute proof-of-concept experiments?

I was reminded of this question today, when I realized that many intellectual tasks are already commoditized:

The article "Inside the multimillion-dollar essay-scoring business: Behind the scenes of standardized testing" gives a dreadful view of now essays are being scored for the standardized tests.

Based on the description of the article, the (human-based) scoring process "goes too fast; relies on cheap, inexperienced labor; and does not accurately assess student learning." Needless to say, the workers were not exactly enthusiastic about their work. Match that with the computer-assisted scoring of essays, and you have an MTurk-like environment for much more intellectually-demanding tasks...

After reading this essay-scoring mill story, I started feeling a little bit uneasy. The MTurk-style work seems too far away to be in my future, so the discussion is always, ahem, academic. But the essay scoring brought the concept a little bit too close for comfort.