Showing posts with label fraud. Show all posts
Showing posts with label fraud. Show all posts

Friday, June 28, 2013

Mechanical Turk account verification: Why Amazon disables so many accounts

Over the last year, Amazon embarked into a big effort: All holders of an Amazon Payments account (which includes all the Mechanical Turk worker) had to verify their accounts, by providing their social security number, address, full legal name, etc. Users that did not provide this information found their accounts disabled, and unable to perform any financial transaction.

This led to big changes in the market, as many international workers realized that Amazon could not verify their identity (even if they provided the correct information), and they found themselves locked out of Mechanical Turk.

So, why would Amazon start doing that?

  • Low quality of international workers. While there are certainly many high-quality workers outside the US, there is a certain segment of workers that join the market with the sole purpose of getting something for nothing. Especially after Indian workers became eligible to receive cash compensation (instead of just gift cards available to other non-US workers), the number of spam attacks from India went up significantly. 

    So, identity verification can help in that front. It is well-known that it is difficult to have a good reputation scheme that allows for cheap generation of identities. When identities are easy to create, every time someone commits a bad action and gets caught, the account gets closed and a new account is created, ready to commit the same bad actions again. This hurts significantly new workers, that are defacto treated as potential spammers, discouraging them to join the market. 

    I have long criticized the fact that Amazon allowed for easy generation of ids. Even though it seemed that Amazon required SSN numbers, and other information to create an account, this was an effectively optional step. In fact, it was possible to use Fake Name Generator, and create plenty of seemingly authentic "US based" accounts, using simply SSN numbers of dead people. This meant that many fake accounts existed, many of them being "US based" that then used Amazon Payments to forward their earnings to the true puppetmaster holder.


  • Labor law. Even though many (small) requesters are unaware of the fact, when you post jobs on Mechanical Turk, you directly engage into hiring contractors to do some work for you. Many people believe that you are paying Amazon, who then pays the workers but in reality Amazon acts simply as a payment processor. Amazon does not act as an employer; the requester acts as an employer. As discussed in the past, this forces many requesters to unknowingly participate in a black market.

    The moment requesters realize that they are actually employing all these contractors is when some workers end up receiving more than $600 in payments from the requester over the fiscal year. At that point, due to IRS regulations, the requester needs to send a 1099-MISC form to the MTurk worker. Amazon then provides the full information (SSN, address, etc) of the workers to the requester. So Amazon would like to have the correct information, to avoid forcing the requesters to send 1099 forms to fake addresses, with fake names and SSNs.

    I should clarify here that the $600 limit is the point where the employer is forced to send a 1099-MISC form. In principle, a requester may want to send 1099-MISC forms to all workers, and Amazon may want to provide this information on demand. (I doubt that this can be the reason, though).

    Finally, there was a new regulation from IRS last year: IRS introduced the concept of a 1099-K form. Since Amazon acts as a defacto payment processor (and not as an employer), Amazon should also report the amount of payments sent to each worker. So, even if no worker have met the $600 limit from a single requester, if the overall payments for a single worker was high enough (specifically $20,000/yr or more, and more than 200 requesters) then again Amazon needs to report this information and include valid worker information there.

  • Money laundering: Since Mechanical Turk started becoming a marketplace with significant volume, this may have raised some flags in all the places that monitor financial transactions for money laundering. All US companies need to comply with the infamous US Patriot Act, and for Mechanical Turk the provisions about money laundering and financing of terrorist activities may have been a reason for cleaning up the marketplace from fake worker identities. The basic idea, known as the "Know Your Customer (KYC)" doctrine, is that Amazon should know from whom they get money and to whom they send the money. Since Amazon accepts payments from US requesters only, they know where the money come from. Now, with cleaning up the marketplace from fake identities and verifying the existing ones, they also know where the money flows to, so they seem to be more in compliance with the money laundering laws.
Overall, there are many reasons for Amazon to check and clean up the market from fake accounts and prevent any anonymous activity. For me, this is a good step, despite all the problems that it may generate for workers that have problems proving their identity. Even in India, the new UID system will eventually allow the legitimate Indian workers to prove their identity without problems.

One concerns that someone expressed to me was that this direction was removing the ability of workers to be truly anonymous. I am not exactly sure how this can be a concern, given that it is well established that in the workplace (electronic or not) there is very limited right to privacy. Knowing the true identity of your workers (contractors or employees) is a pretty fundamental right of the employer, and I doubt that the expectation that a worker remains anonymous can be a "reasonable expectation of privacy". The only case that I see this happening is if Amazon switches from being a payment processor to being an employer of all the Mechanical Turk workers, but I doubt this will happen anytime soon. 

At the end of the day, markets do not mix well with true and complete anonymity.

Wednesday, March 14, 2012

50% of the online ads are never seen

Almost a year back, I was involved in an advertising fraud case, as part of my involvement with AdSafe Media. (See the related Wall Street Journal story.) Long story short, it was a sophisticated scheme for generating user traffic to websites that were displaying ads to real users but these users could never see these ads, as they were never visible to the user. While we were able to uncover the scheme, what triggered our investigation was almost an accident: our adult-content classifier seemed to detect porn in websites that had absolutely nothing suspicious. While it was a great investigative success, we could not overlook the fact that this was not a systematic method for discovering such attempts for fraud. As part of this effort to make more systematic, the following idea came up:

Let's monitor the duration for which a user can actually see an ad?

After a few months of development to get this feature to work, it became possible to measure the exact amount of time an was visible to a user. While this feature could easily now detect any fraud attempt that delivers ads to users that never see them, this was now almost secondary. It was the first time that we could monitor the amount of time that users get exposed to ads.



50% of the Ads are (almost) Never Seen.

By measuring the statistics of more than 1.5 billion ad impressions per day, it was possible to understand deeply how different websites perform. Some of the high level results:
  • 38% of the ads are never in view to a user
  • 50% of the ads are in view for less than 0.5 seconds
  • 56% of the ads are in view for less than 5 seconds
Personally, I found these numbers impressive. 50% of the delivered ads are never seen for more than 0.5 seconds! I wanted to check myself whether 0.5 seconds is sufficient to understand the ad. Apparently, the guys at AdSafe thought about that as well, so here is their experiment:



You know the old saying, "half of my marketing budget is completely wasted, I just do not know which half"? Well, apparently this intuition was correct :-) The cool thing now is that you can find out which half of the budget is wasted :-)




Give me More Data!

OK, the high level results were good, but honestly, I was not satisfied. The 50%-of-the-ads-are-never-seen is a good one-liner but I was craving for more data. Were these results reliable? Or some convenient accident? So, I talked with Arun Ahuja, who gave me access to much more detailed data, sending my way the measurements for the top-1000 websites that run ads, ranked by number of visitors. (Fun fact of the day: Arun is working for AdSafe after replying to a tweet of mine. Who said that Twitter is not recruiting mechanism?)

The first thing that I wanted to check is whether the timing measurements are reliable. For that, I got the visitorship and time-on-page data from Comscore, and compared the ranked list by AdSafe and Comscore. The two lists had more than 75% overlap, which was pretty significant, given that the Comscore list also contained sites that do not display ads (e.g., Wikipedia). I also ranked the sites by number of visitors by time spent on page and compared the rankings of AdSafe and Comscore. The resulting Spearman ranking correlation coefficient was at 0.72, which was strong enough to convince me that the measurements were solid.

The first time that wanted to see was the distribution of time that people spend on a web page. The times within a website followed a log-normal distribution, so the best way to summarize these values was by using the geometric mean of the samples, which is equal to $\left(\prod^n_i t_i\right)^{1/n} = \exp\left(\frac{1}{n}\cdot\sum^n_i \ln(t_i)\right)$; for the lognormal distribution, the geometric mean is equal to the median of the distribution, which is a pretty robust statistic. OK, done with the geeky stat details.

The next thing was to plot the median time on page across different sites. Not surprisingly, the distribution is also a heavy-tailed one. While most people stay on a particular web page for just a few seconds on average (cough, median), there a few sites for which people spend significantly more time. Here is the distribution:


What is the site with the highest median time on page? No, it is not Facebook. (You see, on Facebook people do move from one page to another...) The puzzles page of USA Today and Pandora are two of the top sites in terms of time on page, with median times around 10 minutes each.




Percent of Users Exposed to Ads, for Various Periods of Time

Unlike "time on page" checking the median ad visibility per site is not a very informative metric, given that the median time is close to zero for many sites. Instead it is better to set different thresholds for ad visibility, and see what percentage of user sessions reach that level of ad visibility.

You can see below the distributions for $t>0 secs$, $t>2.5 secs$, $t>5.0 secs$, $t>7.5 secs$, and $t>10 secs$.







How to interpret these plots?

For example, for the $t>0$ plot, we that for ~12% of the sites in the dataset, were displaying the ad to 90%-100% of the visiting users. However, based on the $t>2.5$ plot, we can see that only 5% of the sites manage to show the ad for more than 2.5 seconds to 90%-100% of the visiting users, and these numbers plummet further for higher thresholds.

On the other side of the distribution, we can see that ~5% of the sites do not manage to make their ads visible to their users for more than 2.5 seconds for 90%+ of their visitors, and this number grows to 10% of the sites if we ask for the visibility to be higher than 10 seconds.

If you want to have the overall picture, here is a summary plot that puts together the histograms above:


Again, just a few data points to get you to interpret this plot quickly:
  • In 15% of the sites, the ad is not visible for 40% or more of the user sessions (see $t>0$ line)
  • In 60% of the sites, the ad is not visible for 70% or more of the user sessions (see $t>0$ line)
  • In 60% of the sites, the ad is not visible for more than 10 seconds for 40% of the user sessions (see $t>10$ line)
  • In 75% of the sites, the ad is not visible for more than 10 seconds for 50% of the user sessions (see $t>10$ line)



Correlation of time on page and ad visibility

And now let's move to the juicy stats. What is the correlation between the time on the page vs the time that people actually see the ads in the page? Interesting enough, the two numbers are not correlated:



What is wrong here? Well the main problem lies in the fact that many ads are never visible to the user (38% of them to be exact), or are visible for only brief periods of time (50% are seen for less than 0.5 seconds). From the above, we can see that the metric "percentage of user sessions with ad visibility greater than X seconds" is more descriptive than just the median.

In fact, if we compute the correlation of the visibility metrics with time on page and ad visibility, we get a more clear picture:




Correlation between time in page and  percent of user sessions exposed to ad for various periods of time
0 secs
0.09
2.5 secs
0.18
5.0 secs
0.22
7.5 secs
0.24
10 secs
0.25


As you can see, the metric that correlates best with time on page is the metric that examines what percentage of user sessions are exposed to an ad for more than 10 seconds. Indeed, we can see that there is a more clear trend, but still the variance is extremely high.







"Above the Fold" vs. "Below the Fold"?

Another common way to evaluate the visibility of an ad is to examine whether it is "above the fold" (i.e., near the top of the page and visible when the page loads), or "below the fold". This is a concept that is borrowed from the printed press and is a decent heuristic; unfortunately, it is not always accurate in the digital world. The site "Life below 600px" does a good job in explaining this. (Please visit the site, it is worth checking out :-)

To examine the effect of the "above the fold" visibility, we also measured the probability that an ad is visible when the site loads. (We decided not to use a hard metric such as "600px and below" as display sizes come in all sorts of variants).

Here is the median ad visibility, as a function of the probability of seeing an ad on load:


Here is the probability of seeing an ad for more than 10 seconds, as a function of the probability of seeing an ad on load:


There is definitely a positive correlation. But there is still significant amount of remaining noise. As you can see, there are cases where the ad is visible on load ("above the fold") but people do not see the ad for long periods of time, and there are cases where the ad is not visible on load ("below the fold").





Example Sites

Given all the metrics and combinations, it would be good to examine a few sample sites to understand better what layouts and content generate the different combinations of time on site, ad visibility, etc.
  • High time on page, high ad visibility, above the fold: Check the ZeroHedge site. This combination is the "expected" combination. Ads are visible when the page loads, users stay at the site for long (3-4 minutes median time on page), and they get exposed to the ads for long periods of time, with high probability (The probability of ad visibility above 10 seconds is greater than 70%.)
  • High time on page, low ad visibility, above the fold: Check the "That Guy with the Glasses" site. (It is better to see a representative internal page). In this site, there is a banner ad on top, but the actual content of the site is the video. So users quickly scroll down to the video and never see the top banner ad.
  • High time on page, low ad visibility, below the fold: Consider the page with puzzles at USA Today. This is a page where users spend a significant amount of time. However, they rarely see the ad, as it is rendered below the game, and users simply do not scroll down there. (Median time on page 12 minutes, with median ad visibility being 0, and probability of seeing the ad for any period of time below 10%)
  • Low time on page, high ad visibility, below the fold: Check the site http://www.everydayhealth.com/. In this site, the main banner ad is rarely above the fold. However, the users seem to habitually scroll down to the options in the lower part of the page, so they get exposed to the ad for significant amounts of time. (The probability of ad visibility above 10 seconds is greater than 40%, while the median time on page is just 20 seconds.)



The Future of Ad Pricing

I would be very surprised if the pricing model for ads does not change to account for the visibility statistics. For display ads that get paid per impression, it is a no brainer. If the user never sees the ad, there is no real impression, and the ad should not be paid. But even for ads that get paid on a per-click mode, the visibility statistics are important. How can we compute the clickthrough rate reliably in the presence of ads that are not even seen? I would expect visibility statistics to become standard part of the clickthrough computation process, which is a key metric of effectiveness for an ad.

The question is how fast this change will come. Perhaps the moment advertisers realize that they should not be paying for ads that are never shown to the users.

Wednesday, March 16, 2011

Uncovering an advertising fraud scheme. Or "the Internet is for porn"

You have heard about fraud and online advertising. You may have seen the Wall Street Journal video  "Porn Sites Scam Advertisers", or even read the story at today's Wall Street Journal about "Off Screen, Porn Sites Trick Advertisers" (Hint: to avoid the WSJ paywall, search the title of the article through Google News and click from there, to read the full article).

Since I am intimately familiar with the story covered by WSJ (i.e., I was part of the team at AdSafe that uncovered it), I thought it would be also good to cover the technical aspects in more detail, uncovering the way in which this advertising fraud scheme operated.

It is long but (I think) interesting. It is a story of a one-man-making-a-million-dollar-per-month fraud scheme. It shows how a moderately sophisticated advertising fraud scheme can generate very significant monetary benefits for the fraudster: Profits of millions of dollars per year.

If you want to skip the technical sleuthing details, you can skip directly to the overall picture and the discussion.



Disclaimer: In the story below, I will only mention by name the sites performing the fraudulent activities. All the brand names that you see are just for illustration purposes. They are not the ones affected by this case of fraud. Also remember that this is a personal blog. The views and opinions that I express here are my own and do not necessarily represent the views of AdSafe or the views of New York University.



The erroneous classifier

It all started while working at AdSafe. For those not familiar with AdSafe: The role of AdSafe is to provide brand protection services to online advertisers. In plain English, AdSafe analyzes website content and can block ads from appearing on individual web pages with content inappropriate for a brand. Porn, hate speech, gambling, celebrity gossip, torrents, are among the many categories that we detect.

On a nice Monday, the data science team gets the notification: The web page classifier was detecting a large number of porn web pages within legitimate, clean, big-brand-name websites. Think of websites such as BabyCenter, MSN MoneyCentral, HGTV, and so on. These sites would never have anything racy in their pages. However, we could see them being classified as having hard-core porn!

Why do we detect porn in clean sites? None of the pages within the sites contained anything offensive. No porn, no offensive material. Nothing. The website was clean as it gets. What was going on?



The invisible iframe hosting

The lifesaver was a technique developed at AdSafe: The key to the solution was the ability to read the address of the top frame that was hosting the ad(*). We were detecting porn because the ads that were supposed to appear within a "clean publisher" site were appearing within the frame of a porn website. Think of HGTV as an illustrative, but not the real, example of such a "clean publisher."

(*) For the technically curious: reading the address of the top frame is a challenging problem. For security reasons, browsers do not allow cross-domain scripting. So, it is not possible to just call the "top" object and read its properties. We have a proprietary solution for this.

By using this technique, we got our explanation: The HGTV website was appearing within an iframe of a porn website. In our case, the porn website was www.hqtubevideos.com.

WTF? This made no sense. Why would the porn website display HGTV (and the associated ads) within an iframe? Why would the porn site generate this "invisible" traffic towards HGTV? Just so that HGTV would get paid for the CPM ads? Or was the porn site trying to decrease the clickthrough rate of HGTV and ruin the performance the CPC campaigns? Did the porn website love HGTV so much and it was trying to increase its traffic? No way. Did HGTV employ a porn website to increase its traffic? No way, either.

Made no sense whatsoever.



Checking the structure of the porn website

So, we decided to investigate. Let's see what is going on. First, we go and see the HTML source of www.hqtubevideos.com/play.html that was the top frame what we were detecting. Here is the source:



The highlighted part shows an interesting redirection. We go to the www.hqtubevideos.com/index.php, but with the parameter ?x=1 at the end.

Loading the page www.hqtubevideos.com/index.php without this parameter loads a "vanilla" porn website. A few semi-suspicious attempts to add the website in the bookmarks in the beginning. Then porn pictures and links to affiliate sites. Plenty of porn but nothing to set an alarm. So far, so good.

They key, though, is this parameter ?x=1. Loading the www.hqtubevideos.com/index.php?x=1 we see a key part added in the website, at the very bottom. Here is the corresponding source.


Aha! A 0x0 iframe, loading the following URLs:
The first URL seems to be loading some randomized hashid. Ignore.

The second URL, www.hqtubevideos.com/counter2.php, is a little bit more interesting and puzzling:


What is going on? Why would a porn site link to these domains? What is the connection?



The parked domains

We started by doing a whois to figure out the ownership of this domains. Unfortunately, the registration information for the hqtubevideos is private and protected. However, the registration info for all the other domains is available. Not surprisingly, we see a common ownership for all these seven domains:

Registrant:
Thomas Schneider
519 S. York Road
Dillsburg, Pennsylvania 17019
United States
Registered through: GoDaddy.com, Inc. (http://www.godaddy.com)
Domain Name: RELAXHEALTH.COM
Created on: 11-Mar-09
Expires on: 11-Mar-12
Last Updated on: 10-Jan-11
Administrative Contact:
Schneider, Thomas garret.and@gmail.com
519 S. York Road
Dillsburg, Pennsylvania 17019
United States
7174327575 Fax --
Technical Contact:
Schneider, Thomas garret.and@gmail.com
519 S. York Road
Dillsburg, Pennsylvania 17019
United States
7174327575 Fax --
Domain servers in listed order:
NS1.ROLENEWS.COM
NS2.ROLENEWS.COM

Now we start seeing something being uncovered. Would this guy, Thomas Schneider, be behind this? Too easy to be true. We went and did a reverse whois to find other domains that contained the email garret.and@gmail.com. And here we are: The email is associated with the registration of 89 other domains, which are registered under a variety of last names, but all listing garret.and@gmail.com as the contact email:
aboutclimax.com
aboutclinical.com
aboutcouples.com
abouterectile.com
abouterection.com
achieveday.com
achievedrugs.com
afterdeaths.com
afterdrugs.com
associatedmagazine.com
atlantea.org
baldnesshealth.com
basehealth.com
becomeerect.com
begineducate.com
behaviordesire.com
beingdizzy.com
bestcialis.com
bestclimax.com
bigcouples.com
bodychemical.com
bodyclimax.com
bodyday.com
bundlehealth.com
calnam.com
cancerdamage.com
carecouples.com
carloschongdds.com
ceaifa.com
cialisc.com
cigarettesfinder.com
clubofheads.com
coacaz.com
college-grants1.com
college-scholarships1.com
collinshall.com
conditionnews.com
couponvi.com
criminaldefenseattorneys2.com
criminaldefenselawfirms1.com
detailedhealth.com
drinkershealth.com
drinkingmagazine.com
eurovision-2008.com
experiencemedical.com
fantasiesmagazine.com
fearhealth.com
gendergibe.org
government-grants1.org
groupovienna.net
hardballdollars.com
hawgsandpaws.org
impotencemagazine.com
letscurepeyronies.com
levitrav.com
medicationmagazine.com
moorehabitat.org
nighttimemagazine.com
ownmeds.com
panimarock.com
playmeds.com
powerfulselling.com
printcoupons1.com
propeciav.com
relationshipmeds.com
relaxhealth.com
rml-inc.com
rxvis.com
savewhalompark.com
sex-tvs.com
shopwizz.biz
signbysign.com
steve-magic.com
styleandmore.net
syncsql.com
takemedical.com
taylor-training.com
testosteronehealth.com
thedongman.com
traumamedical.com
twohealth.com
viagracomp.com
viagraeds.com
viagramagazine.com
viagravi.com
washealth.com
waymagazine.com
weightmedical.com
worldcuplive1.com

Let's see what we have so far: The owner of a porn domain loads in a set of 0x0 iframes, a set of other websites, all operated by the same owner. But still, no clear motivation. Also, the connection with the publishers that we checked remains elusive.



Re-directions within the parked domains

Now, let's see what is going on within these URL calls, such as www.takemedical.com/go_with_post.php. Here is the HTML source of one of those URLs:


Interesting. Another redirection. The site automatically submits a search form, searching for the term "hihijiji". Loading the page in the browser with the GET method (as opposed to the POST method indicated in the form), takes us to the normal page of a parked domain.

But let's submit with the POST method:
curl www.takemedical.com/search.php -d token=hihijiji
Ha! The result is different:

<iframe frameborder="No" height="1" src="index2.php" width="1"> </iframe>
We uncovered a hidden URL. This is the point where everything will start falling into place.





Re-directions and generating click fraud with normal click patterns

Within this hidden URL is where all the interesting things are happening! Let's load www.takemedical.com/index2.php and see the network activity. (In Chrome, go to Tools, Developer Tools, and then to the Network tab.). Here is the screenshot:



Indeed, here is where all the action is happening: These innocent sounding parked domain load all sorts of ad sites, and then "clicks" on the ads. By click, we do not mean any actual click. Instead the site loads the URL in the ad, that is typically a redirection to the ad server, which then redirects to the advertised URL. After this "click" within the iframe we finally have the publisher website (the "HGTV" that we mentioned before)!

Interestingly enough, the click fraud was very well-done: It was not loading all the time the same website. Sometimes it was mevio, other times it was tremor, other times bodyarchitect.tv, and so on. And once we have been redirected enough times from the same IP address, the final redirect was going to find.fm, to execute a straight-forward search. Clever! Engage in fraud, but be careful not to trigger any alarms.

Also, notice that the traffic patterns for the clicks are not bot-generated. These are actual users. With real and different web browsers. Different IP addresses. Different times of the day, following the usual traffic patterns per region. Good job: these click-fraud patterns are the least likely to be caught as they have patterns very similar to normal traffic.

For those interested in the details, here is the set of screenshots with the redirects:













The role of parked domains: Laundering traffic

At this point, we now know how this person makes money. Clearly, there is click-fraud: the scammer is employing click-fraud services to click on the pay-per-click ads "displayed" in his parked domains. If some of the ads are also pay-per-impression, he may also get paid for these invisible impressions that happen within the 0x0 iframe.

Why the parked domains though? Why not doing the same directly within the porn site? The answer is simple: Traffic laundering.

What do I mean by "traffic laundering"? First, the ad networks are unlikely to place many ads within a porn site. On the other hand, they have ad-placement services for parked domains. Second, the publishers that get the traffic from the parked domains see in the referral URLs some legitimately-sounding domain names, not a porn site. Even if they go and check the site, they will only see an empty site full of ads. Nothing too suspicious. Hats off to the scammer. Clever scheme.

You think we are done? No. There is one more piece in the puzzle. How does the scammer attract visitors to the porn site?



Generating traffic through an adult traffic exchange

The other interesting part: The porn website does not really contain porn! There are a few images but most of the links are to other porn website that actually host the video. In other words, the scammer does not even pay the cost of hosting porn!

However, according to QuantCast and Compete, the website has a pretty significant number of unique visitors per month. Here is the traffic over the last year:






This porn website gets 500K to 1M unique visitors per month! That is a lot of traffic for a website without any real content! So, how does the guy get all the traffic?

The answer surprised me. Apparently, there is an exchange (yes, my dear readers, an exchange!) for buying and selling adult traffic! Its name: TrafficHolder.com

Do you want to buy traffic for people interested in midget sex? The price is \$2.94 per thousand visitors. Interested in latex? The cost is \$2.54 per thousand visitors. Interested in HD video? \$3.54 per thousand visitors. (The running price catalog and the available traffic volume is available at http://www.trafficholder.com/cgi-bin/traffic/manager/buying100.cgi)

How do the porn sites sell traffic to each other? Through pop-ups, pop-unders, by causing the first click to the website to redirect to the buyer's site. The term for this traffic is "skimmed traffic"

Following the trail, we figured out the source of the traffic for hqtubevideos.com: It was coming from the (very popular, apparently) website www.pornoxo.com. The reports from QuantCast and Compete confirm that PornoXo gets approximately 1 million unique visitors per month. If you visit the PornoXo.com website, you will see that the first click will create a pop-under that loads the page hqtubevideos.com/play.html. This is the page responsible for all the fraud that I described above.

Based on the exchange prices and the visitorship at PornoXo, this traffic has a cost of \$3K/month for hqtubevideos.com, which is significant. So, we need to figure out how the scammer recovers this cost.



How much money are we talking about?

So, the key question now: How much money the hqtubevideos.com generates through the scheme? To get a feeling of how much fraud is going on, please do the following:
  • Open Chrome
  • Open the options, and then Tools, then Developer Tools. This will load the monitoring tool.
  • Switch to the "Network" tab
  • Visit http://www.hqtubevideos.com/play.html and see what is being loaded in the background (my own counting was approximately 1 ad loading per 10 seconds)
Let's do some back-of-the-envelope, very conservative, approximations:
  • The site gets 500K-1M visitors per month
  • The cost of this traffic is approximately \$1.5K to \$3K per month
  • Each unique visit loads 7 sites, which then generate clicks. Let's assume that there is no reload of the invisible sites, to keep the estimates low.
    • Assuming 500K visitors and that just one click out of the seven sites goes through, this is 500K clicks per month (low estimate)
    • Assuming 1M visitors and that all clicks, in all 7 sites, go through, this is 7M clicks per month (high estimate)
  • The a low-end estimate for CPC click costs is 30 cents, out of which we can assume that the scammer gets, say, 10 cents.
  • This generates a total income of \$50K to \$700K per month
  • The scheme is running for 8 months now, generating total revenue of \$400K to \$5M so far. (And you thought that investment bankers were getting paid a lot...)
Notice that these approximations assume that the site only generates the direct clicks discussed above. You will notice that there is no end in the loading of ads, if you leave the website open for a while. Given that the site visitors come from PornoXo, there is a good chance they will keep watching the porn video at PornoXo, leaving hqtubevideos to load the ads in the background.

But even with the modest estimates listed above, we are talking about a business that generates tens of thousands of dollars, with really minimum requirements. This is a scheme that a single person can set up in a week...



Overall picture

Trying to put all pieces together, I created the following graphical summary to see what is going on:



Let's follow the flow of the users:
  1. Scammer buys user traffic from PornoXo.com and sends it to HQTubeVideos.
  2. HQTubeVideos loads, in invisible iframes, some parked domains with innocent-sounding names (relaxhealth.com, etc)
  3. In the parked domains, ad networks serve display and PPC ads.
  4. The click-fraud sites click on the ads that appear within the parked domains.
  5. The legitimate publishers gets invisible/fraudulent traffic through the (fraudulently) clicked ads from parked domains.
  6. Brand advertisers place their ad on the websites of the legitimate publishers, which in reality appear within the (invisible) iframe of HQTubeVideos.
  7. AdSafe detects the attempted placement within the porn website, and prevents the ads of the brand publisher from appearing in the legitimate website, which is hosted within the invisible frame of the porn site.
Notice how nicely orchestrated is the whole scheme: The parked domains "launder" the porn traffic. The ad networks place the ads in some legitimately-sounding parked domains, not in a porn site. The publishers get traffic from innocent domains such as RelaxHealth, not from porn sites. The porn site loads a variety of publishers, distributing the fraud across many publishers and many advertisers.




Who has the incentives to fight this?

And now let's see who has the incentives to fight this. It is fraud, right? But I think it is well-executed type of fraud. It targets and defrauds the player that has the least incentives to fight the scam.

Who is affected? Let's follow the money:
  1. The big brand advertisers (Continental, Coca Cola, Verizon, Vonage,...) pay the publishers and the ad networks for running their campaigns.
  2. The publishers pay the ad network and the scammer for the fraudulent clicks.
  3. The scammer pays PornoXo and TrafficHolder for the traffic.
The ad networks see clicks on their ads, they get paid, so not much to worry about. They would worry if their advertisers were not happy. But here we have a piece of genius:

The scammer did not target sites that would measure conversions or cost-per-acquisition. Instead, the scammer was targeting mainly sites that sell pay-per-impression ads and video ads. If the publishers display CPM ads paid by impression, any traffic is good, all impressions count. It is not an accident that the scammer targets publishers with video content, and plenty of pay-per-impression video ads. The publishers have no reason to worry if they get traffic and the cost-per-visit is low.

Effectively, the only one hurt in this chain are the big brand advertisers, who feed the rest of the advertising chain.

Do the big brands care about this type of fraud? Yes and no, but not really deeply. Yes, they pay for some "invisible impressions". But this is a marketing campaign. In any case, not all marketing attempts are successful. Do all readers of Economist look at the printed ads? Hardly. Do all web users pay attention to the banner ads? I do not think so. Invisible ads are just one of the things that make advertising a little bit more expensive and harder. Consider it part of the cost of doing business. In any case, compared to the overall marketing budget of these behemoths, the cost of such fraud is peanuts.

The big brands do not want their brand to be hurt. If the ads do not appear in places inappropriate for the brand, things are fine. Fighting the fraud publicly? This will just associate the brand with fraud. No marketing department wants that.

Note also that the fraudster does not target a single publisher, does not target a single advertiser. The damage is amortized so nicely that nobody feels that it is a big deal. A mastery of the long tail...

Well, but what if fraud is big? What if big bucks are wasted? Maybe some newspapers would like to investigate. Let's break the big story. What would be the effect? Publicizing that a significant source of their income (online advertising) is a dangerous thing, full of fraud? Who would like to shoot himself in the foot?



Fraud as (harmless?) parasite

Really. Genius. Defraud many rich guys a little bit each, and ensure that nobody has the incentives to really fight and chase the fraud.

The guy essentially realized that this type of fraud is really behaving like a parasite within a much bigger ecosystem. And it is a parasite that is so costly to remove that it makes sense to leave it there. As long as the parasite does not annoy the host too much, things will be fine.

Only if fraud becomes really big there will be the real incentive to fight advertising fraud. Until then, you know how to make \$500K/month...