Wednesday, March 14, 2012

50% of the online ads are never seen

Almost a year back, I was involved in an advertising fraud case, as part of my involvement with AdSafe Media. (See the related Wall Street Journal story.) Long story short, it was a sophisticated scheme for generating user traffic to websites that were displaying ads to real users but these users could never see these ads, as they were never visible to the user. While we were able to uncover the scheme, what triggered our investigation was almost an accident: our adult-content classifier seemed to detect porn in websites that had absolutely nothing suspicious. While it was a great investigative success, we could not overlook the fact that this was not a systematic method for discovering such attempts for fraud. As part of this effort to make more systematic, the following idea came up:

Let's monitor the duration for which a user can actually see an ad?

After a few months of development to get this feature to work, it became possible to measure the exact amount of time an was visible to a user. While this feature could easily now detect any fraud attempt that delivers ads to users that never see them, this was now almost secondary. It was the first time that we could monitor the amount of time that users get exposed to ads.



50% of the Ads are (almost) Never Seen.

By measuring the statistics of more than 1.5 billion ad impressions per day, it was possible to understand deeply how different websites perform. Some of the high level results:
  • 38% of the ads are never in view to a user
  • 50% of the ads are in view for less than 0.5 seconds
  • 56% of the ads are in view for less than 5 seconds
Personally, I found these numbers impressive. 50% of the delivered ads are never seen for more than 0.5 seconds! I wanted to check myself whether 0.5 seconds is sufficient to understand the ad. Apparently, the guys at AdSafe thought about that as well, so here is their experiment:



You know the old saying, "half of my marketing budget is completely wasted, I just do not know which half"? Well, apparently this intuition was correct :-) The cool thing now is that you can find out which half of the budget is wasted :-)




Give me More Data!

OK, the high level results were good, but honestly, I was not satisfied. The 50%-of-the-ads-are-never-seen is a good one-liner but I was craving for more data. Were these results reliable? Or some convenient accident? So, I talked with Arun Ahuja, who gave me access to much more detailed data, sending my way the measurements for the top-1000 websites that run ads, ranked by number of visitors. (Fun fact of the day: Arun is working for AdSafe after replying to a tweet of mine. Who said that Twitter is not recruiting mechanism?)

The first thing that I wanted to check is whether the timing measurements are reliable. For that, I got the visitorship and time-on-page data from Comscore, and compared the ranked list by AdSafe and Comscore. The two lists had more than 75% overlap, which was pretty significant, given that the Comscore list also contained sites that do not display ads (e.g., Wikipedia). I also ranked the sites by number of visitors by time spent on page and compared the rankings of AdSafe and Comscore. The resulting Spearman ranking correlation coefficient was at 0.72, which was strong enough to convince me that the measurements were solid.

The first time that wanted to see was the distribution of time that people spend on a web page. The times within a website followed a log-normal distribution, so the best way to summarize these values was by using the geometric mean of the samples, which is equal to $\left(\prod^n_i t_i\right)^{1/n} = \exp\left(\frac{1}{n}\cdot\sum^n_i \ln(t_i)\right)$; for the lognormal distribution, the geometric mean is equal to the median of the distribution, which is a pretty robust statistic. OK, done with the geeky stat details.

The next thing was to plot the median time on page across different sites. Not surprisingly, the distribution is also a heavy-tailed one. While most people stay on a particular web page for just a few seconds on average (cough, median), there a few sites for which people spend significantly more time. Here is the distribution:


What is the site with the highest median time on page? No, it is not Facebook. (You see, on Facebook people do move from one page to another...) The puzzles page of USA Today and Pandora are two of the top sites in terms of time on page, with median times around 10 minutes each.




Percent of Users Exposed to Ads, for Various Periods of Time

Unlike "time on page" checking the median ad visibility per site is not a very informative metric, given that the median time is close to zero for many sites. Instead it is better to set different thresholds for ad visibility, and see what percentage of user sessions reach that level of ad visibility.

You can see below the distributions for $t>0 secs$, $t>2.5 secs$, $t>5.0 secs$, $t>7.5 secs$, and $t>10 secs$.







How to interpret these plots?

For example, for the $t>0$ plot, we that for ~12% of the sites in the dataset, were displaying the ad to 90%-100% of the visiting users. However, based on the $t>2.5$ plot, we can see that only 5% of the sites manage to show the ad for more than 2.5 seconds to 90%-100% of the visiting users, and these numbers plummet further for higher thresholds.

On the other side of the distribution, we can see that ~5% of the sites do not manage to make their ads visible to their users for more than 2.5 seconds for 90%+ of their visitors, and this number grows to 10% of the sites if we ask for the visibility to be higher than 10 seconds.

If you want to have the overall picture, here is a summary plot that puts together the histograms above:


Again, just a few data points to get you to interpret this plot quickly:
  • In 15% of the sites, the ad is not visible for 40% or more of the user sessions (see $t>0$ line)
  • In 60% of the sites, the ad is not visible for 70% or more of the user sessions (see $t>0$ line)
  • In 60% of the sites, the ad is not visible for more than 10 seconds for 40% of the user sessions (see $t>10$ line)
  • In 75% of the sites, the ad is not visible for more than 10 seconds for 50% of the user sessions (see $t>10$ line)



Correlation of time on page and ad visibility

And now let's move to the juicy stats. What is the correlation between the time on the page vs the time that people actually see the ads in the page? Interesting enough, the two numbers are not correlated:



What is wrong here? Well the main problem lies in the fact that many ads are never visible to the user (38% of them to be exact), or are visible for only brief periods of time (50% are seen for less than 0.5 seconds). From the above, we can see that the metric "percentage of user sessions with ad visibility greater than X seconds" is more descriptive than just the median.

In fact, if we compute the correlation of the visibility metrics with time on page and ad visibility, we get a more clear picture:




Correlation between time in page and  percent of user sessions exposed to ad for various periods of time
0 secs
0.09
2.5 secs
0.18
5.0 secs
0.22
7.5 secs
0.24
10 secs
0.25


As you can see, the metric that correlates best with time on page is the metric that examines what percentage of user sessions are exposed to an ad for more than 10 seconds. Indeed, we can see that there is a more clear trend, but still the variance is extremely high.







"Above the Fold" vs. "Below the Fold"?

Another common way to evaluate the visibility of an ad is to examine whether it is "above the fold" (i.e., near the top of the page and visible when the page loads), or "below the fold". This is a concept that is borrowed from the printed press and is a decent heuristic; unfortunately, it is not always accurate in the digital world. The site "Life below 600px" does a good job in explaining this. (Please visit the site, it is worth checking out :-)

To examine the effect of the "above the fold" visibility, we also measured the probability that an ad is visible when the site loads. (We decided not to use a hard metric such as "600px and below" as display sizes come in all sorts of variants).

Here is the median ad visibility, as a function of the probability of seeing an ad on load:


Here is the probability of seeing an ad for more than 10 seconds, as a function of the probability of seeing an ad on load:


There is definitely a positive correlation. But there is still significant amount of remaining noise. As you can see, there are cases where the ad is visible on load ("above the fold") but people do not see the ad for long periods of time, and there are cases where the ad is not visible on load ("below the fold").





Example Sites

Given all the metrics and combinations, it would be good to examine a few sample sites to understand better what layouts and content generate the different combinations of time on site, ad visibility, etc.
  • High time on page, high ad visibility, above the fold: Check the ZeroHedge site. This combination is the "expected" combination. Ads are visible when the page loads, users stay at the site for long (3-4 minutes median time on page), and they get exposed to the ads for long periods of time, with high probability (The probability of ad visibility above 10 seconds is greater than 70%.)
  • High time on page, low ad visibility, above the fold: Check the "That Guy with the Glasses" site. (It is better to see a representative internal page). In this site, there is a banner ad on top, but the actual content of the site is the video. So users quickly scroll down to the video and never see the top banner ad.
  • High time on page, low ad visibility, below the fold: Consider the page with puzzles at USA Today. This is a page where users spend a significant amount of time. However, they rarely see the ad, as it is rendered below the game, and users simply do not scroll down there. (Median time on page 12 minutes, with median ad visibility being 0, and probability of seeing the ad for any period of time below 10%)
  • Low time on page, high ad visibility, below the fold: Check the site http://www.everydayhealth.com/. In this site, the main banner ad is rarely above the fold. However, the users seem to habitually scroll down to the options in the lower part of the page, so they get exposed to the ad for significant amounts of time. (The probability of ad visibility above 10 seconds is greater than 40%, while the median time on page is just 20 seconds.)



The Future of Ad Pricing

I would be very surprised if the pricing model for ads does not change to account for the visibility statistics. For display ads that get paid per impression, it is a no brainer. If the user never sees the ad, there is no real impression, and the ad should not be paid. But even for ads that get paid on a per-click mode, the visibility statistics are important. How can we compute the clickthrough rate reliably in the presence of ads that are not even seen? I would expect visibility statistics to become standard part of the clickthrough computation process, which is a key metric of effectiveness for an ad.

The question is how fast this change will come. Perhaps the moment advertisers realize that they should not be paying for ads that are never shown to the users.