A/B and Multivariate Test Validity: Beware of Bad Data!

Up and down arrowsAccording to the 2011 Marketing Sherpa Landing Page Optimization Benchmark Report, 40% of the over 2,000 marketers surveyed did not calculate the statistical significance of A/B and multivariate test results in 2010. 40%! That’s a big chunk of marketers.

Clearly, validating your test results should be a key part of the conversion testing process, or you’re going to be acting on bad data (and losing cash).

But how can you tell when there might be problems with your numbers? Look out for these 4 types of validity threats:


Too small a sample size

To find a winner, test your layout and copy variations with enough test subjects to reach a high level of confidence in your results. But how many is enough? Several factors impact the sample size you’ll need including:

  • The current conversion rate of the page you’re testing (note: not the same as the conversion rate of your entire site)
  • The average number of daily visits to the test page
  • The number of versions you’re testing
  • The percentage of visitors in the experiment (sometimes you want to test with just a segment of your traffic)
  • The percentage improvement you expect over the control
  • How confident you need to be in the results (usually 95% but could be higher if the risks of being wrong are high)

To estimate how long you need to run your test for your results to be statistically significant at the 95% confidence level (i.e. 5% chance you’ll think the variations are performing differently when really they aren’t), look to the Google Website Optimizer calculator. Amadesa also has an A/B experiment duration calculator that’s little more flexible. It lets you choose the level of confidence you want to achieve. By playing with the calculators, what you’ll find is that if your site gets limited traffic, you won’t be able to run as many versions or segment your test traffic as much as a higher volume site.

Multiplying the potential duration of your experiment by your average daily visitors gives you an indication of your sample size (or you can use a complicated formula). It’s helpful to have a sample size in mind before you start testing because many testing tools can be a little misleading. They can turn “green” or “red” after only a few visits, falsely indicating a high level of confidence that you have a winner or loser, and then quickly revert back to “yellow” or inconclusive results. If you heed the first “green” bar, you will stop your test too early. By waiting until you’ve tested with your full pre-determined sample size, you stand a better chance of finding the real superior performer. But don’t worry, peeking during a test is ok, and necessary as we’ll see below.

Google Website Optimizer results showing statistical significance of 98.5%

An external event that changes visitor behavior

Events outside of an experiment, often called “history” threats, can affect response rates. Often, these are news events (e.g. holidays, major industry or company events, or news stories) that significantly but temporarily affect the attitudes and behaviors of visitors, and the amount of traffic. So much so that you can’t tell whether response differences are due to page changes or the historical event.

This is why we never recommend sequential testing, like trying one page version against your control in the first half of the month, and another version in the second half. An external event that happened only during the second half of the month can alter your results. But even A/B split testing is susceptible to external influencers. While an external event impacts all test versions equally, your overall results may vary if you had started the test earlier or later.

To minimize the risk of “history” impacts, here are a few tips:

  • Regularly analyze your data for consistency during the test, especially if it’s particularly long running.
  • Don’t run tests that extend into holidays (or across periods that differ significantly for your industry) unless holiday behavior is what you want to study in your test.
  • During the test, look out for industry or news stories that may temporarily affect purchase behavior or traffic.
  • Test over a longer duration, or repeat a test (to a point), until you are confident in your data.

A change in your technical environment or measurement tools

“Instrument change”, where something happens to the technical environment or the measurement tools used during a test, can invalidate the results of your A/B or multivariate experiment. This could be things like:

  • Inconsistent placement of test control code (e.g. in the body on some pages but in the header on others)
  • A code deploy happening during a test that disables or alters your control code
  • Performance issues stemming from web server or network problems
  • Testing software or reporting tool malfunction
  • Response time slowdowns due to heavy page weights or page code, or server overload

Criss-crossing data While your test is running, if you spot sudden changes in performance or traffic distribution between variations, take a look under the covers to see if your technical environment or testing toolkit has altered in any way.

And to reduce the risk of “instrument change”, follow these guidelines:

  • Make sure your control code is placed correctly and consistently across all your versions.
  • Browser compatibility check your versions before launch to make sure there are no compatibility issues.
  • Be careful when deploying code while a test is running not to alter or delete your test control code.
  • Monitor for odd data during the test. If you have multiple sources of the same data, cross check your numbers every once and awhile to make sure there are no major differences.

A change in incoming traffic sources or traffic mix

When different types of visitors are not distributed equally between page versions, the test outcome can be affected. This is called “selection bias”. For example, if your incoming traffic sources, or mix of traffic, change dramatically during the test (due to a big email send or other channel-specific marketing activity). Or if the profile of your testers doesn’t match the profile of your actual customers.

While your test is running, monitor your control to make sure it’s not deviating significantly from past performance. As with “instrument change” threats, look out for sudden changes in the performance of one page version over another, or in the distribution of traffic amongst your variations.

Here are a few ways to minimize “selection bias”:

  • Use traffic sources that most closely match the target audience for the page being tested.
  • Make sure visitors are being randomly distributed between your test versions. It should be impossible for you to predict which version a given visitor will see, and visitors shouldn’t be able to self-select the version they see either.
  • Compare the performance of your control with its recent historical performance for consistency.
  • Gather enough analytics data to allow deeper analysis post-test. For example, to compare weekend vs. weekday results, or new vs. returning visitors.

And finally, always take time to carefully analyze your data after each test has run its course. Sometimes you’ll uncover interesting learnings like one variation worked better for one audience, while another worked better for a different group. Running follow-up tests can confirm these findings.


Related Articles

9 Responses to “A/B and Multivariate Test Validity: Beware of Bad Data!”

  1. Hi Amanda,

    While these are great insights, it make sense to understand why significant sample size is required. Basically, confidence level is an indicator that the results that are shown in sample can be extrapolated to the populatation. i.e. conversion rate increased by 15% in A/B test will also result in 15% increase when the winner is live.

    Now the problem is lot of people think its linearly about confidence level. e.g. if result with 90% confidence level indicates observed improvement of 10%, is significantly poor result than 5% observed improvement with 95% confidence level.

    The fundamental of hypothesis testing means that if results are less than 3-sigma (or 95% confidence level or more), there are not enough evidence of suggested(10%) improvement (mind it doesn’t mean less(5%) improvement with low confidence level). I think not understanding such things and having claim for conversion improvement is hugely detrimental to overall CRO efforts by industry !

    Thanks so much for bringing this to everyone’s notice !

    • Amanda says:

      You’re most welcome – agree that there is a lot of misinformation out there about what ‘success’ looks like, and that results will automatically apply to sales increases. Running longer tests and repeating or reiterating are definitely recommended!

  2. An other tip to minimize the risk of “history” is to run the test one or more complete weeks: a test started at 10am on a Monday must stop on a Monday at 10am.
    Visitor behavior is so different between day and night, and between weekdays and weekends.

    • Amanda says:

      Great suggestion. What do you think about running tests over a full 30 day period (at least when starting a testing program) to see if you can observe different behavior related to paydays/weekends etc.?

  3. I’m almost surprised about this post because it’s actually useful! Far too many A/B testing blog entries have little more than “A/B testing is kewl” and leave it at that, so thank you for good insights.

    A couple of comments.

    First, you list several ways in which a test can go “green” or “red” prematurely. There’s also another way: random chance. We’ve demonstrated this at work by writing some sample code which simulates a million visitors for both a base and a variant, each converting at 6% and completely randomly distributed. We repeatedly found “statistically significant” differences between the base and variant, even up to the million visitor mark (where the chance of a false result dropped significantly). So understanding your data instead of blindly following it is very important.

    And in a response, you asked about running a test for a full 30 day period. I would say that you need to identify your business cycles and try to run a test at least over a full cycle. So if your business peaks every two weeks, run your test for at least two weeks. This can minimize (though not eliminate) the chance of having skewed data.

    Again, great post. Thanks!

    • Amanda says:

      Thanks! Definitely makes sense to run over a full business cycle. Perhaps when you first start testing you should test for a full 30 days to help identify or confirm what a business cycle looks like for your company. From there you might be able to shorten up your tests.

  4. Cristina Chetroi says:

    Segmenting is also crucial when testing (at least by media and visitor type). For those using GWO it means integrating it with GA – otherwise the results are aggregate. And we all know how misleading averages can be.

    One of the last split tests I ran, showed how a page performing worst for one AdWords campaign was acutally winning for another – an insight you’d never get from GWO console alone.

    I also find it useful to not ignore how only NEW visitors perform, as a returning visitor that sees a new layout might not convert cause of the ‘change’ factor rather than the layout per se.

    • Amanda says:

      When doing your analysis, totally agree that you need to evaluate performance against your various segments or traffic sources and then, if you can, start customizing for each group. Nice point! Integrating Google Website Optimizer and Google Analytics can certainly help you gather better insight into customer behaviors.

  5. Paul L says:

    Hey Amanda,

    Great post. I’ll be passing this article around to some of the more junior members of our team :).

    I think it’s worth mentioning how important it is to consider all these factors, especially during the initial setup of the test. For example, if you know you are running a site with low conversion counts, then it’s not a good idea to run 5 different ad copies because it will take forever to get a good sample size on the ads. It is better to run two, possibly three then once a winner is identified, do a second round. Another example is to watch out for weekends, especially holiday weekends when setting the length of tests because as you said, it can severely mess up the data. Worse case is to discover 3 weeks into the test, that a big holiday is going to mess up the data you wanted!

Leave a Reply

© 2014 Get Elastic Ecommerce Blog. All rights reserved. Site Admin · Entries RSS · Comments RSS