A/B Test Case Study: Can Split Test Results Be Trusted?

In the spirit of the Winter Olympic games that ended last week, I would like to talk about the “winner” of the final A/B experiment we did for the Official Vancouver 2010 Olympic Store. We already achieved significant improvement with help of previous tests on the checkout process, product details page, and home page. So, shortly before Games time, in collaboration with Wider Funnel, we underwent our fourth A/B experiment on product list page template to take a last stab at conversion optimization.

Testing the Product List Page

The image below is the control version:

We looked at our previous tests and after several hypotheses and investigations we produced two alternative variations with following changes:

A) Introduced a vertical menu that shows all subcategories for easier access to other products
B) Provided color thumbnails to products that have alternative colors
C) With help of recommendation engine (cross-sell), showed most popular products from that specific category to increase revenue from top selling items
D) Introduced a minor face-lift to filtered navigation to improve usability

Treatment A:

Treatment B:

What We Learned

During the experiment all variations were equally distributed to 100% of all traffic. This was another tough experiment where even 2272 transactions and 10 days did not provide a statistically significant winner. But we gathered just enough visitor and ecommerce data for a decision to be made.

Variation B was chosen because, according to Google Website Optimizer (GWO), it was converting better than the control by 7.74%

What Surprised Us: Control vs. Control

Additionally, we wanted to do a little test on GWO itself. We created another variation which was an exact copy of the control variation. Maybe it was a “statistical coincidence,” but this alternative “exact” variation performed 4.97% better! We didn’t do this for any other tests, and, thus, can’t confirm if there is a pattern for such behavior. So this is up for discussion. Have you tried a similar A/A test and found similar results?

This post is contributed by Janis Lanka (@janislanka, who manages front-end development for Elastic Path Software.


Related Articles

21 Responses to “A/B Test Case Study: Can Split Test Results Be Trusted?”

  1. GWO testing is good but not perfect….

    Depending on the amount of conversion increase or decrease, a test can really have no great winner. Motivation of the customer is a huge part of any test.

    I have run the same test twice before, retesting to double check results and have had a different winner each time in A/B testing.

    Point is to depend on GWO to find good increases but when conversion increase or decrease is very small, then realize other outside factors could be swaying your results.

  2. Audio Bible good summary I agree with the Motivation factor and in this case a once every 4 year factor. Janis this is a good case study though and is something I will want to run on sites that have no season of event motivation at the time to see the results. Will up date Getelastic when I have results great post as always.

  3. Brian says:

    interesting results on the control v control. what were the conversion numbers on that 4.97% increase? statistically significant?

  4. With AB testing you assume that the underlying population coming to your site doesn’t change, but to me the Olympics is such a unique event that the type of visitors coming to the site could very well change from day to day, depending on news or events related to the Olympics. Your daily pageviews and visits naturally fluctuate every day so I would expect the same be true for A/A and A/B test results. There is lots of noise that you just can’t eliminate.

    Love your decision-making process though. You have a deadline and you take action using imperfect or ambiguous data.

  5. @Brian The improvement for the copy of the Control were not statistically significant.

    This is a great example of why it’s best to make decisions based on statistical significance whenever possible. Clearly, the copy of the Control is unlikely to actually perform better than the Control.

    In the “real world”, though, other deadlines and org constraints make statistical significance unachievable. Janis and the Elastic Path were great at rolling with the data and making the best decision possible in the very short time-period.

    @AudioBible You’re right that significant traffic variations can produce a different winner. That should be a reason *for* testing with a tool like GWO, not *against* it, though! Understanding how differences in seasonality and traffic segments perform can only be achieved through Controlled statistically-valid testing. Otherwise, your site changes are just guesswork.

  6. David Minor says:

    > but this alternative “exact” variation performed 4.97% better!

    It’s hard to say if this is surprising without knowing the confidence level.

  7. @Kevin – certainly looking forward your findings.

    @Brian – to bring it down to numbers, there was 3.39% increase in Conversion Rate and 7.66% increase in AOV.

    @Michael – all four variations were served parallel to each other at the same time so that external factors like events wouldn’t affect the results. At the end of the test, each variation was served around 500 times each over the same period of time. There are other noises I could talk about, but that’s probably another blog post.

  8. I’ve been doing “math” about this since I read the article. I believe the following approach would work to determine if the variation you saw was an anomoly (you didn’t provide enough info for me to ‘do the math’ for you).

    1) IF your data allows it, convert the ‘conversion percentage’ values into normal distributions. With 550 samples (for each variant), the conversion percentage would have to be above 1.6% for this step to be valid.
    2) Do a T-test to see if the those distributions are within “statistical coincidence” with a level of confidence (95%, 99%, etc). I used an infinite degrees of freedom assumption.

    Only question I’m still chasing is if it is valid to do T-tests (with infinite degree of freedom assumption) when using normal distributions to approximate binomial distributions. It passes the sniff test for me, but I don’t know for certain.

    @sehlhorst on Twitter

  9. @Janis – (did not see your comment before posting mine) to be more specific, the data that would be needed to do the approach I suggest:
    1) conversion % (absolute, not ‘improvement in’) and number of samples for each of the test and control versions (of the exact same page).
    1b) confirm that these binomial distributions (that’s what the absolute conversion data technically is) can be approximated by normal distributions – it can only be done if there are “enough samples”, which is a function of how close the absolute conversion percentage is to 0.

    Hope that helps – feel free to ping me offline if you want to go over in detail.

    @sehlhorst

  10. There’s a post at http://blog.asmartbear.com/easy-statistics-for-adwords-ab-testing-and-hamsters.html about A/B testing hamsters that talks about quickly determining statistical significance, using a simplification of Pearson’s chi-square test.

    Intuition tells me that the difference in your A/A test was not significant, but you can’t trust intuition when statistics are involved. It would be interesting to know the number of conversions from each to see the results of the χ² test.

  11. @carey – great link, and the math at the end makes sense, however, it is only applicable to a test of “which do you like better, given two choices?”

    I’m not convinced that it applies here – where the assumption is that all the sessions that didn’t convert are considered “mistrials.” Would love to know what others think!

    @sehlhorst

  12. John Town says:

    Really depends on what your trying to split test – Your going to need enough data there to analyze and make a good decision – Being a statistician in this case is very helpful or using a high end program to do this.

    Split testing is not something you just eyeball.

  13. Rex Dixon says:

    Please share these results on our A/B Tests site.

  14. Your 4.97% variance is a good explanation why there needs to be a threshold when choosing a winner. We typically recommend at least a 5% difference for this very reason. “Margin for error” doesn’t always mean you made a mistake, it can simply mean that you’re allowing for the differences in your audience.

  15. @Will – you have a very good point and 5% mark is a good practice. And b/c of “margin for error” we are looking at additional data (Conversion Rate, AOV, etc.) to make most educated decisions.

  16. Let’s say you had 1000 impressions, and 200 conversions – meaning a 20% conversion rate. But how certain can you be of this conversion rate given how many impressions it is based on? Fortunately, math can help – what you need to use is something called the “beta distribution”. The beta distribution takes 2 parameters, A and B. In your case, A is 200+1 or 201, and B is 1000-200+1 or 801. Now we can determine the accuracy of our 20% conversion rate estimate. The variance is A*B/((A+B)^2 * (A+B+1)), which in this case is 201*801/((201+801)^2*(201+801+1)) which is 0.000159879285.

    The standard deviation is the square root of the variance, which in this example case is 0.0126443381 or 1.3%. If we double the standard deviation, we get our 95% confidence interval, which is from 20%-(1.3%*2) to 20%+(1.3%*2).

    It means that based on the fact that we had 200 conversions with 1000 impressions, there is a 95% chance that the real conversion rate is between 17.4% and 22.6%.

    Using this approach, you can take the guess work out of deciding how much data is enough to make a decision. This type of thing is the bread and butter of our company, SenseArray.

  17. How did you calculate the increase in average order value?

  18. Evan Miller published a story entitled “How to Not Run an A/B Test.” Might be worth a quick read. He demonstrates that the size of your sample can influence results if you don’t freeze it in advance at a set number. http://www.evanmiller.org/how-not-to-run-an-ab-test.html

Leave a Reply

© 2014 Get Elastic Ecommerce Blog. All rights reserved. Site Admin · Entries RSS · Comments RSS