Leave a comment
Get the GH Bookmarklet

Ask GH

I often have doubts why exactly an A/B test fails. Was it because the idea itself was bad? Or because it was executed poorly?

As an example, I ran a test where I added product testimonials to the trial sign up page. It wasn't ugly, but it looked a little bit out of place. The test failed and I cannot decide why: due to the bad design or because didn't work.

  • AY

    Alex Yumas

    almost 7 years ago #

    Question is too broad to answer. Usually, if a test fails to show any significant results, it simply means THE HYPOTHESIS WAS WRONG.

    • PL

      Peep Laja

      almost 7 years ago #

      Well, not so fast.

      Let's say that the hypothesis is that by boosting the perception of security we will reduce anxiety and thus increase purchases.

      How many different ways are there to implement this hypothesis? Like 46.

      So if one fails, it doesn't mean that the hypothesis was necessarily wrong. Maybe shitty implementation.

      • LJ

        Lance Jones

        almost 7 years ago #

        Agreed -- a prerequisite for a good test is aligning the creative with the hypothesis. A lot of people get this wrong, too.

      • BL

        Brian Lang

        almost 7 years ago #

        Rephrasing the original question: "At what point do you move on, and test an entirely new hypothesis?"
        If your answer is, "Once you've iterated and found a winner for that hypothesis", you've exposed yourself to the multiple comparison problem, which unless controlled, eventually you will ALWAYS find a "significant" winner, even if there truly isn't one.

    • LJ

      Lance Jones

      almost 7 years ago #

      This! +1

      However, this outcome only applies if you create a valid hypothesis. And I believe creating a valid hypothesis is a weak spot for most marketers (I work with dozens of them every day at Adobe in my role as Optimization Director).

  • SE

    Sean Ellis

    almost 7 years ago #

    I think the question should almost start with "is it important to understand why an A/B test failed?" In my experience, I don't spend a lot of time thinking about why it failed. Maybe I should... I'm going to post that as a separate AskGH.

    I heard an interesting perspective from an optimization specialist at Microsoft this week. She said they don't think in terms of A/B testing successes and failures, but rather they think in terms of gains and saves. Essentially that an A/B test that doesn't improve results is something that very easily could have been implemented without measuring the impact. So the fact that you measured it and it hurt results means you were able to "save" the loss that would have resulted if you had just implemented it without testing it.

    • JG

      Jim Gray

      almost 7 years ago #

      Yeah. This is really the approach I prefer to take in pitching it, over the standard article promising 300% jumps in profits overnight.

      You're going to make design revisions anyway. A/B testing gives you a mechanism for determining "is this, at worst, going to have a neutral effect on our conversions." If you get no result or a positive result, feel free to push the update live. If you get a negative result, you should hold back and investigate further before proceeding with that route.

    • MA

      Max Al Farakh

      almost 7 years ago #

      The point is I don't want to miss a good idea because of some sort of an execution issue. I could run the test again and again in slightly different variations, but how to know when to stop? I've been struggling with this for quite some time.

      The "saves" and "gains" idea is great. I used "fail" for lack of a better word.

      • JP

        Jeff Pickhardt

        almost 7 years ago #

        If you share a screenshot or two, the community here could have a look.

        First, by saying it failed, do you mean there was no change in sign ups, or it actually had a negative impact? You say it looked out of place, but I think it would have to look really bad for that to be the reason it failed. That's why I'm interested in seeing a screenshot and hearing whether it was a "no change" or a "negative change" result.

        If it's quick to make the design fix, and re-running it won't slow down deploying other experiments, sounds like you should re-run it with the fixed design. After all, you are asking about it online and seem to believe that it would've caused an uplift if the design had been better, so you seem motivated to give it another shot.

        • MA

          Max Al Farakh

          almost 7 years ago #

          By "failed" I mean "failed to make an improvement". Anyway, it's not just about this particular test. I got doubts about every other test I do. I could show the screenshots, but it wouldn't solve the issue.

          That's the most important question – should I try to run a test one more time after it has already failed once? It takes at least a week to get meaningful results and I can't run other tests on the same page during that time. I hope you see why I'm struggling with this.

          • TD

            Tiffany Dasilva

            almost 7 years ago #

            It depends on how badly it's failed. Are the tests slightly off? Like 2-5% difference or are they huge fails like 50% or above?
            I always look at the small differences as not conclusive so I try a new version of the test (based on what I said below) if it's a huge difference after 2 weeks and I need some more evidence, I would probably go another week just to see what happens. If it's still not going the way I thought I would ask myself why and create a new test based on that. (see below again for what I mean)

            • SE

              Sean Ellis

              almost 7 years ago #

              Great point!

            • MA

              Max Al Farakh

              almost 7 years ago #

              That's a great point indeed. In fact, we've been running different tests for two or three years now and we've never seen a difference bigger than 10%. Most of the the alternatives are almost equal. I wonder, does it happen for everyone or just us? It gets me really paranoid sometimes.

  • SC

    Shana Carp

    almost 7 years ago #

    This gets at the core of experiment design, which gets at one of the most difficult parts of what an AB test is.

    This all depends on how your hypothesis was written.

    Basically, your hypothesis needs to be testable and falsifiable - but falsifiable gets to some weird issues about "what is true" and "what is knowledge"

    Tightening up procedures and knowing how you do test will only help.

    • GB

      George Bullock

      almost 7 years ago #

      This is the most useful comment in the thread. Technologies like VWO and Apptimize have basically made A/B testing accessible to everyone. Someone with little background in experimental design and statistical analysis can have a test up and running in minutes. The drawback is that they may not really know what they're doing or what's going on behind the scenes. People pay for access to the tech, read a couple of Ultimate Guides to A/B Testing (that were probably released by the A/B testing software vendors to make it seem really easy) and they think they are ready rock.

      My sense is that A/B testing tools are having a similar effect on marketers that calculators have arguably had on people's ability to do mental arithmetic. A lot of people don't even think about doing the math anymore, they just whip out their smartphone's calculator and viola, you calculate 15% of your dinner bill for the tip. That's what VWO really is, a high-powered expensive calculator. 10 years ago you had to basically have an BA in stats to run A/B test professionally. 20 years ago people were probably doing this stuff by hand if they didn't have Lotus or Excel.

      Fast forward to today and I suspect in many cases, the people responsible for running A/B tests couldn't operate them without the tech. Say you gave them their own problems in the form of a case study complete with a data set in an Excel spreadsheet and asked them to do basic things like identify the name of the statistical procedure that is appropriate for their hypothesis, state the null and alternative hypotheses, and calculate the summary stats, test stat, and p-values - I bet most of them couldn't do it.

      I'm not saying I think A/B testing tech is not great - it is. I don't think people should need to have an MA in stats to run A/B test either. I just think we would all be doing ourselves and our businesses a big favor if we took some time to crack open our old college stats books and go over the relevant material (experimental design, test selection, etc.), then figure out how to make the numbers work on real data in Excel first, then graduate to VWO with a much better understanding of what is going on behind the scenes. I broke down and did this over the course of 3 weeks and the dividends have been huge in terms not spinning my wheels on low quality tests and interpreting the results of tests.

      • BL

        Brian Lang

        almost 7 years ago #

        Going along with your example, I use a shiny calculator to calculate tips not because I couldn't calculate it on my own by hand, but because I have confidence in the calculator to provide me the right output based on my inputs, and in the process, that saves me some time.

        The problem with most A/B test tool sets is 1) They misrepresent the meaning of the p-value with something such as "(1- p value)% chance to beat baseline" and 2) most tools don't bring up or address issues related to a) repeated significance testing b) needing to calculate sample size up front based on statistical significance, power and effect size, and c) correcting for the multiple comparison problem. At least lately there has been more discussion on b), but the industry / most practitioners still have a long way to go on understanding the above issues / implications of not addressing them.

      • SC

        Shana Carp

        almost 7 years ago #

        As a followup

        1) It probably shouldn't surprise you if I told you I calculate tip by hand except if it is before I have coffee.

        2) I probably could do all the steps needed for the calculations for an ab test (albeit not in excel, mostly because for some reason I transfered to python/pandas for speed reasons, plus then I can stop repeating myself) with a textbook in front of me. I don't remember what they are off the top of my head (though I do remember they do have to with variances from the probabilities of appearing within certain standard deviations of a gaussian distribution...and that if they involve a different distribution a standard ab test is probably not the right move anyway)

        3) I don't have an MA in stat. I don't have an MBA. My actual degree will shock you (bonuses if you ask) which goes to show curiosity takes you far)

        4) the reason for my first answer is I have no idea what the background was behind the original question. I don't know enough about what he did in the first place, and how things are done to begin with.

  • TD

    Tiffany Dasilva

    almost 7 years ago #

    I think you're halfway there. Your test didn't go as you thought, so you've outlined two possible reasons:
    - It looked out of place
    - it "doesn't work"

    Why not try an example where it is in "place" or another form of testimonial like a logo/press?

    I had a similar experience that i wrote about here:
    http://t.co/CZb7JnwSPd
    (Only plugging this because I'm too lazy to write out the whole story) but TL;DR version: I found myself testing testimonials and realizing they didn't work.. but I was only halfway there - When I tried different tests afterwards I came to some conclusions:
    - press worked much better than testimonials
    - Our testimonials didn't seem to fit (maybe they aren't believable? Maybe they aren't trustworthy?)

    So.. here's my advice:

    1) Take what you've learned from the experiences and the questions you now have based on that test.
    2) Test one of the reasons you think it went badly (there's your hypothesis)
    3) see what happens - rinse & repeat

    • MA

      Max Al Farakh

      almost 7 years ago #

      Thanks! It is hard to find the right balance between re-testing the same hypothesis and moving on to the next ones.

  • OY

    Omri Yacubovich

    almost 7 years ago #

    A prior question I'd ask myself is how did you define that an A/B test has failed? Keep in mind that you should have a significant amount of traffic. Otherwise, you would be surprised to explore that even if you will run the same exact A/B tests twice, you may see different results.

    Here's a free tool to calculate significance: http://www.usereffect.com/split-test-calculator

    Another thing you can do in order to better understand the reasons for failure/success is understanding the user flow and whether anyone even bothered reading your additional components. You can easily do that by using: crazyegg.com / clicktale.com as you may find out that maybe the test has failed, but no one looked at the new component (and another cause could be that it influenced the page loading speed for example..).

    In order to get better and improve, I highly recommend to do it step by step, and by testing improvements on small KPIs when possible.

  • EA

    Ethar Alali

    over 6 years ago #

    GeorgeBullock (following on from Shana Carp) are absolutely bang on! I seriously thought I was the only person banging on about how most A/B tests are done incorrectly.

    A/B-tests are a form of multivariate analysis. They are statistical in nature, as part of good experimental design fall under the the same 'rules' as research methods.

    When analysing the A/B-test against a control, you don't start saying "Did A/B-tests succeed in improving X,Y, Z" what you actually need to be asking is "What is the probability the improvements of A result over B would have occurred for some other reason than the hypothesis?" This is really what you want to be asking, as this forms what statistical experimentalists call the 'null hypothesis'.

    By combining this with the Chi squared test and picking a value more than 2 standard deviations on either side of the mean by default (i.e. outside the 95% percentile - note, this is what Shana was getting at) this automatically takes care of your type 1 and 2 errors (false positives and negatives), since these sort of outliers normally do indeed exist due to experimental error.

    If the probability that A resulted 'by chance' is found to be significant at the 95% percentile (95.45% to be more exact :) i.e. due to experimental error, fluke or anything else [since you *can't* care exactly what, it's not like you have an option] this means that you cannot discount the probability that the A result would have happened anyway! I.e. you *have* to accept the null hypothesis. It's only when the 'small-p' value (which is this probability that the null hypothesis is true) doesn't follow a chi squared distribution that you can call an A hypothesis/result significant (i.e. you can accept the alternate). Otherwise you're chasing red herrings and blind alleys.

    So, the question is, how many tools out there do this out of the box? My guess is nearly none. It's up to the experimenter to design the experiment in such a way as to account for the above, then use the tools to help test thoroughly. This involves rephrasing the hypotheses appropriately for testing. The falsifiability is naturally accounted for because what you are actually testing is the probability of the compliment of the statement being true (compliment being the null hypothesis). Hence, all the analysis so far that showed results from marketing and PR activities which do not follow this result, cannot be concluded to have resulted separate from chance. This may mean more people come along who like blue buttons, the time of year or seasonality accounting for the change. It also means you can run the same experiment at two different times and get two different results.

    Approaching it another way, correlations don't equal causality. What you're doing when A/B-testing is proving that the correlation seen can't be accounted for by chance.

    Do I think that the tools will lead us into an era of folk who can't design effective experiments? Yes. It will certainly create a two tier system with the vast majority of folk that come in being unable to design effective experiments. We've seen it with calculators, but we've also seen it with managed programming code frameworks such as .NET and Java. It's a 'skills window' which is a trade-off of skill for productivity. They aren't mutually exclusive, but our natural reliance on tools means it may as well be. However, at least it's bringing some level of experimentation into arena which haven't valued it before. You can then start to build on that over the next 5 to 10 years or more.

    So +5 to George, +1 to Shana and back to class for the rest of you ;)

SHARE
20
20