What do Google, the Huffington Post, my optometrist and a cantaloupe have in common? They are all devout practitioners of A/B testing, the art, no, the science of concurrently testing different options and allowing real people to choose which one they prefer. (Except for the melon; I threw that in to see if this post would rank higher with or without random references to fruit.)
Many companies use A/B split testing to test hypotheses and make informed decisions related to websites, mobile applications, emails, and beyond. The range of items which can be tested is vast, and can include things like validating which headline or call-to-action has a higher click-through rate, optimizing landing page layouts, or testing which shopping cart flow results in the lowest abandonment rate.
Most importantly, A/B testing is a very effective way to experiment and iterate continuously while collecting hard data to drive decisions (leading to less drama and less guesswork).
In This Article:
What is A/B Testing?
A/B testing, broadly defined, is the act of comparing two items to each other to determine which best suits your needs. It is very often used to describe the data-driven method of testing whether changes made to an app or webpage result in a “better” experience. (Better for whom, you may ask? Hold that thought, we’ll touch on it below.)
A/B, or multivariate testing is an attempt to remove some of the guesswork, emotion, and gut-work associated with creating effective web pages and applications. By systematically dissecting what it is you want to test, you can arrive at a more definitive answer. “What’s the best way to format this landing page?” is no longer an unwinnable debate focused on “I think,” but rather a constructive discussion centered on “I know.”
The basic idea behind these tests is that you choose the page elements or variations you’d like to compare and establish the metric (or metrics) you will measure, then you’ll compare the performance of the two variations. For example, let’s say you want to determine which version of a page is most effective at getting visitors to register. By modifying different elements on the page or the flow of pages, you can test which changes are most effective against a control (the original version of the page) and therefore result in a higher conversion rate.
As Paras Chopra, founder of Visual Website Optimizer, wrote in Smashing Magazine, every A/B test is completely unique, however the elements most often tested can include:
- Call to Action (Wording, Size, Color, Placement)
- Headline or Product Description
- Form (Length, Types of Fields, Layout)
- Page Layout & Style
- Pricing & Promotional Offers
- Amount of Text
By modifying these elements on different versions of the page and splitting web traffic to the different versions, you can gather data about which one is more effective. Does the green call to action get more clicks than the blue call to action? Which version of the same headline gets more clicks?
It is important to note that while A/B testing can tell you what works better, it cannot tell you why it is more effective. The folks who can analyze the data to identify the “why” and consciously apply it are the ones who will find the most value in multivariate tests (and earn the big(ger) bucks).
Running A/B Split Tests
Running continuous tests and making small changes based on your findings over time is one of the easiest ways to take a “good” product and make it better. There are a few general rules and best practices to follow when constructing and running an A/B Test but to sum it up, the process typically goes something like this:
- 1. Develop two versions of a page
- 2. Randomly divide users into two groups
- 3. Show each group a different version
- 4. Track how those users perform
- 5. Evaluate the versions
- 6. Go with the winning version
- 7. Repeat as needed
Here’s a more in-depth look at the process and best practices for A/B tests:
1. Generate a Hypothesis
The first step is understanding what it is you want to test, and formulating a general hypothesis about what might occur.
2. Choose A Metric or Single Variable
Next, it’s time to test your hypothesis. Start by determining what metric you’ll use to evaluate success (or lack thereof). A possible metric could be a measurement of the % of users who 1) click a particular link or 2) respond to a specific call to action. (For those of you tut-tutting at the idea of only testing a single thing at a time, yes, you are right, it is possible to do multivariate tests and perform sophisticated cohort analysis. But why would you? For simplicity’s sake, it is easier to design, test, tally, and evaluate when there are fewer variables, and given how easy it can be to run a single A/B test, there are very few cases where you will only have one shot and will need to try many dimensions at once. If there’s multiple variables you’d like to test, I suggest running a series of tests.
3. Identify your Control and Establish a Baseline
“B” needs to be compared to something, and that something is “A”. A is your control. A will be your “do nothing” option – if we leave the page as is, this is how we would expect our chosen metric(s) to perform.
There are a few different schools of thought about using Controls in A/B tests. Some people claim that it is ok to run a control at a different time than the different variations you are testing. Those people might say that if you have enough data, there will be so little variance that it will not matter. Let’s call these people “Wrong.” There are other people, let’s refer to them as “Right,” who say that A/B Tests must be run in parallel, not serially.
Tests need to be running at the same time, with different versions of the page, for the data to be meaningful, and the “Right” people point out that by running the tests at the same time, you negate variability due to seasonal fluctuations, holidays, traffic spikes, other promotions running on your site, PR activities, or any of a million other confounding factors. It’s really up to you how you want to run the tests, you can side with “Right” or “Wrong,” as long as you can take steps to re-normalize the data if needed. (But really, save yourself grief and do it the “Right” way from the start.)
4. Split Tests: Pick Who Sees What
A/B Tests are also called “Split Tests” because, in practice, you’ll divert some of the traffic that would normally visit the baseline or control page to view the variation you are testing.
A key consideration to make: how much traffic do you want to divert to the test page? 50% – will half of all users be subjected to the test? A quarter? More? Less? As long as the volume of traffic is high enough, and the split is effectively random (as opposed to directing only one customer segment to the new page, while all other segments use the control) and you gather enough data points, there is no hard-and-fast rule about where to draw the line.
It goes without saying that you need to have enough data points for an A/B Test to be statistically significant. Too few data points mean the data is anecdotal, and while we all like a good story, anecdotes are not math. Please consult your favorite stats textbook or online reference for more information about testing for statistical significance. (I’m talking to you, p-values.)
5. Document the Results and Draw Conclusions
Finally, you’ve run your A/B Test and it’s time to evaluate the results. First, make sure your experiment was valid — do the controls look right? How did your modified page compare to the control? Was the result expected? Did it show an improvement? Was the change big enough to warrant implementing the changes into production? Are there other variations you want to test? Document your conclusions, and double-check that the data back them up.
6. Lather, Rinse. Repeat?
You may decide to move forward with the changes you tested, in which case you can wrap up this rev of the A/B Test. If you decide to continue tweaking and improving, don’t forget to go back to the start and make sure your hypothesis is still relevant.
Case Studies: Companies who A/B Test
There are many, many examples of how A/B testing is being used in the wild. Most are well-intentioned, sincere attempts to quantifiably optimize a particular flow or process, to increase click-through. But one person’s “well-intentioned, sincere” is another’s “creepy, manipulative.” Read these examples, judge for yourself.
Huffington Post: A/B Testing for the Most Effective Headlines
The Huffington Post uses A-B testing to make its headlines “viscerally effective.” As David Segal wrote in the New York Times Magazine:
“When most sites were merely guessing about what would resonate with readers, The Huffington Post brought a radical data-driven methodology to its home page, automatically moving popular stories to more prominent spaces and A/B testing its headlines. The site’s editorial director, Danny Shea, demonstrated to me how this works a few months ago, opening an online dashboard and pulling up an article about General Motors. One headline was ‘‘How GM Silenced a Whistleblower.’’ Another read ‘‘How GM Bullied a Whistleblower.’’ The site had automatically shown different headlines to different readers and found that ‘‘Silence’’ was outperforming ‘‘Bully.’’ So ‘‘Silence’’ it would be. It’s this sort of obsessive data analysis that has helped web-headline writing become so viscerally effective.”
Many other news outlets do something similar. As reported by the BBC, “Sites like Slate and Upworthy, for example, often test up to 25 headlines using specially designed software to see which performs best. This leads to headlines such as: “They Had A Brilliant Idea To Give Cameras To Homeless People. And Then The Cameras Got Turned On Us.” These can receive huge attention online and be widely shared, but are frequently derided as misleading “clickbait” because the articles or videos they relate to can often be a disappointment.”
Google: 42 Shades of Blue Links
If there is any company that truly lives the test-everything ethos, it is Google. One of the more widely publicized (and sometimes ridiculed) examples of this involved testing various shades of blue links in ads across search and Gmail to determine which one site visitors would be most likely to click on.
Laugh all you want, but it turns out the right shade of blue could be worth more than $200 million. According to Dan Cobley, Google’s UK Managing Director who explains:
“In the world of the hippo, you ask the chief designer or the marketing director to pick a blue and that’s the solution. In the world of data you can run experiments to find the right answer…We ran ‘1%’ experiments, showing 1% of users one blue, and another experiment showing 1% another blue. And actually, to make sure we covered all our bases, we ran forty other experiments showing all the shades of blue you could possibly imagine and we saw which shades of blue people liked the most, demonstrated by how much they clicked on them. As a result we learned that a slightly purpler shade of blue was more conducive to clicking than a slightly greener shade of blue, and gee whizz, we made a decision…But the implications of that for us, given the scale of our business, was that we made an extra $200m a year in ad revenue.”
There are a couple of important footnotes to this story:
First, the constant need to test and validate every decision led some Google designers to leave the company. Doug Bowman, the company’s top designer at the time of the blue-test, decided to move on in part because of the hostility it created between the designers and engineers. “I can’t operate in an environment like that.” he wrote in his Goodbye, Google post. ”I’ve grown tired of debating such miniscule (sic) design decisions. There are more exciting design problems in this world to tackle.”
Also, Google has since had a change of heart about the importance of designers and design throughout the company. While A/B Testing remains effective at proving which approach worked best, it does not necessarily lead to beautiful, well-designed pages or products. When one of its founders reclaimed the helm at Google, he elevated the importance of design. “Something strange and remarkable started happening at Google immediately after Larry Page took full control as CEO in 2011: it started designing good-looking apps.” In the years since then, the praise and focus has continued, “It would have been crazy to say just a few years ago. But today, Google produces better-designed software than any other tech behemoth. If you don’t believe that, then set down your Apple-flavored Kool-Aid. Take a cleansing breath, open your mind, and compare Android and iOS.”
TripAdvisor: Blue or Yellow Links? It depends.
TripAdvisor has been quietly and continuously using A/B Testing throughout its site for a long… long… ok, almost forever. Testing is deeply ingrained within their engineering culture, and given the enormous traffic the site sees in a given day, it is relatively easy to generate substantive results quickly. For example, they were able to use A/B testing to discover the impact of color on visits, finding that “certain colours draw some people in more than others. If people have arrived on a TripAdvisor page from a Google advert, for example, they’re more likely to click on a blue button. Other users navigating from within the TripAdvisor site, however, prefer yellow.”
A/B Test Everything! (Oh…Except That, And That.)
Multivariate tests are great; they’re objective, quantifiable, and can lead you to more definitive answers. However, as Ben Tilly points out, split testing is a poor substitute for many other critical tasks including:
- 1. Talking to users
- 2. Usability tests
- 3. Acceptance tests
- 4. Unit tests
- 5. Thinking
The first and last points on Tilly’s list bear the most significance. As powerful as data is throughout a product’s development process and lifecycle, product managers cannot simply hide behind the data. You must know and understand your users on a human level if you want to create great products and the best way to do that is to interact with them, directly, as humans. Data cannot replace human interaction.
In the context of A/B tests, if you do not know your users you’ll have a very hard time creating small, discrete tests, and may end up taking an assay approach (testing every possible combination to see which one might be best). Finally, A/B testing does not absolve you from having to actually think. You’ll need to understand what you want to test (and why), you’ll need to structure your hypothesis, understand your metrics, make sure you have solid controls in place, and when you get the results, you’ll need to be capable of critically assessing whether they make sense — not all data is good data.
A/B Testing is an extremely powerful tool that should be in every product manager’s toolbox; use it wisely, use it often, use it correctly, and use it in conjunction with other tools to make sure you have enough information to make solid choices. Many companies use A/B testing as a way to make and validate decisions related to their websites or mobile applications, including validating calls to action and testing which shopping cart flow results in the lowest abandonment rate. This type of testing is a very effective way to use hard data and continuous experimentation to drive decision-making, but as a Product Manager you must make sure you use the data from A/B, split, and multivariate tests judiciously, since they are a poor substitute for interacting directly with your customers and thinking about what they really need and want.