A/B Test Power Calculator
Before you run an A/B test, you need to know how many visitors you require per variant. Enter your baseline conversion rate and the minimum improvement you want to detect — this calculator tells you exactly how large your sample needs to be.
Your current conversion rate before the test.
A 20% relative lift on a 5% baseline = detecting a rise to 6%.
Confidence level
Statistical power
Results
Visitors per variant
Total visitors needed
Baseline rate
Target rate
Confidence level
Statistical power
Estimated runtime
Based on your daily visitor count split 50/50 across both variants.
Test complete?
Once you've hit your sample size, check whether your results are statistically significant.
Why You Need a Power Test Before Your A/B Test
Most A/B test failures are not caused by a bad variant. They are caused by a badly sized experiment.
Underpowered tests miss real improvements
If your sample is too small, you won't have enough statistical sensitivity to reliably detect the lift you're looking for — even if it's genuinely there. You'll call the test inconclusive and move on, leaving a real winner on the table. A power test tells you the minimum sample required to avoid this.
Overpowered tests waste time and budget
Running a test for twice as long as you need ties up your traffic, delays other experiments, and in paid media contexts, extends the period of suboptimal spend. Calculate the right sample size upfront and stop exactly when you have enough data.
It forces you to commit to an MDE
Deciding your minimum detectable effect before the test starts is a forcing function. It makes you answer: what improvement is actually worth acting on? If a 5% lift would not change your marketing decisions, set your MDE higher. This prevents you from fishing for significance on tiny, meaningless effects.
It prevents peeking and early stopping
When you know exactly how many visitors you need, you have a hard stopping criterion. Without one, it's tempting to stop as soon as results look good — which inflates your false-positive rate dramatically. Once you hit your target sample size, take your results to our statistical significance calculator to get your p-value.
Test already running? Once you've collected enough data, check if your results are real.
Statistical Significance Calculator →How Statistical Power Works
Four concepts that determine your required sample size — and how to set each one.
Baseline conversion rate
Your current conversion rate before the test. The lower your baseline, the more visitors you need to detect the same absolute change. A test on a 1% baseline requires far more visitors than the same test on a 10% baseline. Use the last 30 days of data from your analytics.
Minimum detectable effect (MDE)
The smallest improvement that would justify launching the variant. Smaller MDEs require dramatically larger samples — halving your MDE roughly quadruples the required sample size. Be honest: if you would not launch a variant for a 5% lift, don't power your test to detect 5%.
Confidence level (1 − α)
The probability of not getting a false positive — declaring a winner when there is none. 95% is standard. At 95% confidence, 1 in 20 significant results is a false positive by chance. Lowering to 90% reduces sample size but increases noise; use it only for low-stakes tests.
Statistical power (1 − β)
The probability of detecting a real effect when it exists. 80% is the industry standard — meaning a 20% chance of a false negative. Higher power (95%) dramatically increases required sample size. Use 80% unless the cost of missing a true improvement is unusually high.
Frequently Asked Questions
What is statistical power in A/B testing?
Statistical power is the probability that your test will detect a real effect when one exists. A power of 80% means there is a 20% chance of missing a genuine improvement — a false negative. Higher power requires more visitors but reduces the risk of incorrectly calling a test inconclusive.
What is a minimum detectable effect (MDE)?
The minimum detectable effect is the smallest improvement you want to reliably detect. Setting a smaller MDE requires more visitors. Choose your MDE based on the minimum business impact that would justify acting on the test — if a 5% lift would not change your decisions, set a higher MDE.
Should I use relative or absolute MDE?
Relative MDE is a percentage improvement on your baseline — a 20% relative lift on a 5% baseline means detecting a rise to 6%. Absolute MDE is a direct percentage point change — 5% to 6% is a 1pp absolute lift. Relative is more intuitive for most marketers; both give the same required sample size if they describe the same underlying change.
What power level should I use — 80% or 95%?
80% is the industry standard for most tests. It accepts a 20% risk of missing a real effect. Use 95% when the cost of missing a true improvement is very high, but note that it significantly increases the required sample size — often by 60% or more.
What confidence level should I use?
95% is the standard. It means you accept a 5% chance of a false positive — declaring a winner when there is none. Use 90% for low-stakes, high-frequency tests where you can tolerate slightly more noise. Use 99% only for very high-stakes decisions where a false positive would be costly.
How long should I run my A/B test?
Run the test until you reach your required sample size per variant, and for at least one to two full weeks to average out day-of-week effects. Enter your daily visitor count into the calculator above to get a runtime estimate. Never stop early because results look significant — this is peeking and inflates your false-positive rate.
What happens if I stop before reaching the required sample size?
Stopping early — especially when results look promising — is called peeking, and it inflates your false-positive rate from 5% to 25% or higher. Always run to your pre-calculated sample size. Once you hit it, use our statistical significance calculator to evaluate your results.
Work with Jarrah
Ready to scale your winners?
We run paid media and CRO programs built on rigorous testing — not hunches.