**How Large Should My Sample Be?**

In the first post about sampling, we talked about what a sample is, how it relates to the population, and how proper sampling relates to drawing a valid conclusion. Then, we mentioned the confidence intervals – boundaries within which we estimate that the population parameter lies, with a particular amount of certainty. We also mentioned that the confidence intervals should not be too wide – otherwise, our estimate becomes useless.

**Sample size**

When planning a study, the sample size is often one of the main questions we need to answer because it is directly related to the budget. We all have an intuitive understanding that a sample should be ’large enough. But, large enough for what? For example, let’s say that you are thinking about doing a package redesign. You want to know whether your customers would prefer the new package over the old one before you invest more money (check our post about __ab testing__). So, it would be best to have a large enough sample to get a sufficiently precise estimate of the proportion of your customers who prefer the new package design.

Once we narrow down what we need, the question about the required sample size becomes much easier to answer. There is a relatively simple formula that gives us the required sample size:

In the above formula

*p*is the proportion that we want to estimate (e.g., the proportion of the population that prefers that new package design)*Z*is the value determining the confidence interval we would like to have for our proportion estimate*E*is the amount of error (in percentages) that we are willing to tolerate in our estimate*N*is the (exact or estimated) population size.

Even if the formula is straightforward, there are a couple of parameters that we should consider. First, we need an estimate of *p *(proportion). „But, estimating *p *is why I am doing the study in the first place!“ you might cry. And, you would be entirely right. If you have no idea what the size of *p *might be, the safest way to go is to assume that p is 0.5, since this is the value that maximized the *p(1-p)* part of the expression and thus gives the most conservative (largest) estimate of the required sample. But maybe you do have some idea about the size of *p *from a previous study you did, or some other source - for example, the last time you did a package redesign, 60% of people preferred the new package over the old one. In that case, you have reasonable grounds to set *p *to 0.6. Otherwise, keep it at 0.5.

Next, you need to decide on the value of *Z*. Traditionally, *Z *is set to either 1.96 or 2.56, depending on whether you would like to have a 95% or 99% confidence interval for your estimate (if you are not sure what confidence interval is, we strongly advise you read the first post of sampling before going any further).

The following step is to decide on the value of *E *or how much error you can tolerate. Are you okay with your estimate potentially being ±5% off? Or ±10% off? The smaller the error you are willing to tolerate, the bigger the sample you will need, and vice versa. Finally, you need to enter the population size. You might have no idea about the size of the population, but that doesn’t matter much– if your population is large enough that you don’t know how big it is, the actual size of the sample won’t change. Population size affects the sample size only for tiny populations (e.g., below 1000-2000 entities), in which case you will probably know the approximate size of your population anyway.

So, once you have decided on all the parameters, you can plug them into the equation. Then, you can try it in our sample size calculator:

**In conclusion**

Having a large enough sample is important, but if our sample is not representative, no size will help us reach valid conclusions. In the third and final __post on sampling__, read about some practical sampling strategies that you can use to ensure that your sample gives you what you need.

## 留言