@vicaugusto33, sorry in advance for the lengthy discussion below!

I believe the point was to demonstrate two things:

- That p-values are just measurements of confidence that the null hypothesis is/is not true and are not definitive; and
- The result of a 1 Sample T-test is highly dependent on the sample you provide and whether it is representative of your likely population.

Because these hypothesis testing lessons are pretty confusing, Iâ€™m going to break this down to the nitty gritty so we can see whatâ€™s going on here.

## Background

So when running this test, your null hypothesis is that the sample belongs to a population with a mean age of 30.

In the previous exercise, using the first sample, the p-value was something like 0.56. If that is our only sample, then we conclude that we cannot rule out the null hypothesis because there is a 56% likelihood that â€” based on the sample you provided â€” the null hypothesis is true. However, that sample size was only 14 (or supposed to represent 100 according to the exercise), and is probably not a good sample size for the likely population of BuyPie customers.

For this exercise, you are told to loop through 1000 days of customer info, with each day being its own sample. In your loop you perform a 1 Sample T-test on that dayâ€™s sample and you print out the p-value as well as total number of tests where we could reject the null hypothesis: 499 or 1000.

## Immediate Takeaways

Okay, cool, so we see 1000 p-values, some well below 0.05 and some well above. What does this tell us? It tells us that if we take 1000 different 100-person samples of the same population and run T-tests, the T-tests will vary widely in their confidence of accepting/rejecting the null hypothesis, depending on which values are included in the sample.

But what does this really tell us? It tells us that 100 is too small of a sample size to be representative of our likely population. We shouldnâ€™t have such wildly different results if each of our samples was the correct size. How do we know that our current sample size is 100? Try running this code:

```
count = 0
for i in range(len(daily_visitors[0])): #daily_visitors[0] is day 1 of our 1000 days
count += 1
print(count) #prints 100
```

Great question. The best way is to use a sample size calculator to find out what your sample size should be, to be statistically confident that it represents your likely population. Codecademy actually goes over this in the Sample Size Determination course, but for some reason they decided to put that course after the Hypothesis Testing course.

No. This is not a statistically sound way to evaluate your data, and Codecademy just used it for example purposes.

## A Better Approach

You might be wondering how to properly find whether the average age of BuyPieâ€™s customers is 30, using a sample. Below, Iâ€™ll give you an example of how to do just that using the data Codecademy provided.

First, add `import random`

to the top of your code and comment out everything from your `for`

loop down.

Now, letâ€™s take all of the data from `daily_visitors`

and put it into one list:

```
all_customers = []
for i in range(len(daily_visitors)):
for j in daily_visitors[i]:
all_customers.append(j)
```

Now, letâ€™s figure out the size of the sample we need. To do this, weâ€™ll use the sample size calculator from Codecademyâ€™s Sample Size Determination course:

For more info on how to choose the numbers, check out the course, which is included in the Data Science path.

Now that we know that if we want a confidence level of 99%, we should use a sample of 659 people, we need to choose 659 randomly sampled people from `all_customers`

. We can do this with `random.sample()`

:

```
sample = random.sample(all_customers, k=659)
tstat, pval = ttest_1samp(sample, 30)
print(pval)
```

As you will see, `pval`

will be less than 0.05, meaning that we can reject the null hypothesis and confidently say that the mean age of our BuyPie customer population is not 30.

Now, assuming that the 100,000 data points we have represent the entire population, letâ€™s check what our actual mean is:

```
print(np.mean(all_customers)) # prints 31.00082
```

Our 1 Sample T-test was accurate! Of course, in real life we will probably never know our exact population size or be able to verify our T-test against the actual mean (especially for an online store). But, here it is cool to be able to check whether the proper sample size helped any.

If you are interested to see the difference between the exercise (1000 tests with an improper sample size) and running 1000 tests with the correct sample size, you can put this code in a loop, just like the original exercise:

```
null_true = 0
for i in range(1000):
sample = random.sample(all_customers, k=659)
tstat, pval = ttest_1samp(sample, 30)
print(pval)
if pval > 0.05:
null_true += 1
print("We accepted our null hypothesis that the population's average age was 30...{} times!".format(null_true))
```

Anyway, hopefully this helped you out. Happy coding!