Hypothesis Testing | Towards Data Science https://towardsdatascience.com/tag/hypothesis-testing/ The world’s leading publication for data science, AI, and ML professionals. Thu, 06 Mar 2025 05:42:22 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Hypothesis Testing | Towards Data Science https://towardsdatascience.com/tag/hypothesis-testing/ 32 32 One-Tailed Vs. Two-Tailed Tests https://towardsdatascience.com/one-tailed-vs-two-tailed-tests/ Thu, 06 Mar 2025 04:22:42 +0000 https://towardsdatascience.com/?p=598815 Choosing between one- and two-tailed hypotheses affects every stage of A/B testing. Learn why the hypothesis direction matters and explore the pros and cons of each approach.

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

]]>
Introduction

If you’ve ever analyzed data using built-in t-test functions, such as those in R or SciPy, here’s a question for you: have you ever adjusted the default setting for the alternative hypothesis? If your answer is no—or if you’re not even sure what this means—then this blog post is for you!

The alternative hypothesis parameter, commonly referred to as “one-tailed” versus “two-tailed” in statistics, defines the expected direction of the difference between control and treatment groups. In a two-tailed test, we assess whether there is any difference in mean values between the groups, without specifying a direction. A one-tailed test, on the other hand, posits a specific direction—whether the control group’s mean is either less than or greater than that of the treatment group.

Choosing between one- and two-tailed hypotheses might seem like a minor detail, but it affects every stage of A/B testing: from test planning to Data Analysis and results interpretation. This article builds a theoretical foundation on why the hypothesis direction matters and explores the pros and cons of each approach.

One-tailed vs. two-tailed hypothesis testing: Understanding the difference

To understand the importance of choosing between one-tailed and two-tailed hypotheses, let’s briefly review the basics of the t-test, the commonly used method in A/B testing. Like other Hypothesis Testing methods, the t-test begins with a conservative assumption: there is no difference between the two groups (the null hypothesis). Only if we find strong evidence against this assumption can we reject the null hypothesis and conclude that the treatment has had an effect.

But what qualifies as “strong evidence”? To that end, a rejection region is determined under the null hypothesis and all results that fall within this region are deemed so unlikely that we take them as evidence against the feasibility of the null hypothesis. The size of this rejection region is based on a predetermined probability, known as alpha (α), which represents the likelihood of incorrectly rejecting the null hypothesis. 

What does this have to do with the direction of the alternative hypothesis? Quite a bit, actually. While the alpha level determines the size of the rejection region, the alternative hypothesis dictates its placement. In a one-tailed test, where we hypothesize a specific direction of difference, the rejection region is situated in only one tail of the distribution. For a hypothesized positive effect (e..g., that the treatment group mean is higher than the control group mean), the rejection region would lie in the right tail, creating a right-tailed test. Conversely, if we hypothesize a negative effect (e.g., that the treatment group mean is less than the control group mean), the rejection region would be placed in the left tail, resulting in a left-tailed test.

In contrast, a two-tailed test allows for the detection of a difference in either direction, so the rejection region is split between both tails of the distribution. This accommodates the possibility of observing extreme values in either direction, whether the effect is positive or negative.

To build intuition, let’s visualize how the rejection regions appear under the different hypotheses. Recall that according to the null hypothesis, the difference between the two groups should center around zero. Thanks to the central limit theorem, we also know this distribution approximates a normal distribution. Consequently, the rejection areas corresponding to the different alternative hypothesis look like that:

Why does it make a difference?

The choice of direction for the alternative hypothesis impacts the entire A/B testing process, starting with the planning phase—specifically, in determining the sample size. Sample size is calculated based on the desired power of the test, which is the probability of detecting a true difference between the two groups when one exists. To compute power, we examine the area under the alternative hypothesis that corresponds to the rejection region (since power reflects the ability to reject the null hypothesis when the alternative hypothesis is true).

Since the direction of the hypothesis affects the size of this rejection region, power is generally lower for a two-tailed hypothesis. This is due to the rejection region being divided across both tails, making it more challenging to detect an effect in any one direction. The following graph illustrates the comparison between the two types of hypotheses. Note that the purple area is larger for the one-tailed hypothesis, compared to the two-tailed hypothesis:

In practice, to maintain the desired power level, we compensate for the reduced power of a two-tailed hypothesis by increasing the sample size (Increasing sample size raises power, though the mechanics of this can be a topic for a separate article). Thus, the choice between one- and two-tailed hypotheses directly influences the required sample size for your test. 

Beyond the planning phase, the choice of alternative hypothesis directly impacts the analysis and interpretation of results. There are cases where a test may reach significance with a one-tailed approach but not with a two-tailed one, and vice versa. Reviewing the previous graph can help illustrate this: for example, a result in the left tail might be significant under a two-tailed hypothesis but not under a right one-tailed hypothesis. Conversely, certain results might fall within the rejection region of a right one-tailed test but lie outside the rejection area in a two-tailed test.

How to decide between a one-tailed and two-tailed hypothesis

Let’s start with the bottom line: there’s no absolute right or wrong choice here. Both approaches are valid, and the primary consideration should be your specific business needs. To help you decide which option best suits your company, we’ll outline the key pros and cons of each.

At first glance, a one-tailed alternative may appear to be the clear choice, as it often aligns better with business objectives. In industry applications, the focus is typically on improving specific metrics rather than exploring a treatment’s impact in both directions. This is especially relevant in A/B testing, where the goal is often to optimize conversion rates or enhance revenue. If the treatment doesn’t lead to a significant improvement the examined change won’t be implemented.

Beyond this conceptual advantage, we have already mentioned one key benefit of a one-tailed hypothesis: it requires a smaller sample size. Thus, choosing a one-tailed alternative can save both time and resources. To illustrate this advantage, the following graphs show the required sample sizes for one- and two-tailed hypotheses with different power levels (alpha is set at 5%).

In this context, the decision between one- and two-tailed hypotheses becomes particularly important in sequential testing—a method that allows for ongoing data analysis without inflating the alpha level. Here, selecting a one-tailed test can significantly reduce the duration of the test, enabling faster decision-making, which is especially valuable in dynamic business environments where prompt responses are essential.

However, don’t be too quick to dismiss the two-tailed hypothesis! It has its own advantages. In some business contexts, the ability to detect “negative significant results” is a major benefit. As one client once shared, he preferred negative significant results over inconclusive ones because they offer valuable learning opportunities. Even if the outcome wasn’t as expected, he could conclude that the treatment had a negative effect and gain insights into the product.

Another benefit of two-tailed tests is their straightforward interpretation using confidence intervals (CIs). In two-tailed tests, a CI that doesn’t include zero directly indicates significance, making it easier for practitioners to interpret results at a glance. This clarity is particularly appealing since CIs are widely used in A/B testing platforms. Conversely, with one-tailed tests, a significant result might still include zero in the CI, potentially leading to confusion or mistrust in the findings. Although one-sided confidence intervals can be employed with one-tailed tests, this practice is less common.

Conclusions

By adjusting a single parameter, you can significantly impact your A/B testing: specifically, the sample size you need to collect and the interpretation of the results. When deciding between one- and two-tailed hypotheses, consider factors such as the available sample size, the advantages of detecting negative effects, and the convenience of aligning confidence intervals (CIs) with hypothesis testing. Ultimately, this decision should be made thoughtfully, taking into account what best fits your business needs.

(Note: all the images in this post were created by the author)

The post One-Tailed Vs. Two-Tailed Tests appeared first on Towards Data Science.

]]>
Chi-Squared Test: Comparing Variations Through Soccer https://towardsdatascience.com/chi-squared-test-comparing-variations-through-soccer-e291ffe22c2f/ Wed, 01 Jan 2025 00:41:55 +0000 https://towardsdatascience.com/chi-squared-test-comparing-variations-through-soccer-e291ffe22c2f/ Understanding Different Types of Chi-Squared Tests: A/B Testing for Data Science Series (8)

The post Chi-Squared Test: Comparing Variations Through Soccer appeared first on Towards Data Science.

]]>
Understanding Different Types of Chi-Squared Tests: A/B Testing for Data Science Series (11)
Photo by Joppe Spaa on Unsplash
Photo by Joppe Spaa on Unsplash

If you are not a paid member on Medium, I make my stories available for free: Friends link

The term "Chi-Squared Test" is often thrown around without much clarification about which specific test is being referenced. Of course, if you are a Data scientist, you should know type of test that the other is referring to. However, if you are someone just jumping into their data science career or learning about data science, it’s pretty understandable that this distinction can be confusing.

That’s exactly why I’m writing this article today!

I’ll help you break down the different types of Chi-Squared tests and provide a thorough explanation of when and how to use them. This article will be a continuation of my A/B Testing and Hypothesis Testing series, so if you are interested, go check them out as well!

Also… Before I start getting into the details, I want to take a moment to acknowledge the timing of this article. As my final writing for 2024, I imagine many of you might be reading this on New Year’s Eve, New Year’s Day, or sometime shortly after.

I hope 2024 has been a fantastic year for you, and I wish you all an incredible 2025 ahead.

It’s been a great journey so far and I am excited to write more great articles in 2025! With that said, let’s get started into the last article of the year!

Note: Please skip to the second section if you know the basics of a Chi-Squared Test already!


Table of Contents

  1. Chi-Squared Test
  2. Analyzing Playing Styles in Serie A: Chi-squared test for Independence
  3. Penalty Kick Success Rate: Chi-Squared Test for Homogeneity
  4. Goals Scored in Soccer Matches: Chi-Squared Test for Goodness-of-Fit
  5. Summary

Chi-Squared Test

The Chi-Squared Test is a statistical method used to help us figure out whether the differences (or similarities) we observe between categorical variables happen by chance or reflect a real, meaningful relationship. This makes it an essential tool in hypothesis testing, widely used in Data Science and other fields.

If you recall, categorical variables are types of variables in data that represent groups or categories instead of numerical values.

To identify whether something is a categorical variable or not, I generally ask myself "Does this have a finite set of distinct groups?" For instance, if we’re analyzing "preferred drink choices" in a survey, the options might be categories like coffee, tea, water, soda, or juice. Since I can clearly list these distinct groups, I know it’s a categorical variable.

Types of Chi-Squared Tests

There are three main types of Chi-Squared Tests, each serving a specific purpose:

  • Chi-Squared Test for Independence determines if there is a significant association between two categorical variables in a single population.
  • Chi-Squared Test for Homogeneity compares the distribution of a categorical variable across two or more different populations or groups.
  • Chi-Squared Goodness of Fit Test determines whether the observed distribution of a single categorical variable matches a theoretical or expected distribution.

If this seems confusing, don’t worry! I’ll break down each type of test with detailed examples to make it clearer.

Key Concepts

Chi-Squared Test is a hypothesis-testing method

  • Null Hypothesis (H₀): Assumes no significant relationship or difference exists.
  • Alternative Hypothesis (Hₐ): Assumes a significant relationship or difference exists.
  • Right-Tailed: The Chi-Squared Test is always right-tailed because the test statistic (χ²) is based on squared differences, which are always positive. Large values of χ² indicate significant deviations from the null hypothesis, leading to its rejection.
  • Contingency Table: Generally for the chi-squared test for independence, you’ll see some sort of a table that shows the data for each categorical variable (like the one below). It summarizes the data to understand the relationship between the variables.
Incoming example using soccer (again)
Incoming example using soccer (again)

If these concepts seem complex, don’t worry! In the next sections, I’ll explain Chi-squared with a concrete example and go over each type of Chi-Squared Test in detail, so you can confidently apply them to your own data analyses.

Now that we’ve covered the basics, let’s use one of my favorite topics – soccer – to illustrate this test. For my readers who have been following my articles, you know I love to use soccer!

Photo by Peter Glaser on Unsplash
Photo by Peter Glaser on Unsplash

Analyzing Playing Styles in Serie A: Chi-squared test for Independence

Imagine you are working with AC Milan’s head coach as a data scientist. You and the head coach have been hired by the team with a new season coming up (2025–2026). AC Milan has been known for their strong defensive style of football. However, this approach hasn’t been very successful in recent years.

With a new season on the horizon, the coach is curious to understand whether certain playing styles – such as Balanced, Possession-based, or Counter-Attack – are associated with better match outcomes (Win, Draw, or Loss) across Serie A teams.

To test this, you analyze match results from the entire past Serie A season (2024–2025). This data represents a single population: all matches played in Serie A that season. You’re examining the relationship between two categorical variables:

  • Playing Style: Balanced, Possession, Counter-Attack.
  • Match Result: Win, Draw, Loss.

For simplicity, the dataset for our hypothetical Serie A analysis is summarized in the table below.

Observed Data for our make-up Serie A example
Observed Data for our make-up Serie A example

Note for readers: While real-world match outcomes are influenced by various factors – like squad depth, player availability, and injuries – we’ll focus solely on playing style for this analysis!

Define the Hypothesis

How can we use a statistical test to determine whether certain playing styles influence match results? Well… remember that this is a hypothesis test, so let’s define the null and alternative hypotheses!

  • Null Hypothesis (Hₒ): Playing style is independent of match result. (No relationship exists between playing style and outcome.)
  • Alternative Hypothesis (Hₐ​): Playing style is associated with match result. (A relationship exists between the two variables.)

To test these hypotheses, we’ll use the Chi-Squared Test for Independence to evaluate whether the observed differences in match results are statistically significant!

What do we expect from our data?

Before we begin performing the test… let’s take a step back and consider what the Chi-Squared Test for Independence is trying to achieve. At its core, it’s safe to say that this test asks:

  • Are the differences in match results across playing styles just random?
  • Or do these differences reflect a meaningful relationship between playing style and outcomes?

Now, let’s consider this intuitively.

If playing style has no impact on match results, the number of wins, draws, and losses for each style should align with the overall proportions in the data. For example, if 50% of all matches result in a win, we’d expect each playing style to have roughly 50% wins!

If we were to calculate the expected data for each correlation of playing style to match results… It can simply be calculated based on the row and column totals.

This is what we expect each cell's value in our table to be (based on our data)
This is what we expect each cell’s value in our table to be (based on our data)

For example, the expected number of wins for the Balanced playing style is shown below. We can do the same for every single cell to get the expected data!

Expected number of wins for Balanced playing style
Expected number of wins for Balanced playing style

Why Does This Matter?

Okay, we intuitively went over what we expect our data to be. But, why do we care about the expected data instead of just looking at the observed data?

Expected Data
Expected Data

Well…! The expected data is critical to our Chi-Squared Test because it serves as the baseline for comparison. By calculating the expected values, we essentially say:

  • "If playing style has no impact on match results (i.e., no relationship exists), this is what the data should look like."

When we compare the observed to the expected data, any significant deviation might indicate that playing style and match results are related.

Observed Data
Observed Data

It makes sense, right? If there’s truly no relationship, the observed data should closely align with the expected data. On the other hand, if the observed data is notably different from the expected, it suggests that a relationship might exist between playing style and match outcomes.

Understanding the Chi-Squared Test

This comparison is the foundation of the Chi-Squared Test, helping us determine whether the differences are statistically significant or just random chance.

The key metric in this test is the Chi-Squared statistic (χ²), which measures how much the observed data deviates from the expected data, assuming the two variables are independent. The formula for calculating the Chi-Squared statistic is:

Our Chi-squared statistic!
Our Chi-squared statistic!
  • O​ is the observed frequency in each cell of the contingency table.
  • E is the expected frequency in each cell of the contingency table.

For each cell in the table, you calculate this value and then sum them all up to get the total Chi-Squared statistic!

Performing the Test Using Python

Now, you can use Python to calculate the Chi-Squared statistic and determine whether the observed differences are statistically significant! We’ll set the significance level (α) to 0.05.

import numpy as np
from scipy.stats import chi2_contingency

# Observed data
observed = np.array([
    [80, 50, 80],  # Balanced
    [90, 20, 10],  # Possession
    [30, 30, 20]   # Counter-Attack
])

# Perform Chi-Squared Test
chi2, p_value, dof, expected = chi2_contingency(observed)

# Results
print("Chi-Squared Statistic:", chi2) # 72.34
print("P-Value:", p_value) # 0.00001
print("Degrees of Freedom:", dof) # 4
print("Expected Frequencies:n", expected) # Table for expected values

# Significance level
alpha = 0.05

By looking at our data, we got a p-value of 0.00001, which is less than the significance level (α=0.05). What does this mean? It means that we reject the null hypothesis. This indicates a statistically significant relationship between playing style and match outcomes in Serie A.

So we can say to the head-coach that we can explore adopting a Possession-based style, as it shows a stronger association with better match results (e.g., more wins) based on the analysis!


Applying Chi-Squared Tests in Different Scenarios

We just went over the chi-squared test for independence and how it works! Luckily, for all types of Chi-Squared tests – whether it’s the Chi-Squared Test for Homogeneity or the Chi-Squared Goodness-of-Fit Test – the foundational calculations remain the same as those used in the Chi-Squared Test for Independence!

  • Start with the observed data.
  • Calculate the expected data.
  • Compute the Chi-Squared statistic.
  • Evaluate its statistical significance.

Pretty simple right!!?

Since the procedures are very similar across these tests, for the other types of Chi-Squared tests… I’ll focus on the scenarios where each test is used instead of going into the details again!


Photo by Jeffrey F Lin on Unsplash
Photo by Jeffrey F Lin on Unsplash

Penalty Kick Success Rate: Chi-Squared Test for Homogeneity

For a data science project, I analyzed data on the success rates of penalty kicks – categorized as Goal, Saved, or Missed – from various high-profile soccer tournaments, including the UEFA Champions League, FIFA World Cup, and Copa América.

The goal of this project was to investigate whether the success rates of penalty kicks differ across these tournaments or remain consistent, considering the varying levels of pressure and competition in each event.

Interesting right?

This kind of question is perfect for solving with the Chi-Squared Test for Homogeneity! Remember that in this variation of the Chi-Squared test you are looking to compare whether the distribution of a categorical variable (success rate of penalty kicks) is the same across multiple groups or populations (different soccer tournaments).

Below is the data, try out the Chi-Squared Test for Homogeneity yourself!

Observed Data for Success Rates of Penalty Kicks
Observed Data for Success Rates of Penalty Kicks

Answer

import numpy as np
from scipy.stats import chi2_contingency

# Observed data
observed = np.array([
    [40, 35, 30],  # Goal
    [20, 25, 15],  # Saved
    [10, 15, 20]   # Missed
])

# Perform the Chi-Squared Test
chi2, p_value, dof, expected = chi2_contingency(observed)

# Print results
print("Chi-Squared Statistic:", chi2)
print("P-Value:", p_value)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:n", expected)

# Chi-Squared Statistic: 7.688
# P-Value: 0.053
# Degrees of Freedom: 4
# Expected Frequencies:
 [[35.0 37.5 32.5]
  [20.0 21.43 18.57]
  [15.0 16.07 13.93]]
# Fail to reject the null hypothesis: No significant difference in success rates across tournaments.

Photo by My Profit Tutor on Unsplash
Photo by My Profit Tutor on Unsplash

Goals Scored in Soccer Matches: Chi-Squared Test for Goodness-of-Fit

Imagine you want to test whether the distribution of goals scored in soccer matches aligns with a Poisson distribution. If you recall, the Poisson distribution is often used to model the number of events (like goals) occurring within a fixed interval, such as a single soccer match.

Probability Distributions: Poisson vs. Binomial Distribution

Based on historical league data, you hypothesize the following probability distribution for goals scored:

Observed Data and Theoretical Probability
Observed Data and Theoretical Probability

Observed Data and Theoretical Probability

This kind of question is perfect for solving with the Chi-Squared Test for Goodness-of-Fit! Remember that we use this variation to determine whether a single categorical variable (in this case, goals scored) follows a specific theoretical or expected distribution (e.g., Poisson).

By comparing the observed data with the theoretical probabilities, we can determine whether the two distributions are consistent or significantly different.

What this means is that unlike our previous variations of Chi-Squared Test, our hypothesis would be looking to see if the distribution of the categorical variable matches our theoretical distribution!

  • Null Hypothesis (H₀): The observed distribution of goals matches the theoretical Poisson distribution.
  • Alternative Hypothesis (Hₐ): The observed distribution of goals does not match the theoretical Poisson distribution.

Since it might be a bit confusing to calculate the expected data, I’ll provide you with it! Try out the Chi-Squared Test for Goodness-of-fit yourself!

Expected Data
Expected Data

Answer

import numpy as np
from scipy.stats import chisquare

# Observed data
observed = np.array([50, 70, 60, 20])

# Expected data based on theoretical probabilities
expected = np.array([50, 70, 50, 30])

# Perform the Chi-Squared Goodness-of-Fit Test
chi2, p_value = chisquare(f_obs=observed, f_exp=expected)

# Print results
print("Chi-Squared Statistic:", chi2)
print("P-Value:", p_value)

# Chi-Squared Statistic: 1.8
# P-Value: 0.614
# Fail to reject the null hypothesis: The observed distribution matches the expected distribution.

Summary

Using soccer as a case study, we got to see how Chi-Squared Test for Independence can reveal relationships between variables, such as playing styles and match outcomes, enabling coaches to strategize more effectively.

Similarly, the Chi-Squared Test for Homogeneity helps compare distributions across groups, like penalty kick success rates in different tournaments.

And the Chi-Squared Goodness-of-Fit Test? It allows us to evaluate whether real-world data, such as goals scored in matches, aligns with a theoretical model, like the Poisson distribution, giving us better insights into game patterns and possible questions for further analysis.

I personally thought it was pretty confusing to understand the differences between the various types of Chi-Squared tests when I was learning them. My goal was to get these cleared up for you so you don’t struggle like I did (Well… I hope this article was able to do that for you).

I hope you were able to learn something!


Connect with me!

If you made it this far, I assume you are an aspiring data scientist, a teacher in the data science field, a professional looking to hone your craft, or just an avid learner in a different field! I would love to have a chat with you on anything!

For those wondering about my images: Unless otherwise noted, all images are by the author (myself)

Sunghyun Ahn – Medium

The post Chi-Squared Test: Comparing Variations Through Soccer appeared first on Towards Data Science.

]]>
Jingle Bells and Statistical Tests https://towardsdatascience.com/jingle-bells-and-statistical-tests-33ea90912099/ Wed, 25 Dec 2024 15:01:27 +0000 https://towardsdatascience.com/jingle-bells-and-statistical-tests-33ea90912099/ Data Types, Hypotheses and Statistical Tests That Fit Them with Festive Christmas Market Examples🎄🎅🎡

The post Jingle Bells and Statistical Tests appeared first on Towards Data Science.

]]>
It’s that magical time of the year. Twinkling lights and sparkling ornaments delight the eyes; while gifts, laughter, family time, and steaming cups of glühwein warm the soul. Despite the winter chill, there’s a heartwarming joy in being part of the crowd, feasting these magical moments together.

It’s a job hazard, really – after wandering through Christmas markets for three days in a row, I couldn’t help but view everything through the lens of statistical analysis. Then it hit me. Why not explain statistical testing through the lovely examples of Christmas, to make them more enjoyable and easier to grasp. To everyone celebrating, I wish you a joyful Christmas filled love, laughter, and of course glühwein. Happy reading!


Let’s start with a little refresher on what statistical tests actually are. They are essential tools used to make inferences about the data. It’s a little like trying to predict the crowd size at the Christmas market – we make a hypothesis and test it to see if we’re right (or totally off!). We propose a statement – a hypothesis – about a research question, and test it using appropriate techniques – statistical tests – to admit or reject it. Choosing the right statistical test depends on the Data Type, distribution of the data, sample size and the nature of the hypothesis.

Data Types

There are four main data types which affects our decision on the chosen statistical test:

  • Nominal: Categorical data with no inherent order. In other words, no ranking involved. Think of the stall types at a Christmas market, such as food, ornament, gift etc.
  • Ordinal: Categorical data with a meaningful order. For example, the categories of visitor numbers across different Christmas markets, such as high, medium, low.
  • Interval: Numeric data with equal intervals between values, but with no true zero. Think of the weather temperature, since 0°C does not mean the absence of the temperature, just a cold spot.
  • Ratio: numeric data with true zero, where zero means the complete absence of the quantity. For example, the number of hot dogs sold or visitor number of the Christmas Market. If no one shows up, you’ve hit zero – and that’s a real zero.

Data Distribution

Data distribution refers to how values are spread or arranged across a dataset. Here’s a summary of key data distributions:

  • Normal: the distribution is symmetric, with most data points clustering around the mean. It is also called bell curve.
  • Skewed: the distribution is skewed, so one tail is longer than the other.
  • Uniform: data points are spread evenly so all outcomes are equally likely.
  • Bimodal: the distribution has two peaks, which may indicate two underlying groups.
  • Exponential: the distribution has a high concentration of values at the lower end, but becomes sparse as values increase.

The data distribution is important to decide on whether parametric or non-parametric tests should be used. It’s like picking the right glühwein stall to get a delicious one— no one wants to wait in line at the wrong one!

Nature of Hypothesis

It refers to the type of claim or assertion being made about the data or population that is being tested in a Statistical Analysis. In general, hypotheses can be categorized into two: null hypothesis and alternative hypothesis.

  • Null hypothesis: It suggests that there is no significant effect or relationship between variables in the population or data set being studied. Essentially, the null hypothesis assumes that any observed effect or difference is due to chance or random variation. For example: "The average spending per visitor is the same as it was last year."
  • Alternative Hypothesis: It asserts that there is a significant effect, difference, or relationship in the data. It reflects the researcher’s theory or belief about the relationship between variables. For example: "The average spending per visitor at the Christmas market has changed compared to last year."

Sample Size and Error Types

Sample size refers to the number of observations or data points collected in a study. The sample size influences the ability to detect a true effect or relationship in the population and the precision of the estimates. The bigger the sample, the better the test. It’s like needing more snowflakes to make a perfect snowman – smaller samples give you less power, and larger ones help reduce errors.

There are two types of error:

  • Type I error (False Positive): The probability of incorrectly rejecting the null hypothesis when it is true.
  • Type II error (False Negative): The probability of failing to reject the null hypothesis when the alternative hypothesis is true.

Larger sample sizes reduce both types of errors, but they are particularly effective in reducing Type II errors.

Central Limit Theorem states that, regardless of the population distribution, the sampling distribution of the sample mean will be approximately normally distributed when the sample size is sufficiently large (typically n > 30).

Common Statistical Tests and When to Use Them

🎅 Tests for Means

One-Sample t-Test is used to compare the mean of a sample to a known value.

  • Data Type: Interval or Ratio
  • Data Distribution: Normal distribution
  • Sample Size: Small or large (no minimum sample size requirement)
  • Hypothesis: The average spending per visitor at a Christmas market different from last year’s average (€50).

Independent t-Test is used to compare means between two independent groups.

  • Data Type: Interval or Ratio
  • Data Distribution: Normal distribution
  • Sample Size: Large (at least 30 sample per group)
  • Hypothesis: The visitors spend more at markets in large cities versus small towns.

Paired t-Test is used to compare means of the same group before and after an event.

  • Data Type: Interval or Ratio
  • Data Distribution: Normal distribution.
  • Sample Size: Small or large (no minimum sample size requirement)
  • Hypothesis: The average glühwein consumed per hour increases after live music starts at the market.

🎁 Tests for Relationships

Pearson Correlation is used to measure the strength of a linear relationship between two continuous variables.

  • Data Type: Interval or Ratio
  • Data Distribution: Normal distribution
  • Sample Size: Large (at least 30 sample per group)
  • Hypothesis: There is a correlation between the number of stalls and total market revenue.

Chi-Square Test is used to assess relationships between categorical variables.

  • Data Type: Nominal
  • Data Distribution: Non-normal or unknown
  • Sample Size: Large enough to avoid expected counts less than 5
  • Hypothesis: Visitor preferences for ornament materials (wooden, plastic, metal, fabric) are independent of the city.

Spearman Rank Correlation is used to assess relationships between ordinal variables or non-linear data.

  • Data Type: Ordinal or non-linear Interval/Ratio
  • Data Distribution: Skewed or non-normal
  • Sample Size: Small or large (no minimum sample size requirement)
  • Hypothesis: There is a relationship between the number of Santa Claus flying performances and visitor ratings.

🎇 Tests for Proportions

Z-Test for Proportions is used to compare proportions in a sample to a known proportion.

  • Data Type: Nominal
  • Data Distribution: Normal distribution
  • Sample Size: Large (at least 30 sample per group)
  • Hypothesis: The proportion of stalls which sell candles higher this year compared to last year.

Chi-Square Test for Independence is used to compare proportions between two or more groups.

  • Data Type: Nominal
  • Data Distribution: Non-normal or unknown
  • Sample Size: Large enough to avoid expected counts less than 5
  • Hypothesis: The proportion of crepe stalls are similar with currywurst stalls across different cities in Germany.

🤶 Tests for Variances

F-Test is used to compare variances between two groups.

  • Data Type: Interval or Ratio
  • Data Distribution: Normal distribution
  • Sample Size: Moderate or large (no strict minimum sample size)
  • Hypothesis: The variances of total visitor numbers for Christmas markets in large and small cities are different.

Levene’s Test is used to test for equality of variances across groups.

  • Data Type: Interval or Ratio
  • Data Distribution: Non-normal or unknown
  • Sample Size: Small or large (no strict minimum sample size)
  • Hypothesis: The variances in the total amount sold at glühwein and hot cacao stalls are equal.

❄ Tests for Multiple Groups

ANOVA (Analysis of Variance) is used to compare means across three or more groups.

  • Data Type: Interval or Ratio
  • Data Distribution: Normal distribution
  • Sample Size: Large (at least 30 sample per group)
  • Hypothesis: Average spending differs across Christmas markets in Berlin, Munich, and Hamburg.

Kruskal-Wallis Test is used as a non-parametric alternative to ANOVA, used for ordinal or non-normally distributed data.

  • Data Type: Ordinal or non-normal Interval/Ratio
  • Data Distribution: Skewed or non-normal
  • Sample Size: Small or large (no strict minimum sample size)
  • Hypothesis: The median number of carousel rides is the same across Christmas Markets in Berlin, Munich, and Hamburg.

🎄 Time Series Tests

Augmented Dickey-Fuller Test is used to test for stationarity in time series data.

  • Data Type: Interval or Ratio (time series)
  • Data Distribution: Stationary or non-stationary
  • Sample Size: Large enough for reliable testing
  • Hypothesis: The daily visitor count is stable over the Christmas season in Nuremberg Christmas Market.

And there you have it! The magic of statistical tests sprinkled with a bit of holiday cheer. May your Christmas markets be full of joy, and your data be as clear as the sparkling lights. 🎄🎀🕯

The post Jingle Bells and Statistical Tests appeared first on Towards Data Science.

]]>
All you need to know about Non-Inferiority Hypothesis Test https://towardsdatascience.com/all-you-need-to-know-about-non-inferiority-test-c58a74ec4cc5/ Fri, 18 Oct 2024 11:53:45 +0000 https://towardsdatascience.com/all-you-need-to-know-about-non-inferiority-test-c58a74ec4cc5/ A non-inferiority test proves that a new treatment is not worse than the standard by more than a clinically acceptable margin

The post All you need to know about Non-Inferiority Hypothesis Test appeared first on Towards Data Science.

]]>
All You Need to Know About the Non-Inferiority Hypothesis Test

A non-inferiority test statistically proves that a new treatment is not worse than the standard by more than a clinically acceptable margin

Generated using Midjourney by prateekkrjain.com
Generated using Midjourney by prateekkrjain.com

While working on a recent problem, I encountered a familiar challenge – "How can we determine if a new treatment or intervention is at least as effective as a standard treatment?" At first glance, the solution seemed straightforward – just compare their averages, right? But as I dug deeper, I realised it wasn’t that simple. In many cases, the goal isn’t to prove that the new treatment is better, but to show that it’s not worse by more than a predefined margin.

This is where non-inferiority tests come into play. These tests allow us to demonstrate that the new treatment or method is "not worse" than the control by more than a small, acceptable amount. Let’s take a deep dive into how to perform this test and, most importantly, how to interpret it under different scenarios.

The Concept of Non-Inferiority Testing

In non-inferiority testing, we’re not trying to prove that the new treatment is better than the existing one. Instead, we’re looking to show that the new treatment is not unacceptably worse. The threshold for what constitutes "unacceptably worse" is known as the non-inferiority margin (Δ). For example, if Δ=5, the new treatment can be up to 5 units worse than the standard treatment, and we’d still consider it acceptable.

This type of analysis is particularly useful when the new treatment might have other advantages, such as being cheaper, safer, or easier to administer.

Formulating the Hypotheses

Every non-inferiority test starts with formulating two hypotheses:

  • Null Hypothesis (H0​): The new treatment is worse than the standard treatment by more than the non-inferiority margin Δ.
  • Alternative Hypothesis (H1​): The new treatment is not worse than the standard treatment by more than Δ.

When Higher Values Are Better:

For example, when we are measuring something like drug efficacy, where higher values are better, the hypotheses would be:

  • H0​: The new treatment is worse than the standard treatment by at least Δ (i.e., μnew − μcontrol ≤ −Δ).
  • H1​: The new treatment is not worse than the standard treatment by more than Δ (i.e., μnew − μcontrol > −Δ).

When Lower Values Are Better:

On the other hand, when lower values are better, like when we are measuring side effects or error rates, the hypotheses are reversed:

  • H0: The new treatment is worse than the standard treatment by at least Δ (i.e., μnew − μcontrol ≥ Δ).
  • H1​: The new treatment is not worse than the standard treatment by more than Δ (i.e., μnew − μcontrol < Δ).

Z-Statistic

To perform a non-inferiority test, we calculate the Z-statistic, which measures how far the observed difference between treatments is from the non-inferiority margin. Depending on whether higher or lower values are better, the formula for the Z-statistic will differ.

  • When higher values are better:
  • When lower values are better:

where δ is the observed difference in means between the new and standard treatments, and SE(δ) is the standard error of that difference.

Calculating P-Values

The p-value tells us whether the observed difference between the new treatment and the control is statistically significant in the context of the non-inferiority margin. Here’s how it works in different scenarios:

  • When higher values are better, we calculate p = 1 − P(Z ≤ calculated Z) as we are testing if the new treatment is not worse than the control (one-sided upper-tail test).

  • When lower values are better, we calculate p = P(Z ≤ calculated Z) since we are testing whether the new treatment has lower (better) values than the control (one-sided lower-tail test).

Understanding Confidence Intervals

Along with the p-value, confidence intervals provide another key way to interpret the results of a non-inferiority test.

  • When higher values are preferred, we focus on the lower bound of the confidence interval. If it’s greater than −Δ, we conclude non-inferiority.
  • When lower values are preferred, we focus on the upper bound of the confidence interval. If it’s less than Δ, we conclude non-inferiority.

The confidence interval is calculated using the formula:

  • when higher values preferred
  • when lower values preferred

Calculating the Standard Error (SE)

The standard error (SE) measures the variability or precision of the estimated difference between the means of two groups, typically the new treatment and the control. It is a critical component in the calculation of the Z-statistic and the confidence interval in non-inferiority testing.

To calculate the standard error for the difference in means between two independent groups, we use the following formula:

  • between two means
  • between two proportions

Where:

  • σ_new and σ_control are the standard deviations of the new and control groups.
  • p_new and p_control are the proportion of success of the new and control groups.
  • n_new​ and n_control are the sample sizes of the new and control groups.

The Role of Alpha (α)

In Hypothesis Testing, α (the significance level) determines the threshold for rejecting the null hypothesis. For most non-inferiority tests, α=0.05 (5% significance level) is used.

  • A one-sided test with α=0.05 corresponds to a critical Z-value of 1.645. This value is crucial in determining whether to reject the null hypothesis.
  • The confidence interval is also based on this Z-value. For a 95% confidence interval, we use 1.645 as the multiplier in the confidence interval formula.

In simple terms, if your Z-statistic is greater than 1.645 for higher values, or less than -1.645 for lower values, and the confidence interval bounds support non-inferiority, then you can confidently reject the null hypothesis and conclude that the new treatment is non-inferior.

Interpretation

Let’s break down the interpretation of the Z-statistic and confidence intervals across four key scenarios, based on whether higher or lower values are preferred and whether the Z-statistic is positive or negative.

Here’s a 2×2 framework:

Conclusion

Non-inferiority tests are invaluable when you want to demonstrate that a new treatment is not significantly worse than an existing one. Understanding the nuances of Z-statistics, p-values, confidence intervals, and the role of α will help you confidently interpret your results. Whether higher or lower values are preferred, the framework we’ve discussed ensures that you can make clear, evidence-based conclusions about the effectiveness of your new treatment.

Now that you’re equipped with the knowledge of how to perform and interpret non-inferiority tests, you can apply these techniques to a wide range of real-world problems.

Happy testing!

Note: All images, unless otherwise noted, are by the author.

The post All you need to know about Non-Inferiority Hypothesis Test appeared first on Towards Data Science.

]]>
How to Perform A/B Testing with Hypothesis Testing in Python: A Comprehensive Guide 🚀 https://towardsdatascience.com/how-to-perform-a-b-testing-with-hypothesis-testing-in-python-a-comprehensive-guide-17b555928c7e/ Sun, 13 Oct 2024 17:02:03 +0000 https://towardsdatascience.com/how-to-perform-a-b-testing-with-hypothesis-testing-in-python-a-comprehensive-guide-17b555928c7e/ A Step-by-Step Guide to Making Data-Driven Decisions with Practical Python Examples

The post How to Perform A/B Testing with Hypothesis Testing in Python: A Comprehensive Guide 🚀 appeared first on Towards Data Science.

]]>
Have you ever wondered if a change to your website or marketing strategy truly makes a difference? 🤔 In this guide, I’ll show you how to use hypothesis testing to make data-driven decisions with confidence.

In data analytics, Hypothesis Testing is frequently used when running A/B tests to compare two versions of a marketing campaign, webpage design, or product feature to make data-driven decisions.

What You’ll Learn 🧐

  • The process of hypothesis testing
  • Different types of tests
  • Understanding p-values
  • Interpreting the results of a hypothesis test

1. Understanding Hypothesis Testing 🎯

What Is Hypothesis Testing?

Hypothesis testing is a way to decide whether there is enough evidence in a sample of data to support a particular belief about the population. In simple terms, it’s a method to test if a change you made has a real effect or if any difference is just due to chance.

Key Concepts:

  • Population Parameter: A value that represents a characteristic of the entire population (e.g., the true conversion rate of a website).
  • Sample Statistic: A value calculated from a sample, used to estimate the population parameter (e.g., the conversion rate from a sample of visitors).
  • Null Hypothesis (H0​): The default assumption. It often states that there is no effect or no difference.
  • Alternative Hypothesis (H1​): Contradicts the null hypothesis. It represents the outcome you’re testing for.

⚠ Your goal will be to negate the null hypothesis and therefore accept the alternative hypothesis ⚠

Types of Tests

Depending on your data and what you’re testing, you can choose from several statistical tests. Here’s a quick overview:

1. Z-test

  • Purpose: Used to see if there is a significant difference between sample and population means, or between means or proportions of two samples, when the sample size is large.
  • When to Use:
  • Large sample size (n≥30)
  • Data is approximately normally distributed

Example: Checking if the average time users spend on your website is different from the industry average.

Types of Z-tests:

One-Sample Z-test for Means: Tests whether the sample mean is significantly different from a known population mean.

Two-Sample Z-test for Means: Compares the means of two independent samples.

Z-test for Proportions: Tests hypotheses about population proportions, often used when dealing with categorical data and large sample sizes.

2. T-test

  • Purpose: Used to determine if there is a significant difference between sample means when the sample size is small.
  • When to Use:
  • Small sample size (n< 30)
  • Population variance is unknown
  • Data is approximately normally distributed

Example: Comparing average sales before and after a marketing campaign when you have a small dataset.

Types of T-tests:

One-Sample T-test: Tests whether the sample mean is significantly different from a known or hypothesized population mean.

Independent Two-Sample T-test: Compares the means of two independent samples.

Paired Sample T-test: Compares means from the same group at different times (e.g., before and after treatment).

3. Chi-Square Test

  • Purpose: Used for testing relationships between categorical variables.
  • When to Use:
  • Data is in categories (like yes/no, male/female)
  • Testing for independence or goodness-of-fit

Example: Determining if customer satisfaction is related to the type of product purchased.

Types of Chi-Square Tests:

Chi-Square Test for Independence: Determines whether there is an association between two categorical variables.

Chi-Square Goodness-of-Fit Test: Determines whether sample data matches a population distribution.

4. ANOVA (Analysis of Variance)

  • Purpose: Used to compare the means of three or more groups.
  • When to Use:
  • Comparing more than two groups
  • Data is numerical and normally distributed

Example: Comparing average sales across different regions.

2. The Hypothesis Testing Process 📝

To perform a hypothesis test, follow these seven steps:

  1. State the Hypotheses
  2. Choose the Significance Level (α)
  3. Collect and Summarize the Data
  4. Select the Appropriate Test and Check Assumptions
  5. Calculate the Test Statistic
  6. Determine the p-value
  7. Make a Decision and Interpret the Results

Let’s go through each step with a practical example

3. Practical Example: A/B Testing in Marketing 📊

Scenario:

Imagine you want to know if a new version of your website (Version B) leads to a higher conversion rate than your current website (Version A). Let’s find out!

Which Test Should You Use and Why?

You’re comparing the conversion rates (proportions) of two versions of a website.

Your Goal: To determine whether Version B has a higher conversion rate than Version A.

Characteristics of Your Data:

  • Type of Data: Categorical (converted vs. not converted)
  • Sample Size: Large for both versions (nA=nB=1,000)
  • Known Parameters: Population variances are not known
  • Number of Groups: Two independent groups
  • Study Design: Samples are independent; visitors are randomly assigned to each version

Given these characteristics, the Z-test for Proportions is the right choice because:

  • You’re comparing proportions between two independent samples
  • The sample sizes are large
  • The data is categorical.

Step 1: State the Hypotheses 🧐

  • Null Hypothesis (H0): The conversion rate of Version B is less than or equal to that of Version A ❌.
  • Alternative Hypothesis (H1​): The conversion rate of Version B is greater than that of Version A ✅.

Explanation:

  • You’re testing if Version B performs better than Version A.
  • The null hypothesis assumes there’s no improvement or that Version B is worse.
  • The alternative hypothesis is what you’re hoping to find evidence for – that Version B is better.

Step 2: Choose the Significance Level (α) 🎯

Select α significance level, which is the probability of rejecting the null hypothesis when it’s actually true (committing a Type I error).

  • Common choices: 0.05 (5%), 0.01 (1%)
  • For our test: Let’s use α=0.05.

Explanation:

  • A 5% significance level means you’re willing to accept a 5% chance of mistakenly rejecting the null hypothesis.
  • The choice depends on how much risk you’re willing to take.

Step 3: Collect and Summarize the Data 📈

Version A (Current Website):

  • Sample size (nA​) = 1,000 visitors
  • Number of conversions (xA​) = 80
  • Conversion rate (pA​) = 80/1000=0.08(8%)

Version B (New Website):

  • Sample size (nB) = 1,000 visitors
  • Number of conversions (xB​) = 95
  • Conversion rate (pB​) = 95/1000=0.095(9.5%)

Step 4: Calculate the Test Statistic 🧮

Calculate the Pooled Proportion (p)

Since you’re assuming the proportions are equal under H0:

by the Author
by the Author

Explanation:

  • The pooled proportion represents the overall conversion rate assuming no difference between versions.
  • It provides a common proportion for calculating the standard error under H0​.

Calculate the Standard Error (SE)

by the Author
by the Author

Explanation:

  • The standard error measures the variability we’d expect from random sampling.

Calculate the Z-Score

by the Author
by the Author

Explanation:

  • The Z-score measures how many standard errors the observed difference between the two sample proportions is away from zero (the expected difference under H0​).
  • Z-score: 1.15
  • Expected Difference Under H0: If H0​ is true, we expect no difference, so this is 0

Question:

  • Is being 1.15 standard errors above the expected difference enough to consider the increase significant?

Step 5: Determine the p-value 📊

Understanding the p-value

The p-value is the probability of observing a Z-score as extreme as the one calculated (or more extreme) if the null hypothesis H0​ is true.

In our example, the p-value is the probability of getting a Z-score of 1.15 or higher if there is actually no difference between the versions.

For a right-tailed test (since H1: pB>pA), the p-value is:

by the Author
by the Author

Calculating the p-value:

Using a standard normal distribution table or calculator:

  • p-value=1−P(Z≤1.15)≈1−0.8823=0.12

Explanation:

  • There’s an 12% chance of getting a Z-score of 1.15 or higher if H0 is true.
  • This means the observed difference could easily happen by chance.

Important Note: We won’t dive into the mathematical details of calculating p-values in this tutorial. Instead, we’ll use Python’s statistical libraries to compute them for you or refer to standard probability distribution tables.

Step 6: Make a Decision and Interpret the Results 🧐

To make a decision, we have to compare our p-value and our α.

Why Compare p-value to Significance Level (α)?

  • Significance Level (α): The threshold we set for deciding whether an observed result is statistically significant. Commonly set at 5% (0.05).
  • If p-value ≤ α : Reject H0​; the result is statistically significant.
  • If p-value > α: Fail to reject H0; not enough evidence to conclude a significant effect.

Decision:

  • p-value: 0.12
  • Significance Level (α): 0.05

Since p-value > α, we fail to reject the null hypothesis.

Interpretation:

  • There’s not enough evidence to say that Version B is better than Version A at the 5% significance level.
  • The observed increase might just be due to random chance.
  • 12% > 5% there is 12% of chance to observe Z>1,15 under the H0 (even if actually there is no difference between the 2 versions).
  • Imagine Repeating the Experiment Many Times: If there is truly no difference between Version A and Version B, and you repeat the experiment many times, about 12% of the time, you’d observe a difference as big as 1.5 (or more) just by chance.

Quick Python Tutorial: A/B Testing in Action 🐍

Let’s bring everything we’ve learned to life with a practical example using Python! Our imaginary company called TechGear, wants to test a new feature on their website.

Scenario:

TechGear is an online retailer specializing in tech gadgets. They want to know if adding a new recommendation engine (Version B) increases the purchase rate compared to their current website (Version A).

Step 1: Import Libraries 📚

First, you’ll need to import the necessary Python libraries.

import numpy as np
import pandas as pd
from statsmodels.stats.proportion import proportions_ztest
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Generate Synthetic Data 🎲

We’ll create a dataset that simulates user behavior on both versions of the website.

# Set the random seed for reproducibility
np.random.seed(42)
# Define sample sizes
n_A = 1000  # Number of visitors in Version A
n_B = 1000  # Number of visitors in Version B
# Define conversion rates
p_A = 0.08   # 8% conversion rate for Version A
p_B = 0.095  # 9.5% conversion rate for Version B
# Generate conversions (1 = purchase, 0 = no purchase)
conversions_A = np.random.binomial(1, p_A, n_A)
conversions_B = np.random.binomial(1, p_B, n_B)
# Create DataFrames
data_A = pd.DataFrame({'version': 'A', 'converted': conversions_A})
data_B = pd.DataFrame({'version': 'B', 'converted': conversions_B})
# Combine data
data = pd.concat([data_A, data_B]).reset_index(drop=True)

Step 3: Summarize the Data 📊

Let’s see how our data looks.

# Calculate the number of conversions and total observations for each version
summary = data.groupby('version')['converted'].agg(['sum', 'count'])
summary.columns = ['conversions', 'total']
print(summary)

Step 4: Perform the Z-test 🧪

We’ll use the proportions_ztest function to perform the hypothesis test.

# Number of successes (conversions) and number of trials (visitors)
conversions = [summary.loc['A', 'conversions'], summary.loc['B', 'conversions']]
nobs = [summary.loc['A', 'total'], summary.loc['B', 'total']]
# Perform the z-test
stat, p_value = proportions_ztest(count=conversions, nobs=nobs, alternative='smaller')
print(f"Z-statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

Explanation:

  • Z-statistic: Tells us how many standard deviations our observed difference is from the null hypothesis.
  • P-value: The probability of observing such a result if the null hypothesis is true.

Step 5: Interpret the Results 🧐

Let’s interpret the output.

alpha = 0.05  # Significance level
if p_value < alpha:
    print("We reject the null hypothesis. Version B has a higher conversion rate! 🎉 ")
else:
    print("We fail to reject the null hypothesis. No significant difference detected. 🤔")
  • If you rejected H0​: The new recommendation engine is effective! You might consider rolling it out to all users.
  • If you failed to reject H0​: The new feature didn’t make a significant impact. You may want to test other ideas.

Additional Notes 📝

  • Randomness: Since we’re generating random data, results may vary each time you run the code.
  • Repeat Testing: In real-world scenarios, consider running the test with more users or over a longer period to gather more data.

Congratulations! You’ve just performed an A/B test using Python. 🎉


Key Points to Remember 🍀

  • Hypothesis Testing Basics: State the Null Hypothesis (H0) and the Alternative Hypothesis (H1).
  • Goal: Determine if there’s enough evidence to reject H0​.
  • Types of Tests: Choose the appropriate statistical test based on data type and sample size.
  • Hypothesis Testing Steps:
1. State the Hypotheses
2. Choose Significance Level (αalphaα)
3. Collect and Summarize Data
4. Select the Appropriate Test
5. Calculate the Test Statistic
6. Determine the p-value
7. Make a Decision
  • Set Clear Objectives: Know exactly what you’re testing and why.
  • Randomize Assignments: Randomly assign users to control and test groups to avoid bias.
  • Use Sufficient Sample Size: Larger samples yield more reliable results.
  • Control Variables: Keep other factors constant to isolate the effect.
  • Run the Test Long Enough: Ensure the test duration captures typical user behavior.
  • Consider Practical Impact: Evaluate if the detected difference is meaningful for your business.
  • Communicate Clearly: Present results in a straightforward way for stakeholders.
  • Iterate and Learn: Use findings to inform future tests and improvements.

You made it to the end – congrats! 🎉 I hope you enjoyed this article. If you found it helpful, please consider leaving a like and following me. I will regularly write about demystifying machine learning algorithms, clarifying Statistics concepts, and sharing insights on deploying ML projects into production.

The post How to Perform A/B Testing with Hypothesis Testing in Python: A Comprehensive Guide 🚀 appeared first on Towards Data Science.

]]>
t-Test : From Application to Theory https://towardsdatascience.com/t-test-from-application-to-theory-5e5051b0f9dc/ Fri, 12 Jul 2024 06:15:46 +0000 https://towardsdatascience.com/t-test-from-application-to-theory-5e5051b0f9dc/ An overview of the statistical tools, the motivation behind using these tools, and an explanation of the results and their interpretation

The post t-Test : From Application to Theory appeared first on Towards Data Science.

]]>
t-Test : From Application to Theory

To bridge the gap between mathematical computations and programmatic implementation of a two-sample t-Test, this article provides a step-by-step guide using a practical use case. It includes an overview of the statistical tools, the motivation behind using these tools, and an explanation of the results and their interpretation.

Photo by Steve Mushero on Unsplash
Photo by Steve Mushero on Unsplash

Have you ever been stuck in a loop where you repeatedly go over the concepts of statistical tools, memorize them, and revisit them, but the concepts still don’t stick? You know how to use formulas but feel like that just gives surface-level knowledge of the concept. I was in the same boat until I TA’d for a physics lab course 105M at UT Austin and applied statistical tools relevant to the problems we were solving. It was then that I finally understood the theory and application of the Student’s t-Test, and now it truly sticks.

Let’s start with a question.

Does Color Affect the Sliding Time of 2 Similar Balls Rolled Over a Ramp?

Intuitively, the answer to this question might be NO! But let’s validate this hypothesis using prevalent statistical tests. In hypothesis testing terms :

  • Null Hypothesis would be that color doesn’t affect the rolling time (no effect)
  • Alternative Hypothesis would suggest that color does affect the rolling time (there is an effect).
Experiment illustration - Two similar balls of different colors rolling down a ramp
Experiment illustration – Two similar balls of different colors rolling down a ramp

Data

We start by measuring the time taken by two similar balls of different colors (for instance, one black and one red) as we roll them down the ramp one by one, in multiple trials (let’s say 10 trials each).

The difference in the rolling times obtained in different trials highlights the significance of conducting multiple trials instead of one, therefore helping in providing a more reliable estimate.

It is also important to note that there can be many possible values (population) for the estimated rolling time, but we are capturing only a sample of these values with limited trials.

Best Estimate

Next, we calculate the expected value or the best estimate of the rolling time for each ball. We assume that the time recordings from different trials form a random distribution, and the expected value is best represented by the mean or the average value of the distribution.

Mathematical computation for Best Estimate/Mean of the two distributions (units in seconds)
Mathematical computation for Best Estimate/Mean of the two distributions (units in seconds)

Standard Error

As mentioned earlier, we gathered limited data (sample) using just 10 trials out of all the possible values (population). Note that the calculated best estimate is from the sample distribution. However, to get a better estimate of the population mean, we calculate the standard error of the sample distribution. The standard error helps us determine the range within which our best estimate for the population is likely to fall. It is based on the variance of the distribution, which indicates how disperse the distribution is around mean.

To calculate the standard error, first find the standard deviation (square root of variance), then divide it by the square root of the number of data points.

Mathematical computation for the Standard Deviation and Standard Error of the distributions (units in seconds)
Mathematical computation for the Standard Deviation and Standard Error of the distributions (units in seconds)

We observe that the best estimates and the standard errors for both balls are comparable (there is an overlap between the calculated ranges), prompting us to consider that distributions are similar and therfore color may not affect the rolling time of the balls. However, how statistically significant and reliable are these findings? In essence, do these values provide sufficient evidence for us to draw conclusions about the hypothesis?

To measure the certainty about our results and present evidence in a more communicable way, we use test statistics. These statistics help us measure the probability of obtaining these results, providing a measure of certainty. For instance, we use statistics like z-statistic if the population standard deviation is known and the t-statistic if only the sample standard deviation is known, as in our Experiment.

T-statistic

We compare our two sample distributions (groups) using a Two-Sample t-Test, which relies on the best estimates and variances of the two groups. Depending on the similarity of the variances between the two groups, we decide between using pooled variance, as in Student’s t-Test for equal variances, or Welch’s t-Test, which is for unequal variances.

Using statistical tests such as the F-test or Levene’s test, we can assess the equality of variances.

Since the calculated standard deviations (square root of variance) of both distributions are very similar, we proceed with a Student’s t-Test for equal variances. We conduct a two-tailed test to check for inequality of distributions rather than specifically looking for lesser or greater values.

We use the pooled standard deviation along with the averages obtained from our two distributions to calculate the t-score.

Mathematical computation for Two sample Student's t-Test for the two distributions
Mathematical computation for Two sample Student’s t-Test for the two distributions

Result Interpretation

As we observed, the t-statistic is based on the difference of the means of the two samples. In our case, the t-statistic is very small (~-0.38), indicating that the difference between the means of the two distributions is also very small. This suggests that the recordings for the two balls are similar, hinting at the overall conclusion that color has no significant effect on the rolling time.

However, interpreting the t-statistic involves more than just observing the small difference in means, especially since we compared only two samples (limited trials) and not the entire populations. To make an informed inference, we need to determine the critical value and then compare out t-statistic with that critical value.

The critical value is determined based on the confidence interval (e.g., 95%) and the sample sizes (degrees of freedom). A 95% confidence interval (CI) means that if the experiment is repeated several times, the true mean difference will fall within 95% of the calculated intervals.

To find the critical value or critical value range in our case (since we are checking for inequality), we use the t-distribution table. For a two-tailed test with a 95% CI, we look at the 0.05 significance level, which splits into 2.5% for each tail. Given our degrees of freedom (df = 18), the critical value range is approximately -2.101 to +2.101.

Two-tailed t-distribution for for 95% Confidence interval and 18 degree of freedom
Two-tailed t-distribution for for 95% Confidence interval and 18 degree of freedom

Our t-statistic of -0.38 falls within the critical range for a 95% confidence interval, leading to two key inferences. Firstly, the observed difference in means between the rolling times of the red and black balls is very small, indicating that color has no effect on rolling time. Secondly, with 95% certainty, if we were to repeat this experiment multiple times, the true difference in means between the rolling times of the red and black balls would consistently fall within this range.

Therefore, our results showing a low difference between the means of recording times for the two balls are statistically significant and reliable at the 95% confidence level, suggesting no meaningful difference in rolling times based on ball color.

I am excited to have documented my understanding in the hope of assisting others who may have struggled, like me, with grasping these statistical tools. I look forward to seeing others implement these methods. Please feel free to reach out or refer to the references mentioned below for any unanswered questions.

Unless otherwise noted, all images are by the author.

References:

Student’s t-test – Wikipedia

Standard error – Wikipedia

numpy.std – NumPy v2.0 Manual

ttest_ind – SciPy v1.14.0 Manual

The post t-Test : From Application to Theory appeared first on Towards Data Science.

]]>
Chi-Squared Test: Revealing Hidden Patterns in Your Data https://towardsdatascience.com/chi-squared-test-revealing-hidden-patterns-in-your-data-d939df2dda71/ Tue, 25 Jun 2024 04:32:41 +0000 https://towardsdatascience.com/chi-squared-test-revealing-hidden-patterns-in-your-data-d939df2dda71/ Unlock Hidden Patterns in Your Data with the Chi-Squared Test in Python.

The post Chi-Squared Test: Revealing Hidden Patterns in Your Data appeared first on Towards Data Science.

]]>
Unlock hidden patterns in your data with the chi-squared test in Python.
Cover Photo by Sulthan Auliya on Unsplash
Cover Photo by Sulthan Auliya on Unsplash

Part 1: What is Chi-Squared Test?

When discussing Hypothesis Testing, there are many approaches we can take, depending on the particular cases. Common tests like the z-test and t-test are the go-to methods to test our hypotheses (null and alternative hypotheses). The metric we want to test differs depending on the problem. Usually, in generating hypotheses, we involve population mean or population proportion as the metric to state them. Let’s say we want to test whether the population proportion of the students who took the math test who got 75 is more than 80%. Let the null hypothesis be denoted by H0, and the alternative hypothesis be denoted by H1; we generate the hypotheses by:

Figure 1: Example of generating hypotheses by Author
Figure 1: Example of generating hypotheses by Author

After that, we should see our data, whether the population variance is known or unknown, to decide which test statistic formula we should use. In this case, we use z-statistic for proportion formula. To calculate the test statistics from our sample, first, we estimate the population proportion by dividing the total number of students who got 75 by the total number of students who participated in the test. After that, we plug in the estimated proportion to calculate the test statistic using the test statistic formula. Then, we determine from the test statistic result if it will reject or fail to reject the null hypothesis by comparing it with the rejection region or p-value.

But what if we want to test different cases? What if we make inferences about the proportion of the group of students (e.g., class A, B, C, etc.) variable in our dataset? What if we want to test if there is any association between groups of students and their preparation before the exam (are they doing extra courses outside school or not)? Is it independent or not? What if we want to test categorical data and infer their population in our dataset? To test that, we’ll be using the chi-squared test.

The chi-squared test is crafted to help us draw conclusions about categorical data that fall into different categories. It compares each category’s observed frequencies (counts) to the expected frequencies under the null hypothesis. Denoted as X², chi-squared has a distribution, namely chi-squared distribution, allowing us to determine the significance of the observed deviations from expected values.

Figure 2: Chi-Squared Distribution made in Matplotlib by Author
Figure 2: Chi-Squared Distribution made in Matplotlib by Author

The plot describes the continuous distribution of each degree of freedom in the chi-squared test. In the chi-squared test, to prove whether we will reject or fail to reject the null hypothesis, we don’t use the z or t table to decide, but we use the chi-squared table. It lists probabilities of selected significance level and degree of freedom of chi-squared. There are two types of chi-squared tests, the chi-squared goodness-of-fit test and the chi-squared test of a contingency table. Each of these types has a different purpose when tackling the hypothesis test. In parallel with the theoretical approach of each test, I’ll show you how to demonstrate those two tests in practical examples.

Part 2: Chi-squared goodness-of-fit test

This is the first type of the chi-squared test. This test analyzes a group of categorical data from a single categorical variable with k categories. It is used to specifically explain the proportion of observations in each category within the population. For example, we surveyed 1000 students who got at least 75 on their math test. We observed that from 5 groups of students (Class A to E), the distribution is like this:

Figure 3: Dummy data generated randomly by Author
Figure 3: Dummy data generated randomly by Author

We will do it in both manual and Python ways. Let’s start with the manual one.

Form Hypotheses

As we know, we have already surveyed 1000 students. I want to test whether the population proportions in each class are equal. The hypotheses will be:

Figure 4: Hypotheses of Students who at least got 75 from 5 classes by Author
Figure 4: Hypotheses of Students who at least got 75 from 5 classes by Author

Test Statistic

The test statistic formula for the chi-squared goodness-of-fit test is like this:

Figure 5: The Chi-squared goodness-of-fit test by Author
Figure 5: The Chi-squared goodness-of-fit test by Author

Where:

  • k: number of categories
  • fi: observed counts
  • ei: expected counts

We already have the number of categories (5 from Class A to E) and the observed counts, but we don’t have the expected counts yet. To calculate that, we should reflect on our hypotheses. In this case, I assume that all class proportions are the same, which is 20%. We will make another column in the dataset named Expected. We calculate it by multiplying the total number of observations by the proportion we choose:

Figure 6: Calculate expected Counts by Author
Figure 6: Calculate expected Counts by Author

Now we plug in the formula like this for each observed and expected value:

Figure 7: Calculate Test Statistic of goodness-of-fit test by Author
Figure 7: Calculate Test Statistic of goodness-of-fit test by Author

We already have the test statistic result. But how do we decide whether it will reject or fail to reject the null hypothesis?

Decision Rule

As mentioned above, we’ll use the chi-squared table to compare the test statistic. Remember that a small test statistic supports the null hypothesis, whereas a significant test statistic supports the alternative hypothesis. So, we should reject the null hypothesis when the test statistic is substantial (meaning this is an upper-tailed test). Because we do this manually, we use the rejection region to decide whether it will reject or fail to reject the null hypothesis. The rejection region is defined as below:

Figure 8: Rejection Region of goodness-of-fit test by Author
Figure 8: Rejection Region of goodness-of-fit test by Author

Where:

  • α: Significance Level
  • k: number of categories

The rule of thumb is: If our test statistic is more significant than the chi-squared table value we look up, we reject the null hypothesis. We’ll use the significance level of 5% and look at the chi-squared table. The value of chi-squared with a 5% significance level and degrees of freedom of 4 (five categories minus 1), we get 9.49. Because our test statistic is way more significant than the chi-squared table value (70.52 > 9.49), we reject the null hypothesis at a 5% significance level. Now, you already know how to perform the chi-squared goodness-of-fit test!

Python Approach

This is the Python approach to the chi-squared goodness-of-fit test using SciPy:

import pandas as pd
from scipy.stats import chisquare

# Define the student data
data = {
    'Class': ['A', 'B', 'C', 'D', 'E'],
    'Observed': [157, 191, 186, 163, 303]
}

# Transform dictionary into dataframe
df = pd.DataFrame(data)

# Define the null and alternative hypotheses
null_hypothesis = "p1 = 20%, p2 = 20%, p3 = 20%, p4 = 20%, p5 = 20%"
alternative_hypothesis = "The population proportions do not match the given proportions"

# Calculate the total number of observations and the expected count for each category
total_count = df['Observed'].sum()
expected_count = total_count / len(df)  # As there are 5 categories

# Create a list of observed and expected counts
observed_list = df['Observed'].tolist()
expected_list = [expected_count] * len(df)

# Perform the Chi-Squared goodness-of-fit test
chi2_stat, p_val = chisquare(f_obs=observed_list, f_exp=expected_list)

# Print the results
print(f"nChi2 Statistic: {chi2_stat:.2f}")
print(f"P-value: {p_val:.4f}")

# Print the conclusion
if p_val < 0.05:
    print("Reject the null hypothesis: The population proportions do not match the given proportions.")
else:
    print("Fail to reject the null hypothesis: The population proportions match the given proportions.")

Using the p-value, we also got the same result. We reject the null hypothesis at a 5% significance level.

Figure 9: Result of goodness-of-fit test using Python by Author
Figure 9: Result of goodness-of-fit test using Python by Author

Part 3: Chi-squared test of a contingency table

We already know how to make inferences about the proportion of one categorical variable. But what if I want to test whether two categorical variables are independent?

To test that, we use the chi-squared test of the contingency table. We will utilize the contingency table to calculate the test statistic value. A contingency table is a cross-tabulation table that classifies counts summarizing the combined distribution of two categorical variables, each having a finite number of categories. From this table, you can determine if the distribution of one categorical variable is consistent across all categories of the other categorical variable.

I will explain how to do it manually and using Python. In this example, we sampled 1000 students who got at least 75 on their math test. I want to test whether the variable of a group of students and the variable of the students who have taken the supplementary course (Taken or Not) outside the school before the test is independent. The distribution is like this:

Figure 10: Dummy data of contingency table generated randomly by Author
Figure 10: Dummy data of contingency table generated randomly by Author

Form Hypotheses

To generate these hypotheses is very simple. We define the hypotheses as:

Figure 11: Generate hypotheses of contingency table test by Author
Figure 11: Generate hypotheses of contingency table test by Author

Test Statistic

This is the hardest part. In handling real data, I suggest you use Python or other statistical software directly because the calculation is too complicated if we do it manually. But because we want to know the approach from the formula, let’s do the manual calculation. The test statistic of this test is:

Figure 12: The Chi-squared contingency table formula by Author
Figure 12: The Chi-squared contingency table formula by Author

Where:

  • r = number of rows
  • c = number of columns
  • fij: the observed counts
  • eij = (i th row total * j th row total)/sample size

Recall Figure 9, those values are just observed ones. Before we use the test statistic formula, we should calculate the expected counts. We do that by:

Figure 13: Expected Counts of the contingency table by Author
Figure 13: Expected Counts of the contingency table by Author

Now we get the observed and expected counts. After that, we will calculate the test statistic by:

Figure 14: Calculate Test Statistic of contingency table test by Author
Figure 14: Calculate Test Statistic of contingency table test by Author

Decision Rule

We already have the test statistic; now we compare it with the rejection region. The rejection region for the contingency table test is defined by:

Figure 15: Rejection Region of contingency table test by Author
Figure 15: Rejection Region of contingency table test by Author

Where:

  • α: Significance Level
  • r = number of rows
  • c = number of columns

The rule of thumb is the same as the goodness-of-fit test: If our test statistic is more significant than the chi-squared table value we look up, we reject the null hypothesis. We will use the significance level of 5%. Because the total row is 5 and the total column is 2, we look up the value of chi-squared with a 5% significance level and degrees of freedom of (5–1) * (2–1) = 4, and we get 15.5. Because the test statistic is lower than the chi-squared table value (22.9758 > 15.5), we reject the null hypothesis at a 5% significance level.

Python Approach

This is the Python approach to the chi-squared contingency table test using SciPy:

import pandas as pd
from scipy.stats import chi2_contingency

# Create the dataset
data = {
    'Class': ['group A', 'group B', 'group C', 'group D', 'group E'],
    'Taken Course': [91, 131, 117, 75, 197],
    'Not Taken Course': [66, 60, 69, 88, 106]
}

# Create a DataFrame
df = pd.DataFrame(data)
df.set_index('Class', inplace=True)

# Perform the Chi-Squared test for independence
chi2_stat, p_val, dof, expected = chi2_contingency(df)

# Print the results
print("Expected Counts:")
print(pd.DataFrame(expected, index=df.index, columns=df.columns))
print(f"nChi2 Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_val:.4f}")

# Print the conclusion
if p_val < 0.05:
    print("nReject the null hypothesis: The variables are not independent")
else:
    print("nFail to reject the null hypothesis: The variables are independent")

Using the p-value, we also got the same result. We reject the null hypothesis at a 5% significance level.

Figure 16: Result of contingency table test using Python by Author
Figure 16: Result of contingency table test using Python by Author

Now that you understand how to conduct hypothesis tests using the chi-square test method, it’s time to apply this knowledge to your own data. Happy experimenting!

Part 4: Conclusion

The chi-squared test is a powerful statistical method that helps us understand the relationships and distributions within categorical data. Forming the problem and proper hypotheses before jumping into the test itself is crucial. A large sample is also vital in conducting a chi-squared test; for instance, it works well for sizes down to 5,000 (Bergh, 2015), as small sample sizes can lead to inaccurate results. To interpret results correctly, choose the right significance level and compare the chi-square statistic to the critical value from the chi-square distribution table or the p-value.

Reference

  • G. Keller, Statistics for Management and Economics, 11th ed., Chapter 15, Cengage Learning (2017).
  • Daniel, Bergh. (2015). Chi-Squared Test of Fit and Sample Size-A Comparison between a Random Sample Approach and a Chi-Square Value Adjustment Method.. Journal of applied measurement, 16(2):204–217.

The post Chi-Squared Test: Revealing Hidden Patterns in Your Data appeared first on Towards Data Science.

]]>
Welch’s t-Test: The Reliable Way to Compare 2 Population Means with Unequal Variances https://towardsdatascience.com/welchs-t-test-the-reliable-way-to-compare-2-population-means-with-unequal-variances-bbf7a62967bc/ Fri, 14 Jun 2024 15:55:59 +0000 https://towardsdatascience.com/welchs-t-test-the-reliable-way-to-compare-2-population-means-with-unequal-variances-bbf7a62967bc/ Discover why Welch's t-Test is the go-to method for accurate statistical comparison, even when variances differ.

The post Welch’s t-Test: The Reliable Way to Compare 2 Population Means with Unequal Variances appeared first on Towards Data Science.

]]>
Photo by Simon Maage on Unsplash
Photo by Simon Maage on Unsplash

Part 1: Background

In the first semester of my postgrad, I had the opportunity to take the course STAT7055: Introductory Statistics for Business and Finance. Throughout the course, I definitely felt a bit exhausted at times, but the amount of knowledge I gained about the application of various statistical methods in different situations was truly priceless. During the 8th week of lectures, something really interesting caught my attention, specifically the concept of Hypothesis Testing when comparing two populations. I found it fascinating to learn about how the approach differs based on whether the samples are independent or paired, as well as what to do when we know or don’t know the population variance of the two populations, along with how to conduct hypothesis testing for two proportions. However, there is one aspect that wasn’t covered in the material, and it keeps me wondering how to tackle this particular scenario, which is performing Hypothesis Testing from two population means when the variances are unequal, known as the Welch t-Test.

To grasp the concept of how the Welch t-Test is applied, we can explore a dataset for the example case. Each stage of this process involves utilizing the dataset from real-world data.

Part 2: The Dataset

The dataset I’m using contains real-world data on World Agricultural Supply and Demand Estimates (WASDE) that are regularly updated. The WASDE dataset is put together by the World Agricultural Outlook Board (WAOB). It is a monthly report that provides annual predictions for various global regions and the United States when it comes to wheat, rice, coarse grains, oilseeds, and cotton. Furthermore, the dataset also covers forecasts for sugar, meat, poultry, eggs, and milk in the United States. It is sourced from the Nasdaq website, and you are welcome to access it for free here: WASDE dataset. There are 3 datasets, but I only use the first one, which is the Supply and Demand Data. Column definitions can be seen here:

Figure 1: Column Definitions by NASDAQ
Figure 1: Column Definitions by NASDAQ

I am going to use two different samples from specific regions, commodities, and items to simplify the testing process. Additionally, we will be using the R Programming Language for the end-to-end procedure.

Now let’s do a proper data preparation:

library(dplyr)

# Read and preprocess the dataframe
wasde_data <- read.csv("wasde_data.csv") %>%
  select(-min_value, -max_value, -year, -period) %>%
  filter(item == "Production", commodity == "Wheat")

# Filter data for Argentina and Australia
wasde_argentina <- wasde_data %>%
  filter(region == "Argentina") %>%
  arrange(desc(report_month))

wasde_oz <- wasde_data %>%
  filter(region == "Australia") %>%
  arrange(desc(report_month))

I divided two samples into two different regions, namely Argentina and Australia. And the focus is production in wheat commodities.

Now we’re set. But wait..

Before delving further into the application of the Welch t-Test, I can’t help but wonder why it is necessary to test whether the two population variances are equal or not.

Part 3: Testing Equality of Variances

When conducting hypothesis testing to compare two population means without knowledge of the population variances, it’s crucial to confirm the equality of variances in order to select the appropriate statistical test. If the variances turn out to be the same, we opt for the pooled variance t-test; otherwise, we can use Welch’s t-test. This important step guarantees the precision of the outcomes, since using an incorrect test could result in wrong conclusions due to higher risks of Type I and Type II errors. By checking for equality in variances, we make sure that the hypothesis testing process relies on accurate assumptions, ultimately leading to more dependable and valid conclusions.

Then how do we test the two population variances?

We have to generate two hypotheses as below:

Figure 2: null and alternative hypotheses for testing equality variances by author
Figure 2: null and alternative hypotheses for testing equality variances by author

The rule of thumb is very simple:

  1. If the test statistic falls into rejection region, then Reject H0 or Null Hypothesis.
  2. Otherwise, we Fail to Reject H0 or Null Hypothesis.

We can set the hypotheses like this:

# Hypotheses: Variance Comparison
h0_variance <- "Population variance of Wheat production in Argentina equals that in Australia"
h1_variance <- "Population variance of Wheat production in Argentina differs from that in Australia"

Now we should do the test statistic. But how do we get this test statistic? we use F-Test.

An F-test is any statistical test used to compare the variances of two samples or the ratio of variances between multiple samples. The test statistic, random variable F, is used to determine if the tested data has an F-distribution under the true null hypothesis, and true customary assumptions about the error term.

Figure 3: Illustration Probability Density Function (PDF) of F Distribution by Wikipedia
Figure 3: Illustration Probability Density Function (PDF) of F Distribution by Wikipedia

we can generate the test statistic value with dividing two sample variances like this:

Figure 4: F test formula by author
Figure 4: F test formula by author

and the rejection region is:

Figure 5: Rejection Region of F test by author
Figure 5: Rejection Region of F test by author

where n is the sample size and alpha is significance level. so when the F value falls into either of these rejection region, we reject null hypothesis.

but..

the trick is: The labeling of sample 1 and sample 2 is actually random, so let’s make sure to place the larger sample variance on top every time. This way, our F-statistic will consistently be greater than 1, and we just need to refer to the upper cut-off to reject H0 at significance level α whenever.

we can do this by:

# Calculate sample variances
sample_var_argentina <- var(wasde_argentina$value)
sample_var_oz <- var(wasde_oz$value)

# Calculate F calculated value
f_calculated <- sample_var_argentina / sample_var_oz

we’ll use 5% significance level (0.05), so the decision rule is:

# Define significance level and degrees of freedom
alpha <- 0.05
alpha_half <- alpha / 2
n1 <- nrow(wasde_argentina)
n2 <- nrow(wasde_oz)
df1 <- n1 - 1
df2 <- n2 - 1

# Calculate critical F values
f_value_lower <- qf(alpha_half, df1, df2)
f_value_upper <- qf(1 - alpha_half, df1, df2)

# Variance comparison result
if (f_calculated > f_value_lower &amp; f_calculated < f_value_upper) {
  cat("Fail to Reject H0: ", h0_variance, "n")
  equal_variances <- TRUE
} else {
  cat("Reject H0: ", h1_variance, "n")
  equal_variances <- FALSE
}

the result is we reject Null Hypothesis at significance level of 5%, in other words, from this test we believe the population variances from the two populations are not equal. Now we know why we should use Welch t-Test instead of Pooled Variance t-Test.

Part 4: The main course, Welch t-Test

The Welch t-test, also called Welch’s unequal variances t-test, is a statistical method used for comparing the means of two separate samples. Instead of assuming equal variances like the standard pooled variance t-test, the Welch t-test is more robust as it does not make this assumption. This adjustment in degrees of freedom leads to a more precise evaluation of the difference between the two sample means. By not assuming equal variances, the Welch t-test offers a more dependable outcome when working with real-world data where this assumption may not be true. It is preferred for its adaptability and dependability, ensuring that conclusions drawn from statistical analyses remain valid even if the equal variances assumption is not met.

The test statistic formula is:

Figure 6: test statistic formula of Welch t-Test by author
Figure 6: test statistic formula of Welch t-Test by author

where:

and the Degree of Freedom can be defined like this:

Figure 7: Degree of Freedom formula by author
Figure 7: Degree of Freedom formula by author

The rejection region for the Welch t-test depends on the chosen significance level and whether the test is one-tailed or two-tailed.

Two-tailed test: The null hypothesis is rejected if the absolute value of the test statistic |t| is greater than the critical value from the t-distribution with ν degrees of freedom at α/2.

  • ∣t∣>tα/2,ν​

One-tailed test: The null hypothesis is rejected if the test statistic t is greater than the critical value from the t-distribution with ν degrees of freedom at α for an upper-tailed test, or if t is less than the negative critical value for a lower-tailed test.

  • Upper-tailed test: t > tα,ν
  • Lower-tailed test: t < −tα,ν

So let’s do one example with One-tailed Welch t-Test.

lets generate the hypotheses:

h0_mean <- "Population mean of Wheat production in Argentina equals that in Australia"
h1_mean <- "Population mean of Wheat production in Argentina is greater than that in Australia"

this is a Upper Tailed Test, so the rejection region is: t > tα,ν

and by using the formula given above, and by using same significance level (0.05):

# Calculate sample means
sample_mean_argentina <- mean(wasde_argentina$value)
sample_mean_oz <- mean(wasde_oz$value)

# Welch's t-test (unequal variances)
s1 <- sample_var_argentina
s2 <- sample_var_oz
t_calculated <- (sample_mean_argentina - sample_mean_oz) / sqrt(s1/n1 + s2/n2)
df <- (s1/n1 + s2/n2)^2 / ((s1^2/(n1^2 * (n1-1))) + (s2^2/(n2^2 * (n2-1))))
t_value <- qt(1 - alpha, df)

# Mean comparison result
if (t_calculated > t_value) {
  cat("Reject H0: ", h1_mean, "n")
} else {
  cat("Fail to Reject H0: ", h0_mean, "n")
}

the result is we Fail to Reject H0 at significance level of 5%, then Population mean of Wheat production in Argentina equals that in Australia.

That’s how to conduct Welch t-Test. Now your turn. Happy experimenting!

Part 5: Conclusion

When comparing two population means during hypothesis testing, it is really important to start by checking if the variances are equal. This initial step is crucial as it helps in deciding which statistical test to use, guaranteeing precise and dependable outcomes. If it turns out that the variances are indeed equal, you can go ahead and apply the standard t-test with pooled variances. However, in cases where the variances are not equal, it is recommended to go with Welch’s t-test.

Welch’s t-test provides a strong solution for comparing means when the assumption of equal variances does not hold true. By adjusting the degrees of freedom to accommodate for the uneven variances, Welch’s t-test gives a more precise and dependable evaluation of the statistical importance of the difference between two sample means. This adaptability makes it a popular choice in various practical situations where sample sizes and variances can vary significantly.

In conclusion, checking for equality of variances and utilizing Welch’s t-test when needed ensures the accuracy of hypothesis testing. This approach reduces the chances of Type I and Type II errors, resulting in more reliable conclusions. By selecting the appropriate test based on the equality of variances, we can confidently analyze the findings and make well-informed decisions grounded on empirical evidence.

Resources

The post Welch’s t-Test: The Reliable Way to Compare 2 Population Means with Unequal Variances appeared first on Towards Data Science.

]]>
Mastering Statistical Tests (Part I) https://towardsdatascience.com/statistical-tests-demystified-how-to-choose-the-best-test-for-your-data-part-i-688a4b2a23b7/ Sun, 19 May 2024 19:13:31 +0000 https://towardsdatascience.com/statistical-tests-demystified-how-to-choose-the-best-test-for-your-data-part-i-688a4b2a23b7/ Your Guide to Choosing the Right Test for Your Data

The post Mastering Statistical Tests (Part I) appeared first on Towards Data Science.

]]>
Have you ever had a dataset and found yourself lost and confused about which statistical significance test is most suitable to answer your research question? Well, let me assure you, you’re not alone. I was once that person! Despite my respect for Statistics, I never had a great passion for it. In this article, I will focus on unraveling some key concepts to help you make informed decisions when choosing the right statistical significance test for your data. Since performing statistical significant testing essentially involves dealing with variables (independent and dependent), I find it imperative to pay a visit to the different types of those variables.

Types of data:

Photo by Claudio Schwarz on Unsplash
Photo by Claudio Schwarz on Unsplash

1- Categorical or nominal

A categorical (or nominal) variable has two or more categories without intrinsic order. For instance, eye color is a categorical variable with categories like blue, green, brown, and hazel. There is no agreed way to rank these categories. If a variable has a clear order, it is an ordinal variable, discussed below.

2- Ordinal

An ordinal variable is like a categorical variable, but with a clear order. For example, consider customer satisfaction levels: dissatisfied, neutral, and satisfied. These categories can be ordered, but the spacing between them is not consistent. Another example is pain severity: mild, moderate, and severe. Although we can rank these levels, the difference in pain between each category varies. If the categories were equally spaced, the variable would be an interval variable.

3- Interval or numerical

An interval (or numerical) variable, unlike an ordinal variable, has equally spaced intervals between values. For instance, consider temperature measured in Celsius. The difference between 20°C and 30°C is the same as between 30°C and 40°C. This equal spacing distinguishes interval variables from ordinal variables.


Are you still pondering the consequences of not correctly identifying the type of data? Let’s clarify with a simple example. Imagine needing to compute the mean of a dataset that is categorical or ordinal. Does this hold any meaningful interpretation? For instance, what would the average "eye color" signify? It’s clearly nonsensical. That said, it should also be emphasized that the type of data is not the only factor in determining the statistical test. The number of the independent and the dependent variable stands on an equal footing with the type of the data.

Photo by Samuel Regan-Asante on Unsplash
Photo by Samuel Regan-Asante on Unsplash

I would also like to remind you that there is no need to be intimidated by the number of tests to be discussed. One good way to think about these tests is that they are different approaches to calculate the p-value. The p-value itself can be conceived as a measure of the statistical compatibility of the data with the null hypothesis. That is,

The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.

Now, let us delve without any further due into the different tests that you need to understand when and how to use.


Statistical tests:

1- One sample student’s t-test

The one-sample t-test is a statistical test used to determine whether the mean of a single sample (from a normally distributed interval variable) of data significantly differs from a known or hypothesized population mean. This test is commonly used in various fields to assess whether a sample is representative of a larger population or to test hypotheses about population mean when the population standard deviation is unknown.

import pandas as pd
from scipy import stats
import numpy as np

# Sample data (scores of 20 students)
scores = [72, 78, 80, 73, 69, 71, 76, 74, 77, 79, 75, 72, 70, 73, 78, 76, 74, 75, 77, 79]

# Population mean under the null hypothesis
pop_mean = 75

# Create a pandas DataFrame
df = pd.DataFrame(scores, columns=['Scores'])

# Calculate sample mean and sample standard deviation
sample_mean = df['Scores'].mean()
sample_std = df['Scores'].std(ddof=1)  # ddof=1 for sample standard deviation

# Number of observations
n = len(df)

# Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(df['Scores'], pop_mean)

# Critical t-value for two-tailed test at alpha=0.05 (95% confidence level)
alpha = 0.05
t_critical = stats.t.ppf(1 - alpha/2, df=n-1)

# Output results
print("Sample Mean:", sample_mean)
print("Sample Standard Deviation:", sample_std)
print("Degrees of Freedom (df):", n - 1)
print("t-statistic:", t_statistic)
print("p-value:", p_value)
print("Critical t-value (two-tailed, α=0.05):", t_critical)

# Decision based on p-value
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference between the sample mean and the population mean.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the sample mean and the population mean.")

2- Binomial test

The test is used to determine if the proportion of successes in a sample is significantly different from a hypothesized proportion. It’s particularly useful when dealing with binary outcomes, such as success/failure or yes/no scenarios. This test is widely used in fields such as medicine, marketing, and quality control, where determining the significance of proportions is crucial.

from scipy import stats

# Define the observed number of successes and the number of trials
observed_successes = 55
n_trials = 100
hypothesized_probability = 0.5

# Perform the binomial test
results = stats.binomtest(observed_successes, n_trials, hypothesized_probability, alternative='two-sided')

print('Results of the binomial test:')
print(f'Observed successes: {observed_successes}')
print(f'Number of trials: {n_trials}')
print(f'Hypothesized probability: {hypothesized_probability}')
print(f'P-value: {p_value}')

# Set significance level
alpha = 0.05

# Decision based on p-value
if results.pvalue < alpha:
    print("Reject the null hypothesis: The coin is not fair.")
else:
    print("Fail to reject the null hypothesis: There is no evidence to suggest the coin is not fair.")

3- Chi-square goodness of fit

The test is used to determine if an observed frequency distribution of a categorical variable differs significantly from an expected distribution. It helps assess whether the observed data fits a specific theoretical distribution. This test is widely used in fields such as genetics, marketing, and psychology to validate hypotheses about distributions.

import numpy as np
from scipy.stats import chisquare

# Observed frequencies
observed = np.array([25, 30, 20, 25])

# Expected frequencies for a uniform distribution
expected = np.array([25, 25, 25, 25])

# Perform Chi-Square Goodness of Fit test
chi2_stat, p_value = chisquare(f_obs=observed, f_exp=expected)

print('Results of the Chi-Square Goodness of Fit test:')
print(f'Observed frequencies: {observed}')
print(f'Expected frequencies: {expected}')
print(f'Chi-square statistic: {chi2_stat}')
print(f'P-value: {p_value}')

# Set significance level
alpha = 0.05

# Decision based on p-value
if p_value < alpha:
    print("Reject the null hypothesis: The observed distribution does not fit the expected distribution.")
else:
    print("Fail to reject the null hypothesis: The observed distribution fits the expected distribution.")

4- Two independent samples t-test

The test is used to compare the means of a normally distributed continuous dependent variable between two independent groups.For instance, imagine we’re assessing the impact of a medical intervention. We recruit 100 participants, randomly assigning 50 to a treatment group and 50 to a control group. Here, we have two distinct samples, making the unpaired t-test appropriate for comparing their outcomes.

import numpy as np
from scipy import stats

# Generate example data (normally distributed)
np.random.seed(42)  # for reproducibility
treatment_group = np.random.normal(loc=75, scale=10, size=50)
control_group = np.random.normal(loc=72, scale=10, size=50)

# Perform independent samples t-test
t_statistic, p_value = stats.ttest_ind(treatment_group, control_group,equal_var=False)

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in the treatment effect between groups.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in the treatment effect between groups.")

It is also worth noting that if the variances of the two groups of the independent variable are equal, we should choose the ‘pooled’ version of the test. The only difference in the code would be changing ‘False’ to ‘True’:

t_statistic, p_value = stats.ttest_ind(group1, group2, equal_var = True)

5- Wilcoxon-Mann-Whitney test (Mann-Whitney U test)

It is a non-parametric test, meaning it makes no assumptions about the variables distributions, used to compare the means of two independent groups. It assesses whether the distributions of two samples are different without assuming the data follow a specific distribution. This test is particularly useful when the assumptions of the independent samples t-test (such as normality and equal variance) are not met or when analyzing ordinal or interval data that do not meet parametric assumptions.

import numpy as np
from scipy.stats import mannwhitneyu

# Generate example data
np.random.seed(42)  # for reproducibility
group1 = np.random.normal(loc=50, scale=10, size=30)
group2 = np.random.normal(loc=55, scale=12, size=35)

# Perform Wilcoxon-Mann-Whitney test
statistic, p_value = mannwhitneyu(group1, group2)

# Print results
print('Results of the Wilcoxon-Mann-Whitney test:')
print(f'Statistic: {statistic}')
print(f'P-value: {p_value}')

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The distributions of the two groups are significantly different.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in the distributions of the two groups.")

6- Chi-square test of independence

The chi-square test of independence is used to determine if there is a significant association between two categorical variables. It helps identify whether the distribution of one variable is independent of the other. This test is widely applied in fields like marketing, social sciences, and biology. To perform this test, you first need to pivot the data to create a contingency table, as shown in the Python code below.

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Create a contingency table
data = pd.DataFrame({
    'Gender': ['Male', 'Male', 'Female', 'Female'],
    'Preference': ['Yes', 'No', 'Yes', 'No'],
    'Count': [20, 10, 30, 40]
})

# Pivot the data to get the contingency table
contingency_table = data.pivot(index='Gender', columns='Preference', values='Count').fillna(0).values

# Perform Chi-Square Test of Independence
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

# Print results
print('Results of the Chi-Square Test of Independence:')
print(f'Chi-square statistic: {chi2_stat}')
print(f'P-value: {p_value}')
print(f'Degrees of freedom: {dof}')
print('Expected frequencies:')
print(expected)

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant association between gender and product preference.")
else:
    print("Fail to reject the null hypothesis: There is no significant association between gender and product preference.")

Additionally, the chi-square test assumes that the expected value for each cell is five or higher. To find the expected value of a specific cell, we multiply the row total by the column total and then divide by the grand total. If this condition is not verified, we must use the next test.

7- Fisher’s exact test

The test can be thought of as an alternative to chi-square test when one or more of your contingency table cells has an expected frequency of less than five. This makes it particularly valuable for small sample sizes or when dealing with sparse data.

import numpy as np
from scipy.stats import fisher_exact

# Create a contingency table
# Example data: treatment group vs. control group with success and failure outcomes
# Treatment group: 12 successes, 5 failures
# Control group: 8 successes, 7 failures
contingency_table = np.array([[12, 5],
                              [8, 7]])

# Perform Fisher's Exact Test
odds_ratio, p_value = fisher_exact(contingency_table)

# Print results
print('Results of Fisher's Exact Test:')
print(f'Odds ratio: {odds_ratio}')
print(f'P-value: {p_value}')

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant association between the treatment and the outcome.")
else:
    print("Fail to reject the null hypothesis: There is no significant association between the treatment and the outcome.")

It is worthy of notice that while Fisher’s exact test is generally used for 2×2 contingency tables, it can be extended to larger tables using the "fisher_exact" function in Python’s "scipy.stats" module. However, for contingency tables larger than 2×2, it is computationally intensive and not directly supported by "fisher_exact". Instead, you can use other libraries such as "statsmodels" to handle larger tables. Consider the following Python code:

from statsmodels.stats.contingency_tables import Table

# Example 7x7 contingency table
contingency_table = np.array([
    [8, 2, 1, 4, 5, 2, 3],
    [1, 5, 3, 2, 7, 4, 6],
    [3, 2, 6, 1, 4, 7, 5],
    [2, 3, 5, 8, 2, 1, 7],
    [5, 4, 2, 3, 6, 2, 8],
    [1, 3, 8, 4, 1, 3, 8],
    [7, 2, 3, 6, 8, 1, 4]
])

# Handle larger contingency tables with statsmodels
table = Table(contingency_table)
result = table.test_nominal_association()
print(result)

8- Paired t-test

This is the ‘dependent’ version of the student’s t-test that I have covered previously. The test is used to compare the means of two related groups to determine if there is a statistically significant difference between them. This test is commonly applied in before-and-after studies, or when the same subjects are measured under two different conditions.

import numpy as np
from scipy.stats import ttest_rel

# Example data: test scores before and after a training program
before = np.array([70, 75, 80, 85, 90])
after = np.array([72, 78, 85, 87, 93])

# Perform paired t-test
t_statistic, p_value = ttest_rel(before, after)

# Print results
print('Results of the paired t-test:')
print(f'T-statistic: {t_statistic}')
print(f'P-value: {p_value}')

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the before and after scores.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the before and after scores.")

Photo by Tim Mossholder on Unsplash
Photo by Tim Mossholder on Unsplash

Okay, were you able to guess the next test? If you are thinking that we will now relax the normality condition and thus need a non-parametric test, then congratulations. This non-parametric test is called the Wilcoxon Signed-Rank Test.

9- Wilcoxon signed-rank test

The Wilcoxon signed-rank test is a non-parametric test used to compare two related samples or repeated measurements on a single sample to assess whether their population ‘median’ ranks differ. It is often used as an alternative to the paired t-test when the data does not meet the assumptions of normality.

import numpy as np
from scipy.stats import wilcoxon

# Example data: stress scores before and after a meditation program
before = np.array([10, 15, 20, 25, 30])
after = np.array([8, 14, 18, 24, 28])

# Perform Wilcoxon signed-rank test
statistic, p_value = wilcoxon(before, after)

# Print results
print('Results of the Wilcoxon Signed-Rank Test:')
print(f'Statistic: {statistic}')
print(f'P-value: {p_value}')

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the before and after scores.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the before and after scores.")

10- McNemar test

Yes, it is exactly what you are thinking of; this is the counterpart of the paired t-test but for when the dependent variable is categorical.

import numpy as np
from statsmodels.stats.contingency_tables import mcnemar

# Create a contingency table
# Example data: before and after treatment
# [[before success, before failure],
#  [after success, after failure]]
contingency_table = np.array([[15, 5],  # before success, before failure
                              [3, 17]]) # after success, after failure

# Perform McNemar test
result = mcnemar(contingency_table, exact=True)

# Print results
print('Results of the McNemar Test:')
print(f'Statistic: {result.statistic}')
print(f'P-value: {result.pvalue}')

# Decision based on p-value
alpha = 0.05
if result.pvalue < alpha:
    print("Reject the null hypothesis: There is a significant difference between before and after proportions.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between before and after proportions.")

Summary

In this part, I have covered three main groups of common statistical tests. The first group is necessary when analyzing a single population (One sample student’s t-test, Binomial test, and Chi-square goodness of fit). The second group (two independent samples t-test, Mann-Whitney U test, and Chi-square test of independence (Fisher’s exact test)) focuses on calculating p-values when examining the relationship between one dependent variable and one independent variable (specifically with exactly two independent groups). In the third group, I addressed tests (paired t-test, Wilcoxon signed-rank test, and McNemar test) required when assuming dependence between the two levels of the independent variable.

Mastering Statistical Tests (Part II): Your Guide to Choosing the Right Test for Your Data

In Part II, I will explore the tests specifically required when increasing the number of levels (both independent and dependent) of a single independent variable beyond two.


If you found value in this article, please show your support by 👏 clapping, 📝 leaving a comment, or treating me to a coffee ☕! Your encouragement fuels more data-driven insights. Also, feel free to connect with me on LinkedIn to continue the conversation and exchange ideas!


References

[1] https://stats.oarc.ucla.edu/sas/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-sas/

[2] https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/#assumption

[3] https://www.stat.berkeley.edu/~aldous/Real_World/ASA_statement.pdf

The post Mastering Statistical Tests (Part I) appeared first on Towards Data Science.

]]>