Hypothesis Testing and Statistical Analysis

3 min readJan 16, 2024

This provides an overview of hypothesis testing and various statistical tests, along with practical examples in Python.

Overview

Hypothesis testing is a statistical tool used to validate or invalidate assumptions about a population based on sample data. This repository covers the definition, types, steps, relevance, and limitations of hypothesis testing. Additionally, it includes examples and code snippets for various statistical tests in Python.

Hypothesis Testing

Definition

Hypothesis testing is a statistical method to make inferences about a population based on a sample of data. It involves formulating a hypothesis, collecting relevant data, choosing a statistical test, and making conclusions about the population.

Types

Simple Hypothesis: Assumes a specific value.
Composite Hypothesis: Involves upper and lower values.
One-Tail Hypothesis: Focuses on one direction.
Two-Tail Hypothesis: Considers both directions.

Steps

Formulate hypotheses about the population.
Collect relevant data that represents the entire population.
Choose a statistical test based on the nature of the data.
Analyze the test results to accept or reject the null hypothesis.
Compile and summarize findings into a research paper.

Relevance

Hypothesis testing is essential for validating theories when systematic statistical interference is impractical. Researchers aim to reject the null hypothesis in favor of the alternative hypothesis.

Limitations

Suitable for small datasets.
Relies on probability, interpretations, and assumptions.

Statistical Tests

Shapiro Test

The Shapiro test checks whether a sample follows a normal distribution.

from scipy.stats import shapiro

_, p_val = shapiro(data)
if p_val > 0.05:
    print("We accept the null hypothesis. The data is normally distributed.")
else:
    print("We reject the null hypothesis. The data is not normally distributed.")

Kstest Test

The Kstest test compares a sample’s cumulative distribution function with a given distribution.

from scipy.stats import kstest

_, p_val = kstest(data, 'norm')
if p_val > 0.05:
    print("We accept the null hypothesis. The data is normally distributed.")
else:
    print("We reject the null hypothesis. The data is not normally distributed.")

Normal Test

The normal test checks if a sample comes from a normal distribution.

from scipy.stats import normaltest

_, p_val = normaltest(data)
if p_val > 0.05:
    print("We accept the null hypothesis. The data is normally distributed.")
else:
    print("We reject the null hypothesis. The data is not normally distributed.")

Chi-Square Test

The Chi-Square test assesses the independence of two categorical variables.

from sklearn.feature_selection import chi2

chi, p_val = chi2(df[['Age']], df['Outcome'])
if p_val > 0.05:
    print("We accept the null hypothesis. The variables are independent.")
else:
    print("We reject the null hypothesis. The variables are dependent.")

Correlation Test

Pearson and Spearman’s Rank Correlation tests check the correlation between two variables.

from scipy.stats import spearmanr
from scipy.stats import pearsonr

stat, p = spearmanr(data1, data2)
if p > 0.05:
    print('Variables are independent.')
else:
    print('Variables are dependent.')stat, p = pearsonr(data1, data2)
if p > 0.05:
    print('Variables are independent.')
else:
    print('Variables are dependent.')

Skewness

Skewness measures the asymmetry of a distribution.

skewness = data.skew()
if skewness > 1:
    print("Data is highly positively skewed.")
elif -1 <= skewness <= 1:
    print("Data is symmetrical.")
else:
    print("Data is highly negatively skewed.")

Examples

T-Test

The T-Test compares the mean of a sample with a known value.

from scipy.stats import ttest_1samp

_, p_val = ttest_1samp(data, expected_mean)
if p_val > 0.05:
    print("Null Hypothesis is True.")
else:
    print("Alternate Hypothesis is True.")

Z-Test

The Z-Test compares the mean of a sample with a known value for larger sample sizes.

from statsmodels.stats.weightstats import ztest

_, p_val = ztest(data, value=expected_mean)
if p_val > 0.05:
    print("Null Hypothesis is True.")
else:
    print("Alternate Hypothesis is True.")

ANOVA Test

The ANOVA Test compares means of two or more independent samples.

from scipy.stats import f_oneway

tstat, p = f_oneway(sample1, sample2, sample3)
if p > 0.05:
    print('Same distribution of scores.')
else:
    print('Different distributions of scores.')

Mann-Whitney U Test

The Mann-Whitney U Test checks if distributions of two independent samples are equal.

from scipy.stats import mannwhitneyu

stat, p = mannwhitneyu(sample1, sample2)
if p > 0.05:
    print('Same distribution.')
else:
    print('Different distributions.')

Augmented Dickey-Fuller Test

The Augmented Dickey-Fuller Test checks the stationarity of time series data.

from statsmodels.tsa.stattools import adfuller

stat, p, lags, obs, crit, t = adfuller(time_series_data)
if p > 0.05:
    print('Series is not Stationary.')
else:
    print('Series is stationary.')