Hypothesis Testing and Statistical Analysis
This provides an overview of hypothesis testing and various statistical tests, along with practical examples in Python.
Overview
Hypothesis testing is a statistical tool used to validate or invalidate assumptions about a population based on sample data. This repository covers the definition, types, steps, relevance, and limitations of hypothesis testing. Additionally, it includes examples and code snippets for various statistical tests in Python.
Table of Contents
- Hypothesis Testing
- Definition
- Types
- Steps
- Relevance
- Limitations
- Statistical Tests
- Shapiro Test
- Kstest Test
- Normal Test
- Chi-Square Test
- Correlation Test
- Skewness
- Examples
- T-Test
- Z-Test
- ANOVA Test
- Mann-Whitney U Test
- Augmented Dickey-Fuller Test
Hypothesis Testing
Definition
Hypothesis testing is a statistical method to make inferences about a population based on a sample of data. It involves formulating a hypothesis, collecting relevant data, choosing a statistical test, and making conclusions about the population.
Types
- Simple Hypothesis: Assumes a specific value.
- Composite Hypothesis: Involves upper and lower values.
- One-Tail Hypothesis: Focuses on one direction.
- Two-Tail Hypothesis: Considers both directions.
Steps
- Formulate hypotheses about the population.
- Collect relevant data that represents the entire population.
- Choose a statistical test based on the nature of the data.
- Analyze the test results to accept or reject the null hypothesis.
- Compile and summarize findings into a research paper.
Relevance
Hypothesis testing is essential for validating theories when systematic statistical interference is impractical. Researchers aim to reject the null hypothesis in favor of the alternative hypothesis.
Limitations
- Suitable for small datasets.
- Relies on probability, interpretations, and assumptions.
Statistical Tests
Shapiro Test
The Shapiro test checks whether a sample follows a normal distribution.
from scipy.stats import shapiro
_, p_val = shapiro(data)
if p_val > 0.05:
print("We accept the null hypothesis. The data is normally distributed.")
else:
print("We reject the null hypothesis. The data is not normally distributed.")
Kstest Test
The Kstest test compares a sample’s cumulative distribution function with a given distribution.
from scipy.stats import kstest
_, p_val = kstest(data, 'norm')
if p_val > 0.05:
print("We accept the null hypothesis. The data is normally distributed.")
else:
print("We reject the null hypothesis. The data is not normally distributed.")
Normal Test
The normal test checks if a sample comes from a normal distribution.
from scipy.stats import normaltest
_, p_val = normaltest(data)
if p_val > 0.05:
print("We accept the null hypothesis. The data is normally distributed.")
else:
print("We reject the null hypothesis. The data is not normally distributed.")
Chi-Square Test
The Chi-Square test assesses the independence of two categorical variables.
from sklearn.feature_selection import chi2
chi, p_val = chi2(df[['Age']], df['Outcome'])
if p_val > 0.05:
print("We accept the null hypothesis. The variables are independent.")
else:
print("We reject the null hypothesis. The variables are dependent.")
Correlation Test
Pearson and Spearman’s Rank Correlation tests check the correlation between two variables.
from scipy.stats import spearmanr
from scipy.stats import pearsonr
stat, p = spearmanr(data1, data2)
if p > 0.05:
print('Variables are independent.')
else:
print('Variables are dependent.')stat, p = pearsonr(data1, data2)
if p > 0.05:
print('Variables are independent.')
else:
print('Variables are dependent.')
Skewness
Skewness measures the asymmetry of a distribution.
skewness = data.skew()
if skewness > 1:
print("Data is highly positively skewed.")
elif -1 <= skewness <= 1:
print("Data is symmetrical.")
else:
print("Data is highly negatively skewed.")
Examples
T-Test
The T-Test compares the mean of a sample with a known value.
from scipy.stats import ttest_1samp
_, p_val = ttest_1samp(data, expected_mean)
if p_val > 0.05:
print("Null Hypothesis is True.")
else:
print("Alternate Hypothesis is True.")
Z-Test
The Z-Test compares the mean of a sample with a known value for larger sample sizes.
from statsmodels.stats.weightstats import ztest
_, p_val = ztest(data, value=expected_mean)
if p_val > 0.05:
print("Null Hypothesis is True.")
else:
print("Alternate Hypothesis is True.")
ANOVA Test
The ANOVA Test compares means of two or more independent samples.
from scipy.stats import f_oneway
tstat, p = f_oneway(sample1, sample2, sample3)
if p > 0.05:
print('Same distribution of scores.')
else:
print('Different distributions of scores.')
Mann-Whitney U Test
The Mann-Whitney U Test checks if distributions of two independent samples are equal.
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(sample1, sample2)
if p > 0.05:
print('Same distribution.')
else:
print('Different distributions.')
Augmented Dickey-Fuller Test
The Augmented Dickey-Fuller Test checks the stationarity of time series data.
from statsmodels.tsa.stattools import adfuller
stat, p, lags, obs, crit, t = adfuller(time_series_data)
if p > 0.05:
print('Series is not Stationary.')
else:
print('Series is stationary.')