Ashish Pal
- Feb 13, 2023
- 8 min read

Mastering the Basics of Statistics: An Introduction

Statistics is a branch of mathematics that deals with collecting, analyzing, and interpreting data. It helps in making informed decisions and drawing conclusions based on data.

Fundamental concepts of statistics include:

Data: Data refers to a set of values or observations collected for a specific purpose. It is the raw material for statistical analysis and can come from various sources such as surveys, experiments, or databases.
Variables: Variables are characteristics of the data that can take on different values. There are two types of variables: categorical (e.g. gender, colour) and numerical (e.g. height, weight).
Descriptive Statistics: Descriptive statistics summarize and describe the main features of a set of data. This includes measures of central tendency such as mean, median, and mode, and measures of variability such as range, variance, and standard deviation.
Probability: Probability is a mathematical concept that describes the likelihood of a particular event occurring. It is expressed as a number between 0 and 1, with 0 indicating an impossible event and 1 indicating a certain event.
Normal Distribution: Normal distribution is a common probability distribution that is symmetrical and bell-shaped. Many real-world data sets follow a normal distribution, and it is important for understanding statistical concepts such as hypothesis testing and confidence intervals.
Inferential Statistics: Inferential statistics use a sample of data to make inferences or conclusions about a larger population. This includes techniques such as hypothesis testing and regression analysis.
1. Hypothesis Testing: Hypothesis testing is a statistical method used to determine if a claim or hypothesis about a population is true or false. It involves defining a null and alternative hypothesis, selecting a sample, and calculating a test statistic to make a decision about the hypothesis.
Correlation and Causation: Correlation refers to a relationship between two variables, while causation refers to a relationship where one variable directly causes changes in another. It is important to understand the difference between the two as correlation does not always imply causation.

Data sources:

Data sources can come from various sources such as:

Surveys: Surveys are a common method for collecting data. They can be administered in various forms, such as online, over the phone, or in-person.
Experiments: Experiments are designed to test a specific hypothesis or relationship between variables. Data is collected by manipulating one or more variables and measuring the effect on another.
Databases: Databases are collections of data that can be organized and analyzed. They can come from various sources such as financial records, customer transactions, or government data.
Social Media: Social media platforms such as Facebook, Twitter, and Instagram are rich sources of data. Companies and organizations can use data from these platforms to gather insights into consumer behavior and trends.
Sensors and IoT devices: The Internet of Things (IoT) refers to the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, and sensors. These devices generate large amounts of data, which can be used for various purposes such as predictive maintenance and traffic management.
Public Data: Governments and organizations often make data publicly available, such as census data, weather data, or environmental data. This data can be used for research, policy analysis, and decision making.

Variables:

Variables are characteristics or attributes of the data that can take on different values. There are two types of variables:

Categorical Variables: Categorical variables are variables that can be divided into categories or groups. For example, gender (male or female), color (red, blue, green), or type of food (pizza, sushi, burger).
Numerical Variables: Numerical variables are variables that can take on numerical values, such as height, weight, or age. They can be either discrete (e.g. whole numbers) or continuous (e.g. decimal values).

Descriptive Statistics

Measures of Central Tendency: Mean, median, and mode are used to describe the central tendency of a dataset. The mean is the average of all values, the median is the middle value when the data is arranged in order, and the mode is the most frequently occurring value.
Measures of Variability: Range, variance and standard deviation are used to describe the variability or spread of a dataset. The range is the difference between the largest and smallest values, the variance is a measure of how far the values are from the mean, and the standard deviation is a measure of the average deviation from the mean.
Frequency Distributions: Frequency distributions are used to show how often each value occurs in a dataset. They are often presented in a table or graph, such as a histogram, to visually represent the data.
Box Plots: Box plots are a graphical representation of the distribution of a dataset. They show the range, median, and quartiles of the data, and can be used to identify outliers or extreme values.
Percentiles: Percentiles divide a dataset into 100 equal parts and describe the value at each of these parts. For example, the 50th percentile (also known as the median) separates the lower half from the upper half of the data.

Probability

Probability is a branch of mathematics that deals with the study of random events and the likelihood of their occurrence. It is used to quantify the uncertainty or risk associated with a particular event.

The probability of an event is a value between 0 and 1, with 0 representing an impossible event and 1 representing a certain event. The probability of an event is expressed as a decimal or a percentage. For example, the probability of flipping a coin and getting heads is 0.5 or 50%.

There are two main types of probability:

Classical Probability: Classical probability is based on the idea that all possible outcomes of an event are equally likely to occur. For example, the probability of rolling a fair die and getting any number from 1 to 6 is equal.
Empirical Probability: Empirical probability is based on observed data. It is calculated by dividing the number of successful outcomes by the total number of trials. For example, the empirical probability of winning a game of chance can be determined by counting the number of times a player wins and dividing by the total number of games played.

Normal Distribution

The normal distribution, also known as the Gaussian distribution or bell curve, is a symmetrical, continuous probability distribution that is commonly used in statistics to model real-world data. It is defined by its mean (average) and standard deviation (measure of spread), and is often used to describe variables that have a large number of observations.

The normal distribution has several important properties, including:

Symmetry: The normal distribution is symmetrical around its mean, which means that the number of observations above the mean is equal to the number of observations below the mean.
Unimodality: The normal distribution is unimodal, meaning that it has a single peak.
Bell-shaped: The normal distribution is shaped like a bell, with the peak of the distribution at the mean and the spread of the distribution determined by the standard deviation.
Asymptotic: The tails of the normal distribution approach zero, but never reach it. This means that there is a finite probability of observing a value that is very far from the mean.

Topics under the normal distribution include:

Properties of the normal distribution
Mean and standard deviation
Z-scores and standardizing data
Using the normal distribution to calculate probabilities
The 68-95-99.7 rule
The central limit theorem
Normal approximation to the binomial distribution
Applications in hypothesis testing and estimation
Transformation of normal variables
Multivariate normal distribution.

The normal distribution is widely used in many fields, including finance, economics, and engineering, to model the distribution of random variables and to make predictions about future events. It is also a useful tool for understanding the distribution of data and for identifying outliers or extreme values.

Inferential statistics

Inferential statistics is a branch of statistics that deals with the process of drawing conclusions about a population based on a sample of data. It is used to make predictions and test hypotheses about population parameters, such as the mean, standard deviation, and proportion, using statistical methods and techniques.

The main steps in inferential statistics include:

Formulating a research hypothesis: A research hypothesis is a statement about a population parameter that is to be tested using sample data.
Sampling: A sample of data is collected from the population and used to make inferences about the population.
Estimation: Using the sample data, a point estimate is calculated for the population parameter. This estimate provides the best guess for the value of the parameter.
Testing of hypothesis: The sample data is used to test the research hypothesis and determine if it is supported or not. A statistical test is performed to calculate a test statistic and a p-value, which is used to determine if the results are statistically significant.
Interpreting results: The results of the hypothesis test are interpreted, and a conclusion is drawn about the population parameter. If the results are statistically significant, it can be concluded that the hypothesis is supported by the data.

Inferential statistics play a crucial role in many fields, including business, medicine, and social sciences, by allowing researchers to draw meaningful conclusions from data and make informed decisions based on those conclusions.

Topics under Inferential Statistics include:

Point estimation
Interval estimation
Hypothesis testing
t-tests
ANOVA (Analysis of Variance)
Chi-square tests
Regression analysis
Logistic regression
Non-parametric tests
Bayesian inference
Multiple comparisons and correction methods
Power and sample size calculation
Confidence intervals
P-value and significance level
Type I and Type II errors
Effect size and interpretability of results.

Hypothesis Testing

Hypothesis testing is a statistical method used to make inferences about a population based on a sample of data. The process involves formulating a research hypothesis and testing it against the sample data to determine if it is supported or not.

The steps in hypothesis testing include:

State the null and alternative hypothesis: The null hypothesis represents the default assumption that there is no difference or relationship between variables. The alternative hypothesis represents the opposite of the null hypothesis and is what the researcher is attempting to prove.
Choose a significance level: A significance level is a probability of making a Type I error, which is the error of rejecting a true null hypothesis. A common significance level is 0.05.
Select a test statistic: A test statistic is a calculated value used to determine if the null hypothesis can be rejected or not.
Calculate the p-value: The p-value is the probability of observing a test statistic as extreme or more extreme than the one observed, given that the null hypothesis is true.
Make a decision: If the p-value is less than the significance level, the null hypothesis is rejected and the alternative hypothesis is accepted. If the p-value is greater than the significance level, the null hypothesis is not rejected.
Interpret the results: The results of the hypothesis test are interpreted and a conclusion is drawn about the population based on the sample data.

Hypothesis testing is an important tool in inferential statistics and is used to make decisions about populations based on sample data in many fields including medicine, business, and social sciences.

Correlation and causation

Correlation and causation are two related but distinct concepts in statistics.

Correlation refers to the relationship between two variables and is often quantified using a correlation coefficient, such as Pearson's r. Correlation coefficients can range from -1 to 1 and indicate the strength and direction of the relationship between the two variables. A positive correlation coefficient indicates that as one variable increases, the other variable also increases. A negative correlation coefficient indicates that as one variable increases, the other variable decreases. A correlation coefficient of 0 indicates no relationship between the two variables.

Causation refers to the relationship between an independent variable and a dependent variable, where the independent variable is causing changes in the dependent variable. In other words, it is a relationship in which one variable influences or affects the other.

It's important to note that correlation does not imply causation. Just because two variables are correlated does not mean that one is causing the other. There can be many other factors influencing the relationship, and further investigation and analysis are needed to determine the cause-and-effect relationship between the two variables.

Understanding the difference between correlation and causation is critical in many fields, including medical research, marketing, and social sciences, as it affects how results are interpreted and decisions are made.