What is Inferential Statistics in Machine Learning?

Table of Contents [show]

Introduction

Statistics is one of the core foundational skills required for data science. Any data science expert would surely recommend you to learn/upgrade in statistics.

However, if you look at statistics sources, you will see that many focus more on mathematics. It will focus on deriving patterns rather than simplifying the concept. I believe that statistics can be understood directly and practically. That's why I created this tutorial.

In this blog, we will walk you through inferential statistics, one of the most fundamental concepts in statistics for data science. we will guide you through all the related concepts of Inferential Statistics and their practical applications in this blog.

What are Inferential Statistics?

An area of statistics known as inferential statistics is one that employs analytical tools to make inferences about a population by analyzing samples taken at random. Generalizations about a population are what inferential statistics are intended to do. Inferential statistics uses a statistic (such as the sample mean) from sample data to make inferences about a population parameter (e.g., the population means).

Examples of Inferential Statistics

Inferential statistics is beneficial and cost-effective because it can make inferences about a population without collecting complete data. Some examples of derived statistics are given below:

Suppose the average marks of 100 students in a particular country are known. Using this sample information, inferred statistics can approximate the average student grades in a country.

Suppose a coach wants to find out how many average throws his college sophomores can make without stopping. A sample of several students will be asked to take a flip, and the average will be calculated. Inferential statistics will use this data to infer how many sophomores can perform on average.

Our Learners Also Read: Do I Need Probability for Machine Learning?

Why do we Need Inferential Statistics?

Let's say you're interested in learning what Indian data science experts make on average. Which of the subsequent techniques can be used to figure it out?

Meet each and every data science specialist in India. Take note of their pay, then what is the aggregate average?

Or a few professionals in a city like Gurgaon. Note their salaries and use them to calculate the Indian average.

The first method is not impossible, but it would require enormous resources and time. But today, companies want to make decisions quickly and efficiently, so the first method has no chance.

On the other hand, the second method seems feasible. However, there is a caveat. What if the population of Gurgaon does not reflect the entire population of India?

Then there is a good chance that you will make a very wrong estimate of Indian data science professionals' salaries.

Purpose of Inferential Statistics

There are two primary purposes of inferential statistics. The first is parameter estimation. As we did before, we use a statistic from our data set, like the standard deviation, to define a more generic parameter, such as the standard deviation of the total population.

A second place where inferential statistics is useful is in hypothesis tests. These can be especially useful for gathering information about something that can only be administered to a small group, like a new diabetes medication. The information gathered can be used to construct a forecast about whether this medication will be effective for the "full population" of diabetic patients (typically by computing a z-score).

Applications in Machine Learning

Machine Learning is limited in the same way we are when it comes to the world, and we can't look at the whole thing before deciding on a particular topic. He must take a sample, analyze the data, and then infer (using inferential statistics) the remaining information he needs. If ML doesn't predict correctly, it can randomly base its "predictions" on a very limited or incomplete data set based on bias.

Types of Inferential Statistics

Inferential Statistics: Can be divided into hypothesis testing and regression analysis. Hypothesis testing also involves using confidence intervals to test population parameters. Below are the different types of derived stats.

The Hypothesis is of Two Types:

Null Hypothesis: A null hypothesis is a type of hypothesis in which we assume that the sample observations are purely random. It is denoted H0.

Alternative Hypothesis: An alternative hypothesis is a hypothesis in which we assume that the sample observations are not random. Some non-random situations affect them. The alternative hypothesis is labeled H1 or Ha.

Steps of Hypothesis Testing

The process of determining whether or not to reject the null hypothesis based on sample data is called hypothesis testing. It consists of four steps:

Define the null and alternative hypotheses
Define an analysis plan to determine how to use the sample data to estimate the null hypothesis
Analyze data samples to produce a single number called a "test statistic."
Understand the result by applying a decision rule to test whether the null hypothesis is true or not

If the t-stat value is less than the significance level, we reject the null hypothesis, otherwise, we fail to reject the null hypothesis.

Technically, we never accept the null hypothesis. We say that we either fail to reject or reject the null hypothesis.

Terms in Hypothesis Testing

Significance Level

The probability that we will reject the null hypothesis is what is referred to as the significance level. Still, it is valid for, e.g., a significance level of 0.05 means that there is a 5% risk of assuming that there is some difference when there is no difference. It is denoted alpha (α).

The figure above shows that the two shaded regions are equidistant from the null hypothesis, each with a probability of 0.025 and a total of 0.05, which is our significance level. In the case of a two-tailed test, the shaded region is called the critical region.

P-value

The p-value is the probability that the t-statistic will be as extreme as the calculated value if the null hypothesis is true. A sufficiently low p-value is a reason to reject the null hypothesis. We leave the null hypothesis if the p-value is less than the significance level.

Errors in Hypothesis Testing

We have explained what hypothesis testing is and the steps to test it. Now, while doing hypothesis testing, some errors may occur.

We classify these errors into two categories.

Type 1 Error: A Type 1 error is when we reject the null hypothesis, but it is actually true. The probability of a type 1 error is called the significance level alpha(α).

Type 2 Error: A Type 2 error is when we fail to reject the null hypothesis, but it is actually false. The probability of a type 2 error is called beta(β).

Z-test

The Z-test is mainly used when the data is usually distributed. We determine the sample means' Z-statistic and compute the z-score. The z-score is given by the formula,

Z-score = (x – µ) / σ

The z-test is mainly used when the population means and standard deviation are given.

T-test

A t-test is similar to a z-test. Only when we have a sample standard deviation but no population standard or the tiny sample size is it used (n<30).

Different Types of T-tests

One Sample T-test

A one-sample t-test compares the mean of sample data with a known value, when we have to reach the standard of sample data with the population mean, we use a one-sample t-test.

We can perform a one-sample T-test when we do not have the population S.D. or a sample size of less than 30.

The t-statistic is given by:

Two-Sample T-Test

We use the two-sample T-test when we want to evaluate whether the mean of the two samples is different or not. In the two-sample T-test, we have two more categories:

Independent Sample T-Test: Two distinct models should be chosen from two entirely separate populations using independent sampling. In other words, it is inappropriate for one group to depend on another.

Paired T-Test: If our samples are related in some way, we need to use a paired t-test. Here, linkage means that the samples are linked because we are collecting data from the same group twice, e.g., a blood test of patients in a hospital before and after medication.

Chi-square Test

The Chi-square test is used when we have to compare categorical data. The chi-square test is of two types. Both use statistics and the chi-square distribution for different purposes.

The Goodness of Fit

Determines whether the sample data of the categorical variables match the population.

Test of Independence

Compares two categorical variables to see whether they are related.

The chi-square statistic is given by:

ANOVA (Analysis of Variance)

An ANOVA test can be used to assess the significance of experiment data. It is typically applied when there are more than two groups and we need to determine whether the numerous population variances and means are equivalent.

For instance, students from many universities sit for the same exam. See if one college does better than the others.

There are Two Types of ANOVA Tests:-