Introduction
Statistics, a sub-field of mathematics, can be defined as the practice or science of collecting and analyzing numerical data in large quantities. While machine learning is a subset of artificial intelligence that uses algorithms to perform a specific task without explicit instructions. Statistical methods provide the right direction in using, analyzing, and presenting raw data available for machine learning. ML uses a statistical approach. This has led to the successful implementation of speech analysis and computer vision. The statistical analysis looks at the data based on how the sample is represented.
Thus, one does not need to be a renowned statistician to implement the statistical methods used in machine learning. It can gradually be mastered with the help of programming and various other developed tools.
What is Statistics?
Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, and visualization of empirical data. Descriptive statistics and inferential statistics are two significant areas of statistics. Descriptive statistics describe the characteristics of the sample and population data (what happened). Inferential statistics uses these properties to test hypotheses, draw conclusions, and make predictions (what to expect).
Using statistics in machine learning
- Asking questions about data
- Data cleaning and preprocessing
- Choosing the right features
- Model evaluation
- Model predictions
The Role of Statistics in Machine Learning
Statistical learning is crucial in many sciences, finance, and industrial areas. Here are some learning problems:
- Predict whether a patient hospitalized for a heart attack will have a second heart attack. The prediction should be based on the patient's demographic, dietary, and clinical measurements.
- Predict the share price in 6 months based on company performance measurements and economic data.
- Identify numbers in a handwritten zip code from a digitized image.
- Estimate the amount of glucose in a diabetic's blood from the infrared absorption spectrum of that person's blood.
- Identify risk factors for prostate cancer based on clinical and demographic variables.
Learning science is crucial in statistics, data mining, and artificial intelligence, which intersects with engineering and other disciplines.
With this basic understanding, it's time to dive deep into learning all the critical concepts related to statistics for machine learning.
Our Learners Also Read: An Introduction to the Types Of Machine Learning
Statistics and Their Types
Statistics is a discipline concerned with data collection, organization, analysis, interpretation, and presentation. There are 2 types of statistics.
Descriptive statistics
Descriptive statistics is the understanding, analysis, and summarization of data in the form of numbers and graphs. We analyze the data using various charts and graphs on different kinds of data (numerical and categorical), such as bar charts, pie charts, scatter plots, histograms, etc. All types of interpretation and visualization are part of descriptive statistics. Descriptive statistics can be performed on a sample and population data, but we never get or take population data.
Inferential statistics
We extract some data samples from the population data, and from this data sample, we draw something (the main conclusion) for the population data. We test the sample data and draw a conclusion specific to that population. We use various techniques to extract the findings, including data visualization, manipulation, etc.
Now let's discuss the type of data a machine learning engineer receives.
Data Types
There are 2 types of data we get numerical and categorical, which we need to handle and analyze.
Numerical Data –
Numeric data simply means numbers or whole numbers. Numerical data is divided into 2 categories discrete and continuous numerical variables.
1. Discrete Numeric Variables – Discrete variables are those whose values are in an infinite range, such as class rank, number of faculty in a department, etc.
2. Continuous Numeric Variable – Continuous variables are those whose value can vary infinitely; that is, they are not in the correct range, for example, an employee's salary.
Categorical data –
Categorical data means categories, programming strings, or character data types such as name and color. Generally, there are also 2 types.
1. Ordinal Variables – An ordinal categorical variable means that you can order its values into any range, such as the student's grade (A, B, C), high, medium, and low.
2. Nominal Variables – Nominal variables are variables that cannot be ordered; they simply contain names or a series of categories, such as the name of a color, objects, etc.
Elements of Statistical Learning
1. Measures of central tendency
The measure of central tendency gives an idea of the centrality of the data, meaning what is in the center of your data. It includes various terms such as mean, median, and mode.
Mean
The mean is the arithmetic average of a set of data. To calculate the mean, add the values and divide by the number of values. The sample mean is the arithmetic mean of the sample and is denoted by x's ('x-bar'). The population means is the arithmetic mean of the population and is denoted by ????.
Median
The median is the middle value of the data if there are an odd number of data values and the data has been sorted in ascending order. If there is an even number, the median is the average of the two middle values of the data. When the income data is sorted in ascending order, the two middle values are $32,100 and $32,200, the average of which is the median income, $32,150.
Mode
The data value that appears with the highest noticeable frequency is the mode. Modes are possible for both quantitative and categorical variables, but only means and medians are possible for quantitative variables. There is no mode because every income figure only appears once. 2010 is the model year, and the frequency is 4.
2. Measures of variability: Range, Variance, Standard Deviation
Quantify the amount of variation, variance, or dispersion in the data.
Range
The range of a variable is equal to the difference between the maximum and minimum values. The income range is:
range(income) = max(income) − min(income) = 48,000 − 24,000 = $24,000
The range only reflects the difference between the largest and smallest observations but does not reflect how the data is centered.
Variance
The population variance is defined as the mean of the squares of the differences from the mean, denoted as ????² ("sigma-squared"):
A more considerable variance means the data is spread out.
With N replaced by n-1, the sample variance (s2) is roughly equal to the mean of the squared deviations. The sample mean is utilized as a rough approximation of the actual population mean, which results in this difference.
Standard Deviation
You may determine how far the individual numbers deviate from the mean by looking at the standard deviation, or sd, of a group of statistics.
The sample variance's square root is the sample standard deviation, or sd = √ s². Income, for instance, deviates from the median by $7201.
The square root of the population variance yields the population standard deviation or sd = √ s².
the same mean (100) and three distinct data distributions with various standard deviations (5,10,20)
The closer the data points are to the mean, the narrower the peak, and the smaller the standard deviation. The standard deviation increases with the distance between the data points and the mean.
3. Rank Measures: Percentile, Z-score, Quartiles
Indicate where a specific data value falls in the data distribution in relation to other data values.
Percentile
A data value is considered to be the path percentile of a data set if it is at or below the p percentile of all the values in the data set. The median is the 50th percentile. The median income, for instance, is $32,150, and 50% of the data values are at or below that level.
Percentile ranking
The percentage of values in the data set that is at or below a certain value is the percentile rank of a given data value. Rank in percentiles, for instance. 90% of Applicant 1's $38,000 income is made up of a salary that is equal to or below that amount.
Interquartile range (IQR)
The 25th percentile of the data set is represented by the first quartile (Q1), the median is represented by the second quartile (Q2), and the third quartile (Q3) is the 75th percentile.
Using the equation IQR = Q3 Q1, the IQR calculates the difference between the 75th and 25th observations.
If either x ≤ Q1 − 1.5 (IQR) or x ≥ Q3 + 1.5 (IQR)., then x is an outlier (IQR).
Z-score
The z-score for a given data value indicates how many standard deviations above or below the mean the data value is.
As a result, if z is positive, the value is higher than the norm. According to the Z-score for Applicant 6, whose salary is 1.2 standard deviations below the mean, it is (24,000 − 32,540)/ 7201 ≈ −1.2.
Conclusion
We learned about statistics and their importance in Machine Learning. We started with the types of statistics and what types of data we are dealing with and saw the basic concepts we need to perform some mathematical and statistical operations to understand the nature of the data.