Fatskills
Practice. Master. Repeat.
Study Guide: Mathematics: Statistics & Statistical Analysis
Source: https://www.fatskills.com/teaching/chapter/mathematics-statistics-statistical-analysis

Mathematics: Statistics & Statistical Analysis

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~18 min read

Statistics
Statistics is the branch of mathematics that deals with collecting, recording, interpreting, illustrating, and analyzing large amounts of data.
The following terms are often used in the discussion of data and statistics:

Data – the collective name for pieces of information (singular is datum).
Quantitative data – measurements (such as length, mass, and speed) that provide information about quantities in numbers.
Qualitative data – information (such as colors, scents, tastes, and shapes) that cannot be measured using numbers.
Discrete data – information that can be expressed only by a specific value, such as whole or half numbers. For example, since people can be counted only in whole numbers, a population count would be discrete data.
Continuous data – information (such as time and temperature) that can be expressed by any value within a given range.
Primary data – information that has been collected directly from a survey, investigation, or experiment, such as a questionnaire or the recording of daily temperatures. Primary data that has not yet been organized or analyzed is called raw data.
Secondary data – information that has been collected, sorted, and processed by the researcher.
Ordinal data – information that can be placed in numerical order, such as age or weight.
Nominal data – information that cannot be placed in numerical order, such as names or places.

Data Collection

Population

In statistics, the population is the entire collection of people, plants, etc., that data can be collected from. For example, a study to determine how well students in local schools perform on a standardized test would have a population of all the students enrolled in those schools, although a study may include just a small sample of students from each school. A parameter is a numerical value that gives information about the population, such as the mean, median, mode, or standard deviation. Remember that the symbol for the mean of a population is μ and the symbol for the standard deviation of a population is σ.

Sample
A sample is a portion of the entire population.

Whereas a parameter helped describe the population, a statistic is a numerical value that gives information about the sample, such as mean, median, mode, or standard deviation. Keep in mind that the symbols for mean and standard deviation are different when they are referring to a sample rather than the entire population.

For a sample, the symbol for mean is

and the symbol for standard deviation is s.

The mean and standard deviation of a sample may or may not be identical to that of the entire population due to a sample only being a subset of the population. However, if the sample is random and large enough, statistically significant values can be attained. Samples are generally used when the population is too large to justify including every element or when acquiring data for the entire population is impossible.

Inferential Statistics
Inferential statistics is the branch of statistics that uses samples to make predictions about an entire population.
This type of statistic is often seen in political polls, where a sample of the population is questioned about a particular topic or politician to gain an understanding about the attitudes of the entire population of the country. Often, exit polls are conducted on election days using this method. Inferential statistics can have a large margin of error if you do not have a valid sample.

Sampling Distribution
Statistical values calculated from various samples of the same size make up the sampling distribution. For example, if several samples of identical size are randomly selected from a large population and then the mean of each sample is calculated, the distribution of values of the means would be a sampling distribution.
The sampling distribution of the mean is the distribution of the sample mean,
, derived from random samples of a given size. It has three important characteristics. First, the mean of the sampling distribution of the mean is equal to the mean of the population that was sampled. Second, assuming the standard deviation is non-zero, the standard deviation of the sampling distribution of the mean equals the standard deviation of the sampled population divided by the square root of the sample size. This is sometimes called the standard error. Finally, as the sample size gets larger, the sampling distribution of the mean gets closer to a normal distribution via the central limit theorem.

Survey Study
A survey study is a method of gathering information from a small group in an attempt to gain enough information to make accurate general assumptions about the population. Once a survey study is completed, the results are then put into a summary report.

Survey studies are generally in the format of surveys, interviews, or questionnaires as part of an effort to find opinions of a particular group or to find facts about a group.
It is important to note that the findings from a survey study are only as accurate as the sample chosen from the population.

Correlational Studies
Correlational studies seek to determine how much one variable is affected by changes in a second variable
. For example, correlational studies may look for a relationship between the amount of time a student spends studying for a test and the grade that student earned on the test or between student scores on college admissions tests and student grades in college.
It is important to note that correlational studies cannot show a cause and effect, but rather can show only that two variables are or are not potentially correlated.

Experimental Studies
Experimental studies take correlational studies one step farther, in that they attempt to prove or disprove a cause-and-effect relationship
. These studies are performed by conducting a series of experiments to test the hypothesis. For a study to be scientifically accurate, it must have both an experimental group that receives the specified treatment and a control group that does not get the treatment. This is the type of study pharmaceutical companies do as part of drug trials for new medications. Experimental studies are only valid when proper scientific method has been followed. In other words, the experiment must be well-planned and executed without bias in the testing process, all subjects must be selected at random, and the process of determining which subject is in which of the two groups must also be completely random.

Observational Studies
Observational studies are the opposite of experimental studies.
In observational studies, the tester cannot change or in any way control all of the variables in the test. For example, a study to determine which gender does better in math classes in school is strictly observational. You cannot change a person's gender, and you cannot change the subject being studied. The big downfall of observational studies is that you have no way of proving a cause-and-effect relationship because you cannot control outside influences. Events outside of school can influence a student's performance in school, and observational studies cannot take that into consideration.

Random Samples
For most studies, a random sample is necessary to produce valid results
. Random samples should not have any particular influence to cause sampled subjects to behave one way or another. The goal is for the random sample to be a representative sample, or a sample whose characteristics give an accurate picture of the characteristics of the entire population. To accomplish this, you must make sure you have a proper sample size, or an appropriate number of elements in the sample.


In statistical studies, biases must be avoided. Bias is an error that causes the study to favor one set of results over another. For example, if a survey to determine how the country views the president's job performance only speaks to registered voters in the president's party, the results will be skewed because a disproportionately large number of responders would tend to show approval, while a disproportionately large number of people in the opposite party would tend to express disapproval. Extraneous variables are, as the name implies, outside influences that can affect the outcome of a study. They are not always avoidable and can trigger bias in the

 

Statistical Analysis

Measures of Central Tendency

A measure of central tendency is a statistical value that gives a reasonable estimate for the center of a group of data. There are several different ways of describing the measure of central tendency. Each one has a unique way it is calculated, and each one gives a slightly different perspective on the data set. Whenever you give a measure of central tendency, always make sure the units are the same. If the data has different units, such as hours, minutes, and seconds, convert all the data to the same unit, and use the same unit in the measure of central tendency. If no units are given in the data, do not give units for the measure of central tendency.

Mean
The statistical mean of a group of data is the same as the arithmetic average of that group. To find the mean of a set of data, first convert each value to the same units, if necessary.
Then find the sum of all the values, and count the total number of data values, making sure you take into consideration each individual value. If a value appears more than once, count it more than once. Divide the sum of the values by the total number of values and apply the units, if any.

Note that the mean does not have to be one of the data values in the set, and may not divide evenly.

For instance, the mean of the data set {88, 72, 61, 90, 97, 68, 88, 79, 86, 93, 97, 71, 80, 84, 89} would be the sum of the fifteen numbers divided by 15:




While the mean is relatively easy to calculate and averages are understood by most people, the mean can be very misleading if used as the sole measure of central tendency. If the data set has outliers (data values that are unusually high or unusually low compared to the rest of the data values), the mean can be very distorted, especially if the data set has a small number of values. If unusually high values are countered with unusually low values, the mean is not affected as much.

For example, if five of twenty students in a class get a 100 on a test, but the other 15 students have an average of 60 on the same test, the class average would appear as 70. Whenever the mean is skewed by outliers, it is always a good idea to include the median as an alternate measure of central tendency.

A weighted mean, or weighted average, is a mean that uses 'weighted' values. The formula is
. Weighted values, such as

are assigned to each member of the set . If calculating weighted mean, make sure a weight value for each member of the set is used.

Median
The statistical median is the value in the middle of the set of data.
To find the median, list all data values in order from smallest to largest or from largest to smallest. Any value that is repeated in the set must be listed the number of times it appears. If there are an odd number of data values, the median is the value in the middle of the list. If there is an even number of data values, the median is the arithmetic mean of the two middle values.
For example, the median of the data set {88, 72, 61, 90, 97, 68, 88, 79, 86, 93, 97, 71, 80, 84, 88} is 86 since the ordered set is {61, 68, 71, 72, 79, 80, 84, 86, 88, 88, 88, 90, 93, 97, 97}.
The big disadvantage of using the median as a measure of central tendency is that is relies solely on a value's relative size as compared to the other values in the set. When the individual values in a set of data are evenly dispersed, the median can be an accurate tool. However, if there is a group of rather large values or a group of rather small values that are not offset by a different group of values, the information that can be inferred from the median may not be accurate because the distribution of values is skewed.

Mode
The statistical mode is the data value that occurs the greatest number of times in the data set.
It is possible to have exactly one mode, more than one mode, or no mode. To find the mode of a set of data, arrange the data like you do to find the median (all values in order, listing all multiples of data values). Count the number of times each value appears in the data set. If all values appear an equal number of times, there is no mode. If one value appears more than any other value, that value is the mode. If two or more values appear the same number of times, but there are other values that appear fewer times and no values that appear more times, all of those values are the modes.
For example, the mode of the data set {88, 72, 61, 90, 97, 68, 88, 79, 86, 93, 97, 71, 80, 84, 88} is 88.
The main disadvantage of the mode is that the values of the other data in the set have no bearing on the mode. The mode may be the largest value, the smallest value, or a value anywhere in between in the set. The mode only tells which value or values, if any, occurred the greatest number of times. It does not give any suggestions about the remaining values in the set.

Dispersion
A measure of dispersion is a single value that helps to 'interpret' the measure of central tendency
by providing more information about how the data values in the set are distributed about the measure of central tendency. The measure of dispersion helps to eliminate or reduce the disadvantages of using the mean, median, or mode as a single measure of central tendency, and give a more accurate picture of the dataset as a whole. To have a measure of dispersion, you must know or calculate the range, standard deviation, or variance of the data set.

Range
The range of a set of data is the difference between the greatest and lowest values of the data in the set.
To calculate the range, you must first make sure the units for all data values are the same, and then identify the greatest and lowest values. If there are multiple data values that are equal for the highest or lowest, just use one of the values in the formula.
Write the answer with the same units as the data values you used to do the calculations.

Standard Deviation
Standard deviation is a measure of dispersion that compares all the data values in the set to the mean of the set to give a more accurate picture. To find the standard deviation of a sample, use the formula

Note that s is the standard deviation of a sample, x represents the individual values in the data set,

is the mean of the data values in the set, and n is the number of data values in the set. The higher the value of the standard deviation is, the greater the variance of the data values from the mean. The units associated with the standard deviation are the same as the units of the data values.

Variance
The variance of a sample, or just variance, is the square of the standard deviation of that sample
. While the mean of a set of data gives the average of the set and gives information about where a specific data value lies in relation to the average, the variance of the sample gives information about the degree to which the data values are spread out and tell you how close an individual value is to the average compared to the other values. The units associated with variance are the same as the units of the data values squared.

Percentile
Percentiles and quartiles are other methods of describing data within a set. Percentiles
tell what percentage of the data in the set fall below a specific point. For example, achievement test scores are often given in percentiles. A score at the 80th percentile is one which is equal to or higher than 80 percent of the scores in the set. In other words, 80 percent of the scores were lower than that score.
Quartiles are percentile groups that make up quarter sections of the data set. The first quartile is the 25th percentile.
The second quartile is the 50th percentile; this is also the median of the dataset. The third quartile is the 75th percentile.

Skewness
Skewness is a way to describe the symmetry or asymmetry of the distribution of values in a dataset.

If the distribution of values is symmetrical, there is no skew. In general the closer the mean of a data set is to the median of the data set, the less skew there is. Generally, if the mean is to the right of the median, the data set is positively skewed, or right-skewed, and if the mean is to the left of the median, the data set is negatively skewed, or left-skewed. However, this rule of thumb is not infallible. When the data values are graphed on a curve, a set with no skew will be a perfect bell curve.

To estimate skew, use the formula:

Note that n is the datapoints in the set,
value in the set, and <br><img data-cke-saved-src=" />
is the mean of the set.

Unimodal vs. Bimodal
If a distribution has a single peak, it would be considered unimodal.
If it has two discernible peaks it would be considered bimodal. Bimodal distributions may be an indication that the set of data being considered is actually the combination of two sets of data with significant differences. A uniform distribution is a distribution in which there is no distinct peak or variation in the data. No values or ranges are particularly more common than any other values or ranges.

Outlier
An outlier is an extremely high or extremely low value in the data set.
It may be the result of measurement error, in which case, the outlier is not a valid member of the data set. However, it may also be a valid member of the distribution. Unless a measurement error is identified, the experimenter cannot know for certain if an outlier is or is not a member of the distribution. There are arbitrary methods that can be employed to designate an extreme value as an outlier. One method designates an outlier (or possible outlier) to be any value less than
) or any value greater than
).

Data Analysis

Simple Regression

In statistics, simple regression is using an equation to represent a relation between independent and dependent variables. The independent variable is also referred to as the explanatory variable or the predictor and is generally represented by the variable x in the equation. The dependent variable, usually represented by the variable y, is also referred to as the response variable. The equation may be any type of function – linear, quadratic, exponential, etc. The best way to handle this task is to use the regression feature of your graphing calculator. This will easily give you the curve of best fit and provide you with the coefficients and other information you need to derive an equation.

Line of Best Fit
In a scatter plot, the line of best fit is the line that best shows the trends of the data. The line of best fit is given by the equation
are the regression coefficients. The regression coefficient <i>a</i> is also the slope of the line of best fit, and <i>b</i> is also the <i>y</i>-coordinate of the point at which the line of best fit crosses the <i>y</i>-axis. Not every point on the scatter plot will be on the line of best fit. The differences between the y-values of the points in the scatter plot and the corresponding y-values according to the equation of the line of best fit are the residuals.<br> The line of best fit is also called the least-squares regression line because it is also the line that has the lowest sum of the squares of the residuals.<br> <br> Correlation Coefficient<br> The correlation coefficient is the numerical value that indicates how strong the relationship is between the two variables of a linear regression equation. A correlation coefficient of –1 is a perfect negative correlation. A correlation coefficient of +1 is a perfect positive correlation. Correlation coefficients close to –1 or +1 are very strong correlations. A correlation coefficient equal to zero indicates there is no correlation between the two variables. This test is a good indicator of whether or not the equation for the line of best fit is accurate.<br> The formula for the correlation coefficient is <br><img data-cke-saved-src=" />
where

is the correlation coefficient,

is the number of data values in the set,

is a point in the set, and

and

are the means.

Z-Score
A z-score is an indication of how many standard deviations a given value falls from the mean. To calculate a z-score, use the formula

, where

is the data value,

is the mean of the data set, and

is the standard deviation of the population.

If the z-score is positive, the data value lies above the mean. If the z-score is negative, the data value falls below the mean. These scores are useful in interpreting data such as standardized test scores, where every piece of data in the set has been counted, rather than just a small random sample. In cases where standard deviations are calculated from a random sample of the set, the z-scores will not be as accurate.

Central Limit Theorem
According to the central limit theorem, regardless of what the original distribution of a sample is, the distribution of the means tends to get closer and closer to a normal distribution as the sample size gets larger and larger (this is necessary because the sample is becoming more all-encompassing of the elements of the population). As the sample size gets larger, the distribution of the sample mean will approach a normal distribution with a mean of the population mean and a variance of the population variance divided by the sample size.


Practice:
P1. Suppose the class average on a final exam is 87, with a standard deviation of 2 points. Find the z-score of a student that got an 82.
P2. Given the following graph, determine the range of patient ages:

P3. Calculate the sample standard deviation for the dataset

 

P1. Using the formula for z-score:

P2. Patient 1 is 54 years old; Patient 2 is 55 years old; Patient 3 is 60 years old; Patient 4 is 40 years old; and Patient 5 is 25 years old. The range of patient ages is the age of the oldest patient minus the age of the youngest patient. In other words,
. The range of ages is 35 years.
P3. To find the standard deviation, first find the mean:

Now, apply the formula for sample standard deviation: