Fatskills
Practice. Master. Repeat.
Study Guide: Data Science 101
Source: https://www.fatskills.com/data-science/chapter/data-science-101

Data Science 101

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~16 min read

What is type 1 error and type 2 error?
Falsely concluding that intervention was successful. Known as false positive result

Falsely concluding intervention was not successful. Known a false negative

What can we do about overfitting?
> Regularization (penalizing model complexity while we're training)
> L2 regularization penalizes really big weights - complexity(model) = sum of squares of weights
> Regularization is about instead of minimizing only loss, its minimizing loss + complexity which is called structural risk minimization

Describe true positive, false positive, false negative, true negative
True Positives - we correctly called wolf, the town is saved.
> False positive - we called wolf falsely, the town is mad
> False negative - There was a wolf but we didn't spot it. Chickens are eaten.
> True negative - no wolf, no alarm. All is well.

What is precision?
True Positive / (True Positive + False Positive)

When you classify something as positive, how often are you right?

What is recall?
True positive / (True positive + False Negative)

When you classify something as positive, how many times did you fail to recall something as actually positive?

What is an ROC curve?
A graph showing the performance of a classification model at all classification thresholds. The curve plots two parameters true positive rate (recall) & true negative rate, also called Specificity (true negative / (true negative + false positive)) along the axis from 0 to 1

i.e. TPR on the y axis, and FPR on the x axis

What is false positive rate?
(false positive / (false positive + true negative))

What is the bias?
An error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).


The effect on the model because the sample systematically misrepresents the 'real' data. Most datasets are a convenience sample - the data easiest to collect

What is variance?
An error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

The effect on the model because it was built from this sample rather than that sample

variance measures how inconsistent are the predictions from one another

What is skewness?
Asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution.

This is called negative skewness (tail goes towards negative

What is kurtosis?
The sharpness of the peak of a frequency-distribution curve.

What are the different ways to handle missing values?

1. Delete the entire row/column

2. Replace by a fixed value (i.e. "unknown")

3. General statistic replacement (replace values by a statistic associated with a particular column like mean or median)

4. Grouped statistic replacement (replace values by a statistic associated with a particular group)

5. Imputation - predict values based on nearest neighbours or likelihood

What kind of feature transformation can you perform on numeric?

1. Round numeric to the nearest decimal or you can turn it into discrete for turning it into a categorical later

2. Discretization: binning of a variable to become categorical for better value management

3. Scaling (change the sale of the variable for better understanding), i..e min-max, z-score, etc.

What are some types of discretization methods?

1. Equal-width binning (bins have equal ranges, roughly same distribution as original variable

2. equal-density (frequency) binning - bins have equal number of examples/records/rows with a uniform distribution

What are the 5 categories of feature generation?

1. Indicator features (Attributes that isolate key information)

2. Aggregation features (Attributes to hold values aggregated across multiple rows)

3. Interaction features (attributes that highlight interactions between two or more features)

4. Expanding features (attributes that split old features)

5. External data (an underused type but can lead to some o the biggest breakthroughs in performance)

What is an example of using indicator features?

1. Threshold flags (Create yes/no features to replace numeric/categorical)

2. Special events flag (create yes/no feature to indicate special event)

3. Multi-feature flag (create yes/no feature off two or more features)

4. Multi-category flag (Create a yes/no feature for two or more categorical vaues of a categorical attribute)

What is principal component analysis?
Maps the present high dimension feature space to another low-dimension feature space where features become more manageable

> Convert a set of values of possibly correlated features into a set of linearly uncorrelated features called principal components
> Number of principal components is less than or equal to the number of original features
> First principal component has largest possible variance (i.e. accounts for as much of the variability as possible)

trying to maximize variance

What is out of bag testing/bootstrapping?
Select the training set from the dataset with replacement. Suppose there are n objects in the dataset, select an object repeatedly and put it into the training set but allow it to remain in the dataset. Stop when n objects have been selected.

There'll be repeated instances in the training set, so the ones that aren't repeated go into the test which will have about 1/3rd

What is Baye's rule?
A probability of event H given evidence E, describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

Pr[H | E] = (Pr[E | H] * Pr[H])/(Pr[E])

Pr[H] = Prior probability of H, before evidence is seen

Pr[H | E] - probability of event after evidence is seen

Pr(E) - probability of evidence

What is cross validation?
Divide dataset into training and test sets in multiple different ways. Build a model from each of the training sets and rate it using each of the test sets. use the average performance over the partitions to judge how we're doing

Models that exhibit small variance and high bias ____ the truth target.
underfit

Models that exhibit high variance and low bias _____ the truth target
overfit

What is bagging?
used to reduce variance which helps avoid overfitting. The idea is that once you have your sample from bootstrapping, you can then build a series of models. This ensemble of models, will carry votes with equal weight and you're able to use that average.

Difference between bagging and boosting?
Bagging splits data and runs seperate models with the data then votes on it. Boosting runs each model then tracks which data samples are successful/not successful. The least successful data is given heavier weights, which means it'll be iterated on more often to properly train the model

What is a gini index in random forests?
This is used to calculate the nodes purity, gini scores of 0 are perfect splits, you don't want the split to be evenly split (i.e. gini=0.5)


The impurity (or purity) measure used in building decision tree in CART is Gini Index. The decision tree built by CART algorithm is always a binary decision tree (each node will have only two child nodes

What is boosting?
Refers to any Ensemble method that can combine several weak learners into a strong learner and is used to reduce bias and variance. It does this through a weighted majority vote (classification) or a weighted sum (regression). Ada boost and Gradient boost are two popular methods.

It trains the models sequentially

What is bootstrapping?
A sampling technique with replacement. This ends up leaving some data unselected (on average 63% are sampled), while the remaining 37% of the training instances that are not sampled are called out-of-bag instances. Since the predictor never sees the out of bag instances during training, it can be evaluated on these instances without the need for a separate validation set or cross validation.

What is root mean squared error?
the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.


sqroot(sum(actual - predicted)^2 / n)

When should you use mean squared error vs root mean squared error?
MSE is useful for comparing models, RMSE is useful for understanding the data

What is specificity?
true negative / (true negative + false positive)

measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

What is accuracy?
(tp + tn) / (tp + fp + fn + tn)

What is an f-1 score?
is the harmonic mean of precision and sensitivity

f1 = 2TP / (2tp + fp + fn)

F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score

What is stacking?
A way of combining multiple models, that introduces the concept of a meta learner. It is less widely used than bagging and boosting. Unlike bagging and boosting, stacking may be (and normally is) used to combine models of different types. The procedure is as follows:

1. Split the training set into two disjoint sets.

2. Train several base learners on the first part.

3. Test the base learners on the second part.

4. Using the predictions from 3) as the inputs, and the correct responses as the outputs, train a higher level learner.

What is curse of dimensionality?
Lots of features makes training very slow, high dim datasets at risk of being sparse

What is the difference for reporting normal vs non normal data?
When reporting normal data you can report mean & confidence interval.

However for non-normal data you should report median & 1st/3rd quartile since asymmetry of non-normal prevents reporting of CI, StDev, etc.

What is statistical power?
The probability that we're able to detect a difference between two populations given a sample size, distribution and error rate.

Power has an effect on how robust our results are

What is revision to the mean?
Any anomaly will eventually revert back to the standard outcome

What are the top regression mistakes?
Top regression mistakes:

1) using regression to analyse a non linear relationship.

2) correlation does not equal causation

3) reverse causality - statistical association between A and B does not prove that A causes B. B may cause A.

4) omitted variable bias

5) highly correlated explanatory variables (multicollinearity)

5) extrapolating beyond the data.

True or false, the probability that two events will both occur can never be greater than the probability that each will occur individually
True

What is pearson's chi squared test?
A statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is suitable for unpaired data from large samples. Use the chi-square test for independence to determine whether there is a significant relationship between two categorical Variables.

What is the Law of compounding probabilities?
- if two possible events A and B are independent then the probability that both A and B will occur is equal to the product of the individual probabilities (prob(A)*prob(B))

True or false, Last law - if an event can have a number of different and distinct possible outcomes (A, B, C, etc. ) then the probability that either A or B will occur is equal to the sum of the individual probabilities of A and B. The sum of all probabilities of all possible outcomes is 1.
True

What is a random variable?
Random variable is a function from sample space that maps to the real line. Think of it as a numerical "summary" as an aspect of an experiment.

What is a qqplot?
a plot to visualize how close a sample distribution is to a normal distribution

What is Cramer's V Chi Squared Correlation?
Correlation of 0 means two variables are not related, correlation of 1 means they are strongly related

What is Analysis of Variance (ANOVA) ?
Analysis of Variance (ANOVA) - assess whether the average of a numerical variable for more than two categories of a categorical variable are statistically different from each other (P value less than 0.05 means variables are correlated)

How to transform skewed distributions into symmetric?
Transform skewed distributions into Symmetric (since many models are based on assumption of normality, like Neural Network). For right skewed distribution, take square/cube root or logarithm of variable. Square root and log can't handle negative numbers. For left skewed distribution, take square/cube or exponential of variables.

What is ACID in relational databases?
Atomicity
- requires that each transaction is "all or nothing", if one part of transaction fails, entire transaction fails and database state is let unchanged.
Consistency property ensures that any transaction will bring the database from one valid state to another.
Isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially.
Durability means that once a transaction has been committed, it will remain so (even in the event of power loss, crashes or errors)

What is Consistency Availability and Partition Tolerance theorem?
Consistency Availability and Partition Tolerance (CAP) theorem states that there are 3 essential system requirements necessary for successful design/deployment/implementation of applications in distributed systems:

Distributed system can only guarantee two of these features, not all 3.
Typically Consistency suffers in new large-scale distributed non-relational systems so ACID collapses. SQL databases only designed to store/manage structured data.

What is NoSQL and what does it support?
NoSQ
L - a not only sql database provides a mechanism for storage and retrieval of data that employs less constrained consistency models. They support BASE: Basically Available: this constraint states that the system does guarantee availability of data as regards CAP theorem; there will be a response to any request, but data may be inconsistent or changing state.
Soft state: the state of the system could change over time, so even during times without input, there may be changes going on due to 'eventual consistency' thus state is always soft.
Eventual consistency: system will eventually become consistent once it stops receiving input.

Data will propagate to everywhere it should sooner or later, but system will continue to receive input and is not checking consistency of every transaction before it moves on to the next one. NoSQL databases often highly optimized distributed key-value stores intended for simple retrieval and appending operations, with key-value stores allowing applications to store its data in schema-less way thus it can store both structured and semi-structured data

What are databases evaluated in terms of?
Databases are evaluated in terms of: Latency (Time between submitting a query and getting results - lower better), and throughput (number of queries served per second - higher better)

What is a data warehouse?
Data Warehouse is an OLAP (online Analytical processing) relational database that is designed for query and analysis (Reads). It is a storage repo of a carefully selected set of historical data of an entire Enterprise that is collected (Extract) from multiple different sources, cleaned and joined together (transform) and stored in structured tables (Load). It is updated on a regular basis by the ETL process. To build a data warehouse requires: understanding the raw data scheme and define which data to include in the warehouse. Developing data cleaning/transformation process to transform raw data to cleaned format. Designing a new relational database (warehouse) to store the transformed raw data (Scheme-on-write).

What is a data mart?
Data Mart: a small data warehouse oriented towards individual departments or groups to address one or more specific subject areas of a business.

What is a data lake?
Data Lake: a storage repo that holds all historical data of an enterprise that is collected from multiple different sources and saved in their native/raw format on a distributed storage system such as HDFS. Data lakes allow you to store structured, semi-structured and unstructured data while data warehouse can only store structured data. Require fetching new data on a regular basis

What are RNN good for vs LSTM?
Recurrent Neural Network are good for short term dependencies, whereas Long/Short Term memory are good for long-term dependencies

What is normalization?
Scales all numeric values in the range [0,1] - bad for outliers

What is standardization?
Rescales data to have a mean of 0 and a std dev of 1 (unit variance)

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

What does model selection depend on?
No best method or one size fits all.

It usually depends on:
> Data size
> data type (numeric or nominal)
> problem type (prediction or grouping)

What is the binomial distribution?
one in which there are only two possibilities, such as yes/no. the binomial distribution gives the probability of obtaining an exact number of successes in a series of independent trials.

What is a Poisson distribution?
The Poisson distribution is what you must think of when trying to count events over a time given the continuous rate of events occurring.

What is a probability density function?
In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function, whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.

What is a windows function in SQL?
When you write something like

agg_func(var) OVER (PARTITION BY ____ ORDER BY ____)


i.e.

sum(revenue) OVER (partition by product order by transaction) -> this would return a running total per product

sum(Revenue) over (partition by product) -> this would return a total for each

What are all the different types of window functions in SQL?
row_number(), rank(), dense_rank(), lag(), lead(), first_value(), last_value(), sum(), avg(), count(), nth_value(), ntile()

How does spark work?
transfers data from the physical, magnetic hard discs into far-faster electronic memory where processing can be carried out far more quickly - up to 100 times faster in some operations.

What are the five stages of the experimental framework for A/B testing?

1. determine conversion to improve

2. hypothesize change

3. identify variables and create variations

4. run experiment

5. measure results

What makes a good market segmentation?

1. Distinct and identifiable, 2. sizable, 3. reachable 4. stable, 5. profitable/valuable 6. relevant

What type of relationship does pearsons work best with?
Pearson's correlation only measures linear relationships. If there's a nonlinear relationship, strength is understated

What is the most reliable way to demonstrate a casual relationship?
Randomizes controlled trial (treatment group - intervention & control group - receives no intervention). Most reliable way to demonstrate casual relationship

what is covariance?
A measure of the tendency of two variables to vary together

What is a confidence interval?
range of values in which a specified probability of the means of repeated samples would be expected to fall

How to create a classical hypothesis test?

1. Choose a test statistic (to measure effect, like mean).

2. Define a null hypothesis (a model of the system based on assumption that apparent effect isn't real)

3. Compute a p-value (probability of seeing the apparent effect of the null hypothesis is true)

4. If p-value is low than effect is statistically significant

What is a parametric model?
A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won't change its mind about how many parameters it needs. (I.e. linear/logistic regression, naive Bayes, neural networks

What are benefits of a parametric model?
Benefits of Parametric Machine Learning Algorithms:
Simpler: These methods are easier to understand and interpret results.
Speed: Parametric models are very fast to learn from data.
Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.

What are nonparametric methods good for?
Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don't want to worry too much about choosing just the right features. (I.e. support vector machines, decision trees, k-nearest neighbor

What are benefits of nonparametric models?
Benefits of Nonparametric Machine Learning Algorithms:
Flexibility: Capable of fitting a large number of functional forms.
Power: No assumptions (or weak assumptions) about the underlying function.
Performance: Can result in higher performance models for prediction.
Limitations of Nonparametric Machine Learning Algorith

What is the box-cox transformation?
Box-cox transformation- This is a useful data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation between variables and for other data stabilization procedures