Fatskills
Practice. Master. Repeat.
Study Guide: Deep Learning / Machine Learning Notes
Source: https://www.fatskills.com/machine-learning-101/chapter/deep-learning-machine-learning-notes

Deep Learning / Machine Learning Notes

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~33 min read

What is deep learning?
A sub-field of machine learning that is a set of algorithms that is inspired by the structure and function of the brain

What is backpropagation?
At the heart of backpropagation is an expression for the partial derivative ∂C/∂w of the cost function C with respect to any weight w (or bias b) in the network. The expression tells us how quickly the cost changes when we change the weights and biases.

What is Vectors ?
are special types of matrices, which are rectangular arrays of numbers.

Vectors are (ordered/unordered) collections of numbers?
Because vectors are ordered collections of numbers, they are often seen as column matrices: they have just one column and a certain number of rows.

LSA is stand for___
Latent Semantic Analysis

LSA stands for___
Latent Semantic Analysis

SVD stands for ___
Single Value Decomposition

What is entropy?
___is the measure of disorder or how messy the data is.

Equation of entropy?
= - p x log2(p) - q x log2(q)

Bias-variance tradeoff is___
In prediction models, prediction errors can be decomposed into two main subcomponents we care about: error due to "bias" and error due to "variance". There is a tradeoff between a model's ability to minimize bias and variance. Understanding these two types of error can help us diagnose model results and avoid the mistake of over- or under-fitting.

Cross validation?
a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. - Cross validation is a model evaluation method that is better than residuals. - The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen.

What is Holdout cross-validation?
In _______, we hold out a percentage of observations and so we get two datasets. One is called the training dataset and the other is called the testing dataset. Here, we use the testing dataset to calculate our evaluation metrics, and the rest of the data is used to train the model.

Advantage of holdout cross-validation?
It is very easy to implement and it is a very intuitive method of cross-validation.

Problem of holdout cross-validation?
It provides a single estimate for the evaluation metric of the model. This is problematic because some models rely on randomness. So in principle, it is possible that the evaluation metrics calculated on the test sometimes they will vary a lot

What is k-fold cross-validation?
In _____, we basically do holdout cross-validation many times. So in ______, we partition the dataset into k equal-sized samples. This cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation.

What is Machine Learning?
Give computer ability to learn to make decisions from data without being explicitly programmed.

What is Unsupervised leanring?
Uncovering hidden patterns from unlabeled data

Reinforcement learning
Software agents interact with an environment: - Learn how to optimize their behavior - Given a system of rewards and punishments - Draw inspiration from behavioral psychology

Types of recommender systems?
Collaborative filters are one of the most popular recommender models used in the industry and have found huge success for companies such as Amazon. - Collaborative filtering can be broadly classified into two types:

Content-based systems:

What is Item-based filtering?
If a group of people have rated two items similarly, then the two items must be similar. Therefore, if a person likes one particular item, they're likely to be interested in the other item too.

What is Overfitting?
- Overfitting occurs when the model fit/learn the training data too closely/too well instead of predicting unseen data. It is the result of a complex model with many variables.
- A model that is overfitted is inaccurate because the trend does not reflect the reality of the data.

How to prevent overfitting?
To prevent overfitting, we can use techniques like :
1> Cross-validation, 2> Regularization, early stopping, pruning, Bayesian priors, dropout and model comparison. Make a simple model: withe lesser variables and parameters, the variance can be reduced.

What is Bias and Variance Tradeoff?
- Bias and variance is a ways to diagnose the performance of a prediction algorithm by breaking down its prediction error.There are 2 types of prediction error: BIAS and VARIANCE - <1>The bias is an error from erroneous assumptions in the learning algorithm. Bias occurs when an algo' has limited flexibility to learn the true signal from a dataset. - It is the difference between your model expected prediction and the true value. - <2> Variance is the algorithm's sensitivity to specific sets of training data. Variance is the variability of model prediction for a given data point.

How can you choose a classifier based on training set size?
- When the training set is small, a model that has a high bias and low variance seems to work better because they are less likely to overfit. e.g. Naive Bayes work best

- When the training set is large, model with low bias and high variance tends to perform better as they work fine with complex relationships. e.g. decision tree.

Explain confusion matrix with respect to Machine Learning algorithm?
Confusion matrix (or error matrix) is a specific table that is used to measure the performance of an algorithm.
- It is mostly used in supervised learnign ( in un-supervised learning it is called matching matrix)
- Confusion matrix has 2 dimentsions:
1) Actual
2) Predicted
- It also has identical sets of features in both these dimensions

What are the three stages to build a model in Machine learning?
1> Model building:

- Choose the suitable algorithm for the model and train it according to the requirement.

2> Model testing:
- Check the accuracy of the model through the test data.

3> Applying the model:
- Make the require changed after testing and apply the final model.

When will you use classification over regression?
- Classification is used when your target variable is CATEGORICAL. E.g. predict gender of a person, type of color,etc.
- Regression is used when target variable is CONTINUOUS. E.g. estimate sale and price of a product, predicting sports score, amount of rainfall, etc.
Both belong to the category of Supervised ML Algorithms.

What is Random Forest?
Random Forest is a supervised ML Algo that is generally used for classification problems.
- Random Forest operates by constructing multiple Decision trees during training phase. The decision of the majority of the trees is chosen by the random forest as the final decision.

Descriptive statistics
are statistics that describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data.
- Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made. They are simply a way to describe our data.

Difference Between Linear And Logistic Regression?
Two main difference are as follows -
1>
-Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups).
- While Binary logistic regression requires the dependent variable to be binary - two categories only (0/1).
- Multinomial or ordinary logistic regression can have dependent variable with more than two categories.

2>
-Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value.
- While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood)

How To Treat Outliers?
There are several methods to treat outliers -

Percentile Capping
Box-Plot Method
Standard Deviation
Weight of Evidence
Transformation
------------------------------------
- A box plot is a graphical display for describing the distribution of the data. Box plots use the median and the lower and upper quartiles.
An outlier is defined as the value above or below the upper or lower fences.

- If a value is higher than the mean plus or minus three Standard Deviation is considered as outlier. It is based on the characteristics of a normal distribution for which 99.87% of the data appear within this range.

What are the types of Outliers?
Outlier can be of two types: Univariate and Multivariate.

- Multi-variate outliers are outliers in an n-dimensional space.

How to detect Outliers?
- Most commonly used method to detect outliers is visualization.
- We use various visualization methods, like Box-plot, Histogram, Scatter Plot

What is Variable Transformation?
In data modelling, transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others.

What Is P-value And How It Is Used For Variable Selection?
p-value is level of significance at which you can reject null hypothesis.

p-value or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H0) of a study question is true.

- One commonly used p-value = 0.05.
- p-value < 0.05, --> reject the null hypothesis and accept the alternative hypothesis.
- p-value > 0.05, accept the null hypothesis.

What is Simple vs. Multiple Linear Regression?
- Linear regression can be simple linear regression when you have only one independent variable .
- Whereas Multiple linear regression will have more than one independent variable.

What is residual?
The difference between an observed (actual) value of the dependent variable and the value of the dependent variable predicted from the regression line.

What is regularization and where might it be helpful? What is an example of using regularization in a model?
Regularization is useful for reducing variance in the model, meaning avoiding overfitting . For example, we can use L1 regularization in Lasso regression to penalize large coefficients.

Why might it be preferable to include fewer predictors over many?
- When we add irrelevant features, it increases model's tendency to overfit because those features introduce more noise.
- When two variables are correlated, they might be harder to interpret in case of regression, etc.
- curse of dimensionality
- adding random noise makes the model more complicated but useless
- computational cost

You're Uber and you want to design a heatmap to recommend to drivers where to wait for a passenger. How would you approach this?
Based on the past pickup location of passengers around the same time of the day, day of the week (month, year), construct
Ask someone for more details.
Based on the number of past pickups
account for periodicity (seasonal, monthly, weekly, daily, hourly)
special events (concerts, festivals, etc.) from tweets

What are various ways to predict a binary response variable? Can you compare two of them and tell me when one would be more appropriate? What's the difference between these? (SVM, Logistic Regression, Naive Bayes, Decision Tree, etc.)
Things to look at: N, P, linearly seperable?, features independent?, likely to overfit?, speed, performance, memory usage
Logistic Regression
features roughly linear, problem roughly linearly separable
robust to noise, use l1,l2 regularization for model selection, avoid overfitting
the output come as probabilities
efficient and the computation can be distributed
can be used as a baseline for other algorithms
(-) can hardly handle categorical features
SVM
with a nonlinear kernel, can deal with problems that are not linearly separable
(-) slow to train, for most industry scale applications, not really efficient
Naive Bayes
computationally efficient when P is large by alleviating the curse of dimensionality
works surprisingly well for some cases even if the condition doesn't hold
with word frequencies as features, the independence assumption can be seen reasonable. So the algorithm can be used in text categorization
(-) conditional independence of every other feature should be met
Tree Ensembles
good for large N and large P, can deal with categorical features very well
non parametric, so no need to worry about outliers
GBT's work better but the parameters are harder to tune
RF works out of the box, but usually performs worse than GBT
Deep Learning
works well for some classification tasks (e.g. image)
used to squeeze something out of the problem

What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced? What if there are more than 2 groups?
Accuracy: proportion of instances you predict correctly. Pros: intuitive, easy to explain, Cons: works poorly when the class labels are imbalanced and the signal from the data is weak
AUROC: plot fpr on the x axis and tpr on the y axis for different threshold. Given a random positive instance and a random negative instance, the AUC is the probability that you can identify who's who. Pros: Works well when testing the ability of distinguishing the two classes, Cons: can't interpret predictions as probabilities (because AUC is determined by rankings), so can't explain the uncertainty of the model
logloss/deviance: Pros: error metric based on probabilities, Cons: very sensitive to false positives, negatives
When there are more than 2 groups, we can have k binary classifications and add them up for logloss. Some metrics like AUC is only applicable in the binary case.

What are some ways I can make my model more robust to outliers?
We can have regularization such as L1 or L2 to reduce variance (increase bias).
Changes to the algorithm:
Use tree-based methods instead of regression methods as they are more resistant to outliers. For statistical tests, use non parametric tests instead of parametric ones.
Use robust error metrics such as MAE or Huber Loss instead of MSE.
Changes to the data:
Winsorizing the data
Transforming the data (e.g. log)
Remove them only if you're certain they're anomalies not worth predicting

What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?
The model that has high training accuracy might have low test accuracy. Without further knowledge, it is hard to know which dataset represents the population data and thus the generalizability of the algorithm is hard to measure. This should be mitigated by repeated splitting of train vs test dataset (as in cross validation).
When there is a change in data distribution, this is called the dataset shift. If the train and test data has a different distribution, then the classifier would likely overfit to the train data.
This issue can be overcome by using a more general learning method.
This can occur when:
P(y|x) are the same but P(x) are different. (covariate shift)
P(y|x) are different. (concept shift)
The causes can be:
Training samples are obtained in a biased way. (sample selection bias)
Train is different from test because of temporal, spatial changes. (non-stationary environments)
Solution to covariate shift
importance weighted cv

Given a Dataset) Analyze this dataset and give me a model that can predict this response variable.
Start by fitting a simple model (multivariate regression, logistic regression), do some feature engineering accordingly, and then try some complicated models. Always split the dataset into train, validation, test dataset and use cross validation to check their performance.
Determine if the problem is classification or regression
Favor simple models that run quickly and you can easily explain.
Mention cross validation as a means to evaluate the model.
Plot and visualize the data.

What is bias?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict.

- Bias is the average difference between the estimator and the true value.
- High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
- Model with high bias pays very little attention to the training data and oversimplifies the model.
- It always leads to high error on training and test data.

What is variance?
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.

- Variance is the average of the squared distances from each point to the mean.
- Variance shows how subject the model is to outliers, meaning those values that are far away from the mean.
-Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn't seen before.
-As a result, such models perform very well on training data but has high error rates on test data.

What does it mean to have high variance?
A high variance indicates that the data points are very spread out from the mean, and from one another. - A model is too specific (overfitting), leading to high variance

What does it mean to have small variance?
A small variance indicates that the data points tend to be very close to the mean, and to each other.

What does high bias mean?
In machine learning terminology, underfitting means that a model is too general, leading to high bias, while overfitting means that a model is too specific, leading to high variance.
... Since you can't realistically avoid bias and variance altogether, this is called the bias-variance tradeoff.

What is the difference between bias and precision?
Bias is the average difference between the estimator and the true value. Precision is the standard deviation of the estimator. One measure of the overall variability is the Mean Squared Error, MSE, which is the average of the individual squared errors.

What is normally distribution?
- The normal distribution is the most important and most widely used distribution in statistics.
- It is sometimes called the "bell curve,"
- A normal distribution is perfectly symmetrical around its center.
- A normal distribution has a bell-shaped density curve described by its mean and standard deviation .
-The density curve is symmetrical, centered about its mean, with its spread determined by its standard deviation.

What is normal distributed determined by?
A normal distribution is determined by two parameters the mean and the variance

When a normal distribution is called Standard Normal Distribution (SND)?
A normal distribution is called Standard Normal Distribution (SND) when its mean is zero and SD is equal to 1.

What is Standard Normal Distribution?
The standard normal distribution (z distribution) is a normal distribution with a mean of 0 and a standard deviation of 1.

What is Normal Distribution?
- A "normal" distribution is also known as a bell-shaped curve or Gaussian curve.
- In a Gaussian or normal distribution, the mean , mode and median would all have the same (or similar) value and would look like the figure.
- A normal distribution is perfectly symmetrical around its center. - Total area under the curve = 1

How do you convert a normal distribution to a standard normal distribution?
So to convert a value to a Standard Score ("z-score"):
first subtract the mean,
then divide by the Standard Deviation.

What is Z score in normal distribution?
The standard score (more commonly referred to as a z-score) is a very useful statistic because it (a) allows us to calculate the probability of a score occurring within our normal distribution and (b) enables us to compare two scores that are from different normal distributions.

How do you find the Z value?
To find the Z score of a sample, you'll need to find the mean, variance and standard deviation of the sample. To calculate the z-score, you will find the difference between a value in the sample and the mean, and divide it by the standard deviation.

What is Z distribution?
In statistics, the Z-distribution is used to help find probabilities and percentiles for regular normal distributions (X).
-It serves as the standard by which all other normal distributions are measured.
-The Z-distribution is a normal distribution with mean zero and standard deviation 1;

What does a standard deviation of 1 mean?
A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

What does a standard deviation of 0 mean?
xi - x = 0. This means that every data value is equal to the mean. This result along with the one above allows us to say that the sample standard deviation of a data set is zero if and only if all of its values are identical.

What is the importance of standard deviation?
Standard deviation is a number used to tell how measurements for a group are spread out from the average (mean), or expected value. A low standard deviation means that most of the numbers are very close to the average. A high standard deviation means that the numbers are spread out.

What does it mean when standard deviation is higher than mean?
A large standard deviation indicates that the data points can spread far from the mean and a small standard deviation indicates that they are clustered closely around the mean. ... The third population has a much smaller standard deviation than the other two because its values are all close to 7.

What are z scores used for?
The standard score (more commonly referred to as a z-score) is a very useful statistic because it (a) allows us to calculate the probability of a score occurring within our normal distribution and (b) enables us to compare two scores that are from different normal distributions.

How do you calculate normal distribution?
So to convert a value to a Standard Score ("z-score"):
first subtract the mean,
then divide by the Standard Deviation.

What is Z * in statistics?
The Z score is a test of statistical significance that helps you decide whether or not to reject the null hypothesis.
- The p-value is the probability that you have falsely rejected the null hypothesis.
- Z scores are measures of standard deviation. ... Both statistics are associated with the standard normal distribution.

What does the Z score tell you?
z-score is how many standard deviations away from the mean a data point is.

What is the goal of Linear Regression?
The goal of simple (univariate) linear regression is to model the relationship between a single feature (explanatory variable x) and a continuous valued response (target variable y).
y = ax + b

What is regression line?
___is the best-fitting line.

What is offset or residuals?
____are the vertical lines from the regression line to the sample points, call prediction errors.
vertical offset = |y^ - y|

What is difference between simple linear and multiple linear regressions?
Simple linear regression has only one x and one y variable.

Multiple linear regression has one y and two or more x variables.

For instance, when we predict rent based on square feet alone that is simple linear regression.

When we predict rent based on square feet and age of the building that is an example of multiple linear regression.

What are the 4 assumptions of linear regression?
The 4 assumptions are:
- Linearity of residuals
- Independence of residuals
- Normal distribution of residuals
- Equal variance of residuals

What is meant by dependent and independent variables?
Dependent variable depends upon independent variable.

What is a residual? How is it computed?
Residual is also called Error.
It is the difference between the predicted y value and the actual y value.
Residual = Actual y value - Predicted y value.

It can be positive or negative.

If residuals are always 0, then your model has a Perfect R square i.e. 1.

What is the difference between coefficient of determination, and coefficient of correlation?
Coefficient of correlation is "R" value which is given in the summary table in the Regression output. R square is also called coefficient of
determination. Multiply R times R to get the R square value. In other words Coefficient of Determination is the square of Coefficeint of Correlation.

R square or coeff. of determination shows percentage variation in y which is explained by all the x variables together. Higher the better. It is always between 0 and 1. It can never be negative - since it is a squared value.

It is easy to explain the R square in terms of regression. It is not so easy to explain the R in terms of regression.

What is adjusted R2?
Adjusted R2 is used to compensate for the addition of variables to the model. It is always less than or equal to R2. As more independent variables (i.e. x variables) are added to the regression model, R2 usually increases. R2 increases even when the additional variables do little to help explain the dependent variable. To compensate for this, adjusted R2 is discounted (i.e. lowered) for the number of independent variables in the model.

What does coefficient of determination explain? (in terms of variation)
R square or coefficient of determination is the percentage variation in y expalined by all the x variables together.

Example, say we are trying to predict Rent based on square feet and number of bedrooms in the apartment. Say the R square for our model is 72% - that means that all the x variables i.e. square feet and number of bedrooms together explain 72% variation in y i.e. Rent.

Now let say we add another x variable, for example age of the building to our model. By addiding this third relevant x variable the R square is expected to go up. Let say the new R square is 95%. This means that square feet, number of bedrooms and age of the building together explain 95% of the variation in the Rent.

Remember, coefficient of determination or R square can only be as high as 1 (it can go down to 0, but not any lower).

If we can predict our y variable (i.e. Rent in this case) then we would have R square (i.e. coefficient of determination) of 1.

Usually the R square of .70 is considered good.

For those cases where we really know nothing much about say the hormones which increase our body's immunity against Cancer - in such cases if we have a regression model with say R square of .05 or even .02, is also considered very good. It shows that atleast our x variables (what ever they are) are predicting some effect on cancer immunity.

What happens when p value for f test is lower than alpha i.e. what do you conclude?
When p value for f test is lower than alpha (which is usually .05 if nothing else is specified), then we reject H0. We conclude that:

We have the evidence that at least one of the [x variables] has a significant relationship with the [y variable].

Note: you need to replace the terms in the square parenthesis with the actual variable names, e.g., square feet and number of bedrooms for x variables and rent for y variable.

What is the difference between R square and adjusted R square?
R square and adjusted R square values are used for model validation in case of linear regression.

- R square indicates the variation of all the independent variables on the dependent variable. i.e. it considers all the independent variable to explain the variation.

-In the case of Adjusted R squared, it considers only significant variables(P values less than 0.05) to indicate the percentage of variation in the model.

How to find RMSE and MSE?
RMSE and MSE are the two of the most common measures of accuracy for a linear regression.

RMSE indicates the Root mean square error, which indicated by the formulae:


What are the possible ways of improving the accuracy of a linear regression model?
There could be multiple ways of improving the accuracy of a linear regression, most commonly used ways are as follows:

Outlier Treatment:
-Regression is sensitive to outliers, hence it becomes very important to treat the outliers with appropriate values.
-Replacing the values with mean, median, mode or percentile depending on the distribution can prove to be useful.

What is the significance of an F-test in a linear model?
The use of F-test is to test the goodness of the model.

-When the model is re-iterated to improve the accuracy with changes, the F-test values prove to be useful in terms of understanding the effect of overall regression.

What are the disadvantages of the linear model?
- Linear regression is sensitive to outliers which may affect the result.

- Over-fitting

- Under-fitting

What are the important assumptions of Linear regression?
A linear relationship

Restricted Multi-collinearity value

Homoscedasticity

Firstly, there has to be a linear relationship between the dependent and the independent variables. To check this relationship, a scatter plot proves to be useful.

Secondly, there must no or very little multi-collinearity between the independent variables in the dataset. The value needs to be restricted, which depends on the domain requirement.

The third is the homoscedasticity. It is one of the most important assumptions which states that the errors are equally distributed.

What is heteroscedasticity?
Heteroscedasticity is exactly the opposite of homoscedasticity, which means that the error terms are not equally distributed.

-To correct this phenomenon, usually, a log function is used.

Entropy
chỉ số nhiễu loạn

Bias coefficient
hệ số lệch

bias
độ chênh lệch, độ chệch

ANOVA?
analysis of

- ANOVA testing does not just examine the differences, it also looks at the degree of variance, or the difference between them, in variable means.

- It is a way of analyzing the statistical significance of the variables.

- ANOVA analysis is considered to be more accurate than t-testing because it is more flexible and requires fewer observations.

- uncover relationships among variables, while a t-test does not

- Variations of ANOVA testing include One-Way ANOVA (used to search for statistically significant differences between two or more independent variables), Two-Way ANOVA (to uncover potential interaction of two independent variables on one dependent

Why do we use Anova?
___ is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.

Model capacity
Capacity of a model is its ability to fit a wide variety of functions

We can increase capacity by changing the number of input features
and adding parameters

representational capacity
the family of functions used by the learning
algorithm

§ Polynomial vs. neural network regression

effective capacity
Imperfection of the optimization algorithm

can be less than the representational capacity

Occam's razor (principle of parsimony)
Among competing hypotheses that explain observations equal well, we should choose the "simplest" one

Vapnic-Chervonenkis (VC) dimension
measures the capacity of a model f as the maximum number of points that can be labeled arbitrarily (basically given a paramset theta, how PERFECTLY can it label things)

Bayes error:
Error incurred by an oracle than know the true data generating distribution (because of noise)

Regularization
any modification we make to a learning algorithm to reduce its generalization error but not its training error

types of regularization
§ 𝐿7 parameter regularization
§ 𝐿8 parameter regularization
§ Data set augmentation
§ Injecting noise
§ Early stopping
§ Bagging (bootstrap aggregating)
§ Dropout

k-fold Cross Validation
§ Partition the dataset into k nonoverlapping subsets
§ The test error is computed by averaging the test error across k trials
§ On trial i, the i-th subset is used as the test set and the rest of the data is used as the training set

stochastic gradient descent
SGD approximates the gradient using a small set of examples
(minibatch)

Most popular type of Grad. Descent for DL models?
Stochastic

DL motivations
§ Curse of dimensionality: Machine learning algorithms become exceedingly difficult when the number of dimensions of the data is high
§ Deep learning algorithms exhibit reduced generalization error in high-dimensions

Neuron (Percepton)
Converts inputs to outputs
𝑧 = 𝑤-𝑥 + 𝑏
𝑎 = 𝜎(𝑧) (activation function)

Loss function used in project
(-sum(ylog(y_hat) + (1 - y)log(1 - y_hat))/m

Sigmoid function
1 / (1 - e^-x)

Sigmoid derivative
sigmoid / (1 - sigmoid)

ReLU Pros
§ Gives large and consistent gradients (does not saturate) when active
§ Efficient to optimize, converges much faster than sigmoid or tanh

ReLU Cons
Non zero centered output
§ Units "die", i.e., when inactive they will never update

dying ReLU problem
A "dead" ReLU always outputs the same value (zero as it happens, but that is not important) for any input. Probably this is arrived at by learning a large negative bias term for its weights.

In turn, that means that it takes no role in discriminating between inputs. For classification, you could visualise this as a decision plane outside of all possible input data.

Once a ReLU ends up in this state, it is unlikely to recover, because the function gradient at 0 is also 0, so gradient descent learning will not alter the weights. "Leaky" ReLUs with a small positive gradient for negative inputs (y=0.01x when x < 0 say) are one attempt to address this issue and give a chance to recover.

The sigmoid and tanh neurons can suffer from similar problems as their values saturate, but there is always at least a small gradient allowing them to recover in the long term.

Logistic Sigmoid and tanh Issue
Saturate across most of their domain, strongly sensitive only when z is closer to zero

Saturation
values are driven far to the left or right and the network does not learn quickly

As number of layers increases, generalization seems to...
improve

Jacobian matrix
Suppose f : ℝn → ℝm is a function which takes as input the vector x ∈ ℝn and produces as output the vector f(x) ∈ ℝm. Then the Jacobian matrix J of f is an m×n matrix, usually defined and arranged as follows:

The Jacobian represents the local linear space after a transform if you zoom into a specific point. The Jacobian represents the partial derivatives of the transform and can be interpreted as a mapping of the local linear space around a given point

Jij = dfi / dxj

Tensor
In mathematics, tensors are geometric objects that describe linear relations between geometric vectors, scalars, and other tensors.

We could imagine flattening each tensor into a vector before we run back-propagation, computing a vectorvalued gradient, and then reshaping the gradient back into a tensor.

What will happen if the weights are initialized to zero?
all outputs of each layer will just be 0 regardless of input

Why do we tend to only regularize the weights and not the biases?
The biases typically require less data to fit accurately than the weights and regularizing the bias parameters can introduce a significant amount of underfitting.

L2 norm penalty for regularization
adds lambda / 2 (wT w) to the loss function value

What assumption to L2 rely on?
a model with small weights is simpler than a model with large weights

L1 regularization
Adds lambda * sum(absval of wi)s to object fn

L1 regularization tends to result in more ____ solutions
sparse

it's good for weeding out features that are unneeded


Feature selection tends to use ____ regularization
L1

Dataset Augmentation
takes current data and creates new data out of it with simple transforms

types of dataset augmentation
affine distortion
horizontal flip
noise
random translation
elastic deformation
hue shift

Injecting Noise
One way to improve the robustness of neural networks is simply to train them with random noise applied to their inputs.

Early Stopping
Instead of running our optimization algorithm until we reach a (local) minimum of validation error, we run it until the error on the validation set has not improved for some amount of time.
§ Every time the error on the validation set improves, we store a copy of the model parameters.
§ When the training algorithm terminates, we return these parameters, rather than the latest parameters.

Benefits of Early Stopping
unobtrusive, don't have to keep changing a hyperparameter, reduces computational time of the training procedure (needs to use some for testing)

Ensemble Methods of Reg.
Train multiple models and have each of the models vote on the answer

called model averaging

The reason that model averaging works is that different models
will usually not make all the same errors on the test set.
...

Bagging
- Is a form of ensemble learning.
- It is where you perform repeated sample of your dataset with replacement. Then train multiple classifiers on the training sets. The final class is a majority vote
- Reduces variance

When to NOT use model averaging?
When benchmarking algorithms for specific scientific papers

Dropout
randomly shuts off some neurons by using a binary mask

Should dropout be used during testing
no

adversarial training
when intelligent people can trick computers into missclassifications by simple alterations (stop sign to speed limit, panda to gibbon, etc)

Difference b/w learning and pure optimization?
In learning, P happens indirectly

We reduce a cost function 𝐽(𝜃) using the training data in the hope that doing so will improve P in learning, but in pure opt. where minimizing J is a goal in and of itself

What does the "risk" in deep learning refer to?
the expected generalization error

What distr. do we know and which do we not when trying to learn?
we know p_hat distribution but don't know pdata generating distribution

what is empirical risk?
minimization
training process

Difference between risk and empirical risk?
We can optimize empirical risk but not sure about risk

optimization algs that use the whole training set batch or deterministic gradient methods

optimization algs that use only a single example
stochastic
or sometimes online methods

The term online is usually reserved for the case where the examples are drawn from a stream of continually created examples

somewhere in between online/batch methods
minibatch mthods

minibatch sizes should be...
powers of 2
selected randomly

if we don't shuffle the examples during each epoch, our minibatches will be...
biased

Ill Conditioining
A problem is ill-conditioned if a small error in the initial data can result in much larger errors in the solution 

evaluated using hessian (second derivative) matrix

Saddle point
both a local min and max

in high dimensional spaces its most likely to have local min, max, or saddle point?
Saddle Point

exploding gradient problem
On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually jumping off of the cliff 
How to avoid cliffs
Redesign the network to have fewer layers
§ Use rectified linear (ReLU) activation
§ Weight regularization
§ e.g., L2 parameter regularization
§ Gradient clipping heuristic

Gradient Clipping Heuristic
When gradient descent algorithm proposes to make a very large step, the gradient clipping heuristic intervenes to reduce the step size to be small enough

True or false? The gradients we calculate are exact
No

momentum purpose
accelerate learning

momentum. how it work?
The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction

what is velocity?
The velocity is set to an exponentially decaying moving average
of the negative gradient

Difference between standard and nesterov momentum
nesterov momentum makes a correction based on previous momentum, whereas the standard algorithm takes a single jump in a certain direction

Adagrad
Adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all of their historical squared values

Adagrad is good when
AdaGrad performs well when the objective is convex and has desirable theoretical properties

Adagrad is bad sometimes because
When applied to a non-convex function to train a neural network, the learning trajectory may pass through many different structures and eventually arrive at a region that is a locally convex bowl

AdaGrad shrinks the learning rate according to the entire history of the squared gradient and may have made the learning rate too small before arriving at such a convex structure

Empirically it has been found the accumulation of squared

RMSProp
modifies AdaGrad to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average

Adam
combines RMSProp with Momentum

Estimates with exponential weighting of both first order (gradients) and second order (squares of gradients) moments

batch normalization
§ In order to maintain the expressive power of the network, it is common to replace the batch of hidden unit activations 𝐻 with 𝛾𝐻′ + 𝛽 rather than simply the normalized 𝐻′
§ The variables 𝛾 and 𝛽 are parameters that are learned by backpropagation and allow the new variable to have any mean and standard deviation.

Standard Weight initialization
wij = uniform(-1/sqrt(m), 1/sqrt(m))

xavier uniform initialization
wij ~ uniform(-6 / sqrt(m + n), 6 / sqrt(m + n))

xavier normal initilization
variance: (2 / (nin + nout))

he normal initiliazation
2 / nin

lecun normal initialization
It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(1 / fan_in) where fan_in is the number of input units in the weight tensor.

lecun uniform initilaization
It draws samples from a uniform distribution within [-limit, limit] where limit is sqrt(3 / fan_in) where fan_in is the number of input units in the weight tensor.

What are the two types of collaborative filtering?
1. Neighborhood methods (NN) 2. Latent Factors - Find a latent space with users and items.



ADVERTISEMENT