By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
What is deep learning? A sub-field of machine learning that is a set of algorithms that is inspired by the structure and function of the brain
What is backpropagation? At the heart of backpropagation is an expression for the partial derivative ∂C/∂w of the cost function C with respect to any weight w (or bias b) in the network. The expression tells us how quickly the cost changes when we change the weights and biases.
What is Vectors ? are special types of matrices, which are rectangular arrays of numbers.
Vectors are (ordered/unordered) collections of numbers? Because vectors are ordered collections of numbers, they are often seen as column matrices: they have just one column and a certain number of rows.
LSA is stand for___ Latent Semantic Analysis
LSA stands for___ Latent Semantic Analysis
SVD stands for ___ Single Value Decomposition
What is entropy? ___is the measure of disorder or how messy the data is.
Equation of entropy? = - p x log2(p) - q x log2(q)
Bias-variance tradeoff is___ In prediction models, prediction errors can be decomposed into two main subcomponents we care about: error due to "bias" and error due to "variance". There is a tradeoff between a model's ability to minimize bias and variance. Understanding these two types of error can help us diagnose model results and avoid the mistake of over- or under-fitting.
Cross validation? a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. - Cross validation is a model evaluation method that is better than residuals. - The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen.
What is Holdout cross-validation? In _______, we hold out a percentage of observations and so we get two datasets. One is called the training dataset and the other is called the testing dataset. Here, we use the testing dataset to calculate our evaluation metrics, and the rest of the data is used to train the model.
Advantage of holdout cross-validation? It is very easy to implement and it is a very intuitive method of cross-validation.
Problem of holdout cross-validation? It provides a single estimate for the evaluation metric of the model. This is problematic because some models rely on randomness. So in principle, it is possible that the evaluation metrics calculated on the test sometimes they will vary a lot
What is k-fold cross-validation? In _____, we basically do holdout cross-validation many times. So in ______, we partition the dataset into k equal-sized samples. This cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation.
What is Machine Learning? Give computer ability to learn to make decisions from data without being explicitly programmed.
What is Unsupervised leanring? Uncovering hidden patterns from unlabeled data
Reinforcement learning Software agents interact with an environment: - Learn how to optimize their behavior - Given a system of rewards and punishments - Draw inspiration from behavioral psychology
Types of recommender systems? Collaborative filters are one of the most popular recommender models used in the industry and have found huge success for companies such as Amazon. - Collaborative filtering can be broadly classified into two types:
Content-based systems:
What is Item-based filtering? If a group of people have rated two items similarly, then the two items must be similar. Therefore, if a person likes one particular item, they're likely to be interested in the other item too.
What is Overfitting? - Overfitting occurs when the model fit/learn the training data too closely/too well instead of predicting unseen data. It is the result of a complex model with many variables. - A model that is overfitted is inaccurate because the trend does not reflect the reality of the data.
How to prevent overfitting? To prevent overfitting, we can use techniques like : 1> Cross-validation, 2> Regularization, early stopping, pruning, Bayesian priors, dropout and model comparison. Make a simple model: withe lesser variables and parameters, the variance can be reduced.
What is Bias and Variance Tradeoff? - Bias and variance is a ways to diagnose the performance of a prediction algorithm by breaking down its prediction error.There are 2 types of prediction error: BIAS and VARIANCE - <1>The bias is an error from erroneous assumptions in the learning algorithm. Bias occurs when an algo' has limited flexibility to learn the true signal from a dataset. - It is the difference between your model expected prediction and the true value. - <2> Variance is the algorithm's sensitivity to specific sets of training data. Variance is the variability of model prediction for a given data point.
How can you choose a classifier based on training set size? - When the training set is small, a model that has a high bias and low variance seems to work better because they are less likely to overfit. e.g. Naive Bayes work best
- When the training set is large, model with low bias and high variance tends to perform better as they work fine with complex relationships. e.g. decision tree.
Explain confusion matrix with respect to Machine Learning algorithm? Confusion matrix (or error matrix) is a specific table that is used to measure the performance of an algorithm. - It is mostly used in supervised learnign ( in un-supervised learning it is called matching matrix) - Confusion matrix has 2 dimentsions: 1) Actual 2) Predicted - It also has identical sets of features in both these dimensions
What are the three stages to build a model in Machine learning? 1> Model building: - Choose the suitable algorithm for the model and train it according to the requirement.
2> Model testing: - Check the accuracy of the model through the test data.
3> Applying the model: - Make the require changed after testing and apply the final model.
When will you use classification over regression? - Classification is used when your target variable is CATEGORICAL. E.g. predict gender of a person, type of color,etc. - Regression is used when target variable is CONTINUOUS. E.g. estimate sale and price of a product, predicting sports score, amount of rainfall, etc. Both belong to the category of Supervised ML Algorithms.
What is Random Forest? Random Forest is a supervised ML Algo that is generally used for classification problems. - Random Forest operates by constructing multiple Decision trees during training phase. The decision of the majority of the trees is chosen by the random forest as the final decision.
Descriptive statistics are statistics that describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. - Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made. They are simply a way to describe our data.
Difference Between Linear And Logistic Regression? Two main difference are as follows - 1> -Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). - While Binary logistic regression requires the dependent variable to be binary - two categories only (0/1). - Multinomial or ordinary logistic regression can have dependent variable with more than two categories.
2> -Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value. - While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood)
How To Treat Outliers? There are several methods to treat outliers -
Percentile Capping Box-Plot Method Standard Deviation Weight of Evidence Transformation ------------------------------------ - A box plot is a graphical display for describing the distribution of the data. Box plots use the median and the lower and upper quartiles. An outlier is defined as the value above or below the upper or lower fences.
- If a value is higher than the mean plus or minus three Standard Deviation is considered as outlier. It is based on the characteristics of a normal distribution for which 99.87% of the data appear within this range.
What are the types of Outliers? Outlier can be of two types: Univariate and Multivariate. - Multi-variate outliers are outliers in an n-dimensional space.
How to detect Outliers? - Most commonly used method to detect outliers is visualization. - We use various visualization methods, like Box-plot, Histogram, Scatter Plot
What is Variable Transformation? In data modelling, transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others.
What Is P-value And How It Is Used For Variable Selection? p-value is level of significance at which you can reject null hypothesis. p-value or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H0) of a study question is true.
- One commonly used p-value = 0.05. - p-value < 0.05, --> reject the null hypothesis and accept the alternative hypothesis. - p-value > 0.05, accept the null hypothesis.
What is Simple vs. Multiple Linear Regression? - Linear regression can be simple linear regression when you have only one independent variable . - Whereas Multiple linear regression will have more than one independent variable.
What is residual? The difference between an observed (actual) value of the dependent variable and the value of the dependent variable predicted from the regression line.
What is regularization and where might it be helpful? What is an example of using regularization in a model? Regularization is useful for reducing variance in the model, meaning avoiding overfitting . For example, we can use L1 regularization in Lasso regression to penalize large coefficients.
Why might it be preferable to include fewer predictors over many? - When we add irrelevant features, it increases model's tendency to overfit because those features introduce more noise. - When two variables are correlated, they might be harder to interpret in case of regression, etc. - curse of dimensionality - adding random noise makes the model more complicated but useless - computational cost
You're Uber and you want to design a heatmap to recommend to drivers where to wait for a passenger. How would you approach this? Based on the past pickup location of passengers around the same time of the day, day of the week (month, year), construct Ask someone for more details. Based on the number of past pickups account for periodicity (seasonal, monthly, weekly, daily, hourly) special events (concerts, festivals, etc.) from tweets
What are various ways to predict a binary response variable? Can you compare two of them and tell me when one would be more appropriate? What's the difference between these? (SVM, Logistic Regression, Naive Bayes, Decision Tree, etc.) Things to look at: N, P, linearly seperable?, features independent?, likely to overfit?, speed, performance, memory usage Logistic Regression features roughly linear, problem roughly linearly separable robust to noise, use l1,l2 regularization for model selection, avoid overfitting the output come as probabilities efficient and the computation can be distributed can be used as a baseline for other algorithms (-) can hardly handle categorical features SVM with a nonlinear kernel, can deal with problems that are not linearly separable (-) slow to train, for most industry scale applications, not really efficient Naive Bayes computationally efficient when P is large by alleviating the curse of dimensionality works surprisingly well for some cases even if the condition doesn't hold with word frequencies as features, the independence assumption can be seen reasonable. So the algorithm can be used in text categorization (-) conditional independence of every other feature should be met Tree Ensembles good for large N and large P, can deal with categorical features very well non parametric, so no need to worry about outliers GBT's work better but the parameters are harder to tune RF works out of the box, but usually performs worse than GBT Deep Learning works well for some classification tasks (e.g. image) used to squeeze something out of the problem
What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced? What if there are more than 2 groups? Accuracy: proportion of instances you predict correctly. Pros: intuitive, easy to explain, Cons: works poorly when the class labels are imbalanced and the signal from the data is weak AUROC: plot fpr on the x axis and tpr on the y axis for different threshold. Given a random positive instance and a random negative instance, the AUC is the probability that you can identify who's who. Pros: Works well when testing the ability of distinguishing the two classes, Cons: can't interpret predictions as probabilities (because AUC is determined by rankings), so can't explain the uncertainty of the model logloss/deviance: Pros: error metric based on probabilities, Cons: very sensitive to false positives, negatives When there are more than 2 groups, we can have k binary classifications and add them up for logloss. Some metrics like AUC is only applicable in the binary case.
What are some ways I can make my model more robust to outliers? We can have regularization such as L1 or L2 to reduce variance (increase bias). Changes to the algorithm: Use tree-based methods instead of regression methods as they are more resistant to outliers. For statistical tests, use non parametric tests instead of parametric ones. Use robust error metrics such as MAE or Huber Loss instead of MSE. Changes to the data: Winsorizing the data Transforming the data (e.g. log) Remove them only if you're certain they're anomalies not worth predicting
What could be some issues if the distribution of the test data is significantly different than the distribution of the training data? The model that has high training accuracy might have low test accuracy. Without further knowledge, it is hard to know which dataset represents the population data and thus the generalizability of the algorithm is hard to measure. This should be mitigated by repeated splitting of train vs test dataset (as in cross validation). When there is a change in data distribution, this is called the dataset shift. If the train and test data has a different distribution, then the classifier would likely overfit to the train data. This issue can be overcome by using a more general learning method. This can occur when: P(y|x) are the same but P(x) are different. (covariate shift) P(y|x) are different. (concept shift) The causes can be: Training samples are obtained in a biased way. (sample selection bias) Train is different from test because of temporal, spatial changes. (non-stationary environments) Solution to covariate shift importance weighted cv
Given a Dataset) Analyze this dataset and give me a model that can predict this response variable. Start by fitting a simple model (multivariate regression, logistic regression), do some feature engineering accordingly, and then try some complicated models. Always split the dataset into train, validation, test dataset and use cross validation to check their performance. Determine if the problem is classification or regression Favor simple models that run quickly and you can easily explain. Mention cross validation as a means to evaluate the model. Plot and visualize the data.
What is bias? Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. - Bias is the average difference between the estimator and the true value. - High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). - Model with high bias pays very little attention to the training data and oversimplifies the model. - It always leads to high error on training and test data.
What is variance? Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. - Variance is the average of the squared distances from each point to the mean. - Variance shows how subject the model is to outliers, meaning those values that are far away from the mean. -Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn't seen before. -As a result, such models perform very well on training data but has high error rates on test data.
What does it mean to have high variance? A high variance indicates that the data points are very spread out from the mean, and from one another. - A model is too specific (overfitting), leading to high variance
What does it mean to have small variance? A small variance indicates that the data points tend to be very close to the mean, and to each other.
What does high bias mean? In machine learning terminology, underfitting means that a model is too general, leading to high bias, while overfitting means that a model is too specific, leading to high variance. ... Since you can't realistically avoid bias and variance altogether, this is called the bias-variance tradeoff.
What is the difference between bias and precision? Bias is the average difference between the estimator and the true value. Precision is the standard deviation of the estimator. One measure of the overall variability is the Mean Squared Error, MSE, which is the average of the individual squared errors.
What is normally distribution? - The normal distribution is the most important and most widely used distribution in statistics. - It is sometimes called the "bell curve," - A normal distribution is perfectly symmetrical around its center. - A normal distribution has a bell-shaped density curve described by its mean and standard deviation . -The density curve is symmetrical, centered about its mean, with its spread determined by its standard deviation.
What is normal distributed determined by? A normal distribution is determined by two parameters the mean and the variance
When a normal distribution is called Standard Normal Distribution (SND)? A normal distribution is called Standard Normal Distribution (SND) when its mean is zero and SD is equal to 1.
What is Standard Normal Distribution? The standard normal distribution (z distribution) is a normal distribution with a mean of 0 and a standard deviation of 1.
What is Normal Distribution? - A "normal" distribution is also known as a bell-shaped curve or Gaussian curve. - In a Gaussian or normal distribution, the mean , mode and median would all have the same (or similar) value and would look like the figure. - A normal distribution is perfectly symmetrical around its center. - Total area under the curve = 1
How do you convert a normal distribution to a standard normal distribution? So to convert a value to a Standard Score ("z-score"): first subtract the mean, then divide by the Standard Deviation.
What is Z score in normal distribution? The standard score (more commonly referred to as a z-score) is a very useful statistic because it (a) allows us to calculate the probability of a score occurring within our normal distribution and (b) enables us to compare two scores that are from different normal distributions.
How do you find the Z value? To find the Z score of a sample, you'll need to find the mean, variance and standard deviation of the sample. To calculate the z-score, you will find the difference between a value in the sample and the mean, and divide it by the standard deviation.
What is Z distribution? In statistics, the Z-distribution is used to help find probabilities and percentiles for regular normal distributions (X). -It serves as the standard by which all other normal distributions are measured. -The Z-distribution is a normal distribution with mean zero and standard deviation 1;
What does a standard deviation of 1 mean? A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
What does a standard deviation of 0 mean? xi - x = 0. This means that every data value is equal to the mean. This result along with the one above allows us to say that the sample standard deviation of a data set is zero if and only if all of its values are identical.
What is the importance of standard deviation? Standard deviation is a number used to tell how measurements for a group are spread out from the average (mean), or expected value. A low standard deviation means that most of the numbers are very close to the average. A high standard deviation means that the numbers are spread out.
What does it mean when standard deviation is higher than mean? A large standard deviation indicates that the data points can spread far from the mean and a small standard deviation indicates that they are clustered closely around the mean. ... The third population has a much smaller standard deviation than the other two because its values are all close to 7.
What are z scores used for? The standard score (more commonly referred to as a z-score) is a very useful statistic because it (a) allows us to calculate the probability of a score occurring within our normal distribution and (b) enables us to compare two scores that are from different normal distributions.
How do you calculate normal distribution? So to convert a value to a Standard Score ("z-score"): first subtract the mean, then divide by the Standard Deviation.
What is Z * in statistics? The Z score is a test of statistical significance that helps you decide whether or not to reject the null hypothesis. - The p-value is the probability that you have falsely rejected the null hypothesis. - Z scores are measures of standard deviation. ... Both statistics are associated with the standard normal distribution.
What does the Z score tell you? z-score is how many standard deviations away from the mean a data point is.
What is the goal of Linear Regression? The goal of simple (univariate) linear regression is to model the relationship between a single feature (explanatory variable x) and a continuous valued response (target variable y). y = ax + b
What is regression line? ___is the best-fitting line.
What is offset or residuals? ____are the vertical lines from the regression line to the sample points, call prediction errors. vertical offset = |y^ - y|
What is difference between simple linear and multiple linear regressions? Simple linear regression has only one x and one y variable.
Multiple linear regression has one y and two or more x variables.
For instance, when we predict rent based on square feet alone that is simple linear regression.
When we predict rent based on square feet and age of the building that is an example of multiple linear regression.
What are the 4 assumptions of linear regression? The 4 assumptions are: - Linearity of residuals - Independence of residuals - Normal distribution of residuals - Equal variance of residuals
What is meant by dependent and independent variables? Dependent variable depends upon independent variable.
What is a residual? How is it computed? Residual is also called Error. It is the difference between the predicted y value and the actual y value. Residual = Actual y value - Predicted y value.
It can be positive or negative.
If residuals are always 0, then your model has a Perfect R square i.e. 1.
What is the difference between coefficient of determination, and coefficient of correlation? Coefficient of correlation is "R" value which is given in the summary table in the Regression output. R square is also called coefficient of determination. Multiply R times R to get the R square value. In other words Coefficient of Determination is the square of Coefficeint of Correlation.
R square or coeff. of determination shows percentage variation in y which is explained by all the x variables together. Higher the better. It is always between 0 and 1. It can never be negative - since it is a squared value.
It is easy to explain the R square in terms of regression. It is not so easy to explain the R in terms of regression.
What is adjusted R2? Adjusted R2 is used to compensate for the addition of variables to the model. It is always less than or equal to R2. As more independent variables (i.e. x variables) are added to the regression model, R2 usually increases. R2 increases even when the additional variables do little to help explain the dependent variable. To compensate for this, adjusted R2 is discounted (i.e. lowered) for the number of independent variables in the model.
What does coefficient of determination explain? (in terms of variation) R square or coefficient of determination is the percentage variation in y expalined by all the x variables together.
Example, say we are trying to predict Rent based on square feet and number of bedrooms in the apartment. Say the R square for our model is 72% - that means that all the x variables i.e. square feet and number of bedrooms together explain 72% variation in y i.e. Rent.
Now let say we add another x variable, for example age of the building to our model. By addiding this third relevant x variable the R square is expected to go up. Let say the new R square is 95%. This means that square feet, number of bedrooms and age of the building together explain 95% of the variation in the Rent.
Remember, coefficient of determination or R square can only be as high as 1 (it can go down to 0, but not any lower).
If we can predict our y variable (i.e. Rent in this case) then we would have R square (i.e. coefficient of determination) of 1.
Usually the R square of .70 is considered good.
For those cases where we really know nothing much about say the hormones which increase our body's immunity against Cancer - in such cases if we have a regression model with say R square of .05 or even .02, is also considered very good. It shows that atleast our x variables (what ever they are) are predicting some effect on cancer immunity.
What happens when p value for f test is lower than alpha i.e. what do you conclude? When p value for f test is lower than alpha (which is usually .05 if nothing else is specified), then we reject H0. We conclude that:
We have the evidence that at least one of the [x variables] has a significant relationship with the [y variable].
Note: you need to replace the terms in the square parenthesis with the actual variable names, e.g., square feet and number of bedrooms for x variables and rent for y variable.
What is the difference between R square and adjusted R square? R square and adjusted R square values are used for model validation in case of linear regression.
- R square indicates the variation of all the independent variables on the dependent variable. i.e. it considers all the independent variable to explain the variation.
-In the case of Adjusted R squared, it considers only significant variables(P values less than 0.05) to indicate the percentage of variation in the model.
How to find RMSE and MSE? RMSE and MSE are the two of the most common measures of accuracy for a linear regression.
RMSE indicates the Root mean square error, which indicated by the formulae:
What are the possible ways of improving the accuracy of a linear regression model? There could be multiple ways of improving the accuracy of a linear regression, most commonly used ways are as follows:
Outlier Treatment: -Regression is sensitive to outliers, hence it becomes very important to treat the outliers with appropriate values. -Replacing the values with mean, median, mode or percentile depending on the distribution can prove to be useful.
What is the significance of an F-test in a linear model? The use of F-test is to test the goodness of the model.
-When the model is re-iterated to improve the accuracy with changes, the F-test values prove to be useful in terms of understanding the effect of overall regression.
What are the disadvantages of the linear model? - Linear regression is sensitive to outliers which may affect the result.
- Over-fitting
- Under-fitting
What are the important assumptions of Linear regression? A linear relationship
Restricted Multi-collinearity value
Homoscedasticity
Firstly, there has to be a linear relationship between the dependent and the independent variables. To check this relationship, a scatter plot proves to be useful.
Secondly, there must no or very little multi-collinearity between the independent variables in the dataset. The value needs to be restricted, which depends on the domain requirement.
The third is the homoscedasticity. It is one of the most important assumptions which states that the errors are equally distributed.
What is heteroscedasticity? Heteroscedasticity is exactly the opposite of homoscedasticity, which means that the error terms are not equally distributed.
-To correct this phenomenon, usually, a log function is used.
Entropy chỉ số nhiễu loạn
Bias coefficient hệ số lệch
bias độ chênh lệch, độ chệch
ANOVA? analysis of - ANOVA testing does not just examine the differences, it also looks at the degree of variance, or the difference between them, in variable means.
- It is a way of analyzing the statistical significance of the variables.
- ANOVA analysis is considered to be more accurate than t-testing because it is more flexible and requires fewer observations.
- uncover relationships among variables, while a t-test does not
- Variations of ANOVA testing include One-Way ANOVA (used to search for statistically significant differences between two or more independent variables), Two-Way ANOVA (to uncover potential interaction of two independent variables on one dependent
Why do we use Anova? ___ is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.
Model capacity Capacity of a model is its ability to fit a wide variety of functions
We can increase capacity by changing the number of input features and adding parameters
representational capacity the family of functions used by the learning algorithm
§ Polynomial vs. neural network regression
effective capacity Imperfection of the optimization algorithm
can be less than the representational capacity
Occam's razor (principle of parsimony) Among competing hypotheses that explain observations equal well, we should choose the "simplest" one
Vapnic-Chervonenkis (VC) dimension measures the capacity of a model f as the maximum number of points that can be labeled arbitrarily (basically given a paramset theta, how PERFECTLY can it label things)
Bayes error: Error incurred by an oracle than know the true data generating distribution (because of noise)
Regularization any modification we make to a learning algorithm to reduce its generalization error but not its training error
types of regularization § ð¿7 parameter regularization § ð¿8 parameter regularization § Data set augmentation § Injecting noise § Early stopping § Bagging (bootstrap aggregating) § Dropout
k-fold Cross Validation § Partition the dataset into k nonoverlapping subsets § The test error is computed by averaging the test error across k trials § On trial i, the i-th subset is used as the test set and the rest of the data is used as the training set
stochastic gradient descent SGD approximates the gradient using a small set of examples (minibatch)
Most popular type of Grad. Descent for DL models? Stochastic
DL motivations § Curse of dimensionality: Machine learning algorithms become exceedingly difficult when the number of dimensions of the data is high § Deep learning algorithms exhibit reduced generalization error in high-dimensions
Neuron (Percepton) Converts inputs to outputs ð‘§ = ð‘¤-ð‘¥ + ð‘ 𑎠= ðœŽ(ð‘§) (activation function)
Loss function used in project (-sum(ylog(y_hat) + (1 - y)log(1 - y_hat))/m
Sigmoid function 1 / (1 - e^-x)
Sigmoid derivative sigmoid / (1 - sigmoid)
ReLU Pros § Gives large and consistent gradients (does not saturate) when active § Efficient to optimize, converges much faster than sigmoid or tanh
ReLU Cons Non zero centered output § Units "die", i.e., when inactive they will never update
dying ReLU problem A "dead" ReLU always outputs the same value (zero as it happens, but that is not important) for any input. Probably this is arrived at by learning a large negative bias term for its weights.
In turn, that means that it takes no role in discriminating between inputs. For classification, you could visualise this as a decision plane outside of all possible input data.
Once a ReLU ends up in this state, it is unlikely to recover, because the function gradient at 0 is also 0, so gradient descent learning will not alter the weights. "Leaky" ReLUs with a small positive gradient for negative inputs (y=0.01x when x < 0 say) are one attempt to address this issue and give a chance to recover.
The sigmoid and tanh neurons can suffer from similar problems as their values saturate, but there is always at least a small gradient allowing them to recover in the long term.
Logistic Sigmoid and tanh Issue Saturate across most of their domain, strongly sensitive only when z is closer to zero
Saturation values are driven far to the left or right and the network does not learn quickly
As number of layers increases, generalization seems to... improve
Jacobian matrix Suppose f : â„n → â„m is a function which takes as input the vector x ∈ â„n and produces as output the vector f(x) ∈ â„m. Then the Jacobian matrix J of f is an m×n matrix, usually defined and arranged as follows:
The Jacobian represents the local linear space after a transform if you zoom into a specific point. The Jacobian represents the partial derivatives of the transform and can be interpreted as a mapping of the local linear space around a given point
Jij = dfi / dxj
Tensor In mathematics, tensors are geometric objects that describe linear relations between geometric vectors, scalars, and other tensors.
We could imagine flattening each tensor into a vector before we run back-propagation, computing a vectorvalued gradient, and then reshaping the gradient back into a tensor.
What will happen if the weights are initialized to zero? all outputs of each layer will just be 0 regardless of input
Why do we tend to only regularize the weights and not the biases? The biases typically require less data to fit accurately than the weights and regularizing the bias parameters can introduce a significant amount of underfitting.
L2 norm penalty for regularization adds lambda / 2 (wT w) to the loss function value
What assumption to L2 rely on? a model with small weights is simpler than a model with large weights
L1 regularization Adds lambda * sum(absval of wi)s to object fn
L1 regularization tends to result in more ____ solutions sparse
it's good for weeding out features that are unneeded
Feature selection tends to use ____ regularization L1
Dataset Augmentation takes current data and creates new data out of it with simple transforms
types of dataset augmentation affine distortion horizontal flip noise random translation elastic deformation hue shift
Injecting Noise One way to improve the robustness of neural networks is simply to train them with random noise applied to their inputs.
Early Stopping Instead of running our optimization algorithm until we reach a (local) minimum of validation error, we run it until the error on the validation set has not improved for some amount of time. § Every time the error on the validation set improves, we store a copy of the model parameters. § When the training algorithm terminates, we return these parameters, rather than the latest parameters.
Benefits of Early Stopping unobtrusive, don't have to keep changing a hyperparameter, reduces computational time of the training procedure (needs to use some for testing)
Ensemble Methods of Reg. Train multiple models and have each of the models vote on the answer
called model averaging
The reason that model averaging works is that different models will usually not make all the same errors on the test set. ...
Bagging - Is a form of ensemble learning. - It is where you perform repeated sample of your dataset with replacement. Then train multiple classifiers on the training sets. The final class is a majority vote - Reduces variance
When to NOT use model averaging? When benchmarking algorithms for specific scientific papers
Dropout randomly shuts off some neurons by using a binary mask
Should dropout be used during testing no
adversarial training when intelligent people can trick computers into missclassifications by simple alterations (stop sign to speed limit, panda to gibbon, etc)
Difference b/w learning and pure optimization? In learning, P happens indirectly
We reduce a cost function ð½(ðœƒ) using the training data in the hope that doing so will improve P in learning, but in pure opt. where minimizing J is a goal in and of itself
What does the "risk" in deep learning refer to? the expected generalization error
What distr. do we know and which do we not when trying to learn? we know p_hat distribution but don't know pdata generating distribution
what is empirical risk? minimization training process
Difference between risk and empirical risk? We can optimize empirical risk but not sure about risk
optimization algs that use the whole training set batch or deterministic gradient methods
optimization algs that use only a single example stochastic or sometimes online methods
The term online is usually reserved for the case where the examples are drawn from a stream of continually created examples
somewhere in between online/batch methods minibatch mthods
minibatch sizes should be... powers of 2 selected randomly
if we don't shuffle the examples during each epoch, our minibatches will be... biased
Ill Conditioining A problem is ill-conditioned if a small error in the initial data can result in much larger errors in the solution
evaluated using hessian (second derivative) matrix
Saddle point both a local min and max
in high dimensional spaces its most likely to have local min, max, or saddle point? Saddle Point
exploding gradient problem On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually jumping off of the cliff How to avoid cliffs Redesign the network to have fewer layers § Use rectified linear (ReLU) activation § Weight regularization § e.g., L2 parameter regularization § Gradient clipping heuristic
Gradient Clipping Heuristic When gradient descent algorithm proposes to make a very large step, the gradient clipping heuristic intervenes to reduce the step size to be small enough
True or false? The gradients we calculate are exact No
momentum purpose accelerate learning
momentum. how it work? The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction
what is velocity? The velocity is set to an exponentially decaying moving average of the negative gradient
Difference between standard and nesterov momentum nesterov momentum makes a correction based on previous momentum, whereas the standard algorithm takes a single jump in a certain direction
Adagrad Adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all of their historical squared values
Adagrad is good when AdaGrad performs well when the objective is convex and has desirable theoretical properties
Adagrad is bad sometimes because When applied to a non-convex function to train a neural network, the learning trajectory may pass through many different structures and eventually arrive at a region that is a locally convex bowl
AdaGrad shrinks the learning rate according to the entire history of the squared gradient and may have made the learning rate too small before arriving at such a convex structure
Empirically it has been found the accumulation of squared
RMSProp modifies AdaGrad to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average
Adam combines RMSProp with Momentum
Estimates with exponential weighting of both first order (gradients) and second order (squares of gradients) moments
batch normalization § In order to maintain the expressive power of the network, it is common to replace the batch of hidden unit activations ð» with ð›¾ð»â€² + 𛽠rather than simply the normalized ð»â€² § The variables 𛾠and 𛽠are parameters that are learned by backpropagation and allow the new variable to have any mean and standard deviation.
Standard Weight initialization wij = uniform(-1/sqrt(m), 1/sqrt(m))
xavier uniform initialization wij ~ uniform(-6 / sqrt(m + n), 6 / sqrt(m + n))
xavier normal initilization variance: (2 / (nin + nout))
he normal initiliazation 2 / nin
lecun normal initialization It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(1 / fan_in) where fan_in is the number of input units in the weight tensor.
lecun uniform initilaization It draws samples from a uniform distribution within [-limit, limit] where limit is sqrt(3 / fan_in) where fan_in is the number of input units in the weight tensor.
What are the two types of collaborative filtering? 1. Neighborhood methods (NN) 2. Latent Factors - Find a latent space with users and items.
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.