Fatskills
Practice. Master. Repeat.
Study Guide: All The Useful Machine Learning Interview Questions & Answers - Part 3
Source: https://www.fatskills.com/machine-learning-101/chapter/all-the-useful-machine-learning-interview-questions-answers-part-3

All The Useful Machine Learning Interview Questions & Answers - Part 3

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~22 min read

Q 101. Explain Eigenvectors and Eigenvalues.
Linear transformations are helpful to understand using eigenvectors
. They find their prime usage in the creation of covariance and correlation matrices in data science.

Simply put, eigenvectors are directional entities along which linear transformation features like compression, flip etc. can be applied.

Eigenvalues are the magnitude of the linear transformation features along each direction of an Eigenvector.

Q 102. How would you define the number of clusters in a clustering algorithm?
The number of clusters can be determined by finding the silhouette score.
Often we aim to get some inferences from data using clustering techniques so that we can have a broader picture of a number of classes being represented by the data. In this case, the silhouette score helps us determine the number of cluster centres to cluster our data along.

Another technique that can be used is the elbow method.

Q 103. What are the performance metrics that can be used to estimate the efficiency of a linear regression model?
The performance metric that is used in this case is:

Mean Squared Error
R2 score
Adjusted  R2 score
Mean Absolute score

Q 104. What is the default method of splitting in decision trees?
The default method of splitting in decision trees is the Gini Index. Gini Index is the measure of impurity of a particular node.

This can be changed by making changes to classifier parameters. 

Q 105. How is p-value useful?
The p-value gives the probability of the null hypothesis is true. It gives us the statistical significance of our results. In other words, p-value determines the confidence of a model in a particular output.

Q 106. Can logistic regression be used for classes more than 2?
No, logistic regression cannot be used for classes more than 2 as it is a binary classifier. For multi-class classification algorithms like Decision Trees, Naïve Bayes' Classifiers are better suited.

Q 107. What are the hyperparameters of a logistic regression model?
Classifier penalty, classifier solver and classifier C are the trainable hyperparameters of a Logistic Regression Classifier. These can be specified exclusively with values in Grid Search to hyper tune a Logistic Classifier.

Q 108. Name a few hyper-parameters of decision trees.
The most important features which one can tune in decision trees are:

Splitting criteria
Min_leaves
Min_samples
Max_depth

Q 109. How to deal with multicollinearity?
Multi collinearity can be dealt with by the following steps:

Remove highly correlated predictors from the model.
Use Partial Least Squares Regression (PLS) or Principal Components Analysis,

Q 110. What is Heteroscedasticity?
It is a situation in which the variance of a variable is unequal across the range of values of the predictor variable.

It should be avoided in regression as it introduces unnecessary variance.  

Q 111. Is ARIMA model a good fit for every time series problem?
No, ARIMA model is not suitable for every type of time series problem. There are situations where ARMA model and others also come in handy.

ARIMA is best when different standard temporal structures require to be captured for time series data.

Q 112. How do you deal with the class imbalance in a classification problem?
Class imbalance can be dealt with in the following ways:

Using class weights
Using Sampling
Using SMOTE
Choosing loss functions like Focal Loss

Q 113. What is the role of cross-validation?
Cross-validation is a technique which is used to increase the performance of a machine learning algorithm, where the machine is fed sampled data out of the same data for a few times. The sampling is done so that the dataset is broken into small parts of the equal number of rows, and a random part is chosen as the test set, while all other parts are chosen as train sets.

Q 114. What is a voting model?
A voting model i
s an ensemble model which combines several classifiers but to produce the final result, in case of a classification-based model, takes into account, the classification of a certain data point of all the models and picks the most vouched/voted/generated option from all the given classes in the target column.

Q 115. How to deal with very few data samples? Is it possible to make a model out of it?
If very few data samples are there, we can make use of oversampling to produce new data points. In this way, we can have new data points.

Q 116. What are the hyperparameters of an SVM?
The gamma value, c value and the type of kernel are the hyperparameters of an SVM model.

Q 117. What is Pandas Profiling?
Pandas profiling is a step to find the effective number of usable data. It gives us the statistics of NULL values and the usable values and thus makes variable selection and data selection for building models in the preprocessing phase very effective.

Q 118. What impact does correlation have on PCA?
If data is correlated PCA does not work well. Because of the correlation of variables the effective variance of variables decreases. Hence correlated data when used for PCA does not work well.

Q 119. How is PCA different from LDA?
PCA is unsupervised. LDA is unsupervised.

PCA takes into consideration the variance. LDA takes into account the distribution of classes.

Q 120. What distance metrics can be used in KNN?
Following distance metrics can be used in KNN.

Manhattan
Minkowski
Tanimoto
Jaccard
Mahalanobis

Q 121. Which metrics can be used to measure correlation of categorical data?
Chi square test can be used for doing so. It gives the measure of correlation between categorical predictors.

Q 122. Which algorithm can be used in value imputation in both categorical and continuous categories of data?
KNN is the only algorithm that can be used for imputation of both categorical and continuous variables.

Q 123. When should ridge regression be preferred over lasso?
We should use ridge regression when we want to use all predictors and not remove any as it reduces the coefficient values but does not nullify them.

Q 124. Which algorithms can be used for important variable selection?
Random Forest, Xgboost and plot variable importance charts can be used for variable selection.

Q 125. What ensemble technique is used by Random forests?
Bagging is the technique used by Random Forests. Random forests are a collection of trees which work on sampled data from the original dataset with the final prediction being a voted average of all trees.

Q 126. What ensemble technique is used by gradient boosting trees?
Boosting is the technique used by GBM.

Q 127. If we have a high bias error what does it mean? How to treat it?
High bias error means that that model we are using is ignoring all the important trends in the model and the model is underfitting.

To reduce underfitting:

We need to increase the complexity of the model
Number of features need to be increased
Sometimes it also gives the impression that the data is noisy. Hence noise from data should be removed so that most important signals are found by the model to make effective predictions.

Increasing the number of epochs results in increasing the duration of training of the model. It's helpful in reducing the error.

Q 128. Which type of sampling is better for a classification model and why?
Stratified sampling is better in case of classification problems because it takes into account the balance of classes in train and test sets. The proportion of classes is maintained and hence the model performs better. In case of random sampling of data, the data is divided into two parts without taking into consideration the balance classes in the train and test sets. Hence some classes might be present only in tarin sets or validation sets. Hence the results of the resulting model are poor in this case.

Q 129. What is a good metric for measuring the level of multicollinearity?
VIF or 1/tolerance is a good measure of measuring multicollinearity in models. VIF is the percentage of the variance of a predictor which remains unaffected by other predictors. So higher the VIF value, greater is the multicollinearity amongst the predictors.

A rule of thumb for interpreting the variance inflation factor:

Q 1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.

Q 130. When can be a categorical value treated as a continuous variable and what effect does it have when done so?
A categorical predictor can be treated as a continuous one when the nature of data points it represents is ordinal. If the predictor variable is having ordinal data then it can be treated as continuous and its inclusion in the model increases the performance of the model.

Q 131. What is the role of maximum likelihood in logistic regression.
Maximum likelihood equation helps in estimation of most probable values of the estimator's predictor variable coefficients which produces results which are the most likely or most probable and are quite close to the truth values.

Q 132. Which distance do we measure in the case of KNN?
The hamming distance is measured in case of KNN for the determination of nearest neighbours. Kmeans uses euclidean distance.

Q 133. What is a pipeline?
A pipeline is a sophisticated way of writing software such that each intended action while building a model can be serialized and the process calls the individual functions for the individual tasks. The tasks are carried out in sequence for a given sequence of data points and the entire process can be run onto n threads by use of composite estimators in scikit learn.

Q 134. Which sampling technique is most suitable when working with time-series data?
We can use a custom iterative sampling such that we continuously add samples to the train set. We only should keep in mind that the sample used for validation should be added to the next train sets and a new sample is used for validation.

Q 135. What are the benefits of pruning?
Pruning helps in the following:

Reduces overfitting
Shortens the size of the tree
Reduces complexity of the model
Increases bias

Q 136. What is normal distribution?
The distribution having the below properties is called normal distribution. 

The mean, mode and median are all equal.
The curve is symmetric at the center (i.e. around the mean, μ).
Exactly half of the values are to the left of center and exactly half the values are to the right.
The total area under the curve is 1.

Q 137. What is the 68 per cent rule in normal distribution?
The normal distribution is a bell-shaped curve.
Most of the data points are around the median. Hence approximately 68 per cent of the data is around the median. Since there is no skewness and its bell-shaped. 

Q 138. What is a chi-square test?
A chi-square determines if a sample data matches a populati
on. 

A chi-square test for independence compares two variables in a contingency table to see if they are related.

A very small chi-square test statistics implies observed data fits the expected data extremely well. 

Q 139. What is a random variable?
A Random Variable is a set of possible values from a random experiment. Example: Tossing a coin: we could get Heads or Tails. Rolling of a dice: we get 6 values

Q 140. What is the degree of freedom?
It is the number of independent values or quantities which can be assigned to a statistical distribution. It is used in Hypothesis testing and chi-square test.

Q 141. Which kind of recommendation system is used by amazon to recommend similar items?
Amazon uses a collaborative filtering algorithm
for the recommendation of similar items. It's a user to user similarity based mapping of user likeness and susceptibility to buy.

Q 142. What is a false positive?
It is a test result which wrongly indicates that a particular condition or attribute is present.

Example - 'Stress testing, a routine diagnostic tool used in detecting heart disease, results in a significant number of false positives in women'

Q 143. What is a false negative?
A test result which wrongly indicates that a particular condition or attribute is absent.

Example - 'it's possible to have a false negative - the test says you aren't pregnant when you are'

Q 144. What is the error term composed of in regression?
Error is a sum of bias error+variance error+ irreducible error in regression. Bias and variance error can be reduced but not the irreducible error.

Q 145. Which performance metric is better R2 or adjusted R2?
Adjusted R2
because the performance of predictors impacts it. R2 is independent of predictors and shows performance improvement through increase if the number of predictors is increased.

Q 146. What's the difference between Type I and Type II error?
Type I and Type II error in machine learning refers to false values.
Type I is equivalent to a False positive while Type II is equivalent to a False negative. In Type I error, a hypothesis which ought to be accepted doesn't get accepted. Similarly, for Type II error, the hypothesis gets rejected which should have been accepted in the first place.

Q 147. What do you understand by L1 and L2 regularization?
L2 regularization:
It tries to spread error among all the terms. L2 corresponds to a Gaussian prior.

L1 regularization: It is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting. L1 corresponds to setting a Laplacean prior on the terms.

Q 148. Which one is better, Naive Bayes Algorithm or Decision Trees?
Although it depends on the problem you are solving, but some general advantages are following:

Naive Bayes:

Work well with small dataset compared to DT which need more data
Lesser overfitting
Smaller in size and faster in processing

Decision Trees:

Decision Trees are very flexible, easy to understand, and easy to debug
No preprocessing or transformation of features required
Prone to overfitting but you can use pruning or Random forests to avoid that.

Q 149. What do you mean by the ROC curve?
Receiver operating characteristics (ROC curve): ROC curve illustrates the diagnostic ability of a binary classifier. It is calculated/created by plotting True Positive against False Positive at various threshold settings. The performance metric of ROC curve is AUC (area under curve). Higher the area under the curve, better the prediction power of the model.

Q 150. What do you mean by AUC curve?
AUC (area under curve). Higher the area under the curve, better the prediction power of the model.

Q 151. What is log likelihood in logistic regression?
It is the sum of the likelihood residuals. At record level, the natural log of the error (residual) is calculated for each record, multiplied by minus one, and those values are totaled. That total is then used as the basis for deviance (2 x ll) and likelihood (exp(ll)).

The same calculation can be applied to a naive model that assumes absolutely no predictive power, and a saturated model assuming perfect predictions.

The likelihood values are used to compare different models, while the deviances (test, naive, and saturated) can be used to determine the predictive power and accuracy. Logistic regression accuracy of the model will always be 100 percent for the development data set, but that is not the case once a model is applied to another data set.

Q 152. How would you evaluate a logistic regression model?
Model Evaluation is a very important part in any analysis to answer the following questions,

How well does the model fit the data?, Which predictors are most important?, Are the predictions accurate?

So the following are the criterion to access the model performance,

1. Akaike Information Criteria (AIC): In simple terms, AIC estimates the relative amount of information lost by a given model. So the less information lost the higher the quality of the model. Therefore, we always prefer models with minimum AIC.
2. Receiver operating characteristics (ROC curve): ROC curve illustrates the diagnostic ability of a binary classifier. It is calculated/ created by plotting True Positive against False Positive at various threshold settings. The performance metric of ROC curve is AUC (area under curve). Higher the area under the curve, better the prediction power of the model.
3. Confusion Matrix: In order to find out how well the model does in predicting the target variable, we use a confusion matrix/ classification rate. It is nothing but a tabular representation of actual Vs predicted values which helps us to find the accuracy of the model.

Q 153. What are the advantages of SVM algorithms?
SVM algorithms have basically advantages in terms of complexit
y. Both Logistic regression as well as SVM can form non linear decision surfaces and can be coupled with the kernel trick. If Logistic regression can be coupled with kernel then why use SVM?

- SVM is found to have better performance practically in most cases.
-  SVM is computationally cheaper O(N^2*K) where K is no of support vectors (support vectors are those points that lie on the class margin) where as logistic regression is O(N^3)
- Classifier in SVM depends only on a subset of points . Since we need to maximize distance between closest points of two classes (aka margin) we need to care about only a subset of points unlike logistic regression.

Q 154. Why does XGBoost perform better than SVM?
First reason is that XGBoos is an ensemble method that uses many trees to make a decision so it gains power by repeating itself.

SVM is a linear separator, when data is not linearly separable SVM needs a Kernel to project the data into a space where it can separate it, there lies its greatest strength and weakness, by being able to project data into a high dimensional space SVM can find a linear separation for almost any data but at the same time it needs to use a Kernel and we can argue that there's not a perfect kernel for every dataset.

Q 155. What is the difference between SVM Rank and SVR (Support Vector Regression)?
One is used for ranking and the other is used for regression.

There is a crucial difference between regression and ranking. In regression, the absolute value is crucial. A real number is predicted.

In ranking, the only thing of concern is the ordering of a set of examples. We only want to know which example has the highest rank, which one has the second-highest, and so on. From the data, we only know that example 1 should be ranked higher than example 2, which in turn should be ranked higher than example 3, and so on. We do not know by how much example 1 is ranked higher than example 2, or whether this difference is bigger than the difference between examples 2 and 3.

Q 156. What is the difference between the normal soft margin SVM and SVM with a linear kernel?

Hard-margin
You have the basic SVM - hard margin. This assumes that data is very well behaved, and you can find a perfect classifier - which will have 0 error on train data.

Soft-margin
Data is usually not well behaved, so SVM hard margins may not have a solution at all. So we allow for a little bit of error on some points. So the training error will not be 0, but average error over all points is minimized.

Kernels
The above assume that the best classifier is a straight line. But what is it is not a straight line. (e.g. it is a circle, inside a circle is one class, outside is another class). If we are able to map the data into higher dimensions - the higher dimension may give us a straight line.

Q 157. How is linear classifier relevant to SVM?
An svm is a type of linear classifier. If you don't mess with kernels, it's arguably the most simple type of linear classifier.

Linear classifiers (all?) learn linear fictions from your data that map your input to scores like so: scores = Wx + b. Where W is a matrix of learned weights, b is a learned bias vector that shifts your scores, and x is your input data. This type of function may look familiar to you if you remember y = mx + b from high school.

A typical svm loss function ( the function that tells you how good your calculated scores are in relation to the correct labels ) would be hinge loss. It takes the form: Loss = sum over all scores except the correct score of max(0, scores - scores(correct class) + 1).

Q 158. What are the advantages of using a naive Bayes for classification?
Very simple, easy to implement and fast.
If the NB conditional independence assumption holds, then it will converge quicker than discriminative models like logistic regression.
Even if the NB assumption doesn't hold, it works great in practice.
Need less training data.
Highly scalable. It scales linearly with the number of predictors and data points.
Can be used for both binary and mult-iclass classification problems.
Can make probabilistic predictions.
Handles continuous and discrete data.
Not sensitive to irrelevant features.

Q 159. Are Gaussian Naive Bayes the same as binomial Naive Bayes?
Binomial Naive Bayes:
It assumes that all our features are binary such that they take only two values. Means 0s can represent 'word does not occur in the document' and 1s as 'word occurs in the document'.

Gaussian Naive Bayes: Because of the assumption of the normal distribution, Gaussian Naive Bayes is used in cases when all our features are continuous. For example in Iris dataset features are sepal width, petal width, sepal length, petal length. So its features can have different values in the data set as width and length can vary. We can't represent features in terms of their occurrences. This means data is continuous. Hence we use Gaussian Naive Bayes here.

Q 160. What is the difference between the Naive Bayes Classifier and the Bayes classifier?
Naive Bayes assumes conditional independence, P(X|Y, Z)=P(X|Z)

P(X|Y,Z)=P(X|Z)

P(X|Y,Z)=P(X|Z), Whereas more general Bayes Nets (sometimes called Bayesian Belief Networks), will allow the user to specify which attributes are, in fact, conditionally independent.

For the Bayesian network as a classifier, the features are selected based on some scoring functions like Bayesian scoring function and minimal description length(the two are equivalent in theory to each other given that there is enough training data). The scoring functions mainly restrict the structure (connections and directions) and the parameters(likelihood) using the data. After the structure has been learned the class is only determined by the nodes in the Markov blanket(its parents, its children, and the parents of its children), and all variables given the Markov blanket are discarded.

Q 161. In what real world applications is Naive Bayes classifier used?
Some of real world examples are as given below

To mark an email as spam, or not spam?
Classify a news article about technology, politics, or sports?
Check a piece of text expressing positive emotions, or negative emotions?
Also used for face recognition software

Q 162. Is naive Bayes supervised or unsupervised?
First, Naive Bayes is not one algorithm but a family of Algorithms that inherits the following attributes:

1.Discriminant Functions
2.Probabilistic Generative Models
3.Bayesian Theorem
4.Naive Assumptions of Independence and Equal Importance of feature vectors.

Moreover, it is a special type of Supervised Learning algorithm that could do simultaneous multi-class predictions (as depicted by standing topics in many news apps).

Since these are generative models, so based upon the assumptions of the random variable mapping of each feature vector these may even be classified as Gaussian Naive Bayes, Multinomial Naive Bayes, Bernoulli Naive Bayes, etc.

Q 163. What do you understand by selection bias in Machine Learning?
Selection bias stands for the bias which was introduced by the selection of individuals
, groups or data for doing analysis in a way that the proper randomization is not achieved. It ensures that the sample obtained is not representative of the population intended to be analyzed and sometimes it is referred to as the selection effect. This is the part of distortion of a statistical analysis which results from the method of collecting samples. If you don't take the  selection bias into the account then some conclusions of the study may not be accurate.

The types of selection bias includes:

Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

Q 164. What do you understand by Precision and Recall?
In pattern recognition, The information retrieval and classification in machine learning are part of precision. It is also called as positive predictive value which is the fraction of relevant instances among the retrieved instances.

Recall is also known as sensitivity and the fraction of the total amount of relevant instances which  were actually retrieved. 

Both precision and recall are therefore based on an understanding and measure of relevance.

Q 165. What Are the Three Stages of Building a Model in Machine Learning?
To build a model in machine learning, you need to follow few steps:

Understand the business model
Data acquisitions
Data cleaning
Exploratory data analysis
Use machine learning algorithms to make a model
Use unknown dataset to check the accuracy of the model

Q 166. How Do You Design an Email Spam Filter in Machine Learning?
Understand the business model:
Try to understand the related attributes for the spam mail
Data acquisitions: Collect the spam mail to read the hidden pattern from them
Data cleaning: Clean the unstructured or semi structured data
Exploratory data analysis: Use statistical concepts to understand the data like spread, outlier, etc.
Use machine learning algorithms to make a model: can use naive bayes or some other algorithms as well
Use unknown dataset to check the accuracy of the model

Q 167. What is the difference between Entropy and Information Gain?
The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding the attribute that returns the highest information gain (i.e., the most homogeneous branches). Step 1: Calculate entropy of the target.

Q 168. What are collinearity and multicollinearity?
Collinearity is a linear association between two predictors. Multicollinearity is a situation where two or more predictors are highly linearly related.

Q 169. What is Kernel SVM?
SVM algorithms have basically advantages in terms of complexity. First I would like to clear that both Logistic regression as well as SVM can form non linear decision surfaces and can be coupled with the kernel trick. If Logistic regression can be coupled with kernel then why use SVM?

- SVM is found to have better performance practically in most cases.
- SVM is computationally cheaper O(N^2*K) where K is no of support vectors (support vectors are those points that lie on the class margin) where as logistic regression is O(N^3)
- Classifier in SVM depends only on a subset of points . Since we need to maximize distance between closest points of two classes (aka margin) we need to care about only a subset of points unlike logistic regression.

Q 170. What is the process of carrying out a linear regression?
Linear Regression Analysis consists of more than just fitting a linear line through a cloud of data points. It consists of 3 stages- 

(1) analyzing the correlation and directionality of the data,
(2) estimating the model, i.e., fitting the line, 
(3) evaluating the validity and usefulness of the model.

 

Also see:
All The Useful Machine Learning Interview Questions & Answers - Part 1
All The Useful Machine Learning Interview Questions & Answers - Part 2

 



ADVERTISEMENT