Fatskills
Practice. Master. Repeat.
Study Guide: Interview QA Data Science: Part 3
Source: https://www.fatskills.com/data-science/chapter/interview-qa-data-science-part-3

Interview QA Data Science: Part 3

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~10 min read

How to split the data with equal set of classes in both training and testing data?
Using Stratified Shuffle package

What do you mean by giving "epoch = 1" in neural network?
It means that "traversing the data set one time

What do you mean by Ensemble Model? When to use?
Ensemble Model is a combination of Different Models to predict correctly and with good accuracy.

Ensemble learning is used when you build component classifiers that are more accurate and independent from each other.

When will you use SVM and when to use Random Forest?
SVM can be used if the data is outlier free whereas Naïve Bayes can be used even if it has outliers (since it has built in package to take care).
SVM suits best for Text Classification Model and Random Forest suits for Binomial/Multinomial Classification Problem.
Random Forest takes care of over fitting problem with the help of tree pruning
 

Applications of Machine Learning?
Self Driving Cars
Image Classification
Text Classification
Search Engine
Banking, Healthcare Domain


If you are given with a use case - 'Predict whether the transaction is fraud (or) not fraud", which algorithm would you choose
Logistic Regression

If you are given with a use case - 'Predict the house price range in the coming years", which algorithm would you choose
Linear Regression

What is the underlying mathematical knowledge behind Naïve Bayes?
Bayes Theorem

When to use Random Forest and when to Use XGBoost?
If you want all core processors in your system to be utilized, then go for XGBoost(since it supports parallel processing) and if your data is small then go for random forest.

If you are training model gives 90% accuracy and test model gives 60% accuracy? Then what problem you are facing with?
Overfitting.

Overfitting and can be reduced by many methods like (Tree Pruning, Removing the minute information provided in the data set).

In Google if you type "How are "it gives you the recommendation as "How are you "/"How do you do", this is based on what?
This kind of recommendation engine comes from collaborative filtering.

What is margin, kernels, Regularization in SVM?
Margin - Distance between the hyper plane and closest data points is referred as "margin"
Kernels - there are three types of kernel which determines the type of data you are dealing with i) Linear, ii) Radial, iii) Polynomial
Regularization - The Regularization parameter (often termed as C parameter in python's sklearn library) tells the SVM optimization how much you want to avoid misclassifying each training example

What is Boosting? Explain how Boosting works?

What is Null Deviance and Residual Deviance (Logistic Regression Concept?)
Null Deviance indicates the response predicted by a model with nothing but an intercept

Residual deviance indicates the response predicted by a model on adding independent variables

Note:

Lower the value, better the model

What are the different method to split the tree in decision tree?
Information gain and gini index

What is the weakness for Decision Tree Algorithm?
Not suitable for continuous/Discrete variable

Performs poorly on small data

Why do we use PCA(Principal Components Analysis) ?
These are important feature extraction techniques used for dimensionality reduction.

During Imbalanced Data Set, will you
Calculate the Accuracy only? (or)
Precision, Recall, F1 Score separately
We need to calculate precision, Recall separately

How to ensure we are not over fitting the model?
Keep the attributes/Columns which are really important
Use K-Fold cross validation techniques
make use of drop-put in case of neural network

Steps involved in Decision Tree and finding the root node for the tree
Step 1:- How to find the Root Node

Use Information gain to understand the each attribute information w.r.t target variable and place the attribute with the highest information gain as root node.

Step 2:- How to Find the Information Gain

Please apply the entropy (Mathematical Formulae) to calculate Information Gain. Gain (T,X) = Entropy(T) - Entropy(T,X) here represent target variable and X represent features.

Step3: Identification of Terminal Node

Based on the information gain value obtained from the above steps, identify the second most highest information gain and place it as the terminal node.

Step 4: Predicted Outcome

Recursively iterate the step4 till we obtain the leaf node which would be our predicted target variable.

Step 5: Tree Pruning and optimization for good results

It helps to reduce the size of decision trees by removing sections of the tree to avoid over fitting.

What is hyper plane in SVM?
It is a line that splits the input variable space and it is selected to best separate the points in the input variable space by their class(0/1,yes/no).

Explain Bigram with an Example?
Eg: I Love Data Science

Bigram - (I Love) (Love Data) (Data Science)

What are the different activation functions in neural network?
Relu, Leaky Relu , Softmax, Sigmoid

Which Algorithm Suits for Text Classification Problem?
SVM, Naïve Bayes, Keras, Theano, CNTK, TFLearn(Tensorflow)

You are given a train data set having lot of columns and rows. How do you reduce the dimension of this data?
Principal Component Analysis(PCA) would help us here which can explain the maximum variance in the data set.
We can also check the co-relation for numerical data and remove the problem of multi-collinearity(if exists) and remove some of the columns which may not impact the model.
We can create multiple dataset and execute them batch wise.

You are given a data set on fraud detection. Classification model achieved accuracy of 95%.Is it good?
Accuracy of 96% is good. But we may have to check the following items:

what was the dataset for the classification problem
Is Sensitivity and Specificity are acceptable

if there are only less negative cases, and all negative cases are not correctly classified, then it might be a problem
In-Addition it is related to fraud detection, hence needs to be careful here in prediction (i.e not wrongly predicting the fraud as non-fraud patient.

What is prior probability and likelihood?
Prior probability:
The proportion of dependent variable in the data set.

Likelihood:
It is the probability of classifying a given observation as '1' in the presence of some other variable.

How can we know if your data is suffering from low bias and high variance?
...

How is kNN different from kmeans clustering?
Kmeans partitions a data set into clusters, which is homogeneous and points in the cluster are close to each other. Whereas KNN tries to classify unlabelled observation based on its K surrounding neighbours.

Random Forest has 1000 trees, Training error: 0.0 and validation error is 20.00.What is the issue here?
It is the classical example of over fitting. It is not performing well on the unseen data. We may have to tune our model using cross validation and other techniques to overcome over fitting

Data set consisting of variables having more than 30% missing values? How will you deal with them?
We can remove them, if it does not impact our model
We can apply imputation techniques (like MICE, MISSFOREST,AMELIA) to avoid missing values

What do you understand by Type I vs. Type II error?
Type I error occurs when - "we classify a value as positive, when the actual value is negative"

(False Positive)

Type II error occurs when - "we classify a value as negative, when the actual value if positive"

(False Negative)

Based on the dataset, how will you know which algorithm to apply ?
If it is classification related problem,then we can use logistic,decision trees etc...
If it is Regression related problem, then we can use Linear Regression.
If it is Clustering based, we can use KNN.
We can also apply XGB, RF for better accuracy.

Why normalization is important?
Data Set can have one column in the range (10,000/20,000) and other column might have data in the range (1, 2, 3).clearly these two columns are in different range and cannot accurately analyse the trend. So we can apply normalization here by using min-max normalization (i.e to convert it into 0-1 scale).

What is Data Science?
Formally, It's the way to Quantify your intuitions.
Technically, Data Science is a combination of Machine Learning, Deep Learning & Artificial
Intelligence. Where Deep Learning is the subset of AI.

What is Deep Learning?
Deep Learning is the process of adding one more logic to the machine learning, where it iterates
itself with the new data and will not fail in future, even though your data distribution changes. The
more it iterates, more it works better.

Where to use R & Python?
R can be used whenever the data is structed. Python is efficient to handle unstructured data. R can't
handle high volume data. Python backend working with Theano/tensor made it easy to perform it as
fast comparing with R.

Which Algorithms are used to do a Binary classification?
Logistic Regression, KNN, Random Forest, CART, C50 are few algorithms which can perform Binary
classification.

Which Algorithms are used to do a Multinomial classification?
Naïve Bayes, Random Forest are widely used for multinomial classification.

What is LOGIT function?
LOGIT function is Log of ODDS ratio. ODDS ratio can be termed as the Probability of success divided
by Probability of failure. Which is the final probability value of your binary classification, where we
use ROC curve to get the cut-Off value of the probability.

What are all the pre-processing steps that are highly recommended?
• Structural Analysis
• Outlier Analysis
• Missing value treatments
• Feature engineering

What is Normal Distribution?
Whenever data that defines with having Mean = Median = Mode, then the data is called as normally
distributed data.

What is empirical Rule?
Empirical Rule says that whenever data is normally distributed, your data should be having the
distribution in a way of,
68 percent of your data spread is within Plus or Minus 1 standard deviation
95 percent of your data spread is within Plus or Minus 2 standard deviation

99.7 percent of your data spread is within Plus or Minus 3 standard deviation

What is Regression problem statement?
With the help of Independent variables(X), we predict target variable(Y), if your target variable
having infinite possibilities, then the problem will fall under Regression problem statement.

What are all the Error metrics for Regression problem statement?
Standard error metrics are RMSE & MAPE.
RMSE: Root Mean Squared Error (where we use least square values).
MAPE: Mean Absolute Percent Error (Here, we use absolute values).

What is R value in Linear regression?
R is the correlation coefficient. Which will be in the range of 0 to 1. If value is closer to 1, it means
that Independent variables are highly correlated to your target variable.
Can be given by the formula: (slope*standard deviation(X))/ standard deviation(Y)

What is an Outlier?
An outlier is an observation that lies in an abnormal distance from other values. In a sense, this
definition leaves it up to the analyst (or a consensus process) to decide what will be considered
abnormal.
Example:data - (2,1,1,3,4,2,1,4,5,6,2,6,8,9,64,1,7,9)
Only one data point is not in the distribution. You could see all data points are within the
range of 1-9. But one data point has a value of 64. Which can be considered as an Influential data
point.

What are all the mechanisms which can identify Outliers?
Box plot is the standard mechanism which can be used in the univariate Analysis.
Scatter plot can be used for Bi-variate Analysis.

How can we treat Outliers?
Outliers should be to investigated first. Investigation should be in a way that, what is the reason
behind that outlier value? Is it possible to change those values by our investigations manually? If
can't be treated manually, need to remove the observation if the values are highly deviated. If the
deviation is low, can keep the outliers as such and we can proceed.

What are all the standard imputations that can be carried for missing value treatments?
Mean, Median & Mode can be always the better replacements.
• Central Imputations
• KNN Imputations

What is the formula for calculating Upper whisker & Lower whisker value in Box plot?
Upper Whisker: Q3 + 1.5(IQR)
Lower Whisker: Q1 - 1.5(IQR)
IQR: Inter-Quartile Range. Which is given by Q3 - Q1.

What is the skewed Distribution & uniform distribution?
Uniform Distribution is identified when the data spread is equal in the range. Right/Left skewed data
is something if data is distributed on any of one side of the plot.

What is the key assumption for Naive Bayes?
Naïve Bayes assumption tells that all independent variables are equally important as well
independent of each other. The reality doesn't support this idea much. But surprisingly Naïve Bayes
model sometimes works efficient for classification problem.

 

Read the complete series of Data Science Guides:

Interview Q&A: Data Science - Part 1

Interview Q&A: Data Science - Part 2

Interview QA Data Science - Part 3