Data Science
Random


Click random to get a fresh chapter.

Data Science Questions




What is p-value?
P = Probability
Probability that the result obtained was due to chance. Generally a P-value < 0.05 (and sometimes < 0.01 or other values, depending on the trial design) indicated statistical significance.
If P < 0.05 that means there is a < 5% probability that the result occurred by chance.

What is sampling?
Statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.

What is the difference between type I and type II error?
type I error: when the null hypothesis is true, but is rejected.
type II error: when the null hypothesis is false, but erroneously fails to be rejected

What is linear regression?
linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables)

r-squared value
a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R-squared explains to what extent the variance of one variable explains the variance of the second variable. R2 = 1 - SSreg/SStotal

What are the assumptions for linear regression?

1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

What is a statistical interaction?
when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor

What is selection bias
Selection (or 'sampling') bias occurs in an 'active,' sense when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e., non-randomly) excluded from analysis

What are common probability distributions?
uniform, bernoulli, binomial, poisson, normal, log normal, students t, chi squared, gamma, beta, exponential

What is the binomial probability formula?
n!/(k!(n-k)!)

In python how is memory managed?
memory is managed in a private heap space. This means that all the objects and data structures will be located in a private heap. However, the programmer won't be allowed to access this heap. Instead, the Python interpreter will handle it. At the same time, the core API will enable access to some Python tools for the programmer to start coding. The memory manager will allocate the heap space for the Python objects while the inbuilt garbage collector will recycle all the memory that's not being used to boost available heap space

What are the data types in python?
text (str), numeric (int, float, complex), sequence types (list, tuple, range), mapping (dict), set types (set), boolean (bool), binary (byte, bytearray)

What is the difference between a tuple and a list in python?
tuples are immutable

What are the types of sorting algorithms in R?
insertion, bubble, selection

What are the data objects in R?
numeric (both integer and double), character and logical

What is the purpose of the group functions in SQL?
get summary statistics of a data set. COUNT, MAX, MIN, AVG, SUM, and DISTINCT

Explain inner join, left join, right join, and union
inner join is when both tables have a match, a left join is when there is a match in the left table and the right table is null, a right join is the opposite of a left join, and a full join is all of the data combined

What does UNION do? What is the difference between UNION and UNION ALL?
removes duplicate records (where all columns in the results are the same), UNION ALL does not

What is the difference between SQL and MySQL or SQL Server?
Structured Query Language. It's a standard language for accessing and manipulating databases. MySQL is a database management system, like SQL Server, Oracle, Informix, Postgres, etc

If a table contains duplicate rows, does a query result display the duplicate values by default? How can you eliminate duplicate rows from a query result
Yes. One way you can eliminate duplicate rows with the DISTINCT clause

How is k-NN different from k-means clustering?
k-NN: classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation.
K-means: clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data.

How would you create a logistic regression model?
xxx

What are cross-correlations with time lags in time series models?
Cross-correlation: is the degree of similarity between two time series in different times or space while lag can be considred when time is under investigation. Auto-correlation: is the cross-correlation of a time series while investitigating the persitance between lagged times of the same time series or signal.

Explain the 80/20 rule
People usually tend to start with a 80-20% split (80% training set - 20% test set) and split the training set once more into a 80-20% ratio to create the validation set

Explain what precision and recall are. How do they relate to the ROC curve
Recall describes what percentage of true positives are described as positive by the model. Precision describes what percent of positive predictions were correct. The ROC curve shows the relationship between model recall and specificity-specificity being a measure of the percent of true negatives being described as negative by the model. Recall, precision, and the ROC are measures used to identify how useful a given classification model is

Explain the difference between L1 and L2 regularization methods.
A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. The key difference between these two is the penalty term: L1 = absolute value, L2 = squared

What are has table collisions?
If the range of key values is larger than the size of our hash table, which is usually always the case, then we must account for the possibility that two different records with two different keys can hash to the same table index. There are a few different ways to resolve this issue. In hash table vernacular, this solution implemented is referred to as collision resolution

What is an exact test?
a test where all assumptions, upon which the derivation of the distribution of the test statistic is based, are met as opposed to an approximate test (in which the approximation may be made as close as desired by making the sample size big enough). This will result in a significance test that will have a false rejection rate always equal to the significance level of the test. For example an exact test at significance level 5% will in the long run reject true null hypotheses exactly 5% of the time

In your opinion, which is more important when designing a machine learning model: model performance or model accuracy
xxxx

What is one way that you would handle an imbalanced data set that's being used for prediction (i.e., vastly more negative classes than positive classes)
xxxx

How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression
xxxx

I have two models of comparable accuracy and computational performance. Which one should I choose for production and why
xxx

How do you deal with sparsity?
Sparse data means incomplete or lack of input data or data with missing values, on which we train machine learning models to predict.

mean absolute error
model evaluation metric used with regression models. The mean absolute error of a model with respect to a test set is the mean of the absolute values of the individual prediction errors on over all instances in the test set. Each prediction error is the difference between the true value and the predicted value for the instance.

What are some situations where a general linear model fails?
Linear regressions are sensitive to outliers. E.g. if most of your data lives in the range (20,50) on the x-axis, but you have one or two points out at x= 200, this could significantly swing your regression results.Similarly, if you build your regression on the range x in (20,50), and then try to use that model to predict a y-value for x = 200, this is pretty significant extrapolation and is not necessarily going to be accurate.
Overfitting - It is easy to overfit your model such that your regression begins to model the random error (noise) in the data, rather than just the relationship between the variables. This most commonly arises when you have too many parameters compared to the number of samples
Linear regressions are meant to describe linear relationships between variables. So, if there is a nonlinear relationship, then you will have a bad model. However, you can sometimes compensate for this by transforming some of the parameters with a log, square root, etc. transformation.

What are the different types of keys in a relational database?
Alternate keys are candidate keys that exclude all primary keys.
Artificial keys are created by assigning a unique number to each occurrence or record when there aren't any compound or standalone keys.
Compound keys are made by combining multiple elements to develop a unique identifier for a construct when there isn't a single data element that uniquely identifies occurrences within a construct. Also known as a composite key or a concatenated key, compound keys consist of two or more attributes.
Foreign keys are groups of fields in a database record that point to a key field or a group of fields that create a key of another database record that's usually in a different table. Often, foreign keys in one table refer to primary keys in another. As the referenced data can be linked together quite quickly, it can be critical to database normalization.
Natural keys are data elements that are stored within constructs and utilized as primary keys.
Primary keys are values that can be used to identify unique rows in a table and the attributes associated with them. For example, these can take the form of a Social Security number that's related to a specific person. In a relational model of data, the primary key is the candidate key. It's also the primary method used to identify a tuple in each possible relation.
Super keys are defined in the relational model as a set of attributes of a relation variable. It holds that all relations assigned to that variable don't have any distinct tuples. They also don't have the same values for the attributes in the set. Super keys also are defined as a set of attributes of a relational variable upon which all of the functionality depends.

List disadvantages of linear models
Errors in linearity assumptions
Lacks autocorrelation
It can't solve overfitting problems
You can't use it to calculate outcomes or binary outcomes

What is a random forest?
a data construct that's applied to ML projects to develop a large number of random decision trees while analyzing variables. These algorithms can be leveraged to improve the way technologies analyze complex data sets. The basic premise here is that multiple weak learners can be combined to build one strong learner.

What is an eigenvalue? Eigenvector?
eigenvalue: direction along which a particular linear transformation compresses, flips or stretches
eigenvector: vector for linear transformation

What is a hash table?
There are two parts to a hash table. The first is an array, or the actual table where the data is stored, and the other is a mapping function that's known as the hash function.
It's a data structure that implements an associative array abstract data type that can map key values. It can also compute an index into an array of slots or buckets where the desired value can be found.

What are the algorithm techniques in ML?
Learning to learn
Reinforcement learning (deep adversarial networks, q-learning, and temporal difference)
Semi-supervised learning
Supervised learning (decision trees, linear regression, naive bayes, nearest neighbor, neural networks, and support vector machines)
Transduction
Unsupervised learning (association rules and k-means clustering)

What is regularization?
When you have underfitting or overfitting issues in a statistical model, you can use the regularization technique to resolve it. Regularization techniques like LASSO help penalize some model parameters if they are likely to lead to overfitting.

Methods to avoid overfitting
Regularization (eg Lasso) that penalize some parameters, cross-validation techniques (eg k-folds cross validation), keep model simple by using fewer variables

What is the difference between inductive, deductive, and abductive learning?
Inductive learning describes smart algorithms that learn from a set of instances to draw conclusions. In statistical ML, k-nearest neighbor and support vector machine are good examples of inductive learning.
There are three literals in (top-down) inductive learning:
Arithmetic literals
Equality and inequality
Predicates
In deductive learning, the smart algorithms draw conclusions by following a truth-generating structure (major premise, minor premise, and conclusion) and then improve them based on previous decisions. In this scenario, the ML algorithm engages in deductive reasoning using a decision tree.
Abductive learning is a DL technique where conclusions are made based on various instances. With this approach, inductive reasoning is applied to causal relationships in deep neural networks.

What steps would you take to evaluate the effectiveness of your ML model?
You have to first split the data set into training and test sets. You also have the option of using a cross-validation technique to further segment the data set into a composite of training and test sets within the data.
Then you have to implement a choice selection of the performance metrics like the following:
Confusion matrix
Accuracy
Precision
Recall or sensitivity
Specificity
F1 score
For the most part, you can use measures such as accuracy, confusion matrix, or F1 score. However, it'll be critical for you to demonstrate that you understand the nuances of how each model can be measured by choosing the right performance measure to match the problem.

What is the difference between supervised and unsupervised ML?
Supervised machine learning requires training labelled data, unsupervised doesn't require labelled data

What is Bias and Variance Tradeoff?
Bias is error introduced in your model due to over simplification of machine learning algorithm." It can lead to under fitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression

Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs bad on test data set." It can lead high sensitivity and over fitting.
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.

The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
The k-nearest neighbours algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbours that contribute to the prediction and in turn increases the bias of the model.
The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

What is a confusion matrix?
2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. False positive, true negative, true positive, false negative.
Error Rate = (FP+FN)/(P+N)
Accuracy = (TP+TN)/(P+N)
Sensitivity(Recall or True positive rate) = TP/P
Specificity(True negative rate) = TN/N
Precision(Positive predicted value) = TP/(TP+FP)
F-Score(Harmonic mean of precision and recall) = (1+b)(PREC.REC)/(b²PREC+REC) where b is commonly 0.5, 1, 2.

Sensitivity
True positive rate (TP/P)

Explain how a ROC curve works
The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false positive rate

Explain SVM
SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyper planes to separate out different classes based on the provided kernel function.

Explain decision trees
supervised machine learning algorithm mainly used for the Regression and Classification.It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. Decision tree can handle both categorical and numerical data

What is entropy and information gain in decision tree algorithms?
A decision tree is built top-down from a root node and involve partitioning of data into homogenious subsets. ID3 uses enteropy to check the homogeneity of a sample. If the sample is completely homogenious then entropy is zero and if the sample is an equally divided it has entropy of one.

The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that returns the highest information gain.

What is tree pruning?
remove sub-nodes of a decision tree

What is ensemble learning?
combine models to improve stability and predictive power of the model:
- bagging (implement models on small sample populations and take mean of predictions, reduces variance)
- boosting (iterative technique to adjust weight of observation based on last classification, decreases bias error but may overfit training data)

What is random forest?
ensemble learning method where we grow multiple trees. To classify each tree gives classification and forest chooses the one with the most votes.

What cross-validation technique would you use on a time series data set?
Instead of using k-fold cross-validation, you should be aware to the fact that a time series is not randomly distributed data — It is inherently ordered by chronological order.
In case of time series data, you should use techniques like forward chaining — Where you will be model on past data then look at forward-facing data.
fold 1: training[1], test[2]
fold 1: training[1 2], test[3]
fold 1: training[1 2 3], test[4]
fold 1: training[1 2 3 4], test[5]

What is the normal distribution?
Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell shaped curve. The random variables are distributed in the form of an symmetrical bell shaped curve.

What is a Box Cox transformation?
Dependent variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests. A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques, if your data isn't normal, applying a Box-Cox means that you are able to run a broader number of tests.

How do you define the number of clusters in a clustering algorithm?
Elbow plot: within groups sum of squares vs number of clusters, look for bend, that point is where k for k-means exists

What is regularization?
the process of adding tunning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1(Lasso) or L2(ridge). The model predictions should then minimize the loss function calculated on the regularized training set.

If you have 4GB RAM but want to train your model on 10GB data how can you do this?
For Neural networks: Batch size with Numpy array will work.
Steps:
Load the whole data in Numpy array. Numpy array has property to create mapping of complete data set, it doesn't load complete data set in memory.
You can pass index to Numpy array to get required data.
Use this data to pass to Neural network.
Have small batch size.
For SVM: Partial fit will work
Steps:
Divide one big data set in small size data sets.
Use partial fit method of SVM, it requires subset of complete data set.
Repeat step 2 for other subsets.

What is Naive in a Naive Bayes?
based on the Bayes Theorem. Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. The Algorithm is 'naive' because it makes assumptions that may or may not turn out to be correct.

Pearson-spearman
...

How would you sort the rows of this table numerically using SQL/Python/R
...

What are various DDL commands in SQL? Give brief description of their purposes.?
Data Definition Language commands in SQL −
CREATE − it creates a new table, a view of a table, or other object in database.
ALTER − it modifies an existing database object, such as a table.
DROP − it deletes an entire table, a view of a table or other object in the database

What are various DML commands in SQL? Give brief description of their purposes..?
Data Manipulation Language commands in SQL −
SELECT − it retrieves certain records from one or more tables.
INSERT − it creates a record.
UPDATE − it modifies records.
DELETE − it deletes records.

What are various DCL commands in SQL? Give brief description of their purposes..?
Data Control Language commands in SQL −
GRANT − it gives a privilege to user.
REVOKE − it takes back privileges granted from user.

What is the purpose of the condition operators BETWEEN and IN
The BETWEEN operator displays rows based on a range of values. The IN condition operator checks for values contained in a specific set of values.

If a table contains duplicate rows, does a query result display the duplicate values by default? How can you eliminate duplicate rows from a query result
A query result displays all rows including the duplicate rows. To eliminate duplicate rows in the result, the DISTINCT keyword is used in the SELECT clause.

What is the default ordering of data using the ORDER BY clause? How could it be changed.?
ascending. It can be changed using the DESC keyword, after the column name in the ORDER BY clause.

What is the purpose of the NVL function.?
converts a NULL value to an actual value

What is the difference between cross joins and natural joins.?
The cross join produces the cross product or Cartesian product of two tables. The natural join is based on all the columns having same name and data types in both the tables.

What is the difference between VARCHAR2 AND CHAR datatypes.?
ARCHAR2 represents variable length character data, whereas CHAR represents fixed length character data.

Explain random sampling, stratified sampling, and cluster sampling..?
stratified random sample, a population is divided into stratum, or sub-populations, before sampling. At first glance, the two techniques seem very similar. However, in cluster sampling the actual cluster is the sampling unit; in stratified sampling, analysis is done on elements within each strata

What are Z-scores and how are they useful.?
a z-score (also called a standard score) gives you an idea of how far from the mean a data point is. z = (x - mean)/SD

How do you select features for a model? What do you look for.?
Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
· Improves Accuracy: Less misleading data means modeling accuracy improves.
· Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

Root
top of decision tree

impurity
when decision tree has inner nodes that are neither 100% one outcome or the other

gini impurity
measure of how impure nodes on a decision tree are:
gini = 1 - (prob of no)^2 - (prob of yes)^2

overfit
does well with training data but no test data

What is the difference between structured and unstructured data.?
Structured data is highly-organized and formatted in a way so it's easily searchable in relational databases. Unstructured data has no pre-defined format or organization, making it much more difficult to collect, process, and analyze.

What is caching and why do you use it in data science.?
enables content to be retrieved faster because an entire network round trip is not necessary. Caching can be necessary to save various data files when the process of loading and/or manipulating data takes a considerable amount of time. There will be caching on the server where already computed elements may not need to be recomputed. When you want to access some data that takes a lot of time and resources to look up, you cache it so that the next time you want to look up that same data, the process of doing so is more efficient.

What is bias and what types of bias can occur during sampling.?
Bias is the difference between the average prediction of a model and the correct value that you are trying to predict. A model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training data.
The three types of bias that can occur are selection, under coverage and survivorship bias.

What are some core steps to take for data preprocessing.?
Data preprocessing involves giving structure to the data for better understanding and decision making related to the data. Some key steps in data pre-processing includes:
Data discovery and acquisition: Gathering data from available sources and trying to understand and make sense of it.
Data structuring and transformation: Taking different data set formats and sizes and giving it a consistent size and shape when merged together.
Data cleaning: Imputing null values and treating outliers/anomalies in the data to make it usable for further analysis.
Exploratory Data Analysis: Finding patterns in the dataset and extracting new features from the given data in order to optimize the performance of a model.
Validating: Verifying data consistency and quality.
Publishing/Modeling: Processing the data further with an algorithm or machine learning model.

What are the feature selection methods to select the right variables.?
There are two types of methods:
Filter methods include linear discriminant analysis, ANOVA and Chi-square (most commonly used). These methods are meant to pull the bad data out.
Wrapper Methods include forward selection, backward selection and recursive feature elimination.

You are given a dataset consisting of variables having more than 30% missing values. How will you deal with this.?
If the data set is huge, you can remove the rows that have missing data values. This is the quickest way to deal with this. If the dataset is small, you can substitute missing values with the mean of the rest of the data using pandas dataframe in python i.e. df.mean()dr.fillna(mean).

For given points, how will you calculate the Euclidian Distance in Python? Given points : plot1 = [1,3} plot2 = [2,5].?
euclidean_distance = sqrt( (plot1[0]-plot2[0])*2 + (plot1[1]-plot2[1])*2 )

How would you maintain a deployed model.?
There are four essential steps:
Monitor to determine the performance accuracy of the model
Calculate evaluation metrics of the current model to determine if a new algorithm is needed
Compare the two models to determine which model performs the best
Rebuild the best performing model using the current state of data.

What are recommender systems.?
system that predicts the rating or preference that a user would give to a product (or other choice). There are two different types of recommender systems: collaborative filtering and content-based filtering. Collaborative makes recommendations based on other users with similar interests. Content-based filtering uses the properties of the product to recommend products with similar properties.

What is RMSE and MSE in linear regression models.?
RMSE stands for root mean square error and MSE stands for mean square error. They are the most common measures of accuracy for a linear regression model. The formulas are below.

What is selection bias and what are the different types.?
kind of error that occurs when a model builder decides what data is going to be used in a way that doesn't allow for randomization. It is the distortion of statistical analysis accuracy resulting from the non-randomized method of collecting samples.

What is overfitting and how can you avoid overfitting of your model.?
condition where a model begins to describe the random error in the data rather than the relationships between variables. It reduces the model's usefulness outside the original dataset. This problem occurs when the model is too complex.
There are 3 main ways to avoid overfitting a model:

1. Keep the model simply by taking into account fewer variables, which reduces some of the noise in the training data.

2. Use cross-validation techniques such as k-folds.

3. Use regularization techniques such as LASSO that penalize certain model parameters if they are likely to cause overfitting.

What criteria would you use to select a representative sample.?
diversity, consistency, and transparency. The sample must be as diverse as the data set. Any changes observed in the sample data should also be reflected in the true population. A discussion should be had within the analytics team to decide the appropriate sample size and structure that is a true representative of the full data set.

What is the difference between univariate, bivariate and multi-variate analysis.?
The difference is in the number of variables used. Univariate uses 1 variable. Its purpose is to describe the data and find patterns in it. Bivariate analysis uses two variables. Its purpose is to find a relationship between the two variables. Multi-variate analysis uses more than two variables. Its purpose is to

What is A/B testing.?
statistical hypothesis testing process whereby a hypothesis is made about the relationship between two data sets and those data sets are then compared against each other to determine if there is a statistically significant relationship or not. A prediction is made that dataset B will perform better than dataset A. Then both data sets are observed and compared to determine if B is a statistically significant improvement over A.

Describe the difference between covariance and correlation.
Covariance gives the direction of a linear relationship while correlation gives both strength and direction.

Why Is Re-sampling done.?
Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
Substituting labels on data points when performing significance tests
Validating models by using random subsets (bootstrapping, cross-validation)

A couple has two children, at least one of which is a girl. What is the probability that they have two girls.?
There are 4 equally likely possibilities : BB, BG, GB and GG; where B = Boy and G = Girl and the first letter denotes the first child.
You can exclude the first case of BB. Thus from the remaining 3 possibilities of BG, GB & GG, you find the probability of the case with two girls. The probability of having 2 girls, given one girl is 1/3.

What is Logistics Regression.?
process that measures the difference between a dependent variable (what you want to predict) and one or more independent variables (features) by estimating the probabilities using its underlying logistics function (i.e. sigmoid). This technique used to predict a binary outcome that is either zero or one, or a yes or no.
The two types of logistics regression are binary and multinomial. Binary deals with two categories whereas multinomial deals with three or more categories.

If a model established by your team gives 95% accuracy, how will you know if it's correct or not.?
The best that you can do is to compare the performance of machine learning models on specific data to other models also trained on the same data. Machine learning model performance is relative. Your ideas of what score a good model can achieve only make sense and can only be interpreted in the context of the skill scores of other models also trained on the same data.
To do this you should develop a baseline model to provide the point from which the skill of all other models trained on a data set can be evaluated. If the model achieves performance below the baseline something is wrong. True model performance will fall within the range of the baseline and 100%.

What is a random forest model and how do you build a random forest model?
buildup of a number of decision trees. The steps to building one are :

1. Randomly select "k" features from total "m" features where K<<m.

2. Among "k" features, calculate the node "d" using the best split point.

3. Split the node into daughter nodes using the best split.

4. Repeat these last 2 steps until leaf nodes are finalized.

5. Build the forest by repeating steps 1-4 for "n" number of times to create "n" number of trees.

What is the bias-variance trade-off and why is it important?
The goal of any supervised machine learning algorithm is to have low bias and low variance. Increasing bias will decrease the variance and vice versa. So the k-nearest neighbors algorithm has low bias and high variance. However, you can increase the value of k to increase the number of neighbors that contribute to the prediction and in turn increases the bias of the model.
The tradeoff between bias and variance is in the model complexity. If a model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if a model has a large number of parameters then it's going to have high variance and low bias.
You need to find the right/good balance without overfitting and underfitting the data. So it's important to understand the bias-variance trade-off in order to avoid over or underfitting a model.

What is the difference between supervised and unsupervised learning?
The differences between these two types of learning are in data labeling, feedback mechanisms and algorithms used. Unsupervised learning has no labelled data inputs and no feedback mechanism. Supervised learning most commonly uses decision tree, logistic regression and support vector machine algorithms. Unsupervised learning uses k-means clustering, hierarchical clustering, and apriori algorithms.

What are the steps to making a decision tree?
Take the entire data set as input
Calculate entropy (measure of chaos of inputs) of your target value as well as predictor attributes.
Calculate the information gain of all attributes.
Choose the attribute with the highest information gain as the root node.
Repeat the same process on every branch until the decision node of each branch is finalized.
Another way to ask this question would be to ask for example, how you would build a decision tree to decide whether or not to accept a job offer. Clearly for this answer you must know how to calculate entropy and information gain.

Name 5 classification algorithms.?
Linear Classifiers - Logistic regression. Naive Bayes classifier. Fisher's linear discriminant.
Quadratic
Neural networks - Recurrent and modular
Kernel Estimation - k nearest neighbor
Decision Trees - random forests
SVM - Linear and non-linear, least squares

How do you split your data between training and non-training?
Training and validation sets from data can be split on 2 principles. First, ensure the validation set is large enough to yield statistically meaningful results. Second, the validation set should be representative of the data set as a whole. In other words, don't pick a validation set with different characteristics than the training set. An optimal way to split data would be to use k-folds validation. This method makes multiple splits of the dataset into training and validation sets. This method offer various samples of data and ultimately reduces the chances of overfitting.

How can you inspect missing data?
Some techniques that can be used to handle missing data are:
Imputation of missing values depending on whether the data is numerical or categorical.
Replacing values with mean, median, mode.
Using the average value of K nearest neighbors as an imputation estimate.
Using linear regression to predict values.

What are some common problems that data analysts encounter during analysis
Having a poor formatted data file. For instance, having CSV data with un-escaped newlines and commas in columns.
Having inconsistent and incomplete data.
Misspelling and duplicate entries
Having different value representations and misclassified data.

What data validation methods can be used by an analyst?
Data Screening
- Various algorithms are used to screen the entire data to find any erroneous or questionable values. Such values need to be examined and should be handled.
Data Verification- Each suspect value is evaluated on case by case basis and a decision is to be made if the values have to be accepted as valid or if the values have to be rejected as invalid or if they have to be replaced with some redundant values.

What is an outlier?
a value that appears far away and diverges from an overall pattern in a sample.

Briefly explain the different data structures in R.
Vector - a sequence of data elements of the same type.
List - R objects which contain elements of different types such as numbers, strings, vectors, or another list inside it.
Matrix - a two-dimensional data structure used to bind vectors from the same length. All elements in a matrix must be of the same type.
Dataframe - combines features of matrices and lists, i.e. different columns can have different data types.

What is the formula for precision?
TP/(TP+FP)

Simple random sampling
Software is used to randomly select subjects from the whole population.

Stratified sampling
Subsets of the data sets or population are created based on a common factor, and samples are randomly collected from each subgroup.

Cluster sampling
The larger data set is divided into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed.

Multistage sampling
A more complicated form of cluster sampling, this method involves dividing the larger population into a number of clusters. Second-stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed. This staging could continue as multiple subsets are identified, clustered and analyzed.

Systematic sampling
A sample is created by setting an interval at which to extract data from the larger population -- for example, selecting every 10th row in a spreadsheet of 200 items to create a sample size of 20 rows to analyze.

Calibration of predicted probabilities
Calibration denotes the consistency between predicted probabilities and their actual frequencies observed on a test dataset.
A perfectly calibrated model should have a calibration curve that is exactly on the diagonal line.
In reality the calibration curve is often quite distinct from the diagonal line and the average distance between the two measures the quality of the calibration.
The calibration loss is computed as the absolute difference between the calibration curve and the diagonal, averaged over the test set, weighted by the number of elements used to compute each point (or the sum of sample weights when it applies).

Definition of accuracy
Proportion of correct predictions (positive and negative) in the test set

Definition of precision
Proportion of positive predictions that were indeed positive (in the test set)

Definition of recall
Proportion of actual positive values found by the classifier

Definition of F1 score
Harmonic mean between Precision and Recall

Definition Hamming loss
Fraction of labels that are incorrectly predicted (the lower the better)

Definition of cost matrix gain
Average gain per record that the test set (2177 rows) would yield given the specified gain for each outcome. Specified gains: TP = 1, TN = 0, FP = -0.3, FN = 0.

Name three threshold independent evaluation metrics
Log loss, AUC, and Calibration loss

Definition of Log loss
Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training data y_true. Error metric that takes into account the predicted probabilities (the lower the better)