Chapter [ ]: Data Mining and Machine Learning

[intro about data mining and machine learning]

Explain what regularization is and why it is useful.

Layman’s term:

As a parent sometime we take decision about how much flexibility should be given to our children during their upbringing. Too much restriction may suppress their development of character. Alternatively, too much flexibility may spoil their future. Answer to this question is regularized flexibility, which is to give enough flexibility added with regularization. Try to fill the expectation of your kids, like- comics’ book, drawing setups, storytelling, chocolate, ice cream, mobile game etc. to make them happy. But, add some regularization like:

You have to finish your homework as well
Distribute chocolate equally with your sister
Show me your result of class test
Let me to return back, then we will discuss

Regularization in Machine Learning:

Regularization is a technique to cope with over-fitting which comes up in training a model on sample data.

Let's take a look at the simple curve fitting problem to understand regularization and over-fitting. Given a set of points(x1,y1),(x2,y2)...(xN,yN)(x1,y1),(x2,y2)...(xN,yN). Our goal is to find a model, which is a functiony=f(x)y=f(x)that fits the given data. To do this, we can use the method least-square error.

For simplicity, supposef(x)f(x)is just a first order linear function,f(x)=Wx+bf(x)=Wx+b. Our job is to figure out what W and b are. We set up an error function that looks like:

To figure out what W and b are, we need to minimize the error function above. However, in minimizing the error function, we get into a problem called over-fitting, when the model we found fits very well with the training data but fails miserably if we apply new data (i.e, get another set of data points).

To do this, we introduce a new terms into the error function, which implies that the coefficient W are also derived from a random process. The error function now looks like:

The added alphaalphais call the regularized term.

To illustrate, supposef(x)f(x)can be any order. To generate testing data, we first have a functiony=g(x)y=g(x). Next, from points belonging toy=g(x)y=g(x), we added some noise and make those points our training data. Our goal is to derive a functiony=f(x)y=f(x)from those noisy points, that is as close to the original functiony=g(x)y=g(x)as possible. The plot below shows over-fitting, where the derived function (the blue line) fits well with the training data but does not resemble the original function.

After using regularization, the derived function looks much closer to the original function, as shown below:

How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?

Proposed methods for model validation:

If the values predicted by the model are far outside of the response variable range, this would immediately indicate poor estimation or model inaccuracy.
If the values seem to be reasonable, examine the parameters; any of the following would indicate poor estimation or multi-collinearity: opposite signs of expectations, unusually large or small values, or observed inconsistency when the model is fed new data.
Use the model for prediction by feeding it new data, and use the coefficient of determination (R squared) as a model validity measure.
Use data splitting to form a separate dataset for estimating model parameters, and another for validating predictions.
Use jackknife resampling if the dataset contains a small number of instances, and measure validity with R squared and mean squared error (MSE).

How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything?

Often it is observed that in the pursuit of rapid innovation (aka "quick fame"), the principles of scientific methodology are violated leading to misleading innovations, i.e. appealing insights that are confirmed without rigorous validation. One such scenario is the case that given the task of improving an algorithm to yield better results, you might come with several ideas with potential for improvement.

An obvious human urge is to announce these ideas ASAP and ask for their implementation. When asked for supporting data, often limited results are shared, which are very likely to be impacted by selection bias (known or unknown) or a misleading global minima (due to lack of appropriate variety in test data).

Data scientists do not let their human emotions overrun their logical reasoning. While the exact approach to prove that one improvement you've brought to an algorithm is really an improvement over not doing anything would depend on the actual case at hand, there are a few common guidelines:

Ensure that there is no selection bias in test data used for performance comparison
Ensure that the test data has sufficient variety in order to be symbolic of real-life data (helps avoid overfitting)
Ensure that "controlled experiment" principles are followed i.e. while comparing performance, the test environment (hardware, etc.) must be exactly the same while running original algorithm and new algorithm
Ensure that the results are repeatable with near similar results
Examine whether the results reflect local maxima/minima or global maxima/minima

One common way to achieve the above guidelines is through A/B testing, where both the versions of algorithm are kept running on similar environment for a considerably long time and real-life input data is randomly split between the two. This approach is particularly common in Web Analytics.

Is it better to have too many false positives, or too many false negatives? Explain.

It depends on the question as well as on the domain for which we are trying to solve the question.

In medical testing, false negatives may provide a falsely reassuring message to patients and physicians that disease is absent, when it is actually present. This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. So, it is desired to have too many false positive.

For spam filtering, a false positive occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. So, we prefer too many false negatives over many false positives.

Explain what is overfitting. How would you control for it?

As a concrete example:

[source: Pattern Recognition and Machine Learning, Bishop, P25]

Here we are trying to do a regression on the data points (blue dots). A sine curve (green) is a reasonable fit. But we can fit it to a polynomial, and if we raise the degree of polynomial to arbitrarily high, we can reduce the error close to 0 (by Taylor expansion theorem). As shown here, the red curve is a 9-degree polynomial. Even though its root mean square error is smaller, its complexity makes it a likely result of overfitting.

Overfitting, however, is not a problem only associated with regression. It is relevant to various machine learning methods, such as maximum likelihood estimation, neural networks, etc.

In general, it is the phenomenon where the error decreases in the training set but increases in the test set. It is captured by the plot below, which is similar to the plots in other answers.

Several methods can be used to avoid "overfitting" the data

Try to find the simplest possible hypothesis
Regularization (adding a penalty for complexity)
Randomization Testing (randomize the class variable, try your method on this data - if it find the same strong results, something is wrong)
Nested cross-validation (do feature selection on one level, then run entire method in cross-validation on outer level)
Adjusting the False Discovery Rate
Using the reusable holdout method - a breakthrough approach proposed in 2015

What is a recommendation engine? How does it work?

We are all familiar now with recommendations from Netflix - "Other Movies you might enjoy" or from Amazon - Customers who bought X also bought Y.

Such systems are called recommendation engines or more broadly recommender systems.

They typically produce recommendations in one of two ways: using collaborative or content-based filtering.

Collaborative filtering methods build a model based on users past behavior (items previously purchased, movies viewed and rated, etc) and use decisions made by current and other users. This model is then used to predict items (or ratings for items) that the user may be interested in.

Content-based filtering methods use features of an item to recommend additional items with similar properties. These approaches are often combined in Hybrid Recommender Systems.

Here is a comparison of these 2 approaches used in two popular music recommender systems - Last.fm and Pandora Radio. (example from Recommender System entry)

Last.fm creates a "station" of recommended songs by observing what bands and individual tracks the user has listened to on a regular basis and comparing those against the listening behavior of other users. Last.fm will play tracks that do not appear in the user's library, but are often played by other users with similar interests. As this approach leverages the behavior of users, it is an example of a collaborative filtering technique.
Pandora uses the properties of a song or artist (a subset of the 400 attributes provided by the Music Genome Project) in order to seed a "station" that plays music with similar properties. User feedback is used to refine the station's results, deemphasizing certain attributes when a user "dislikes" a particular song and emphasizing other attributes when a user "likes" a song. This is an example of a content-based approach.

Here is a good Introduction to Recommendation Engines by Dataconomy and an overview of building a Collaborative Filtering Recommendation Engine by Toptal. For latest research on recommender systems, check ACM RecSys conference.

Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?

In binary classification (or medical testing), False positive is when an algorithm (or test) indicates presence of a condition, when in reality it is absent. A false negative is when an algorithm (or test) indicates absence of a condition, when in reality it is present.

In statistical hypothesis testing false positive is also called type I error and false negative - type II error.

It is obviously very important to distinguish and treat false positives and false negatives differently because the costs of such errors can be hugely different.

For example, if a test for serious disease is false positive (test says disease, but person is healthy), then an extra test will be made that will determine the correct diagnosis. However, if a test is false negative (test says healthy, but person has disease), then treatment will be done and person may die as a result.

What are feature vectors?

In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of an image, when representing texts perhaps term occurrence frequencies. Feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression. Feature vectors are often combined with weights using a dot product in order to construct a linear predictor function that is used to determine a score for making a prediction

What is a gold standard?

A gold standard is a standard accepted as the most valid one and the most used.

In medicine and statistics, gold standard test usually refers to a diagnostic test or benchmark that is the best available under reasonable conditions. Other times, gold standard is used to refer to the most accurate test possible without restrictions.

A hypothetical ideal "gold standard" test has a sensitivity of 100% with respect to the presence of the disease (it identifies all individuals with a well defined disease process; it does not have any false-negative results) and a specificity of 100% (it does not falsely identify someone with a condition that does not have the condition; it does not have any false-positive results). In practice, there are sometimes no true "gold standard" tests. Sometimes they are called "perfect" and "alloyed" gold standard.

In machine learning, "gold standard" data can refer to a well annotated dataset. One that is manually reviewed by a group of domain experts and the validity of the features and class labels are meticulously verified.

What is the difference between supervised learning and unsupervised learning? Give concrete examples.

Common terms:

Suppose you are providing solution to your kids for each and every situation in their life, it is called your kids are supervised. But, if your kids take their decisions out of their own understanding, it is called your kids are unsupervised.

Machine Learning: Supervised vs. Unsupervised

In machine learning, such solutions are called target or output and situations are called input or unleveled data. Situation and solution in combination it is called leveled data.

Supervised: So, if you are training your machine learning task for every input with corresponding target, it is called supervised learning, which will be able to provide target for any new input after sufficient training. Your learning algorithm seeks a function from inputs to the respective targets. If the targets are expressed in some classes, it is called classification problem. Alternatively, if the target space is continuous, it is called regression problem.

Unsupervised: Contrary, if you are training your machine learning task only with a set of inputs, it is called unsupervised learning, which will be able to find the structure or relationships between different inputs. Most important unsupervised learning is clustering, which will create different cluster of inputs and will be able to put any new input in appropriate cluster. Other than clustering, other unsupervised learning techniques are: anomaly detection, Hebbian Learning and learning Latent variable models such as Expectation–maximization algorithm, Method of moments) (mean, covariance) and Blind signal separation techniques (Principal component analysis, Independent component analysis, Non-negative matrix factorization, Singular value decomposition).

Learning:

Generally, associated targets for every inputs are used in supervised learning by reducing error to get a potential target.

Since the examples given to the unsupervised task have no associated target, there is no credit or blame to be used in learning.

Statistical terms:

Basic idea of machine learning is to get the probability distribution of data. Suppose you have some inputs X1, X2, …. Xn and their respective targets Y1, Y2, … Yn. So, training examples are a set of N leveled data: {(X1, Y1), (X2, Y2) … (XN, YN)}.

Now your supervised learning algorithm will try to learn the probability of Y for a particular X. In probability notation it is called posteriorly probability or Probability(Y/X).

Unsupervised learning is closely related to the problem of density estimation in statistics. Same task can be converted to unsupervised task, where every input and corresponding targets are concatenated to create a new set of inputs in the form of {(X1, Y1), (X2, Y2) … (XN, YN)}. These new set of situations are used for unsupervised learning. In probability notation it is called joint probability or Probability (X, Y)

Reinforcement learning:

Other than supervised and unsupervised learning, another type of machine learning technique is called Reinforcement learning.

Semi-supervised learning:

If a very few inputs have associated targets, it is called semi-supervised learning.

Semi-supervised learning with active learning) often can be augmented by extra targets obtained through interactive query by trainer, which were not available during initial stage of training. This is generalized by proactive learning and also called as optimal experimental design.

When would you use SVMs vs. Random Forrest. Explain your rationale.

The no free lunch theorem is proving that when comparing machine learning algorithms over infinitely many datasets there will be no absolute best one (this is already the popularized version of the theorem, the formal one is even stronger).

In practical terms both have their advantages and dissadvantages. For a two class problem and if you expect your data to be reasonably clean and outlier free, structural risk minimization is a powerfull approach and I would probably go for the SVM. In a many class case and especially if you expect to have outliers, then I would go for the Random Forest (using subset of training sets with bagging and subsets of features can help reduce their effect). This is just scratching the iceberg of course, as there are many different variations of the basic algorithms, with different strengths and weaknesses.

What are the advanatages of different classification algorithms?

There are a number of dimensions you can look at to give you a sense of what will be a reasonable algorithm to start with, namely:

Number of training examples
Dimensionality of the feature space
Do I expect the problem to be linearly separable?
Are features independent?
Are features expected to linearly dependent with the target variable?
Is overfitting expected to be a problem?
What are the system's requirement in terms of speed/performance/memory usage?

This list may seem a bit daunting because there are many issues that are not straightforward to answer. The good news though is, that as many problems in life, you can address this question by following the Occam's Razor principle: use the least complicated algorithm that can address your needs and only go for something more complicated if strictly necessary.

Logistic Regression

As a general rule of thumb, I would recommend to start with Logistic Regression. Logistic regression is a pretty well-behaved classification algorithm that can be trained as long as you expect your features to be roughly linear and the problem to be linearly separable. You can do some feature engineering to turn most non-linear features into linear pretty easily. It is also pretty robust to noise and you can avoid overfitting and even do feature selection by using l2 or l1 regularization. Logistic regression can also be used in Big Data scenarios since it is pretty efficient and can be distributed using, for example, ADMM (see logreg). A final advantage of LR is that the output can be interpreted as a probability. This is something that comes as a nice side effect since you can use it, for example, for ranking instead of classification.

Even in a case where you would not expect Logistic Regression to work 100%, do yourself a favor and run a simple l2-regularized LR to come up with a baseline before you go into using "fancier" approaches.

Ok, so now that you have set your baseline with Logistic Regression, what should be your next step. I would basically recommend two possible directions: (1) SVM's, or (2) Tree Ensembles. If I knew nothing about your problem, I would definitely go for (2), but I will start with describing why SVM's might be something worth considering.

Support Vector Machines

Support Vector Machines (SVMs) use a different loss function (Hinge) from LR. They are also interpreted differently (maximum-margin). However, in practice, an SVM with a linear kernel is not very different from a Logistic Regression (If you are curious, you can see how Andrew Ng derives SVMs from Logistic Regression in his Coursera Machine Learning Course). The main reason you would want to use an SVM instead of a Logistic Regression is because your problem might not be linearly separable. In that case, you will have to use an SVM with a non linear kernel (e.g. RBF). The truth is that a Logistic Regression can also be used with a different kernel, but at that point you might be better off going for SVMs for practical reasons. Another related reason to use SVMs is if you are in a highly dimensional space. For example, SVMs have been reported to work better for text classification.

Unfortunately, the major downside of SVMs is that they can be painfully inefficient to train. So, I would not recommend them for any problem where you have many training examples. I would actually go even further and say that I would not recommend SVMs for most "industry scale" applications. Anything beyond a toy/lab problem might be better approached with a different algorithm.

Tree Ensembles

This gets me to the third family of algorithms: Tree Ensembles. This basically covers two distinct algorithms: Random Forests (Algorithm) and Gradient Boosted Trees. I will talk about the differences later, but for now let me treat them as one for the purpose of comparing them to Logistic Regression.

Tree Ensembles have different advantages over LR. One main advantage is that they do not expect linear features or even features that interact linearly. Something I did not mention in LR is that it can hardly handle categorical (binary) features. Tree Ensembles, because they are nothing more than a bunch of Decision Trees combined, can handle this very well. The other main advantage is that, because of how they are constructed (using bagging or boosting) these algorithms handle very well high dimensional spaces as well as large number of training examples.

As for the difference between Random Forests (RF) and Gradient Boosted Decision Trees (GBDT), I won't go into many details, but one easy way to understand it is that GBDTs will usually perform better, but they are harder to get right. More concretely, GBDTs have more hyper-parameters to tune and are also more prone to overfitting. RFs can almost work "out of the box" and that is one reason why they are very popular.

Deep Learning

Last but not least, this answer would not be complete without at least a minor reference to Deep Learning. I would definitely not recommend this approach as a general-purpose technique for classification. But, you might probably have heard how well these methods perform in some cases such as image classification. If you have gone through the previous steps and still feel you can squeeze something out of your problem, you might want to use a Deep Learning approach. The truth is that if you use an open source implementation such as Theano, you can get an idea of how some of these approaches perform in your dataset pretty quickly.

Summary

So, recapping, start with something simple like Logistic Regression to set a baseline and only make it more complicated if you need to. At that point, tree ensembles, and in particular Random Forests since they are easy to tune, might be the right way to go. If you feel there is still room for improvement, try GBDT or get even fancier and go for Deep Learning.

Should I split my data to train/test split or train/validation/test subset?

Let's say you have a dataset with 10,000 instances. As you mentioned, there are the following two options:

Cross-validation:Divide the data to train and test sets - say of sizes 8,000 and 2,000. Then, do a 5-fold cross-validation on the training set, so that each run will train on 6,400 points and test on the remaining 1,600 points. Your final model is the one that gives the minimumaveragevalidation error.
Using a fixed validation set: Here, you divide the total data into 3 parts a priori of sizes 6,400, 1,600 and 2,000 respectively, train on the first part and do model selection using the second part.

The first one is more robust. Suppose for a given set of parameters, two models give the following errors on 5 folds:
Model 1: {5.1%, 4.9%, 5.2%, 5.0%, 4.8%} => Average : 5.0%
Model 2: {4.1%, 6.4%, 6.7%, 6.5%, 6.3%} => Average : 6.0%

Clearly, model 1 does mostly better than model 2, and therefore, that should be the one selected. Model 2 performs better on just fold 1, perhaps because the model just accidentally fits the noise in that split well.

However, if you use a fixed validation set, and that turns out to be similar to fold 1 above, you'll end up selecting a bad model.
The other advantage of cross-validation is that you are using all the data to train your model. In fixed validation set, your validation set is not used for learning, and some rare patterns that only appear in the validation set will not be learnt at all.

The downside of using cross-validation is that it requires 5 times longer to train your model.

Thus, if you can tolerate the minor loss in accuracy due to idiosyncratic behavior of a fixed validation set, and compute time is more crucial, go with method 2; otherwise go with method 1.

Machine Learning