Chapter [ ]: Probability and Statistical Inference

What is statistical power?

Statistical power is the probability of rejecting a null hypothesis when it is false.

To put in another way, statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is).

For example, suppose there is a test for some medical condition. The power is the probability of a positive result when someone has the condition. The signicance is the probability of a positive result when someone does not have the condition. A good test has high power (near one) and low significance (near zero). But there is usually a trade-off between power and significance, if you want more power you have to accept more significance as well. And if you want lower significance, you’ll get less power.

Explain what resampling methods are and why they are useful. Also explain their limitations.

Classical statistical parametric tests compare observed statistics to theoretical sampling distributions. Resampling a data-driven, not theory-driven methodology which is based upon repeated sampling within the same sample.

Resampling refers to methods for doing one of these

Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping))
Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)
Validating models by using random subsets (bootstrapping, cross validation)

Here is a good overview of Resampling Statistics.

What is selection bias, why is it important and how can you avoid it?

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample. For example, if a given sample of 100 test cases was made up of a 60/20/15/5 split of 4 classes which actually occurred in relatively equal numbers in the population, then a given model may make the false assumption that probability could be the determining predictive factor. Avoiding non-random samples is the best way to deal with bias; however, when this is impractical, techniques such as resampling), boosting), and weighting are strategies which can be introduced to help deal with the situation.

Give an example of how you would use experimental design to answer a question about user behavior.

Step 1: Formulate the Research Question:

What are the effects of page load times on user satisfaction ratings?

Step 2: Identify variables:

We identify the cause & effect. Independent variable -page load time, Dependent variable- user satisfaction rating

Step 3: Generate Hypothesis:

Lower page download time will have more effect on the user satisfaction rating for a web page. Here the factor we analyze is page load time.

Step 4: Determine Experimental Design.

We consider experimental complexity i.e vary one factor at a time or multiple factors at one time in which case we use factorial design (2^k design). A design is also selected based on the type of objective (Comparative, Screening, Response surface) & number of factors.

Here we also identify within-participants, between-participants, and mixed model.For e.g.: There are two versions of a page, one with Buy button (call to action) on left and the other version has this button on the right.

Within-participants design - both user groups see both versions.

Between-participants design - one group of users see version A & the other user group version B.

Step 5: Develop experimental task & procedure:

Detailed description of steps involved in the experiment, tools used to measure user behavior, goals and success metrics should be defined. Collect qualitative data about user engagement to allow statistical analysis.

Step 6: Determine Manipulation & Measurements

Manipulation: One level of factor will be controlled and the other will be manipulated. We also identify the behavioral measures:

Latency- time between a prompt and occurrence of behavior (how long it takes for a user to click buy after being presented with products).
Frequency- number of times a behavior occurs (number of times the user clicks on a given page within a time)
Duration-length of time a specific behavior lasts(time taken to add all products)
Intensity-force with which a behavior occurs ( how quickly the user purchased a product)

Step 7: Analyze results

Identify user behavior data and support the hypothesis or contradict according to the observations made for e.g. how majority of users satisfaction ratings compared with page load times.

What is the difference between "long" ("tall") and "wide" format data?

In most data mining / data science applications there are many more records (rows) than features (columns) - such data is sometimes called "tall" (or "long") data.

In some applications like genomics or bioinformatics you may have only a small number of records (patients), eg 100, but perhaps 20,000 observations for each patient. The standard methods that work for "tall" data will lead to overfitting the data, so special approaches are needed.

Different approaches for tall data and wide data, from presentation Sparse Screening for Exact Data Reduction.

The problem is not just reshaping the data (there are many useful packages in Python and R for that), but avoiding false positives by reducing the number of features to find most relevant ones.

Approaches for feature reduction like Lasso are well covered in Statistical Learning with Sparsity: The Lasso and Generalizations.

What method do you use to determine whether the statistics published in an article (or appeared in a newspaper or other media) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject?

A simple rule is if some statistics are published in a newspaper, then they are wrong.

Every media organization has a target audience. This choice impacts a lot of decisions such as which article to publish, how to phrase an article, what part of an article to highlight, how to tell a given story, etc.

In determining the validity of statistics published in any article, one of the first steps will be to examine the publishing agency and its target audience. Even if it is the same news story involving statistics, you will notice that it will be published very differently across Fox News vs. WSJ vs. ACM/IEEE journals. So, data scientists are smart about where to get the news from (and how much to rely on the stories based on sources!).

Often the authors try to hide the inadequacy of their research through canny storytelling and omitting important details to jump on to enticingly presented false insights. Thus, a thumb's rule to identify articles with misleading statistical inferences is to examine whether the article includes details on the research methodology followed and any perceived limitations of the choices made related to research methodology. Look for words such as "sample size", "margin of error", etc. While there are no perfect answers as to what sample size or margin of error is appropriate, these attributes must certainly be kept in mind while reading the end results.

Another common case of erratic reporting are the situations when journalists with poor data-education pick up an insight from one or two paragraphs of a published research paper, while ignoring the rest of research paper, just in order to make their point. So, here is how you can be smart to avoid being fooled by such articles: Firstly, a reliable article must not have any unsubstantiated claims. All the assertions must be backed with reference to past research. Or otherwise, is must be clearly differentiated as an "opinion" and not an assertion. Secondly, just because an article is referring to renowned research papers, does not mean that it is using the insight from those research papers appropriately. This can be validated by reading those referred research papers "in entirety", and independently judging their relevance to the article at hand. Lastly, though the end-results might naturally seem like the most interesting part, it is often fatal to skip the details about research methodology (and spot errors, bias, etc.).

Ideally, I wish that all such articles publish their underlying research data as well as the approach. That way, the articles can achieve genuine trust as everyone is free to analyze the data and apply the research approach to see the results for themselves.

How would you screen for outliers and what should you do if you find one?

Some methods to screen outliers are z-scores, modified z-score, box plots, Grubb's test, Tietjen-Moore test exponential smoothing, Kimber test for exponential distribution and moving window filter algorithm. However two of the robust methods in detail are:

Inter Quartile Range

An outlier is a point of data that lies over 1.5 IQRs below the first quartile (Q1) or above third quartile (Q3) in a given data set.

High = (Q3) + 1.5 IQR
Low = (Q1) - 1.5 IQR

Tukey Method

It uses interquartile range to filter very large or very small numbers. It is practically the same method as above except that it uses the concept of "fences". The two values of fences are:

Low outliers = Q1 - 1.5(Q3 - Q1) = Q1 - 1.5(IQR)
High outliers = Q3 + 1.5(Q3 - Q1) = Q3 + 1.5(IQR)

Anything outside of the fences is an outlier.

When you find outliers, you should not remove it without a qualitative assessment because that way you are altering the data and making it no longer pure. It is important to understand the context of analysis or importantly "The Why question - Why an outlier is different from other data points?"

This reason is critical. If outliers are attributed to error, you may throw it out but if they signify a new trend, pattern or reveal a valuable insight into the data you should retain it.

How would you use either the extreme value theory, Monte Carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?

Extreme value theory (EVT) focuses on rare events and extremes, as opposed to classical approaches to statistics which concentrate on average behaviors. EVT states that there are 3 types of distributions needed to model the the extreme data points of a collection of random observations from some distribution: the Gumble, Frechet, and Weibull distributions, also known as the Extreme Value Distributions (EVD) 1, 2, and 3, respectively.

The EVT states that, if you were to generate N data sets from a given distribution, and then create a new dataset containing only the maximum values of these N data sets, this new dataset would only be accurately described by one of the EVD distributions: Gumbel, Frechet, or Weibull. The Generalized Extreme Value Distribution (GEV) is, then, a model combining the 3 EVT models as well as the EVD model.

Knowing the models to use for modeling our data, we can then use the models to fit our data, and then evaluate. Once the best fitting model is found, analysis can be performed, including calculating possibilities.

Statistics