Don‘t Just Remove Outliers From Your Data – Think Twice

Rethinking Outliers: Strategies for Informed Data Decisions

Arvid Eichner
5 min readOct 14, 2023

Outliers can affect the accuracy of statistical analysis and machine learning models. However, not all outliers are created equal. In this guide, we will explore different types of outliers and how to deal wit them.

Boxplots are a great tool to visually identify outliers. Can you spot them?

What Are Outliers?

Outliers are data points that are significantly different from other data points in a dataset. They can occur for various reasons, such as measurement errors, data entry mistakes, or genuinely rare events. Identifying and dealing with outliers is crucial as they can severly distort statistical analyses and machine learning models: Due to their exceptionally high or low values, if used as independent variables, outliers disproportionately affect the results – leading to misleading conclusions.

Identifying Outliers

There are several ways to identify and handle outliers. The most common ones are:

Trying to find outliers by visual inspection (Source: Polyvinyl Records)
  1. Visual Inspection: By far the easiest method to spot outliers. Create box plots, histograms, or scatter plots to visually identify data points that appear far from the main cluster.
  2. Z-Score Method: Calculate the so-called z-score for each data point. The z-score measures how many standard deviations a data point is from the mean. Data points with a high absolute z-score (typically greater than 2 or 3) can be considered outliers.
  3. IQR (Interquartile Range) Method: Calculate the IQR, which is the range between the 75th and 25th percentiles of your data. Data points outside 1.5 times the IQR can be considered outliers.

For a step-by-step guide on how to identify and remove outliers using the Python library pandas click here.

Handling Outliers

But wait! Before you go ahead and remove the data points that appear to be outliers, do yourself a favor and try to answer the following question:

Is the instrument, that was used to collect the data, reliable and valid?

Without going into too much detail, in the context of measurement instruments reliable and valid means:

  • Your instrument is reliable, when it is consistent and stable, that is, it consistently produces similar results when used multiple times, assuming no actual changes in the underlying construct being measured.
  • Your instrument is valid when it is accurate – a valid instrument should measure what it is intended to measure.

Measurement Errors Cause Outliers

Oftentimes, outliers are caused by unreliable or invalid instruments. We call this type of outliers measurement errors.

An example:

Meet Isabella. Isabella is a researcher in the field of marketing, studying consumer purchase behavior using eye-tracking technology. In her current project, she uses an eye-tracking device to record where on a computer screen participants focus their gaze and for how long. Using this data, she aims to gain insights into their purchase decisions.

As Isabella reviews her data, she notices something strange — roughly one third of her participants feature oddly high product “gaze times”, almost four times as high as those of remaining participants.

This extended gaze time would suggest an outstandingly strong interest in the corresponding products, but Isabella can’t seem to explain the substantial difference. After all, there was no intended distinction between participants; all of them underwent the exact same experimental setup.

Highly sceptical, Isabella re-examines the results, paying particular attention to potential demographic differences between the two groups. Finally, she notices a small but important detail:

The group of participants with the abnormally high glaze times were all wearing prescription glasses.

Apparently, the reflective coating of the glasses created glare when hit by light. This glare was then being picked up by the eye-tracking device, causing it to inaccurately register prolonged gaze times on certain products.

In reality though, these participants may not have been particularly interested in the products, which would, however, have been the fallacious verdict of Isabella’s experiment.

Go Ahead And Remove Outliers That Are Measurement Errors

If outliers are caused by measurement errors, for instance an unreliable instrument, there is a reasonable case to be made to remove the outliers as they do not represent genuine (reliable) measurements of the underlying construct.

For a guide on how to remove outliers using pandas click here.

A Second Type of Oulier

If outliers cannot be explained by unreliable or invalid instruments, that is, you can confidently rule out measurement errors, you should, however, think twice about removing them. In this case, they represent genuine measurements that happen to be out of the ordinary. There is an ongoing debate about whether such outliers are to be removed, as removing them can lead to:

  • The loss of valuable information
  • Misrepresentation of the true patterns in the data (Bias)
  • Concerns of data manipulation
  • Decreased reproducibility of results
  • Potential misrepresentation of marginalized or vulnerable populations

Some very compelling reasons to keep outliers, when they are not caused by measurement errors. Still, many statistical methods, such as linear regression or ANOVA, do require data sets that are free of outliers.

And now?

Well, these are your options:

  1. Choose a different model: Of course, this is not always an option. But many methods have alternatives, that are less prone to outlier-induced bias, for example non-parametric tests.
  2. Transformation: Apply mathematical transformations to the data to reduce the impact of outliers. For example, you can use a square root or log transformation to make the data less skewed.
  3. Winsorization: Replace extreme values with the nearest “normal” data point. Winsorizing is less drastic than outright removal but can still reduce the impact of outliers.

Remember that your approach to handling outliers should be tailored to your data and the objectives of your analysis. Trying to derive insights, that are both meaningful and representative of the underlying reality, requires an approach that makes theoretical sense and fits the context of your research.

Key Takeaways:

  1. Always check your data for outliers. They will impact your analyses.
  2. Think twice! Before mindlessly removing outliers, consider instrument reliability and validity to rule out potential measurement errors.
  3. Go ahead and remove outliers that are probably measurement errors.
  4. Do not remove remove data points simply because they are outliers.

--

--

Arvid Eichner
Arvid Eichner

Written by Arvid Eichner

Ph.D. candidate in Information Systems / Data Science, passionate about Python, R, data, and statistics

No responses yet