Using Mathcad to Detect and Handle Outliers in Data Analysis
Knowledge Base Exclusive
by Frank Smietana
|
This article examines specific aspects of real world data set preparation and illustrates the functionality provided in Mathcad to remedy data deficiencies. The first article addressed Data Quality Issues. This article discusses the vexing problem of outliers.
|
|
An outlier is characterized by a very low frequency occurrence of a variable value that is extremely distant from the other values of that variable. As with missing values discussed in the previous article, you need to determine if the outlier is an erroneous value, perhaps due to an improperly calibrated instrument or other measurement error. Ideally, the source of the error can be found and the outlying data point can be correctly remeasured.
That ideal outcome is rarely achieved in practice, even when the source of the error is found, because most data points can not be re-measured unless the experiment is performed from the beginning.
Note that not all outliers are erroneous data values. For example, the field temperature observed for a single pump operating in Antarctica is very
likely to be an outlier when compared to the temperature values logged for dozens of the same pumps operating in Western Europe.
|
|
Why Are Outliers Problematic?
|
|
To understand why outliers are problematic for statistical modeling and other advanced analytical techniques such as neural networks, look at normalization. Some of these techniques require data to be normalized within a range of values, such as 0 and 1. If you normalize the following collection of temperature values:
|
|
You can see that 90% of the measured values have been squeezed into a space of less than 20% of the normalized vector (0.806 to 1.0). For a statistical algorithm that expects data points to be distributed somewhat uniformly, this dramatic distortion may render the data unusable for further analysis, or compromise the ability of modeling techniques to extract valid conclusions from the data.
|
|
Using Mathcad to Detect and Handle Outliers
|
|
As we discussed in last month's article on handling missing data, you need to ensure that data preprocessing routines for handling data defects do not decrease the information content of the dataset significantly or create additional problems. Knowing something about the assumptions and limitations of the routines in your toolbox helps you reach that goal.
Let's start with a noisy, real-world dataset of 195 values:
|
|
Outliers may appear to be obvious simply by viewing a graph of the data with the mean overlaid as a reference point:
|
|
But this method is neither objective nor scalable. Imagine manually inspecting 100,000 data points!
The following three methods illustrate robust and objective approaches to identifying outliers These methods are packaged as functions in the Mathcad Data Analysis Extension Pack, which discusses them in greater detail.
|
|
Grubb's outlier test assumes that the sample is drawn from a normally distributed population. (Note that in some cases normality is not a valid assumption.) The test determines the likelihood that a particular point lying far from the majority of values was obtained randomly. The test statistic and its value for our example dataset is defined by:
|
| The Grubb's statistic is then compared to the probability that this maximum value lies outside of some percentage of the distribution. The test criterion is: |
|
98% confidence of outlier test being correct
|
|
where qt is the inverse cumulative probability density of the student's t function. The test finds no outliers are present if you wish to be 98% confident ( ) of the analysis.
|
|
By relaxing the confidence criteria to 95% ( ), Grubb's test indicates that outliers are found:
|
|
You now have the tools in place to write a useful routine for outlier detection based on the Grubb's criteria. The following function performs the Grubb's test on a data vector based on a user-defined confidence level and marks any detected outliers as NaN.
|
|
In the following example, alpha has been relaxed to 0.1, resulting in two data points (indices 3 and 19) being flagged as outliers by marking them as NaN (Not a Number). This makes intuitive sense if you examine the raw data plot above.
|
|
Using Interquartile Range for Outlier Detection
|
|
Another way to think of outliers is as data values that lie at the extreme percentiles of your data set. Using box plots as an inspiration, you can apply the same methodology to outlier identification. Box plots label points as extreme in range if they lie 1.5 times the interquartile range above or below the end points of that range.
|
|
This method detects four outliers in the test data:
|
|
These outliers are indicated with blue boxes in the scatterplot below, as are the limits of the interquartile range.
|
|
Handling Detected Outliers
|
|
Analogously to last month's article on dealing with missing values in your dataset, outliers marked as NaN's can be handled by substituting in appropriate values, such as the mean or the median of the dataset. Another approach is to delete the data point entirely, but this should only be used as a last resort. Frequently, outliers are a red flag that further investigation of your sampling and data collection tools is warranted.
|
|
This article explores missing values and their ramification on the information content of a dataset and illustrates some methods for detecting outliers in an objective and automated manner. So far, these articles have been examining continuous-valued data. Next month's issue examines approaches to transforming and maximizing the information content of categorical data.
|
[PRINTER FRIENDLY VERSION]
|