It′s an unfortunate fact of life that data are not well-behaved.
"Outliers" -unusual data values - show up in most research projects
involving data collection.
This is especially true in observational studies where data may naturally
have very unusual values, even if they come from reputable sources. Data
entry errors or rare events (such as readings from a thermometer left in
the sun, a change in accounting practice, or a subject who has a sudden
muscle spasm) - all these and many more are reasons for the existence of
outliers in a dataset.
Likely Sources of Outliers
Data errors. Data recording or entry errors should be the first check as
the possible reason for outlying observations. Use of a spreadsheet
program such as EXCEL for data entry can help improve input accuracy and
therefore reduce the occurrence of data recording errors.
"Rare" event syndrome. Another reason for outliers is the "rare" event
syndrome - extreme observations may occur that for some legitimate reason 数据挖掘研究院
do not fit within the typical range of other data values. Such unusual
observations might include
* a 70 degree January day in Oregon
* a 500 point rise/drop in a stock market index
* a high score on an aggressiveness scale for a troubled child
All these events may be relatively rare, but they still must be considered
part of the overall picture.
With large datasets, computer programs can be written to identify data
entry errors or extreme observations (SAS is a particularly good tool for
this purpose).
Why Are Outliers a Problem?
Developing techniques to look for outliers and understanding how they
impact data analysis are extremely important parts of a thorough analysis,
especially when statistical techniques are to be applied to the data.
For example, in the presence of outliers, any statistical test based on
sample means and variances can be distorted. Regression coefficients
estimated that minimize the Sum of Squares for Error (SSE) are very
sensitive to outliers.
There are several other problematic effects of outliers, including:
* bias or distortion of estimates
* sums of squares are inflated which make it unlikely you will partition
sources of variation in the data into meaningful components
* distortion of p-values (statistical significance, or lack thereof, can
be due to the presence of a few-or even one-unusual data value)
* faulty conclusions (it′s quite possible to draw false conclusions if you
haven′t looked for indications that there was anything unusual in the
data)
The following example may seem a bit extreme, but real data with this
feature actually exist. The results vividly demonstrate the potential
problems that lurk in the background due to unusual data values.
95% Confidence Interval 数据挖掘研究院
Sorted Data Median Mean Variance for the mean
Real Data 1 3 5 9 12 5 6.0 20.0 [0.45 to 11.55]
Data with
Error 1 3 5 9 120 5 27.6 2676.8 [-36.630 to 91.83 ]
The first four data values across each row contain the same numbers.
However, in the second row, the fifth entry has a large discrepancy when
compared to the value in the first row. Note that in the presence of one
outlier, the median does not change in this example.
The median is robust (i.e., it usually does not vary greatly) in the
presence of a small number of outliers and is often the preferred summary
statistic for the "center" of a skewed distribution. Notice how just one
large outlier can greatly distort the mean, variance, and 95% confidence
interval for the mean. Similar results apply to regression, analysis of
variance, or any technique that uses sums of squares in the calculations.
How to Detect Outliers
The "normal" distribution myth. For many statistical modeling purposes,
input data do not necessarily require a "normal" or symmetric, bell-shaped
distribution. (This feature applies primarily to residuals from a
statistical model -- a subject for future articles.) Discrete data or
counts, by definition, will not usually look very "normal".
In fact, for data to be used in linear regression model, the independent
or explanatory variables should not have a normal distribution. It can be
demonstrated mathematically that normality is not required nor even
desirable for this type of data. What is important is to check for data
values that lie well outside the range of other data (called leverage
points) that can have a undue influence on the results. Your objective 数据挖掘研究院
should be to collect data with a distribution that allows you to make the
best inferences possible about the population under study.
Visual aids. Check the distributions of data values by levels of a
categorical variable, if available. This proceure should always be one of
the first steps in data analysis and will quickly reveal the most obvious
outliers.
For continuous or interval data, a dotplot of a single variable or
multi-dimensional scatterplots are good methods to look for any outlying
observations. A box plot is another very helpful tool, since it makes no
distributional assumptions nor does it require any prior estimate of a
mean or standard deviation. Values that are extreme in relation to the
rest of the data are easily identified.
Another decision rule is to eliminate a certain percentage of the data at
the extremes in one or two tails such as the top 1%. One weakness of this
method is the cut-off value is based on an ordering of the data where one
or more of the values closest to the 99% quantile may be eliminated when
actually should be kept or kept when they should have been eliminated.
For continuous or interval data, visual aids such as a dotplot (single
variable) or scatterplots (combinations of two variables) are often good
methods to examine how severe the outlying observations actually are. A
box plot is another very helpful tool since it makes no distributional
assumptions nor does it require any prior estimate of a mean or standard
deviation. Values that are extreme in relation to the rest of the data
are easily identified.
Univariate Tests exist to check for the presence of outliers; however,
many of them are designed to check for the presence of only one outlier,
and they also make distributional assumptions which are often not relevant
(e.g. assume a normal distribution when you have very skewed non-negative
data). They often require that a location (mean) or scale (standard
deviation) parameter be estimated from the data, As shown earlier,
outliers greatly influence their values. This is one reason why
"eliminating data that exceed two or three standard deviations" may not be
a good, or even a reasonable, decision rule.
It only requires basic computing skills to find the inter-quartile-range
(ICR) and then use a multiple of it as a number that defines what values
could be considered outliers. One way to apply this approach is to use
PROC UNIVARIATE with SAS and save the order statistics available with its
OUTPUT statement. The first quartile (q1), third quartile (q3), and
inter-quartile range (icr) can be saved to an output data set or written
to macro variables (see example below). In a subsequent DATA step you can
flag observations that lie outside of q1-(1.5*iqr) and q3+(1.5*iqr) as
potential outliers and anything outside of q1-(3*iqr) and q3+(3*iqr) as
problematic outliers.
Here is a sample SAS program to detection outliers with order statistics. 数据挖掘研究院
OPTIONS ls=78 ps=55 nocenter formdlim=′ ′;
PROC UNIVARIATE DATA=mydata NOprint;
VAR y;
OUTPUT OUT=qdata Q1=q1 Q3=q3 QRANGE=iqr;
RUN;
DATA _null_; SET qdata;
CALL SYMPUT("q1",q1); CALL SYMPUT("q3",q3); CALL SYMPUT("iqr",iqr);
RUN;
* save the outliers;
DATA outliers; SET mydata;
IF (y <= (&q1 - 1.5*&iqr)) OR (y >= (&q3 + 1.5*&iqr)) THEN severity=′*′;
IF (y <= (&q1 - 3*&iqr)) OR (y >= (&q3 + 3*&iqr)) THEN severity=′**′;
IF severity=′*′ OR severity=′**′ THEN OUTPUT outliers;
RUN;
PROC PRINT DATA=outliers NOobs n;
DATA <id variables> y severity;
TITLE ′Data outliers for review′;
RUN;
Multivariate outliers can also lurk undetected in an analysis.
Univariate tests for outliers are not guaranteed to identify multivariate 数据挖掘研究院
outliers. For example, two data values - called x1 and x2 - may be not
considered a univariate outlier when looked at individually as described
above. However, their combination can lie on the periphery of the range
in two-dimensional space - in this case the two values are called an
influential or leverage point that can have a strong impact on the
computation of regression coefficients, for example.
Outliers versus Influential Observations in Linear Regression
It is possible for an influential observation to not to be an outlier and
the opposite. Chatterjee and Hadi (Ref ___) give the following
definitions:
Outlier: An observation in which the Studentized residual is large
relative to other observations in the data set.
High-Leverage Point: Large leverage, far away from center of points in the
X space. May be regarded as outliers in the X space.
Influential Observation: Individually or jointly excessively influence the 数据挖掘研究院
regression equation.
These definitions are within the context of linear regression. In
particular, there are at least two definitions of an outlier in regression
which the following two figures illustrate:
Y |
| . A
|
|
|
|
| . .
| . . .
| . .
| . . .
| . .
| .
+-------------------------------------------------- X
Figure 2. High leverage point that conforms to the linear model.
From the diagram, A *is* an influential point in the sense that summary
statistics will be stronger. R-square will be much larger than when point
A is excluded from the calculations. Values of the hat matrix for X would
easily tell you this. However, A is not an influential point in terms of
how it influences the estimated coefficients of the regression line. The
slope of the line through those points would be about the same regardless
of whether point A was in the dataset or not.
Now whether one considers A to be an outlier or not depends on how one
defines an outlier. The point A lies far away from the rest of the data,
which is an outlier in my book. However, if you were to look at the
residual for A it would be quite small. So statistics such as Mahalanobis
distance would spot this outlier (or an eyeball looking at a univariate or
bivariate plot!) but residuals (unstandardized, standardized,
studentized) would not.
In the plot below, point A is both an outlier and influential.
Y |
| . A
|
|
| . .
| . . .
| . . .
| . .
| . 数据挖掘研究院
| . .
| . . .
| . .
+-------------------------------------------------- X
Figure 3. High leverage point that does not conform to the linear model.
A point considered to be an outlier depends on how far it lies away from
the rest of the data, regardless of whether it conforms to a model
estimated by the rest of the data. A point is influential if it doesn′t
conform to the remainder of the data. Another way to look at it is: do
your results change substantially when computing results with and without
this observation? If so, it is considered influential. In the first plot
above, removing A would not substantially change the coefficient
estimates; however, it will considerably change r^2 and p-values from
significance tests.
What Should You Do About Them?
Working with outliers in quantitative data can pose rather difficult
decisions. Neither ignoring nor deleting them at will are good solutions.
If you do nothing, you may end up with a model that describes essentially
none of the data - neither the bulk of the data nor the outliers. Even
though your numbers may be perfectly legitimate, if they lie outside the
range of most of the data, they can cause potential computational
anomalies and resulting inference problems.
Accommodation. Accommodation of outliers uses techniques to mitigate their
harmful effects. One of its strengths is that accommodation of outliers
does not need to precede identification. These techniques can be often be 数据挖掘实验室
used without prior determination that outliers exist. However, keep in
mind that identification and accommodation do not compete, rather, they
reinforce each other. A few possible approaches to accommodating outliers
are listed below.
Nonparametric Methods. One very effective way to work with data is to use
methods which are robust in the presence of outliers. Nonparametric
statistical methods fit into this type of analyses and should be more
widely applied to continuous or interval data than their current
use. When outliers are not a problem simulation studies have indicated
their ability to detect significant differences is only slightly smaller
than corresponding parametric methods. Various forms of robust regression
models and computer intensive approaches deserve attention.
Transformations. Data transformations are one way to soften the impact of
outliers since the most commonly used expressions, square roots and 数据挖掘研究院
logarithms, shrink larger values to a much great extent than they shrink
smaller values. However, transformations may not fit into the theory of
the model or they may affect its interpretation. Taking the log of a
variable does more than to make a distribution less skewed, it changes the
relationship between the original variable and the other variables in your
model. In addition, many transformations require non-negative data or
data that is greater than zero, so they are not always the answer.
Deletion. Only as a last resort should outliers be deleted, and then only
if they are found to be errors that can′t be corrected or lie so far
outside the range of the remainder of the data that they distort
statistical inferences. When in doubt, you can report model results both
with and without the outliers to see how much they change. Data
transformations and deletion are important tools but shouldn′t be viewed 数据挖掘研究院
as a cure-all for computational problems associated with outliers.
Transformations and/or outlier elimination should be an informed choice,
not a routine task.
Summary
This article briefly deals with the problems of outliers, their detection,
and approaches to data analysis. It′s presented with the hope that
looking for unusual values will always be a regular part of your data
analysis, and that your research objectives and knowledge of your subject
matter will help you decide what to do with them once you find them.
Always apply exploratory data analysis techniques that look for both
univariate and multivariate outliers and then evaluate how they impact on
the results with and without transformations, accomodation, and deletion.
This will help you reach conclusions that are in line with your research
objectives. A "common sense" approach is often the best solution.
Questions to ask yourself concerning potential outliers: 数据挖掘研究院
1. Are any of the values of the predictors unusual?
2. Is the response variable unusual (relative to the model of other data)?
3. Does the value of one response have a big impact on predictions of
other response variables? That is, does it have a big impact on the
parameter estimates?
Dixon′s Method for Detecting Outliers (suitable for small samples).
Dixon′s Method has several variants (for lack of a better word). Some of
them are:
1. Single upper outlier x(n) in a normal sample with unknown sigma2
[x(n)-x(n-1)]/[x(n)-x(1)]
this is effective when there is at most one outlier value, else there is
vulnerablility to masking effects.
2. 2-sided test for extreme outlier in a normal sample with unknown sigma2
max{[x(n)-x(n-1)]/[x(n)-x(2)], [x(2)-x(1)]/[x(n)-x(1)]}
This of course is the 2-sided form of #1 above
3. Upper outlier pair x(n),x(n-1) in a normal sample with unknown sigma2 数据挖掘研究院
[x(n)-x(n-2)]/x(n)-x(2)]
Avoids possible masking of x(1). Can be used in the case of a single
outiler. Avoids masking of x(n-1) if there is more than one outlier.
There are other variants of this depending on how you set up the test.
There is a book out in the Wiley series in probability and mathematical
applied statistics which cover all sorts of outlier tests. Also, consider
Outliers in Statistical Data by Barnett (Wiley).
..the reference to the 1951 Dixon′s outlier test?
Dixon, W.J., "Processing data for outliers", Biometrics, vol. IX (1953),
pp. 74-89.
Dixon′s test are no longer recommended. There are better choices
available today.
REFERENCES
ASTM Mmethod E 178 on Dealing With Outlying Observations.
Barnett, V., and T. Lewis (1984). "Outliers in Statistical Data", 2nd
edition, Chichester: Wiley.
Barnett, V., and T. Lewis (1994). "Outliers in Statistical Data", 3rd
edition, New York: Wiley.
Beckman, R. J., and R. D. Cook, (1983). Outlier...s. Technometrics", vol.
25, pp. 119-149.
Blaedel, W. J., Meloche, V. W., Ramsay, J. A., "A comparison of criteria
for the rejection of measurements," J. Chem. Educ., December 1951,
643-647.
Chatterjee and Hadi (1988) Sensitivity Analysis in Linear Regression, New
York: Wiley.
Cook, R. D. (1977). "Detection of influential observations in linear
regression" Technometrics 19, 15-18.
Cook, R. D. and S. Weisberg (1982). "Residuals and Influence in
Regression". Chapman and Hall: New York.
Cook, D., and Weisberg, S., An Introduction to Regression Graphics, Wiley.
Dean, R. B., Dixon, W. J., "Simplified statistics for small numbers of
observations," J. Anal. Chem., 23, 1951, 636-638.
Dixon, W. J., "Analysis of extreme values," Ann. Math. Stat., 21, 1950,
488-506.
Dixon, W. J., "Ratios involving extreme values," Ann. Math. Stat., 22, 数据挖掘研究院
1951, 68-78.
Hadi, "A Modification of a Method for the Detection of Outliers in
Multivariate Samples" 1994, JRSSB 56:2, 393-396).
Hadi and Simonoff (1993), "Procedures for the Identification of Multiple
Outliers in Linear Models", JASA, 88:424, 1264-1272.
Hampel, Ronchetti, Rousseuw, and Stahel, "Robust Statistics", John Wiley
& Sons, 1986
Hawkins D. M. (1980) Identification of Outliers, Chapman and Hall, 1980.
Huber, "Robust Statistical Procedures", SIAM, 1977 (A new chapter was
added in 1996).
Hu Yuzhu, Smeryers-Verbeke, J., Massart, D. L., "Outlier detection in
calibration," Chemometrics and Intelligent Laboratory Systems, 9, 1990,
31-44.
Jones, M.C., and Sibson, R., What is Projection Pursuit, J. R. Statistical
Society, Series A, Vol. 150, Part 1 1987, pp. 1-36 (Outlier discussion by
Tukey, J.W.,on pg. 33).
Lavine, M. (1991) Problems in Extrapolation Illustrated With Space Shuttle 数据挖掘研究院
O-Ring Data, JASA, (86), 919-921.
Miller, J. N., "Outliers in experimental data and their treatment,"
Analyst, 118, May 1993, 455-461.
Mitschele, J., "Small sample statistics," J. Chem. Educ., 66 (6), June
1991, 470-473, and references.
Rosner′s multiple outlier test Technometrics 25 No 2, May 1983, 165,172.
Rousseeuw, P. J., and A. M. Leroy (1987). "Robust Regression and Outlier
Detection". Wiley, New York.
Tietjen, G. L. and R. H. Moore (1972). "Some Grubbs-Type Statistics for
the Detection of Several Outliers", Technometrics, v14, (3), 583-597.
Weisberg, S. (1985). "Applied Linear Regression", 2nd ed. Wiley, New York.
Wilcox, Rand R. (1998) "How Many Discoveries have Been Lost by ignoring
Modern Statistical Methods?" American Psychologist, Vol 53, No. 3,
300-314.
Youden, W. J., as reported in column ("Out of the Editor′s Basket") item
"The Best Two Out of Three?" in J. Chem. Educ., December 1949, 673-674. 数据挖掘研究院

