sigma = sqrt(1/N * sum(x_i - mean))
That's pretty much the one everyone uses.
Another strategy is to use the interquartile distance:
IQD = x_75% - x_25%
If you look at this for a Gaussian distribution (which is the assumption the standard deviation makes), you can use the cumulative distribution function for the normal distribution to see that you can get a "sigma" from this where
sigma_IQD ~= 0.7413 * IQD
This measure is somewhat more robust against outliers, as you can pile a bunch of points outside the inner half of the distribution, and they won't change the IQD much. However, if you have a bias where all your outliers are on one side (say, all greater than the median of the distribution you care about), then you can only have 25% of all your points being outliers.One way to fix this is to use the median absolute deviation:
MAD = median( abs(x_i - median(x_i)))
It's not too difficult to directly prove that for a symmetric distribution (Gaussian, or a Gaussian + unbiased outlier distribution)
MAD_symmetric = 0.5 * IQD_symmetric
However, if the outliers are biased, MAD is able to accept up to 50% ourliers before it really breaks down. As above, you can create an effective sigma:
sigma_MAD ~= 1.4826 * MAD
Finally, you can histogram all the data, and do a Gaussian fit to that histogram to determine the width. This has the problem that you need to have a decent number of samples before the fit will converge well. It also can be more computationally expensive, since you have to fit a function. Ok, now let's look at a bunch of plots. For these, I created 10000 random samples, with some fraction "outliers." The real distribution is a Gaussian of mean 0 and stddev 1.0. The outliers are drawn from a flat distribution from -5 to 5 (for the unbiased case) or from 0 to 10 (for the biased case).
Biased. | Unbiased. |
- For the biased outlier distribution, IQD fails around 25% contamination. MAD is better until around 50%. Histogram fits are generally better, but fail catastrophically around 60%.
- For the unbiased outliers, IQD and MAD are identical as suggested above. Histograms are still better.
- Standard deviation is universally bad when any outliers are present.
Biased. | Unbiased. |
This is the same set of plots, now done for N = 100. Histograms are now good only in the unbiased case with less than 50% contamination. For the biased case, they're not better than the standard deviation above 20% contamination.
So that's largely why I like using a sigma based on the MAD. It's pretty much universally better than the regular standard deviation when outliers are present. Fitting Gaussians to histograms can do a better job, but that uses more complicated math, so it may not be useful if you don't have fitting code ready to go.
No comments:
Post a Comment