Wednesday, April 15, 2009

Nonparametric Statistics

I keep forgetting about this thing, which I'd feel worse about if other people weren't forgetting as well.

Today I fought with nonparametric statistics. If you've never worked with them, they represent a class of statistical measurements that make no assumptions about the underlying distribution. The standard forms tell you if the mean or median or something like that is consistent between two samples. There are weak ways to check for different variances as well. However, I've been trying to use them to calculate if two variables are correlated, as well as trying to see if two samples are drawn from the same distribution, whatever that may be.

The complication is that I have 150 and 1500 data points in my two samples. Almost all of these nonparametric methods are designed for small N cases, although no one ever claims that they're not good for large N. The problem lies when you attempt to calculate the significance of the statistic. Just about all of these things assume the p-value comes from normal distribution with z ~ sqrt(N) * R. Since p(z=2.33) = 99%, a significant statistic value is something like R = 2.33 / sqrt(N). Therefore, with the large N samples I'm working with, it's hard to get not significant values. This may sound great, but if everything shows a significant relation, then the statistic has lost its value, because everything gives the same answer.

The end result of all this, is that nonparametric statistics are a worse class of lies than regular parametric statistics, and you should be wary of anyone who says "Mann-Wilcoxon-Whitney" or "Kendall Tau." These rank up there with the KS test as "a test you can cling to when you really have nothing valid to claim."

No comments:

Post a Comment