Wednesday, January 15, 2014

Wednesday: I am irrationally concerned with good statistics

First though, there's this Rick and Morty show.  Rick's a crazy scientist, and Morty is his grandson who ends up getting dragged on impossible adventures.  Great stuff.  This week's episode they were abducted by aliens and put into a simulation world for reasons.  However, it's not a good simulation, and Rick points out obvious problems:

Like the guy putting a bun between two hot dogs, and the old lady walking her cat.
 Morty's all "maybe people are just weird, and that old lady is straight up proper crazy."  So Rick has to pull out the big guns:
Morty acknowledges that poptarts probably wouldn't live in a toaster, as that doesn't make much sense.

Nor would they need to get in their car and go to work.

Rick points out the actual flaw in things: people don't drive around in cars that look like their house, but are smaller and have wheels.  Cars and houses are not the same shape, so this is just complete nonsense.
 Yes, they had to be naked there, due to weird alien reasons.
The poptarts show up later, as well.

Ok, statistics again.  The problem is that I saw this article today, which basically complains that "no one really means to use standard deviation, as people intrinsically want to use the mean absolute deviation" which is, of course, complete bullshit.

First, no one would ever do mean absolute deviation in their head.  Here are some numbers: {-1 2 3 -5 1 400}.  If you had to guess another number that would belong to this set, you're going to guess like "dunno, zero maybe?"  You know that 400 is probably bullshit, so you cut it out.  People don't do real means when they filter data.  It's some combination of a mode and median.  Choose a number that doesn't seem crazy.

Second, this mean absolute deviation tells you about where the 50% point falls.  Why that point?  The standard deviation is more inclusive, as it tells you that most (Q(1) = 68.change%) samples are closer to the central value.

Third, all that obvious shit about moments analysis.

Anyway, time for plots.  These are the same idea as the ones from the previous post, just remade with more samples and different stats.  The horizontal lines are the true uncontaminated distribution sigma and the true fully contaminated sigma (sigma_uniform = sqrt((b - a)^2 / 12), because math).  First thing to note:  Actual sigma cleanly switches from the two extremes, as it really should.  Gaussian fits are best, but IQD and MAD are comparable up to the 50% contamination point.  MeanAD doesn't seem particularly good.  The full contamination end is biased, as I'm using a parametric model (that it's a Gaussian distribution).
Biased samples.  This nicely shows that IQD fails before MAD, and that Gaussian fits are reasonable up to 60% contamination.  MeanAD is again off kind of doing its own thing.  Median >>> mean for outlier rejection.


No comments:

Post a Comment