Monday, August 6, 2012

Monday: Why can't I find anyone else who uses this statistic?

I pulled it from a paper in grad school, and have been using it ever since because it's just too useful not to.  Start with the assumption that you're sampling from a Gaussian distribution with unknown mean and variance.  Next, add an unknown amount of error that means you can get really wonky values every so often.  If you calculate just the regular sample mean and sample variance, you can get bad answers because those wonky values skew the distribution badly.  Instead, substitute the median for the mean, and you've got a pretty decent estimate of the center of the distribution.  Now, for the variance standard deviation, take the interquartile distance (the distance between the 75th percentile and the 25th percentile), and multiply it by 0.7413.  You come up with this number by noting that for a Gaussian distribution, sigma = (1 / (norminv(0.75) - norminv(0.25))) * IQD.  I've been calling this Q[uartile]-sigma in all my code since reading about it.

Although, now that I look at this, I realize that that's basically just calculating the MAD estimate of sigma in a different way.  And look, there on the wiki page, it says the same thing.  Well.

The point being, if you have values {1,2,3,4,5,6,7,8,9000}, then the mean is going to be horribly skewed due to a contamination of just 11% of the data.  Ditto with the regular standard deviation.  However, using Qsigma (or the MAD estimate, since they're apparently equivalent) lets you have up to ~50% crap data (if you're lucky and have it equally bad in both directions.  Otherwise just 25%) before the statistic starts to break down.

Anyway, here's a squirrel:
Walnuts!
 A biker having a very bad day:
Deer probably isn't too happy about it either.
 And some books:
Books!
The continuation of yesterday's dinner is that:

  1. You cannot buy pints of regular milk apparently.
  2. Little kid 8oz milk-boxes don't expect you to not use the straw.
  3. My mac&cheese did just need a bit more milk to smooth out.
  4. I cook everything en papillote now-a-days.

2 comments:

  1. If your values are 1,2,..,8,9000, you've fucked something up big time and maybe you should stop being a scientist. :-P

    ReplyDelete
  2. Not everyone has perfect data. I can think of three ways you can get something like that right now:

    1) crosstalk artifact shifting light from a bright star to an empty region, but only in one image.
    2) cosmic ray hitting the detector on a bias frame, making the value way out of range.
    3) comparison of two fits, where one fit occasionally messes up and returns an invalid result.

    Also: shouldn't you be asleep by now?
    :-P

    ReplyDelete