Thursday, April 4, 2013

Thursday: Ill-defined distributions

Here's a scenario to consider.  You have four samples (generically, 2*N), but due to the way they're obtained, a pair of samples have similar values (hence the 2*N choice).  At which point do you exclude one pair from the distribution, and use only the remaining one to constrain your measurements?  The first idea would be to use the observational uncertainties (in terms of the variance) to determine if two pairs are consistent to within those errors.  However, if they're not, you're still left with two values, neither of which have a strong prior for acceptance.  I guess the next idea would be to use neighboring values (because, like usual, I'm talking about images or something like that).  Then you could define a set of five values, (V,dV/dx+,dV/dx-,dV/dy+,dV/dy-) or alternatively, (V,dV/dx,dV/dy,d^2V/dx^2,d^2V/dy^2), which should contain the same information.  This gives a reasonably robust way to exclude point like errors, but fail on larger scale things.

Second scenario: you have a single realization of a distribution in one dimension.  Again, the distribution is sampled in pairs (2*N total), and this can contain up to N modes.  What's the best way to choose how many of those to accept as valid?  My first thought is Gaussian mixture models, but I have reasonable suspicion that the true distribution isn't Gaussian.  It looks more like lognormal, but fitting lognormal mixture models results in a single mode being preferred for all cases.  Clearly that's not helpful.  Maybe do a k-means thing, and then determine if the separation between each mean is inconsistent with the sigma distributions for each mode?  That seems like it's just doing GMM but skipping all the math.

"What's going on over here? Oh! Hello Mr. Castle! I'm the Moon!"

No comments:

Post a Comment