astrochris: Wednesday: Blind ignorance and a pile of math

Wednesday, July 24, 2013

Wednesday: Blind ignorance and a pile of math

The problem is identifying the shift between two spectra. My pile of math immediately shouts out "CROSS CORRELATION!" because that of course is the one true answer for such things. However, maybe you don't believe me when I say that. So, let's come up with a simulation using perl.

Imagine a spectrum extending from 1000 to 2000 with 100 randomly placed lines, with locations you know. Use those line positions to make a series of Gaussians of identical amplitude and width. This is the "real" spectrum. Next, you have a "test" spectrum, that has 100 lines as well, but it probably doesn't have exact matches to the real spectrum. In the simulation, this is defined as follows: at iteration 0, the test spectrum is comprised of 100 new randomly placed lines. After each iteration, a random test line is replaced with the known line from the real spectrum list.

For this test, the test spectrum was defined to have a shift of 11.5 units, and a search was performed over the range -100 to +100 using the same cross-correlation/binary-tree algorithm I implemented in the actual science code. For each iteration and shift value, plot up the correlation coefficient:

Note that there's a ridge in the center that shifts from a peak around zero to a peak around 11 around iteration ~20.

This directly illustrates the power of the cross-correlation technique: when there are no common lines (iteration = 0), the peaks are basically randomly located, and there isn't a very clear best value. However, as more common lines are added, the random peaks lose power, and transfer that power to the peak centered on the best shift value. Here's a slightly more clear picture showing iterations 0, 50, and 99:

The diagonal lines are an artifact of the binary tree search and me not bothering to sort the data file before plotting.

This nicely shows the improvement as the number of common lines increases. The math shows why this has to be the case:

corr[shift] = sum_over_x(R(x) * T(x+shift))

If A and B are completely random, no value of shift should be any different than any other. However, if there are common lines at R(x) and T(x+shift), then those values will reinforce, and the correlation is larger. Even when the number of common lines is small, they still overwhelm the random effects.

Chopped.
I kind of want to play this game now.
Kyle sent this link to me during the middle of the work day. He's lucky I had lots of time compiling and testing code today, so I could look at bears on and off for like an hour.

Below the cut: bears!