Thursday, August 29, 2013

Dear Tomorrow Me,

Hey, Present-Day Me here.  Just wanted to remind you not to forget that this algorithm:

@box = ();
$prior = NAN;
$v = -100;
for ($i = 0; $i <= $#data; $i++) {
    if ($i <= 40) {
        push @box, $data[$i];
    }
    else {
        $min = 0;
        $max = $#box + 1;
        do {
            $j = int(($max + $min) /2);
            if ($box[$j] < $prior) {
                $min = $j;
            }
            elsif ($box[$j] > $prior) {
                $max = $j;
            }
        } while ($box[$j] != $prior);
        $box[$j] = $data[$i];

    @box = sort {$a <=> $b} @box;
    $v = $box[int($#box / 2)];

    $prior = $data[$i];
}

Is a way more efficient way to do the running medians.  The idea being: fill the box up to full size.  Next, we sort the box.  That gives this median.  On the next cycle, do a binary search through the sorted box for the point we want to remove.  I've fucked it up here, as I really need to set it to $prior = $data[$i - $boxsize] or something.  In any case, once you find that point, pull it out, and insert the new point into that position.  The rest of the array stays sorted, so the sort operation really only needs to push this new point into the correct location.  

1377855882 start
1377855882 end read
1377855900 end direct
1377855906 end optimal

So, the direct method takes 18 seconds for 1e6 samples, and this optimal method takes 6, so like a factor of three speedup.

I'm unclear if this can be similarly extended to the MAD calculation, as that's going to require backing out the old median and pushing in the new one in addition to the replacement.  Probably not a deal breaker, but you'll have to sort that out yourself, ok?

Love,
Present-Day Me

PS: Here's a sleepy bear:
"zzzzz"

No comments:

Post a Comment