Tuesday, October 7, 2014

Tuesday: We finish each other's sandwiches!

That is not what you were going to say, liar.

-Or-
This is actually going to be a math post about Markov chains.

Last night Julie was talking about Markov chains, which led to me sending her boring math stuff, and then remembering that I wrote a thing to produce simulated tweets from my twitter archive.  And then this morning while reading the arXiv/stats listings, I found this paper, which uses board games like Monopoly to talk about MCs.  

In any case, one of the things Julie mentioned today was this KimKierkegaardashian thing that merges two different people into one tweet.  I don't think the KK twitter is actually generated by an MC, as it's way less random that MC generators tend to come up with.

Also that's where the title/gif come from.  Same concept.

Anyway, this made me wonder about the math problem.  If you did want to do that with MCs, and still keep something similar, how would that work?  I was going to put a test example here, but coming up with two or more useful corpora is harder than I thought it would be while driving home.  My test case was going to be "Brock vs Brock," but doing a quick quote pull, and then running it through (Julie's) MC code showed that there wasn't enough text to get a decent chain.  Too many of the words were only used once, so the chains generated were largely just the input.

So, let's do it theoretically instead of experimentally.  MCs are just a matrix, T, defining the transition probabilities from word 1 to word 2.  1->2 is not likely to be the same as 2->1, and the matrix is probably going to be sparse (no corpus is likely to contain the phrase "the angrily" as that's grammatically wrong or "taco hater" as that doesn't exist).  Start from a random initial word, denoted by a state vector, s, indicating which word was chosen, and then the new state is just s' = random(Ts), where the random() chooses from the set of probabilities Ts generates.

What does this have to do with the Brock vs Brock thing?  The simplest thing would be to just take T_brock and add it to T_otherbrock.  This gives transitions that are something like an average of the two Brocks.  This might work, but it's not likely to be funny.  It also won't have the discontinuity in thought the KK or Hans/Anna things do.

So, let's take advantage of the fact that T can be arbitrarily large, and simply index both Brocks differently.  This constructs T', which is just a block matrix with T_brock and T_otherbrock aligned diagonally.  Leaving it like this allows an initial word from Brock to create a chain, but it will be bound to just that one Brock.  There is no jumping between the two, as the other two quadrants of T' are zeros.  That's boring.  To allow jumps from one corpus to the other, some values in those quadrants need to be populated.

My thought (and this is what would have been nice to have an experimental set to work with) is that choosing words that are excessively popular in both corpora would be where to break.  Removing the indexing different for these words will make them common to both, and allow chains that start in one to migrate to another.  Using popular words will ensure that there is a transition with some decent probability, but that you won't get a bunch of switches.  In addition, I suspect conjunctions and articles are going to end up being the common words, and that seems like it'd generate the funniest chains.



No comments:

Post a Comment