Markov Chains in Python

What is a Markov Chain?

(This was originally posted here on my original blog but since I'm not sure how much longer that will be around I'm reposting it. I'm also updating it slightly.)


This is my first markov chain! I was very excited to see it producing odd english. Here’s a quote from it: 

I felt only for i can be swept through to tone. all the language between professional gentlemen, the disparition

Amazingly, the basic premise is simple enough that I was able to figure out how to build one without a tutorial. I’m sure there are a bunch of things I could do to get the program to create more pleasing quotes but thus far I am happy.

So what is a markov chain? Essentially in this context it's a collection of words and a set of probabilities that one word will transfer to the next. That's it. You get a word, you look at what percent of the time that word transitions to the others and choose one of the other words it's connected to using the associated probabilities to choose between them.


You’ll notice that I did do a lil optimization already (I'll leave that up to you to find). I base the next word on the word before it. I also allow the punctuation to remain where it is. I decided to do this because I have seen standard Markov Chains and while funny, they’re pretty bad readability wise. I was hoping that this would produce slightly less silly results.

If you would like to load your own sample files you can grab book text from Project Gutenberg (located here: http://www.gutenberg.org/)

The Code 

I’ve pasted the code at the link below which should be able to be copy and pasted directly into your Python editor:

A lil Refactoring...


So after I wrote my own first go at a Markov chain I decided to look up how someone else did it in Python. Some things the other person did better included some pythonisms and others were algorithmic improvements.

My algorithm was actually too complex. I was finding one word and then picking the next word based on a weighting (that was based on how often the successive word appeared after the current word). All I really needed was to group the words into pairs and then just randomly choose one of the words that appear after the given pair. The quality of the results went up drastically.

Here is the new algorithm (heavily borrowed concepts from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/194364

Hungry for More Knowledge!

If you'd like to check out another Python project I've done that uses markov chains to generate music check this link.

Have fun! 

Using Dynamic Programming to Diff Strings

My boss left a note on my last article noting how similar my solution to pattern matching was to file diff algorithms he had been reading about. So naturally I had to take it as a great chance to explore how to do a basic diff between two strings. The solution I've created would be pretty easy to upgrade to support whole lines of text rather than individual letters but I'll leave that as an exercise to you. I'll probably do it if/when I get a spare moment I just doubt I'll blog about it. :)

What's the Real Problem?

We need to figure out what the minimum amount of editing is that's needing to convert string #1 (the "old string") into string #2 (the "new string"). If we didn't need to minimize the editing we could just say delete all of the old string and insert each character of the new string. That's pretty useless though. That's would be the equivalent of your favorite SCM software telling you that the difference between file A and file B is the whole file. Like I said, useless.

So what do I mean when I say editing distance? Imagine you only have three ways to modify a string: insert, delete, and flat out changing letters in your new string to match letters in your old string. So for example to change "ba" into "fg" you would need to swap the b for an f and the a for a g. That would mean that your minimum editing distance would be two. Another example would be to change "adbed" to "abcde". This would have a minimum editing distance of 3 since you would delete the first "d", change the "e" to a "c", and then add an "e" to the end of the string.

Getting Technical

Remember the technique I used last time for string pattern matching? No? Read it. I'm using the same basic idea here with using a matrix, placing scores for the different combinations of two characters and figuring out how to traverse this matrix in an optimal fashion.

The trick in this instance is use the old string as your column headings and the new string as your row headings. The values in the cells that make up the table should be comprised of how many editing steps it would take for you to change the given old character (in its current position!) into the new character. Since, for us, deleting, inserting, or swapping a character all have the same cost (the value 1)  and since the edit distance between two identical characters is the value 0 initially it would seem as though we'll just be traversing a matrix of 0's and 1's. However in this case since we are looking for the minimum edit distance we need to take into account the minimum edit distance from the previous step.

Remember that we will be traversing this list from one corner of the matrix to the other. This is important because it limits the number of cells we need to take into account to figure out the minimum edit distance up to our cell.

In an attempt to quickly clarify what I mean let's first look at my code for calculating the edit distance of each pair of characters (which represents a single cell in our matrix at position oldCharacterIndex, newCharacterIndex):

private static int _CalculateMinimumCellEditDistance(List<List<int>> matrix, string oldString, int oldCharacterIndex, string newString, int newCharacterIndex)
{
    var oldChar = oldString[oldCharacterIndex];
    var newChar = newString[newCharacterIndex];

    var localEditDistance = oldChar == newChar ? 0 : 1;

    var leftArrayValue = localEditDistance + _GetCellValue(matrix, oldCharacterIndex - 1, newCharacterIndex);
    var topLeftArrayValue = localEditDistance + _GetCellValue(matrix, oldCharacterIndex - 1, newCharacterIndex - 1);
    var topArrayValue = localEditDistance + _GetCellValue(matrix, oldCharacterIndex, newCharacterIndex - 1);

    var minEditDistance = Math.Min(Math.Min(leftArrayValue, topLeftArrayValue), topArrayValue);

    return minEditDistance;
}

That makes perfect sense if you don't think about the left and top edges. Here we have to introduce a lil finagling to calculate those cells appropriately. Basically the following code means that if you're in the top row looking up the edit distance in column number oldCharacterIndex, the only way to have gotten there is to have made (oldCharacterIndex + 1) count deletions. By the same token, if we're in the left-most column looking for the value in row number newCharacterIndex, the only way to have gotten there is to have made (newCharacterIndex + 1) count insertions.

private static int _GetCellValue(List<List<int>> matrix, int oldCharacterIndex, int newCharacterIndex)
{
    if (oldCharacterIndex < 0 && newCharacterIndex < 0)
        return 0;
    if (oldCharacterIndex < 0)
        return newCharacterIndex + 1;
    if (newCharacterIndex < 0)
        return oldCharacterIndex + 1;
    return matrix[oldCharacterIndex][newCharacterIndex];
}

Now I'd like to point out, I'm 100% certain that this is **A** "right" way but my code might have bugs so take it with a grain of salt. The main purpose of this post is to show another way of tackling a different dynamic programming problem and the thought process I went through.

Once I've scored each pair of characters I then set out to traverse the given matrix by going to the smallest edits possible. I also created a class to handle tracking the deletes and edits that would be needed along the way. This class can be used to play back my changes to tell someone how they could change the old string into the newer one. Also, since dynamic programming calls for us to work through the problem backwards, this class reorders the steps to read correctly.

Now, once again we create a matrix of how well each cell performs according to the above method. In our context a cell's value should essentially be thought of as the number of steps required to convert the character at the cell's column index into the character at cell's row index. In my calculation I'm also taking into account shifting the character into the appropriate position. That means that every cell has a been scored in such a way that they all solved their own version of the problem as though they were the only portion of it. What's beautiful about this is that by bringing these results together we can find the optimal string editing just by following the smallest values backwards from the lower right cell of our matrix. Basically you start at what has to be your end point and assume that you took the optimal path there from the previous three optional cells (the cell immediately to the left, to the top-left, and to the top). If you have a tie between top or left just choose whichever you like. In the case of a tie between the top-left and one of the others though I default to the top-left since that's roughly half the number of cells (and thus string edit steps) we'll need to process.

Quick Example

To help pull together all of this here's an edit distance matrix from my program for diffing "bcdeffghi" and "abcdefghij". This string is useful because it exercises all 3 diff operations (insert, swap, and delete):

    b  c  d  e   f  f   g  h  i
 a 1  2  3  4  5  6  7  8  9
 b 1  2  3  4  5  6  7  8  9
 c 2  1  2  3  4  5  6  7  8
 d 3  2  1  2  3  4  5  6  7
 e 4  3  2  1  2  3  4  5  6
 f  5  4  3  2  1  2  3  4  5
 g 6  5  4  3  2  2  2  3  4
 h 7  6  5  4  3  3  3  2  3
 i  8  7  6  5  4  4  4  3  2
 j  9  8  7  6  5  5  5  4  3

The lower right hand corner (3 in this case) is our optimal edit distance. This means the string on the top of the table can be transformed into the string on the left of the table with exactly three separate steps.

As always here's the link to my code (It's in the StringMatching folder along with the pattern matcher):