Markov Chains in Python

What is a Markov Chain?

(This was originally posted here on my original blog but since I'm not sure how much longer that will be around I'm reposting it. I'm also updating it slightly.)


This is my first markov chain! I was very excited to see it producing odd english. Here’s a quote from it: 

I felt only for i can be swept through to tone. all the language between professional gentlemen, the disparition

Amazingly, the basic premise is simple enough that I was able to figure out how to build one without a tutorial. I’m sure there are a bunch of things I could do to get the program to create more pleasing quotes but thus far I am happy.

So what is a markov chain? Essentially in this context it's a collection of words and a set of probabilities that one word will transfer to the next. That's it. You get a word, you look at what percent of the time that word transitions to the others and choose one of the other words it's connected to using the associated probabilities to choose between them.


You’ll notice that I did do a lil optimization already (I'll leave that up to you to find). I base the next word on the word before it. I also allow the punctuation to remain where it is. I decided to do this because I have seen standard Markov Chains and while funny, they’re pretty bad readability wise. I was hoping that this would produce slightly less silly results.

If you would like to load your own sample files you can grab book text from Project Gutenberg (located here: http://www.gutenberg.org/)

The Code 

I’ve pasted the code at the link below which should be able to be copy and pasted directly into your Python editor:

A lil Refactoring...


So after I wrote my own first go at a Markov chain I decided to look up how someone else did it in Python. Some things the other person did better included some pythonisms and others were algorithmic improvements.

My algorithm was actually too complex. I was finding one word and then picking the next word based on a weighting (that was based on how often the successive word appeared after the current word). All I really needed was to group the words into pairs and then just randomly choose one of the words that appear after the given pair. The quality of the results went up drastically.

Here is the new algorithm (heavily borrowed concepts from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/194364

Hungry for More Knowledge!

If you'd like to check out another Python project I've done that uses markov chains to generate music check this link.

Have fun!