String Pattern Matching: Welcome to Dynamic Programming!

The Challenge

So here's the puzzle you're provided: Given a string pattern tell if another given string matches the pattern. The character "." is a wild card that can only match one character and a "*" is a wild card that can match as many characters as possible. Nothing matches a missing character.

For the pattern "a.c*f", for example, "abcdef" would match as would "azcrwgfdjkgfdkjhdfjkfhf" but "bbcdef" would not.

How would you go about solving this problem? When I first set out to solve this problem I got horribly stuck. I tried to use just a for loop and then got totally lost in that the possible permutations branched all over the place. It seemed nigh intractable (for me of course since I knew this was solvable since regex does even more coolness than this).

Side note: That is a great way to tell what computer science challenges deserve your attention most. If you literally can't even come close to tackling a problem understanding how to finally conquer it will make you a better developer by an order of magnitude each problem IMO.

Think about this for a bit cuz up next is the solution.

A Speedy Savior

So first off, one solution would be to create a string for every possible permutation of the pattern (up to the test string's length) and check if one of those patterns matches your test string. There are two issues with this approach: A) It takes up a lot of space and B) it would take a LOT of time.

To prove it would take a lot of time consider that our solution would essentially need to consider every letter in the place of the wild card. That's 26 additional runs PER single wild card... in the case of the asterisk that could be 26^N where N is the number of letters your wild card might take up. In the larger example above, with that technique, there'd be at least 766,467,265,200,361,890,474,622,976 permutations just to the wild card alone.

Thankfully as it turns out there's a better way: Dynamic programming! Ok so what is dynamic programming and why does it apply here? Basically, dynamic programming can be helpful whenever you have a problem that breaks down into individual subproblems that can be combined together to form a complete answer. Once you've solved all of the individual subproblems you start from the finishing line and make your program assume that you found your optimal solution and try to work out what the previous step must've been given that the current step is optimal. 

What some of you might be thinking is... what if my string doesn't match? Assuming correctness seems wrong in that case huh? You definitely don't want to return a false positive but that won't happen. What will happen is that if you're program isn't able to match the strings, then the algorithm will be left incomplete and will never reach the block of code that says to return true. Essentially we only say there is a match if the algorithm starts at our finish line and ends up at the starting line we specify. If that path is broken, then our algorithm will return false.

Another benefit of this method is that the worst case runtime is O(mn) or for you non-computer science types, the number of steps the algorithm will take is approximately equal to the length of the pattern string multiplied by the length of the test string (the string we're matching against). In our large sample above that would be about 115 steps to figure out the solution (at the worst)!

Forming a Recurring Subsequence Problem

This is really the hardest part. You need to think about the problem from several different angles until you can see it as a sum of several smaller problems. Once you see this subsolution though the programming just pops out at you.

In our example with the strings, I had been thinking about how I could break this up for the better of the day in the back of head during idle cycles. Finally when I sat down to draw out some ideas (I'm very visual) I realized that I could create a table with the pattern string forming the columns and the test string forming the rows. Once that was done I marked the cells that matched with a T (for True) and an F if they didn't. I saw that in cases where I expected a match, I could traverse the table from the upper left corner to the lower right by traversing cells diagonally (or vertically in the case of an asterisk or a period).

I know this is hard to visualize so I'm going to *try* to give a visual here... bare with me.  :)

1 is True and 0 is False BTW...

Pattern: ".a*.j*"

Test string: "cadeajmn"

.a*.j*
c101101
a111101
d101101
e101101
a101101
j101111
m101101
n101101

That is the test that made me have my "a-ha" moment. So now looking at this primitative visualization let's see how our pattern matching rules show up here. First notice that the asterisks and periods have their entire columns set to true. The reason for this is that they can stand in for any letter so given appropriate positioning they **could** be true in any of those instances. I got a little tripped up here. I thought that this matrix should represent the fact that my period can only represent one letter but that turned out to be unnecessarily complicated during this phase. The 2nd phase, as you'll see, makes that much easier.

From here see if you can start at the finish line though and work your way back to the start. Now while it *is* possible in this table that you could move from cell to cell horizontally, our phase 2 rules won't allow for this. If we did allow this that would mean that two pattern characters could match to one test character and hopefully you can see why that makes no sense in our scenario. Also remember that you can only traverse cells vertically if you're in an asterisk column (since that means that that single pattern character matched multiple test strings and that's the only pattern character capable of doing that by definition).

Note that we know if our traversal was successful because at some point, on some path, we will arrive at cell (0,0). If we miss the starting line and go off the table, then we don't have a match.

These last two paragraphs we've basically listed out the rules for our recursive algorithm. More clearly, they are as follows:

Given a cell (i, j) where i is the column, j is the row

  • if we reach cell (0,0) AND this cell's value is true we have a match
  • if we reach a cell where either i or j are negative we have no match (for this path, regardless of the value)
  • if none of the above but the current cell is true then test cell (i-1,j-1) (the cell to the upper left)
  •   if that doesn't have a complete path and we are in an asterisk column try the row above

Here are those same rules but in code:

private static bool _HasCompletePath(IList<List<bool>> matrix, int patternIndex, int stringIndex, string pattern){    var result = false;     if(patternIndex < 0 || stringIndex < 0)        return false;     if (patternIndex == 0 && stringIndex == 0 && matrix[patternIndex][stringIndex])    {        result = true;    }    else if(matrix[patternIndex][stringIndex])    {        var tempResult = _HasCompletePath(matrix, patternIndex - 1, stringIndex - 1, pattern);        if(!tempResult && _HasCompletePath(matrix, patternIndex, stringIndex-1, pattern) && pattern[patternIndex] == '*')        {            tempResult = true;        }        result |= tempResult;    }     return result;}

Wrapping up

I don't expect this to make perfect sense to someone who has never written a dynamic programming algorithm in their life. My main hope is that by me sharing the insights I gained while I'm still new to this that this will help others to learn this stuff a little easier. If nothing else, it should at least provide a new non-formalized non-mathematical perspective (which is REALLY difficult to find on the web).

Full source code is available here: http://github.com/jcbozonier/Dynamic-Programming-Sample/tree/master 

Creating Music Procedurally (AKA My Computer Makes Art)

(download)

My Latest Project

This week I decided to work on a little script for the session I'll be holding at Portland Code Camp called "Artistic Expression Through Code". The first night I developed the meat of the script: A markov chain based algorithm that you could recite a song to which it would then use to create its own song. This first version didn't have support for timings so everything was quarter beats. It didn't sound as bad as random computer noise, but it didn't sound like a song. It was very... computer-like.

Tonight I decided to implement a second markov chain to track the timings of the recited song and use those along with the notes to create real songs. The results this time weren't too bad at all. (I've attached a sample of the song to this post)

Give it a listen. I think you'll be somewhat surprised that that's a completely computer generated song.

How'd You Do That?!

There's a fairly simple computer science technique known as a "Markov Chain". Don't let the whole reference to computer science fool you, it's really not tough to grasp. Basically I created a table of numbers that answers the following question: When note X was played what percentage of the time were the other notes played next? So just imagine a table with all of the notes from a to g# laid out on top (these are the notes that we last played) and vertical axis is all of the same notes but this axis represents the probability that that particular note was played next.

Here's a sample of what my program generates:
     a a# b c  c# d d# e  f   f# g  g#
   [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    [0, 0, 0, 6, 0, 1, 0, 0, 0, 0, 2, 0], 
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    [0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0], 
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    [0, 0, 0, 1, 0, 2, 0, 3, 1, 0, 0, 0], 
    [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0], 
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    [0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 2, 0], 
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

You'll have to just imagine that same list of notes listed along the left hand edge of that table since it's too hard for me to put it there myself in this editor. :)

Here's the timings table (1, 1/2, 1/4, 1/8, 1/16 notes):
  1  2  4   8 16  
 [0, 0, 0,  0, 0], 
 [0, 0, 0,  1, 0], 
 [0, 0, 3,  5, 0], 
 [0, 2, 4, 11, 0], 
 [0, 0, 0,  0, 0]

But that's it, that's the magic that drives the whole process. As you can see, I'm actually not storing percentages but instead just the count of the number of times a note led to a different note. It works out to be the same thing in the end.

Grok'd

So to summarize, my program was able to generate the music because I fed it a sample musical score and it figured out what percentage of the time a c note led to another note. Here's a step by step run through:
  • Give program a note and a timing.
  • When I give it a second note/timing it notes in it's table that the first note led to the 2nd note one time. It also notes that the first timing led to this 2nd timing one time. (note I don't attempt to relate notes/timings, it's not important surprisingly).
  • Enter a third note.
  • The program then notes in its table that the 2nd note led to the 3rd note one time.
  • Continue ad inifinitum
So that's how we set the system up. Next this is how we get the program to come up with its own song:
  • Choose some random note/timing to start.
  • Ask the computer to suggest what a good note/timing would be to follow those.
  • print out the note/timing (in case it's a work of genius ;)
  • play each note using the python library I'm using, pysynth.
I know that's pretty general and a pretty quick overview but I'll be giving a more in depth explanation at Portland Code Camp.

Here's a link to my git repo with all of my Python code: http://github.com/jcbozonier/MarkovMusic/tree/master

That's all for now!

New Alloy Features Planned

First I want to touch on what Alloy is and what the guiding vision behind it is.
 
Put simply, Alloy is a messaging application written for Microsoft Windows. It allows you to IM ur friends on google talk and also let's you send tweets via Twitter. Ok, so what's the big deal?
 
It's the way Alloy does this that is special. Alloy not only connects you to these services it also connects them together in a way that is intuitive for you. When send out a tweet Alloy also takes that and posts it as your available message on google talk. Why? at it's core that's what a tweet is, a status update.
 
Alloy is all about understanding the context in which different messaging services are used and making intelligent decisions for you about how it shares your information and cross posts your messages.
 
Here's another example: have you noticed how when you want to paste a large amount of text (think code or patagraphs of text) you have to stop yourself and think, "that's not appropriate here. I need to paste that onto pastie first and then paste the link in my tweet." that's too much thinking in my opinion.
 
In Alloy, you simply paste whatever globs of text you want to send to friends and it will handle pasting it somewhere else for you. Alloy is about freeing you the user from having to think about anything but the message you want to communicate.
 
That's really what Alloy is.
 
So, what will Alloy be?
 
Currently the highest priority task on Alloy before I'd even consider it a beta is the addition of a contact list of some sort. My friends and I have some great ideas about how to do this and it's just a matter if choosing a way and getting it done.
 
Once that's done I'd like to include support for twitter searches and also some light facebook integration (mainly revolving around status updates).
 
My last high priority item is animation. I'd love to see our UI more fluidly indcate it's intent to you.
 
So that's all. Hopefully zomeo e else out there ends up discovering Alloy and enjoys it as much as I do. If you try it out but decide to not use it please shoot me an email with all of your criticism. I'd love to know what it would take to convert you into an avid Alloy user.

First Public Release of Alloy!

Click here to download:
Alloy.Setup-0.1.0.115.msi (1009 KB)


Alpha Version of Alloy Released!

I just installed the latest copy for myself and then realized.. hey! This is good enough that it hasn't been crashing for me and a lot of the annoyances are gone. Maybe it's time to post an alpha release? Yup!

I will come back later on and update this post regarding what exactly Alloy is and such but for now I just want to ensure that I can manage my releases like this here on posterous.