String Searching Algorithm: Rabin-Karp

There are always those situations, when all of us are faced with the problem of finding a substring inside a string, or string in a text, or a set of pattern strings on a text (whatever flavour you like). Being this a common “problem” chances are your language of choice already has a very good implementation that you can use.

But, it might be the case, that you really have to implement it yourself or you just want to understand how such a thing might work. There are several alternatives such as Knuth-Morris-Pratt algorithm, Boyer-Moore algorithm, Rabin-Karp algorithm, just to name a few. The first two are probably more “famous” and so we will focus on the third one. We’ll find out how it works because, although it has a slower worst case scenario, it’s the algorithm of choice when we need multi-pattern search.

Let’s think about a simple, naive solution. What we could do is go through our string comparing our substring, incrementing the initial comparisson position as we go, until we either found a match our we reached the end.

def naive_search(substring, string):
    for i in range(len(string)-len(substring)+1):
        if string[i:i+len(substring)] == substring:
	    return i

    return -1

Quite simple and it would work fine for small strings, but with larger strings it wouldn’t work so well. And now imagine if we had to search for multiple patterns and/or multiple times.

At the core of Rabin-Karp algorithm, is a hash function. For our purposes, what this means is that, we will have a function that takes a string and transforms that into a number. Cool, uh? We’ll dive into that later. Let’s take a look at a possible implementation for our Rabin-Karp algorithm.

def rabin_karp(substring, string):
        if substring == None or string == None:
                return -1
        if substring == "" or string == "":
                return -1

        if len(substring) > len(string):
                return -1

        hs = RollingHash(string, len(substring))
        hsub = RollingHash(substring, len(substring))
        hsub.update()

        for i in range(len(string)-len(substring)+1):
            if hs.digest() == hsub.digest():
                if hs.text() == substring:
                    return i
            hs.update()

        return -1

So what’s happening here? First, we do a few checks before we get our hands dirty. Then we create two hash objects for both our substring and string. The object is called RollinHash (there’s a reason for that name; more on that later) and it has a few methods to help us:

  • digest returns the hash value;
  • text returns, as expected, the actual text.

By now you might be thinking: why the hell do we need the actual text, if we’re using the “all mighty” hash? The thing is, our hash function suffers from collisions. What is that, you might ask? We’ll let’s analyse the code below and understand what’s going on. We’ll analyse how our hashing really works, what do we mean about collisions and why it’s called RollinHash (where the magic happens).

class RollingHash:
        def __init__(self, string, size):
                self.str  = string
                self.hash = 0

                for i in xrange(0, size):
                        self.hash += ord(self.str[i])

                self.init = 0
                self.end  = size

        def update(self):
                if self.end <= len(self.str) -1:
                        self.hash -= ord(self.str[self.init])
                        self.hash += ord(self.str[self.end])
                        self.init += 1
                        self.end  += 1

        def digest(self):
                return self.hash

        def text(self):
                return self.str[self.init:self.end]

See that for loop in the __init__ method? That’s the real hashing. What we’re doing is very simple: we get the ascii code of each letter and add them. You’re already seeing where the collisions come from don’t you? Well, words with the same letters will have the same hash value! Although that might be a problem, it really isn’t for our purpose because, if that happens, we check the actual text (which doesn’t happen often).

So, where does the Rolling part come from? Let’s take a look ate the update method. If we just hashed as in the __init__ method we would be following the naive mode. But because we’re already dealing with a value we can improve that: subtract the ascii value of the first letter and add the ascii value of the next one. This way we don’t have to compute the hole think. Confuse? Let’s analyse this with a little more detail:

garbageLelogarbage -> hash value:  412

garbageLelogarbage -> current_hash_value – ord(‘g’) + ord(‘a’)

garbageLelogarbage -> hash value: 406

Let’s assume that we are searching for the substring Lelo. We will be considering substrings of size 4. Hashing the first 4 letters (garb) would yield the value 412, but we won’t have a match (Lelo hash value: 396). In order to move forward we don’t need to recompute everything because the next substring of size 4 will have 3 letters of the previous one. So, we subtract the value of ‘g’ of and add the value of ‘a’. We‘re saving time and effort.

The trick here is the rolling part. This is at the centre of the Rabin-Karp algorithm, and although it’s a simple “trick” it reveals itself as being quite powerful.

The importance of using the right data structure

Have you ever read The Algorithm Design Manual? No? Well, I’m not saying you should, but it might  be useful and it wouldn’t hurt. The name might give you a hint about it’s content, but for me it’s the way this content is explained that makes this book great.

I think the way the author presents its’ concepts, giving real life examples, makes all the difference.  And of course, the famous Catalog of Algorithmic Problems is an invaluable resource.

Despite all that, the author went into the trouble of creating an entire chapter about data structures. If you go through that chapter, you soon realize the importance it has to it’s author. More: he even states that, sometimes, you might not even need to change/improve your algorithm. You probably just need to use the right data structure for the task at hands.

We could go on and argue which data structure is the best for each task (read the book!), but I will instead demonstrate the importance of using the right one for the job, using a practical example. For that we will make use of Problem 23. from Project Euler, which reads like this:

Non-abundant sums

Problem 23

A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.

A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.

As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.

Find the sum of all the positive integers which cannot be written as the sum of two abundant numbers.

Check this for a possible solution. So, what do we have here? A few things:

  • a function to get the proper divisors of a number;
  • a main loop to go up to the limit that finds out if a number is  abundant; if that number can be written as the sum of two abundant numbers, add it;
  • s stores the sum;
  • abundant_numbers is a list that stores the abundant numbers.

Let’s just run this and check how long it takes:

$ time python problem23.py
4179871

real 3m20.546s
user 3m20.437s
sys 0m0.008s

Right answer! But 3m20s? I can’t live with that. I start to get itchy and I have to find a way to reduce that time. The first thing that comes into mind is that our approach might not be the most appropriate. Maybe there’s a better way to to find a number’s proper divisors. Maybe there’s a more efficient way to find out if a number is the sum of two abundant numbers. But let’s take a step back, and check our data structures first.

The only one we’re using is a list to store our abundant numbers. So, let’s take a look at this line of code:

 if not any((i-a in abundant_numbers) for a in abundant_numbers)

You can imagine that, while the list is growing in size, so is the time it takes to do each verification. Since it is a simple list, poor Python has no other way besides iterating though all it’s items and see if it matches what we desire. Well I’m starting to think that if we had a data structure that would allow us a more efficient checking process, that huge time would decrease drastically.

What about a Set? Sets are not more than unordered collections of unique elements. And good old Python even has an implementation for that, which behind the scenes uses dictionaries. The code changes to make use of Sets are minimal, check here. Let’s run this run and find out how much we improved:

$ time python sets_problem23.py
4179871

real 0m12.476s
user 0m12.461s
sys 0m0.008s

Wow! We only changed two lines of code (not even entire lines!) and we got from 3m20s to roughly 12.5s. Now that’s a good improvement!

The lesson here? Using the right data structure makes all difference. It’s not only me that says that, Mr. Skiena says it too (read the book!).

Python decorators

In Python, functions are first class objects. This only means that, like everything else, they’re objects. That can be a handy feature because that way we can pass functions as arguments to other functions. And also we can use functions as return values of other functions. Confusing, uh? Let’s see that with an example.

def func_received():
    print "I was received"

def receiver(new_func):
    new_func()

def return_function():
    def func_to_return():
	print "I was returned"

    return func_to_return

receiver(func_received)

func_returned = return_function()
func_returned()

If we run this, without any surprise we get:

$ python example.py
I was received
I was returned

As you can see we can use functions as an argument or return value, as everything else. So what’s this all about  decorators? Well, they’re simply callables that can take a function as argument and return a replacement function as a result. Let’s see some code.

def dec(func):	
    def inner_func(*args, **kwargs):
	res = func()
	return res + " and I was decorated!"

    return inner_func

def func():
	return "I am a function"

decorated = dec(func)
print decorated()

Again, by running, we get:

$ python decorators.py
I am a function and I was decorated!

This can be quite a handy feature. Let’s imagine a web application where you want to be sure that some action only takes place if a user is logged in. You can always do that verification right there, but this way you have a simple an easy form to achieve that. The only thing that’s required, is a decorator function that does exactly that.

In fact, it would be very nice if we had some piece of syntactic sugar to help us. And since Python 2.4 we have!. The @ applies a decorator to a function. Cool, uh? Some code to make it clearer.

def dec(func):	
    def inner_func(*args, **kwargs):		
	number = args[0] * 10
	res = func(number)
        return res + " and I was decorated!"

    return inner_func

@dec
def func(number):
    return "I am a function %d" % number

print func(1)

Now it has got a lot better and clear. But as an astute reader, you notice those intruding *args and **kwargs. An just to to exemplify that we can do something with them, I multiplied it by 10. Well, it wouldn’t be very helpful if we couldn’t get the input of our decorated functions, would it? So, those arguments are:

  • *args stores positional arguments
  • **kwargs stores all uncaptured keyword arguments. This simply means that **kwargs does for dictionaries and key/value pairs exactly what *args does for iterables and positional parameters.

The names args and kwargs are not part of Python syntax, but they’re a useful convention.

As we can see, decorators can quite a handy feature. If you use, for example Django, you will find them flying around  and now  you have an idea of what’s going on.

Sorting: can we do better than O(nlogn)?

Generally when we think about sorting elements in a list, we’re thinking about comparison sorting algorithms. Very well known algorithms of this kind are, for example, Quicksort or Merge Sort. So what do we mean when we say O(nlog)?

This type of notation (called big O notation) describes us the limit an algorithm can have, in terms of it’s input. We’re  normally interested either on their average or worst case.

For comparison sorting algorithms, there is a fundamental limit for it’s worst case scenario: O(nlog). Merge Sort provides us with this limit. You can check this example for sorting a list of integers. So the question remains: can we do better? It may seem that the question doesn’t make any sense, provided that I just told that there is there’s an actual limit. But actually… We can!

First of all, we’re talking about comparison sorting algorithms. We have other types of sorting algorithms (for example, integer sorting algorithms). So how can we? The simple “lesson” of this post is: it depends on the problem.

These type of algorithms give us the tools for generic sorting, that can be applied to any kind of data, as long as we can compare two elements. That means that we can have some kind of ordering between them: we can say that one is bigger (or smaller, or equal) than the other. What we have to focus on is the kind of problem that we have at hand.  It’s characteristics might enable us to come up with something better than O(nlogn). An example is if we have almost sorted data.

Let’s demonstrate this with a well known practical example: the Dutch national flag problem, proposed by Edgser Dijkstra. The problem reads like this:

The flag of Netherlands consists of three colours: red, white and blue. Given balls of these three colours arranged randomly in a line (the actual number of balls does not matter), the task is to arrange them such that all balls of the same colour are together and their collective colour groups are in the correct order.

Let’s order them Red, White, Blue. If we use an algorithm like Merge Sort, we can guarantee that, in the worst case it will be O(nlogn). But, as you we’re expecting, we can do better. We actually do it in O(n). Yes, that’s right, linear time. Let’s see how we can do that, using Python.

# Compare  function for the problem
def comp_function(a, b):
    if a == 'R':
	return -1
    elif a == 'B':
	return 1

# Swap two elements of an array
def swap(array, i, j):	
    temp = array[i]
    array[i] = array[j]
    array[j] = temp

def dutch_national_flag(array, comp_function):
    start = 0
    end = len(array) - 1 
    i = 0

    while i <= end:
	if comp_function(array[i], 'R') == -1:
		swap(array, i, start)
		start += 1
		i += 1
	elif comp_function(array[i], 'B') == 1:
		swap(array, i, end)
		end -=1
	else:
		i += 1

As we can see, we will go through the list of elements once. There are 2 “tricks” that we’re using here, to achieve O(n):

  • we don’t care about the White elements (poor fellas). We only care about putting Red elements in the beginning and and Blue ones at the end;
  • to achieve the previous point, we have an algorithm-specific compare function and we have to keep track of the indexes where we want to swap to (both at the start and end).

Basically, we’re swapping Red elements to the beginning , and the Blue ones to the end. Kind makes sense doesn’t it? The only small detail that we have to be aware is that, when we swap a Blue element to the end we might be swapping it with another Blue element. And when we make the swap, we need to decrease the end counter and re-check the same element.

With this simple example, we can see that in real world problems, it’s specificity might lead us to find a more efficient solution. And so, we should be aware of this fact when considering sorting for a very specific situation.