Merge Sort: let’s sort!

Sorting has always been a popular subject in Computer Science. Back in 1945 Mr. John von Neumann came up with Merge Sort. It’s an efficient divide and conquer algorithm, and we’ll dive right into it.

The general flow of the algorithm comes as follows: (1) divide the list into lists of size 1 (being the size of the original list), (2) recursively merge them back together to produce one sorted list.

As usual, I always get things better with an example. Let’s transform this (kind of) abstract explanation into a concrete example. We’ll use a simple, yet descriptive, example for sorting lists of integers. Let’s start with (1).

def merge_sort(list):

    if len(list) <= 1:
        return list

    return merge(merge_sort(list[:len(list)/2]), merge_sort(list[len(list)/2:]))

Here we have part 1. Basically what we’re doing here is to check  at each step if our list is already of size 1 and ff it is  we return it. If not, we split it in half, call merge_sort on each of them and call a function merge with both halves. How many times will this merge function be called? log n because we’re splitting the list in half at each time.

Next, phase number (2). We need to merge everything back together.

def merge(l1, l2):
    result = []
    i, j = 0, 0

    while i < len(l1) and j < len(l2):         
        if l1[i] > l2[j]:
            result.append(l2[j])
            j += 1
        else:
            result.append(l1[i])
            i += 1

	result += l1[i:]
	result += l2[j:]

	return result

So, what’s going on here? We know that we start with lists of size 1. That means that at each step, each of the 2 lists will be sorted on it’s own. We just need to stitch them together. That means that we go through the lists (until we reach the end of at least one) and we get  the smallest element at each step. When one of them  ends, we just need to add the remaining elements of the other to the result.

We already know that this merge will be called log n times. But at each call merge does comparisons because it needs to figure out where the all the elements fit together. So Merge Sort is a O(n log n) comparison sorting algorithm

Concurrent vs Parallel

In a world where we hear and talk a lot about making code run concurrent or in parallel, there’s sometimes a little bit of confusion between the two. It happens that many times we use one term when referring to the other or even use them indistinguishably. Let’s shed some light on the matter.

When we say that we have concurrency in our code, that means that we have tasks running in periods of time that overlap. That doesn’t mean that they run at the exact same time. When we have parallel tasks that means that they they run at the same time.

In a multi-core world it might seem that concurrency doesn’t make sense, but as everything, we should the right approach for the job at hand. Imagine for example a very simple web application where one thread handles requests and another one handles database queries: they can run concurrently. Parallelism  has become very useful in recent times in the Big Data era, where we need to process huge amounts of data.

Let’s see an example of each, run and compare run times.

Concurrent:

from threading import Thread

LIMIT = 50000000

def cycle(n):	
    while n < LIMIT:
	n += 1

t1 = Thread(target=cycle,args=(LIMIT/2,))
t2 = Thread(target=cycle,args=(LIMIT/2,))
t1.start()
t2.start()
t1.join()
t2.join()

Parallel:

from multiprocessing import Process

LIMIT = 50000000

def cycle(n):	
    while n < LIMIT:
	n += 1

p1 = Process(target=cycle, args=(LIMIT/2,))
p2 = Process(target=cycle, args=(LIMIT/2,))
p1.start()
p2.start()
p2.join()
p2.join()

Now, the times to run:

$ time python concurrent.py

real0m4.174s

user0m3.729s

sys0m2.272s

$ time python parallel.py

real0m1.764s

user0m3.422s

sys0m0.027s

As we can see, the parallel code runs much faster than the concurrent. Which accordingly to what was said previously makes sense,doesn’t it? In this example, we can only gain time if the tasks run simultaneously.

Your programming language of choice will give the tools needed to implement both the approaches. Analyze you problem, devise a strategy and start coding!

P.S. Please note, that an imperative implementation would run faster than the concurrent one due to the Python’s GIL.