Merge Sort: let’s sort!

Sorting has always been a popular subject in Computer Science. Back in 1945 Mr. John von Neumann came up with Merge Sort. It’s an efficient divide and conquer algorithm, and we’ll dive right into it.

The general flow of the algorithm comes as follows: (1) divide the list into lists of size 1 (being the size of the original list), (2) recursively merge them back together to produce one sorted list.

As usual, I always get things better with an example. Let’s transform this (kind of) abstract explanation into a concrete example. We’ll use a simple, yet descriptive, example for sorting lists of integers. Let’s start with (1).

def merge_sort(list):

    if len(list) <= 1:
        return list

    return merge(merge_sort(list[:len(list)/2]), merge_sort(list[len(list)/2:]))

Here we have part 1. Basically what we’re doing here is to check  at each step if our list is already of size 1 and ff it is  we return it. If not, we split it in half, call merge_sort on each of them and call a function merge with both halves. How many times will this merge function be called? log n because we’re splitting the list in half at each time.

Next, phase number (2). We need to merge everything back together.

def merge(l1, l2):
    result = []
    i, j = 0, 0

    while i < len(l1) and j < len(l2):         
        if l1[i] > l2[j]:
            result.append(l2[j])
            j += 1
        else:
            result.append(l1[i])
            i += 1

	result += l1[i:]
	result += l2[j:]

	return result

So, what’s going on here? We know that we start with lists of size 1. That means that at each step, each of the 2 lists will be sorted on it’s own. We just need to stitch them together. That means that we go through the lists (until we reach the end of at least one) and we get  the smallest element at each step. When one of them  ends, we just need to add the remaining elements of the other to the result.

We already know that this merge will be called log n times. But at each call merge does comparisons because it needs to figure out where the all the elements fit together. So Merge Sort is a O(n log n) comparison sorting algorithm

Advertisements

Concurrent vs Parallel

In a world where we hear and talk a lot about making code run concurrent or in parallel, there’s sometimes a little bit of confusion between the two. It happens that many times we use one term when referring to the other or even use them indistinguishably. Let’s shed some light on the matter.

When we say that we have concurrency in our code, that means that we have tasks running in periods of time that overlap. That doesn’t mean that they run at the exact same time. When we have parallel tasks that means that they they run at the same time.

In a multi-core world it might seem that concurrency doesn’t make sense, but as everything, we should the right approach for the job at hand. Imagine for example a very simple web application where one thread handles requests and another one handles database queries: they can run concurrently. Parallelism  has become very useful in recent times in the Big Data era, where we need to process huge amounts of data.

Let’s see an example of each, run and compare run times.

Concurrent:

from threading import Thread

LIMIT = 50000000

def cycle(n):	
    while n < LIMIT:
	n += 1

t1 = Thread(target=cycle,args=(LIMIT/2,))
t2 = Thread(target=cycle,args=(LIMIT/2,))
t1.start()
t2.start()
t1.join()
t2.join()

Parallel:

from multiprocessing import Process

LIMIT = 50000000

def cycle(n):	
    while n < LIMIT:
	n += 1

p1 = Process(target=cycle, args=(LIMIT/2,))
p2 = Process(target=cycle, args=(LIMIT/2,))
p1.start()
p2.start()
p2.join()
p2.join()

Now, the times to run:

$ time python concurrent.py

real0m4.174s

user0m3.729s

sys0m2.272s

$ time python parallel.py

real0m1.764s

user0m3.422s

sys0m0.027s

As we can see, the parallel code runs much faster than the concurrent. Which accordingly to what was said previously makes sense,doesn’t it? In this example, we can only gain time if the tasks run simultaneously.

Your programming language of choice will give the tools needed to implement both the approaches. Analyze you problem, devise a strategy and start coding!

P.S. Please note, that an imperative implementation would run faster than the concurrent one due to the Python’s GIL.

Django and Jenkins

If you’ve read (and followed) two of my previous posts, A small help to get you into Continuous Integration and Let’s link Jenkins and Github together, by now you have a Jenkins server linked to a Github repository. While those two posts were a little bit more generic, this one will focus on building Django projects. Let’s call it Part 3 of this series.

Building a Django project in a CI environment involves several steps. From installing all dependencies (virtualenv is a must) , rebuilding you database (you should always be in the position where you can make a deploy from scratch and that involves,of course, rebuilding you database), run tests, generate reports, etc. Please read this excellent article  about Continuous Integration from Martin Fowler. It’s worth your time!

As you can see there are a lot of steps involved so it would be best if we script it all once and use many times, wouldn’t it? We’ll with Django that’s even simpler because django-jenkins  allow’s “Plug and play continuous integration with Django and Jenkin “. Sweet! Let’s add the following packages to our requirements file:

  • django-jenkins
  • coverage – code coverage measurement from Python
  • pylint – Python code static checker

Let’s update our settings file with the following settings:

INSTALLED_APPS += (
    'django_jenkins',
)

JENKINS_TASKS = (
    'django_jenkins.tasks.run_pylint',
    'django_jenkins.tasks.with_coverage',
)

PROJECT_APPS=(
    'demo_app',
)

Armed with these tools, django-jenkins “knows” what to do. It knows how to run tests and how to generate reports. PROJECT_APPS will tell Jenkins only to build reports to our apps, excluding Django own code reports. What we need now is to tell Jenkins what to do. Let’s do that.

First thing we nee to do is install the required plugins: Violations for parsing the pylint reports and Cobertura to get the code coverage reports. As we’ve seen in the previous posts, that’s done via the Manage Jenkins -> Manage plugins -> Available.

Next steps will involve pooling the Github repository and adding a build step. Click Configure and on Pool SCM let’s make it poll every ten minutes (cronjob syntax). On the Build section, select Execute shell and will add a shell script to automate the process.

8

Next step: build script. Add this script to the text area:

#!/usr/bin/env bash

virtualenv ve
source ./ve/bin/activate
pip install -r requirements.txt
python manage.py syncdb
python manage.py jenkins

Let’s break down this script into steps:

  • first, we create the environment to install all our dependencies;
  • next we install all dependencies from our requirements file;
  • following, we build our database. In this example we simply sync our models;
  • at last we  run django-jenkins.

This last step will generate the reports. We now need to tell Jenkins where they live so that they can be parsed: test results,  test coverage reports and pylint reports. Again in Configure, go to Add post-build-action and select:

  • Publish JUnit test  result report
  • Report Violations
  • Publish Cobertura coverage report

When django-jenkins runs, it creates a reports folder where reports are generated into. We just need tell Jenkins to find the required reports there.

9

Now, every  10 minutes Jenkins will poll Github and if there are changes, it will build and generate reports.

10

The evolution in the graphs are the result of several builds. Please note, that if your app has no tests the build will always fail.

Now you’re ready to go. CI world is at your feet. Conquer it!

Stacks and Queues: containers for all!

Stacks and Queues are two types of containers and as the name says, they’re used to store content. Predictable, uh? So what’s the difference between them, you might ask. Well Sir (or Madam), it’s the way data is retrieved.

Stacks support what we call LIFO (Last In, First Out). Elements are inserted at the top/end of the container, usually called push and retrieved from the same position, usually called pop.

stack

Let’s see how we could implement this in Python:

class Stack:
    """ Simple stack implementation. """
    def __init__(self):
        self.stack = []

    def push(self, elem):
        """ Add an element to the stack. """
        self.stack.append(elem)

    def pop(self):
        """ Remove element from stack. """
        self.stack.pop()

    def get_stack(self):
        """ Get current stack. """
        return self.stack

This generic implementation is simple but serves as an example of how we could implement a Stack.

Then we have Queues which are similar, but support what we call FIFO (Fast In, First Out). Elements are inserted at the bottom/end of the container, usually called enqueue and retrieved from the first position, usually called dequeue.

queue

Let’s see how we could implement this. Yes, that’s right, in Python:

class Queue:
    """ Simple queue implementation. """
    def __init__(self):
        self.queue = []

    def enqueue(self, elem):
        """ Add an element to the queue. """
        self.queue.append(elem)

    def dequeue(self):
        """ Remove element from queue. """
        return self.queue.pop(0)

    def get_queue(self):
        """ Get current queue. """
        return self.queue

Similar implementations, but as expected a different way of retrieving the data. As we can see, both containers can be efficiently implemented using lists/arrays. Also, we made use of Python’s append and pop methods for lists in order to insert and retrieve elements. Why reinvent the wheel?

Please note that theses data structures accept any kind of valid data simultaneously (integers, floats, arrays, strings, etc).

Let’s link Jenkins and Github together!

If you read A small help to get you into Continuous Integration you now have a server running Jenkins and you’re ready to start doing CI. Let’s hook it up with Github.

First step is to install the necessary plugins. Let’s go over to Manage Jenkins -> Manage Plugins.

Here, click Available and type Github in the search box. Select the Github plugins.

2

 

To be sure, let’s restart Jenkins. You might need to refresh the page once it’s over.

 

3

Let’s create an example project. On the main screen click New Item. Give your project a name and select Build a free-style software project.

4

Go to your Github repository and on the right side select ssh and copy the URL

5

In Jenkins again, under Source code management select Git  and paste the URL in Repository URL.

6

If you move the cursor to any other place you’ll notice the red error message. This one was on purpose to make you aware of the next step. What this means is that Jenkins was unable to connect to Github. We need to let Github “know” about you Jenkins server. Click Save and let’s see how to do that.

Go to the folder where whe vagrant files live and type:

$ vagrant ssh

Once your in, let’s access you jenkins user by typing:

$ sudo su – jenkins

You’re now ready to generate an ssh key. Type the following command and follow the instructions:

$ ssh-keygen

If you followed the default instructions, you now have a public key. Get it by typing:

$ cat .ssh/id_rsa.pub

You should see something similar to this:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCq2Asfk/kdYaBf5h4cX+BFRDEtZFnIv2tcBPklN+obUvclhEVt0n7yZ+MLsTDownSSOrPZwqxnIj0FRUJ4Hj8Hx/jZlCf7V5s3kIA+FHvHIfaBiKAhCdnqtNflHe03bO0MTciSlYcQVwAR8JYBwk/8Alr2/sR7Rbwu+05NiTJ0xb6Y54OTtYHitHqrDaHKMaJSkLRnjzMlZ3vcHpckgGyyZd8NiRNRX0XL2pDG21C+nQyGLu9GjKJh0ixxk3E5lgpvj/w4pFOxozIswpTzf6oXqaELoK3Y0zHwNvFyAgY3Thn+tWGPlD3a0OLcsNqB8Pa8XP04Bo0fRW/6L+1trLE5 jenkins@precise64

Copy your public key and go over to your Github project.  On the right hand side click Settings and then Deploy Keys -> Add deploy key. In the form give a name to your key, and paste the public key you got from your Jenkins server.

7

Back to your shell, let’s check that we can access Github by typing (say yes when a asked):

$ ssh -T git@github.com

You should receive a message similar to:

Hi mccricardo/demo_project! You’ve successfully authenticated, but GitHub does not provide shell access.

That’s all folks. If you don’t trust me, click Configure on the left and verify that the error message disappeared. You’re now one step further into CI.

__init__.py: what in the world is it for?

It’s very common when looking into python code, to see some __inti__.py files around. Most of the times you will see them empty, many times you will se imports or more generic stuff and sometimes a bunch of code in there. If you for example check some django code, say models, you’ll see more than you would initially expect. So, what exactly is __init__.py for? Let’s check the docs:

The __init__.py files are required to make Python treat the directories as containing packages; this is done to prevent directories with a common name, such as string, from unintentionally hiding valid modules that occur later on the module search path. In the simplest case, __init__.py can just be an empty file, but it can also execute initialization code for the package or set the __all__ variable, described later.

What does this mean exactly? Well, if a directory has an __init__.py Python will know that it will treat it as a package. This way  the different modules within your package can be imported in a similar manner as plain modules.

Leaving the this file empty is common, and viewed by many as a good practice, if your packages  modules and sub-packages have no need to share any code. A common practice is to import selected classes, functions, etc into package level making them imported directly from the package.  Also common is to to make subpackages and modules available by using the __all__ variable. When this variable is present, the interpreter imports all the modules listed there.

Sometimes, as shown in the above django example, __init__.py can contain a lot more code. The first time you import any module from that package, the code inside __init__.py runs. It’s up to you to decide what it’s worth to be package level or not. A recurrent issue in including to much code in __init__.py is when the complexity of a project grows, threre might be several sub packages in a deep and complex directory structure. What this means is that importing a single item  will require executing all __init__.py files found while transversing the structure.

As you can see __init__.py plays an import role in Python. You can read more about packages and modules in the official documentation for Modules.

A small help to get you into Continuous Integration

The more I use CI(Continuous Integration), the more I like. Sure it can be a “pain” in the beginning, but the gains are enormous. You can easily start getting results in a short period of time and improve you software development process.

If your team is not using CI, that’s no excuse for you. I’m here to help! You can easily setup, on your local machine, a VM(Virtual Machine) and run Jenkins inside. Then you can start using it. My purpose is not trying to convince anyone into using CI, but instead make it easier to start using.

For this purpose, I’ll give you a hand. Check my repository on GitHub: vagrant_jenkins. The README is self explanatory and you can use it as reference to get everything up and running. Let’s use this post to give just a little insight on what’s happening there.

If you follow the (few) instructions, by the end of process you’ll have a server running Jenkins. Cool uh? For a small effort you can start “playing” around. This set up is composed by two files: the Vagrant and the bootstrap.sh files. The Vagrant file will be used by vagrant to configure the VM itself and the  bootstrap.sh will be used to provision you server (basically install and configure everything you need).

# -*- mode: ruby -*-
# vi: set ft=ruby :

# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  config.vm.box = "precise64"
  config.vm.box_url = "http://files.vagrantup.com/precise64.box"

  config.vm.provision :shell, :path => "bootstrap.sh"
  config.vm.network "forwarded_port", guest: 8080, host:8080

  config.vm.provider :virtualbox do |vb|
      vb.customize ["modifyvm", :id, "--memory", "1024"]
  end
end

You can see here that Vagrant will download a “box”, corresponding to Ubuntu Precise Pangolin and use it as an OS. We also tell the VM that it should provision using our file, to forward ports so we can access Jenkins on our host and the amount of memory we want. Next comes bootstrap.sh.


#!/usr/bin/env bash


echo "Add Jenkins key"
wget -q -O - http://pkg.jenkins-ci.org/debian/jenkins-ci.org.key | sudo apt-key add -
sh -c 'echo deb http://pkg.jenkins-ci.org/debian binary/ > /etc/apt/sources.list.d/jenkins.list'

echo "Add PosgreSQL key"
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sh -c 'echo deb http://apt.postgresql.org/pub/repos/apt/ precise-pgdg main > /etc/apt/sources.list.d/pgdg.list'

echo "Update repos"
apt-get update

echo "Install OpenJDK and Jenkins"
apt-get install -y openjdk-7-jdk jenkins git

echo "Install other packages"
apt-get install -y python-pip python-virtualenv libpq-dev python-dev

echo "Install Postgres"
apt-get install -y postgresql-9.3 postgresql-contrib-9.3

This one is even simpler. We just fetching and installing all the packages we need. The required one’s to run Jenkins are Java and Jenkins itself. The other’s you see there are some I normally use. If you don’t need them, just remove the corresponding code.

Please be patient the first time you run the instructions. There’s a lot being done, sot it might take a while.

By the end of it, you’ll have a basic setup which you can then use to manage your builds. Start using it, share with your colleagues, managers, supervisors, etc and you’ll see they will be interested in no time.