Pairwise and multiple alignments

This page is part of the bioinformatic algorithms.

The code found on this page can be run in Python v. 2.4-2.6.x.

Pairwise alignments

How close are two DNA or protein sequences related? It is possible to find an exact score for this with pairwise alignments. All it requires is the two sequences, a scoring scheme for the matches and mismatches, and of cause some code.

The time complexity of calculating a pairwise alignment squared, but if the optimal alignment is desired, the calculation also requires squared space. But the squared space can be reduced using Hirschbergs method, reducing the space complexity to linear.

This was a part of the exercises in the course Algorithms in Bioinformatics at BiRC, while studying bioinformatics.

A pairwise alignment is interesting because it can show the similarities between two DNA or amino acid sequences. With the correct score scheme, it can also be used to calculate the Levenshtein distance (or simply, edit distance).

In the linked package, there are three different concepts, global alignment, local alignment and global alignment in linear space, using Hirschbergs algorithm.

You can read more about it in the report accompanying the exercise. Included in the package are some simple fasta and phylip parsers.

Multiple alignments

We can extend the pairwise alignment, to align multiple sequences. As in the pairwise alignment, the computation of a multiple alignment requires that each step compares every sequence, so instead of the 2D matrix, a kD matrix is requires, where k is the number of sequences. Naturally, finding the exact solution no longer has a time complexity of n·n, but nk! And if naïvely implemented, it also requires nk space! The exact solution is implemented in sp_exact.py in the package.

Can we improve the time complexity?

Luckily we can, although it no longer can be guaranteed that the found solution is the best! It uses a heurestic, which is an idea of how the sequences might be connected and building on this, so it is an approximation that might find a good solution.

In the implemented algorithm, found in sp_approx.py, the main idea is, that one of sequences are closer related to the rest. Therefore we start by computing the score of a pairwise alignment between all combinations of the sequences. E.g. for the sequences A, B, C and D, we compute the score of the pairwise alignments of (A,B), (A,C), (A,D), (B,C), (B,D) and (C,D), that is "k choose 2" alignments.

Distances and startrees of four sequences.Next, for all sequences we construct a startree with the sequence as the center and the score from the pairwise alignment as the distance. This is illustrated to the right, where each pairwise alignment is shown as a line, and the score of the alignment is the length of the line. From these, the startrees are constructed and the size of the startree is defined as the summed lengths of the lines. Although difficult to see, the startree with 'C' as centrum is the smallest, so we use this as the sequence we build upon for the last step.

The last step is to compute pairwise alignements between the "center sequence" and the rest, and for each of these alignments, all previous alignments are extended with the gaps the last computed alignment introduces.

Last modified 12:09 2009-12-18