BioMed CentralBMC Bioinformatics

ss
Open AcceSoftware
Compressing DNA sequence databases with coil
W Timothy J White* and Michael D Hendy

Address: Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand

Email: W Timothy J White* - w.t.white@massey.ac.nz; Michael D Hendy - m.hendy@massey.ac.nz

* Corresponding author    

Abstract
Background: Publicly available DNA sequence databases such as GenBank are large, and are
growing at an exponential rate. The sheer volume of data being dealt with presents serious storage
and data communications problems. Currently, sequence data is usually kept in large "flat files,"
which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which
rarely achieves good compression ratios. While much research has been done on compressing
individual DNA sequences, surprisingly little has focused on the compression of entire databases
of such sequences. In this study we introduce the sequence database compression software coil.

Results: We have designed and implemented a portable software package, coil, for compressing
and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared
towards achieving high compression ratios at the expense of execution time and memory usage
during compression – the compression time represents a "one-off investment" whose cost is
quickly amortised if the resulting compressed file is transmitted many times. Decompression
requires little memory and is extremely fast. We demonstrate a 5% improvement in compression
ratio over state-of-the-art general-purpose compression tools for a large GenBank database file
containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental
additions to a sequence database.

Conclusion: coil presents a compelling alternative to conventional compression of flat files for the
storage and distribution of DNA sequence databases having a narrow distribution of sequence
lengths, such as EST data. Increasing compression levels for databases having a wide distribution of
sequence lengths is a direction for future work.

Background
The advent of the Sanger sequencing method enabled
DNA sequence data to be collected and manipulated on
computers, paving the way for explosive growth in the
new field of bioinformatics. Publicly available DNA
sequence databases such as GenBank play a crucial role in
collecting and disseminating the raw data needed by
researchers in the field. This database currently contains
168 Gb of sequence data [1] section 2.2.8, and is expected

to continue to grow at an exponential rate, doubling in
size roughly every 14 months [2]. The volume of data
being dealt with now presents serious storage and data
communications problems. Currently, sequence data is
usually kept in large "flat files," which are then com-
pressed using standard Lempel-Ziv compression [3] (e.g.
with gzip [4]). Unfortunately this approach rarely
achieves good compression ratios: typically, gzip fails to

Published: 20 May 2008

BMC Bioinformatics 2008, 9:242 doi:10.1186/1471-2105-9-242

Received: 23 October 2007
Accepted: 20 May 2008

This article is available from: http://www.biomedcentral.com/1471-2105/9/242

© 2008 White and Hendy; licensee BioMed Central Ltd. 
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), 
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 1 of 15
(page number not for citation purposes)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=18489794
http://www.biomedcentral.com/1471-2105/9/242
http://creativecommons.org/licenses/by/2.0
http://www.biomedcentral.com/
http://www.biomedcentral.com/info/about/charter/


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
match the "compression" afforded by simply encoding
each base using 2 bits [5].

Previous work concerning the compression of biological
(DNA or protein) sequences can be divided into two cate-
gories: techniques developed for efficiently compressing
sequence data for the sake of reduced resource consump-
tion (disk space or network usage) [5-9]; and investiga-
tions of the usefulness of compressibility as a measure of
information content, for the purpose of making infer-
ences about sequences (such as the relatedness of two
sequences) [10,11]. In this article we will focus on work in
the former category. Examining this body of work reveals
two distinct approaches:

• Compressing individual biological sequences

• Compressing databases of biological sequences

Compressing individual biological sequences
It is now widely recognised that DNA data is inherently
difficult to compress below the level of 2 bits per base
achievable through direct encoding [5,6,9]. Much
research has gone into developing algorithms for more
effectively compressing individual DNA sequences. These
include BioCompress [6], BioCompress-2 [7], GenCom-
press [8], the CTW+LZ algorithm [5], and DNACompress
[9]. Perhaps the best of these is DNACompress, which
employs the PatternHunter [12] sequence search algo-
rithm to discover patterns of approximate repeats or
approximate palindromic repeats in sequence data. DNA-
Compress achieved compression averaging 13.7% on a
sample set of DNA sequences and is substantially faster
than earlier algorithms. Grumbach and Tahi [7] allude to
a "vertical" mode of compression for compressing multi-
ple sequences in a database, however they do not elabo-
rate on how this might be accomplished.

While these single-sequence algorithms are interesting
from a theoretical point of view, and are certainly becom-
ing increasingly practical in the modern world of genome-
scale analysis, a great deal of everyday bioinformatics
work continues to entail the communication and storage
of multiple sequences, and the modest compression gains
afforded by these algorithms are ultimately not sufficient
to justify their adoption for large databases.

Compressing databases of biological sequences
Strelets and Lim [13] describe a program, SAGITTARIUS,
for compressing PIR-format [14] protein sequence data-
bases. Their system uses standard dictionary-style com-
pression of sequence entry metadata, and a novel
alignment-based compression strategy for the protein
sequence data itself. A small number of sequences is
maintained in memory as the reference sequence accumula-

tor, and each sequence in the database is aligned with each
sequence in this list. If any alignment produces a strong
match, the input sequence is recoded using symbols
describing insertions and deletions to enable recovery
from its close match in the accumulator; otherwise, the
sequence is output verbatim and added to the accumula-
tor, overwriting the oldest incumbent sequence if the
accumulator is full. Sequences to be output are com-
pressed using run-length encoding and Huffman encod-
ing, and the shorter of the two encodings is chosen. Thus
the accumulator represents a window of recently encoun-
tered interesting sequences. The authors set the size of the
accumulator at three sequences, and were able to achieve
2.50:1 compression, significantly better than PKZIP© [15]
at 2.13:1.

Strelets and Lim [13] were interested in producing a com-
pressed database that could be used interactively in much
the same way as the original database. This was facilitated
in part by the fact that their approach never requires recur-
sive decoding of sequences – each sequence is encoded in
terms of at most one other sequence, which is itself avail-
able "as-is," (i.e. not compressed in terms of another
sequence). While useful for interactive operations, it is
clear that avoiding recursive encoding must limit the over-
all level of compression obtained. Since we are targeting
maximum compression, coil differs from that of [13] in
this respect. Another difficulty arises in the assumption
that similar sequences are likely to appear near each other
in the input file. This is crucial in order to be able to limit
the size of the accumulator and thereby the runtime. The
authors found that increasing the size of the accumulator
past three sequences increased the runtime but made no
substantial improvement in compression, which
appeared to justify their assumption. Unfortunately,
while this neat localisation of similar sequences may have
been true of the PIR database in 1995, it is certainly not
true of the large nucleotide databases of today, and we
chose not to make this assumption.

Li, Jaroszewski and Godzik have taken a similar approach
to the related problem of producing non-redundant pro-
tein databases with their CD-HI [16] and CD-HIT [17]
packages. More recently, Li and Godzik have extended this
approach to DNA sequences with the cd-hit-est program
[18]. Their main advance over [13] is in employing short-
word filters to rapidly determine that two sequences can-
not be similar, which significantly reduces the number of
full alignments necessary. Despite impressive speed on
small-to-medium datasets, they report that clustering 6
billion ESTs at 95% similarity takes 139 hours [18].

The program nrdb [19] locates and removes exact dupli-
cate sequences from a DNA database in FASTA format.
While this program is clearly a step in the right direction,
Page 2 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
many sequences in a typical database are almost but not
quite exact duplicates of other sequences (perhaps differ-
ing at one or two positions), highlighting opportunities
for further improvements.

Compression and the maximum parsimony criterion
A phylogenetic tree is a Steiner tree estimating the evolu-
tionary history of a set of taxa (species, genes or individu-
als). Maximum parsimony is a criterion for building
phylogenetic trees from DNA sequences that aims to iden-
tify the tree or trees containing the fewest point mutations
along their edges, where a point mutation, or edit, is an
insertion or deletion of a single nucleotide or a substitu-
tion of one nucleotide for another. (Most implementa-
tions of maximum parsimony search consider only point
substitutions, for reasons of computational efficiency.)
We then note:

If we are given the complete sequence at one node of a tree, as
well as all edge mutations, we can reconstruct the sequences at
all the nodes.

Thus a maximum parsimony tree represents an optimal
solution to storing sequence data in the form of a list of
edit operations on a tree rooted at a single reference
sequence. This can be an efficient compression if the
sequences are closely related, so that the number of edit
operations is small in comparison with the total sequence
length. Within a large database, it is expected that there
will be large groups of closely related sequences – for
example, the DNA encoding a particular gene from many
different species. More precisely, we expect that many
sequences will be highly similar to at least one other
sequence in the database. If this is the case a considerable

saving in storage space can be achieved by identifying
such groups, determining good trees for them, and encod-
ing each group as a single root sequence plus a series of
"deltas" along the tree edges. We have called this
approach edit-tree coding. Figure 1 illustrates how
sequences within a database are processed according to
this scheme.

In practice, it soon becomes apparent that even a heuristic
maximum parsimony search on subsets of sequences is
not computationally feasible for a large database. The
standard maximum parsimony tree evaluation algorithm
requires all sequences to be aligned. Both alignment and
the subsequent parsimony searches are hard problems
[20,21]. Fortunately, when dealing with data compression
we are not concerned about exactly maximising some
function – our requirement is a method which is fast and
performs well on typical cases. A practical alternative to
maximum parsimony search is to construct an approxi-
mation to the minimum spanning tree on the sequences,
where the metric is the edit distance between two
sequences – the number of single-character insertions,
deletions or replacements required to transform one
sequence into the other. Unlike Steiner trees, minimum
spanning trees do not introduce new internal vertices, and
computation is fast: an algorithm having time complexity
almost linear in the number of edges exists [22]. The total
tree length is bounded by twice that of the maximum par-
simony tree. By judicious selection of algorithms and data
structures, we have developed heuristics and approxima-
tions that make this task feasible for databases having
sizes in the gigabyte range, despite having essentially
quadratic time complexity in the size of the database.

Goals of coil
Our goal was to develop a software package, coil, for com-
pressing and decompressing DNA sequence databases
based on edit-tree coding. The primary intention is to
reduce the bandwidth required to transmit large amounts
of DNA sequence data from a central repository to many
recipients, and also to reduce disk space requirements for
archival storage of such data. While it is desirable to ena-
ble efficient searching of a compressed database, and
progress has been made in this area [23-25], we have not
attempted to do so here. Instead, coil is geared towards
maximising compression ratios. This is achieved at the
expense of execution time and memory usage – but note
that the compression time represents a "one-off invest-
ment" whose cost is quickly amortised if the resulting
compressed file is transmitted many times. Decompres-
sion requires little memory and takes O(D) time for data
sets of size D nucleotides.

coil primarily targets sequence databases containing
many short sequences of roughly equal length, such as

Edit-tree coding of similar sequence groupsFigure 1
Edit-tree coding of similar sequence groups. Circles 
represent DNA sequences in a database; the straight-line dis-
tance between circles represents the edit distance between 
sequences. Initially (a) we are presented with the input data-
base. In the first step (b), groups of similar sequences are dis-
covered. In the second step (c), each group is edit-tree 
coded independently by determining a reasonable tree, 
selecting a root sequence (coloured black) and recording the 
necessary edits along each edge. Some sequences are not suf-
ficiently similar to any other sequence to be delta-encoded – 
these sequences will be recorded verbatim.

(a) (b) (c)
Page 3 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
Expressed Sequence Tag (EST) databases. Many such data-
bases exist: at the time of printing, GenBank contains 23
Gb of EST sequence data in FASTA format, comprising
43,380,458 sequences in total [1]. Targeting these data-
bases simplifies design and implementation: many oper-
ations on pairs of sequences take time quadratic in the
length of their operands, so shorter sequences are impor-
tant for high performance; also, assuming reasonably
short sequences means that little attention need be paid to
the intra-sequence positions of regions of similarity
between two sequences.

coil reads and write FASTA format databases [26]. This
simple format is easy to work with manually, easy to pro-
gram input and output routines for, and widely used.
Additionally, the format contains a minimum amount of
additional context information for each sequence, which
allows us to focus on compressing the sequence data.

Implementation
coil consists of a small group of C programs that perform
the compression and decompression steps described
below, as well as a Perl script which simplifies the com-
pression process by automating sequences of steps and
providing sensible defaults where helpful. Use of these
programs is described later.

Overview
Hereafter, unless otherwise qualified, D denotes the
number of nucleotides (characters) in a database, N the
number of sequences, and L = D/N the average length of
a sequence. As in the C language, the notation a % b is
used to indicate taking the remainder of a modulo b, for
some non-negative integer a and positive integer b.

Conceptually, the process of compressing a database with
coil proceeds through the following stages:

1. Creating a similarity graph that pairs sequences of high
similarity. The similarity graph is an edge-weighted undi-
rected graph in which vertices represent sequences and
edges exist between highly similar sequences, with edge
weights indicating similarity strength.

2. Extracting an encoding graph from the similarity graph.
The encoding graph is a set of rooted directed trees (for-
mally, arborescences) whose arcs correspond to a subset
of the edges in the similarity graph.

3. Encoding each tree in the encoding graph. For each tree,
the root sequence is stored verbatim (raw-encoded); an in-
order traversal is then used to delta-encode each other
sequence in the tree in terms of its parent sequence.

4. A multi-platform general-purpose compression pro-
gram, such as gzip or bzip2 [27], is applied to extract fur-
ther compression gains.

A typical usage pattern in a Unix-like environment would
be to use the tar archive program to collect the files of step
3 together, and pipe the resulting file through bzip2 -9
(the -9 command-line option requests maximum com-
pression).

Decompression of a coil archive amounts to inverting the
delta-encoding of the final compression stage: for each
encoded tree, the root sequence is written out, following
which an in-order traversal recovers every other sequence
using the (already recovered) parent sequence and the
encoded delta information. This takes place after the gen-
eral-purpose compression step is undone. Note that in
general, the order of sequences in the recovered FASTA file
will be different than in the original FASTA file; if this is
undesirable, program options can be set to restore the
original order (i.e. an exact copy will be produced).

All of these steps are explained in more detail below.

Edit distances and similarity graphs
A common way to quantify the similarity between two
strings a and b is to compute the Levenshtein distance: the
smallest number of single-character insertions, deletions
and substitutions required to transform a into b. Ideally,
we would compute exact distances between every pair of
sequences in the database and output a complete graph
with perfect similarity information. But since computa-
tion of the Levenshtein distance between two strings of
length m and n takes O(mn) time [28] in the general case,
a database of size D containing N roughly equal-length
sequences would require O(N2)O(D2/N2) = O(D2) com-
parisons. When database sizes are in the gigabyte range,
quadratic-time algorithms are not viable.

Instead, coil uses a more efficient related similarity meas-
ure derived by counting the number of length-k sub-
strings, or k-tuples, two strings have in common. For small
k, calculation of k-tuple similarity scores can be made very
fast by using a k-tuple index data structure (described
below) to obtain a list of all sequences in the database that
contain a given k-tuple in constant time.

The k-tuple index
A nucleotide (A, C, G or T) can be encoded as a 2-bit inte-
ger, and consequently a k-tuple of nucleotides has a natu-
ral representation as an integer of 2k bits. In coil, the
leftmost nucleotide occupies the most significant bits. The
k-tuple index data structure, which is prepared in a pre-
processing step using the program make_index, consists
of two files: a k-tuple sequence list file ending with the
Page 4 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
extension .ktl, which contains (D - N(k - 1))/s integer
sequence numbers (seqnums); and a k-tuple index file end-
ing with the extension .kti, which contains 4k integer off-
sets into the first file. s is a "slide" parameter used to
reduce the size of the k-tuple sequence list file, at the cost
of reduced accuracy: only k-tuples beginning at sequence
positions divisible by s are entered into the index. As Fig-
ure 2 shows, the ith entry in the k-tuple index file points
to the beginning of the list of seqnums that contain k-
tuple i, which continues until the seqnum list for the (i +
1)th entry begins.

These files are built in O(D + 4k) time using a bucket sort
algorithm that performs two passes over the raw sequence
data. The algorithm is similar to that used to build the k-
tuple indices used by SSAHA [29]. Note that unlike in
SSAHA, we do not record the intra-sequence positions of
k-tuples in the k-tuple index, nor do we ever record a given
seqnum more than once in a given k-tuple's seqnum list;
instead we rely on our assumption that the database con-
tains short sequences to ensure that there is a low proba-
bility of a sequence containing more than one instance of
a particular k-tuple. Should this not be the case, the effi-
cacy of the algorithm will be reduced, however correctness
will not be compromised.

The bucket sort algorithm requires both files to be able to
fit in memory simultaneously. If this is not possible,
make_index produces multiple pairs of output files: each
pair is an index on a segment of the database that will just
fit in the amount of memory specified.

It is worth mentioning that empirically, the sizes of seq-
num lists in a k-tuple index built from DNA sequence data
are highly nonuniform, with some k-tuples appearing sev-
eral orders of magnitude more frequently than others.
These k-tuples cause many spurious hits that slow down
the similarity graph construction step. coil follows the
smart practice described in [29] of completely eliminating
k-tuple seqnum lists that exceed a user-specified size: this
has the double effect of reducing index file sizes and dra-
matically improving the selectivity, and hence the speed,
of the next stage.

Creating the similarity graph
Constructing the similarity graph is the main compression
bottleneck in coil. The main contribution made by coil is
in engineering an algorithm to efficiently compute pair-
wise approximate sequence similarity scores using a com-
bination of the raw sequence data and the k-tuple index,
which is implemented in the find_edges program. We first
introduce a "naïve" comparison algorithm, and several
variants which each proved unsatisfactory.

The naïve algorithm is parameterised by k, s and b. b is a
small integer which is used to limit the total number of

edges in the similarity graph to bN; it is necessary to avoid
storing O(N2) edges in the similarity graph. In most test-
ing, b was set to 10. The pseudocode for the algorithm fol-
lows:

• For each query sequence q in the database:

 Create an empty linked list of (seqnum, hit
count) pairs, M.

 For each k-tuple t in q:

▪ Look up the list of sequences that contain t
starting at a position divisible by s using the k-
tuple index.

▪ Merge this list into M.

▪ Keep track of the number of times each
sequence in M has had a k-tuple in common
with q.

 For each pair (i, c) of the b pairs having the high-
est hit counts in M:

▪ Create the edge (q, i) in the similarity graph
and assign it weight c.

The seqnum lists read from the k-tuple index are in seq-
num order, and M is maintained in this order also. The
merge step is the usual list merge, except that whenever
pairs having the same seqnum are to be merged, the result

Example k-tuple index structure for k = 4Figure 2
Example k-tuple index structure for k = 4.

AAAA
AAAC
AAAG
AAAT
AACA
AACC
AACG
…

TTTT

0
1
2
3
4
5
6
7
8
9
10
11
12
…

675
676
677

0
1
2
3
4
5
6
…

255

3
4
9
24
2
4
5
1
5
2
6
21
22
…
8
10
13

K-tuple Index K-tuple Sequence List Table

Pos PosK-tup Seq #

0
4
7
9
14
20
31
…

675

Start
Page 5 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
is a single pair whose hit count is equal to the sum of the
hit counts of the arguments to the comparison. The intui-
tion here is that if two sequences share many k-tuples,
they are likely to be similar. In fact, it is relatively straight-
forward to show that if a string of length n is at edit dis-
tance d from another string, then the two strings must
share at least n - (d + 1)k + 1 k-tuples [30]; so we can rea-
sonably expect a correlation in the reverse direction. The
output of the algorithm is a representation of the similar-
ity graph in edge-list format.

Unfortunately, the above algorithm suffers from severe
performance degradation due to random k-tuple matches
clogging M and slowing down list merges. M soon fills
with many pairs containing low hit counts, representing
sequences that are not significantly similar to q but share
one or two k-tuples with q by chance. In fact it can be
shown that under reasonable assumptions about the dis-
tribution of k-tuples in the database, the repeated list
merging introduces an O(D3) factor into the running
time.

SSAHA [29] overcomes the clogging problem by choosing
k to be high enough that very few chance matches occur;
however this is only a viable approach if enough memory
is available as memory requirements are exponential in k.
The impressive search speeds described in [29] were
obtained on a computer with 16 Gb of RAM and with k set
to 14 or 15. Requiring this amount of memory for coil
would immediately put the program out of range of
almost all computers in use today.

Another way to ameliorate the situation is to convert the
list M into a form of hashtable by maintaining r separate
pair lists M0 ... Mr-1, and merging the seqnum list for the
ith k-tuple in q into the list Mi%r. After all k-tuples have
been scanned, a final merge step combines the r lists. Even
better, partition by seqnum rather than k-tuple position:
send each seqnum i to the list Mi%r. The latter technique is
more resilient to variations in seqnum list sizes. Choosing
r = 2h for some positive integer h enables fast calculation
of the remainder through bitwise operations. While these
modifications do improve the running time of the naïve
algorithm, as Table 1 shows, there remains much work to
be done before this algorithm will be feasible for gigabyte-
sized databases. We describe below a way to eliminate the
time spent processing unpromising hits

Letting go of perfection: the leaky move-to-front 
hashtable
The algorithms described in the preceding subsection all
compute the complete list of (seqnum, hit count) pairs for
a given query sequence q, including the "noise" matches
with small hit counts, even though we end up keeping

only the best b such matches. To avoid getting bogged
down with noise matches, we modify the seqnum-hash-
ing hashtable from the previous subsection by replacing
each of the r variable-length linked lists in the hashtable
with a small fixed-size array of size f. Instead of maintain-
ing these arrays in seqnum order, a move-to-front disci-
pline is used: whenever a seqnum i arrives, we scan the
array Mi%r for an occurrence. If it is found, it is moved to
the front of the array, its hit count is incremented and all
preceding elements are shunted down one position. If it is
not found, it is inserted at the front of the array with hit
count 1; all existing elements are shunted down one posi-
tion to make room, with the last pair being deleted
("pushed off the end").

Intuitively, the success of this algorithm hinges on the fol-
lowing key assumption:

If a database sequence is genuinely similar to the query
sequence, its seqnum will turn up often enough that it will
never be pushed off the end of the list.

There are several reasons for the improved performance of
this algorithm:

Frequently occurring seqnums are found more quickly and require 
fewer updates
A frequently occurring seqnum x is more likely to have
been recently referenced and hence is more likely near the
front of its array. Thus when x next occurs, the scan will
not need to proceed very far down the array. Also note that
only those elements that precede x in the array need to be
shunted back – later elements remain in their original
positions.

Table 1: Execution time for find_edges variations on a small 
dataset

Algorithm Parameters Execution Time (s)

SSAHA maxGap = 0, 118.58
maxInsert = 0
maxGap = 20, 118.72
maxInsert = 20

Basic 97.98
Batch merging c = 16 113.03

c = 32 84.35
c = 64 71.09

Recursive merging 80.55
Hashtable h = 12 56.02

h = 13 56.24
h = 14 57.92

The dataset used, month.est_mouse, is a monthly update of the 
Genbank Mouse EST dataset comprising 31,401 sequences having 
average length 438 nucleotides.
Page 6 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
Packing the seqnum and count values
A "side effect" of the hashtable structure is that there is no
need to explicitly record the low h bits of each seqnum in
each array, since every seqnum in array Mi must have these
bits equal to i. Thus these bits are available for other uses.
Since each array entry consists of a (seqnum, count) pair,
and since count rarely exceeds L/s, it makes sense to store
the count value in these bits. For our default choice of k
and s parameters, this is reasonable for h ≥ 10. (If it is
important to deal properly with the rare occasions that
that more 1023 k-tuples match between two sequences,
we can simply saturate the count at 2h - 1.)

Fixed-size multidimensional arrays require very low memory 
overhead
The leaky move-to-front hashtable data structure is a 2h ×
f array. Because of their highly regular structure, accessing
data items in fixed-size multidimensional arrays requires
only multiplications and additions using the fixed dimen-
sion sizes, and does not require any pointers or special list
termination symbols to be explicitly stored in memory,
reducing the memory allocation overhead.

Computers like fixed-size arrays
Linked lists are efficient in theory, but in practice, compu-
ter hardware has long been designed for efficient process-
ing of contiguous arrays of elements, and the "pointer-
chasing" inherent in working with linked lists inevitably
introduces comparatively large overheads. In particular,
items at consecutive positions in a linked list may occupy
widely separated memory addresses – a problem known
as poor spatial locality of reference. In these circumstances,
when iterating through the items in a linked list, memory
cache hardware cannot predict which bytes will be read or
written next, and performance suffers. In contrast, a scan
through an array accesses memory bytes in sequential
order, and will benefit from cache line fills that read con-
tiguous blocks of memory into cache.

Assuming pessimistically that every seqnum must endure
a full f comparison and move operations, analysis gives a
time complexity of O(fD2/s4k) for this algorithm.

Pentium 4 optimised version
Many modern CPUs use pipelining to increase instruction
throughput. We have developed an implementation of
the find_edges algorithm optimised for the heavily pipe-
lined Pentium 4 processor [see Additional file 1].

Extracting the encoding graph
Once a similarity graph has been created with find_edges,
the next step is to extract from it an encoding graph that
distinguishes groups of similar sequences and describes
how they are to be encoded. Then each group is com-
pressed independently. Both steps are performed by the

program encode. For the time being, we assume the avail-
ability of a subroutine for delta-encoding one sequence in
terms of another that produces a "black box" block of data
bytes; this algorithm is described in the subsection "Delta-
encoding Sequences".

First we note some structural properties of the encoding
graph. Each sequence in the database will be either raw-
encoded, or delta-encoded in terms of one other
sequence: this implies that each vertex in the encoding
graph will have at most one in-edge. Then by prohibiting
cycles it is easy to show that the encoding graph will be a
forest of directed trees, each having edges directed away
from a root vertex. Since the decision about which vertex
to choose as the root has little bearing on the speed or
compression level achieved for a tree, coil selects the low-
est-numbered sequence.

To be effective in compressing sequence databases, coil
needs to produce an encoding graph in which highly sim-
ilar sequences are linked by an edge whenever possible.
More precisely, we want to maximise the total similarity
score of the encoding graph, subject to the constraints that
it be a subgraph of the similarity graph, and also a forest.
This is the maximum spanning forest problem, which is
equivalent to the minimum spanning forest problem using
negative edge weights, which in turn is a generalisation of
the heavily studied minimum spanning tree (MST) prob-
lem. Happily, several algorithms exist for efficiently solv-
ing these problems [22,31,32].

Since the similarity graph produced by find_edges is very
sparse (containing at most bN edges) and is already in
edge-list format, we employ Kruskal's O(|E|log |E|) algo-
rithm [31]. Kruskal's algorithm is very simple to state:

1. Read in the similarity graph edge list.

2. Sort edges by similarity score.

3. For each edge in the sorted list:

• If this edge would not introduce a cycle, add it to the
encoding graph forest.

Importantly, both sorting and cycle-testing can be per-
formed efficiently. Edge sorting is accomplished in O(|E|
+ max(score)) time using bucket sort. (Note that the score
of an edge between sequences of lengths x and y is at most
min(x, y)/s). Cycle-testing is performed using the fast
union/find data structure described in [33]. This data
structure manages an equivalence relation on a set: here,
the classes are the connected components of the encoding
graph, which combine as edges are added. Determining
whether an edge would induce a cycle amounts to testing
Page 7 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
whether its two endpoints are in the same class. m such
tests can be performed on a set of size n in O(mα (n, m))
time [34], where α(n, m) is the extremely slow-growing
inverse Ackermann function: effectively constant time per
test.

Delta-encoding sequences
Once the encoding graph has been created, encoding of
the individual trees can begin. For each tree, the root
sequence is output verbatim, and an in-order tree traversal
then delta-encodes every other sequence in terms of its
parent. It is well-known that such a traversal can be used
to encode a rooted tree as a string containing only vertex
identifiers and parentheses. Conversion in both directions
can be accomplished without requiring random access to
the characters of the string, implying that an in-memory
tree data structure can be efficiently "streamed" to or from
a sequential storage medium (such as a disk file) in time
linear in the number of nodes.

An edit script is a list of edit operations, which we here take
to be single-character insertions, deletions and substitu-
tions. Our task is to find a minimal-length edit script for
converting one string a of length n into another string b of
length m. This problem can be solved in O(nm) time using
a straightforward dynamic programming approach, in
which we successively compute optimal edit scripts for
pairs of prefixes of a and b in terms of previously com-
puted solutions. Several algorithms exist that are asymp-
totically faster for certain input distributions [28,35]. In
particular, an algorithm of Myers [35] can solve a variant
of this problem in which only insertions and deletions are
allowed in O(nd) time and space, where d is the edit dis-
tance (and thus the size of the edit script). Since the
encoding stage of coil deals only with sequences already
deemed to be similar by heuristics, this algorithm was
chosen for implementation. Another attractive feature of
the Myers algorithm is that it considers possible edit
scripts in increasing order of edit distance, and can be ter-
minated when the edit distance reaches some predeter-
mined maximum distance dmax. In coil's encoding stage,
this is used to bound the runtime of the algorithm: if the
edit distance between a pair of sequences exceeds a user-
specified figure (defaulting to 150), the algorithm termi-
nates early and a trivial edit script having length a + b is
produced.

Once an edit script has been found, it must be compactly
encoded into data bytes. coil uses a simple scheme in
which the most significant bit (MSB) of a byte specifies
whether an insertion or deletion is to take place, and the
remaining seven bits specify the offset (with respect to the
source string) from the previous edit operation. If an edit
operation is more than 126 characters along from the pre-
vious edit operation, a special code byte, having its lowest

seven bits equal to 127, is emitted, indicating that the next
byte should be read and 126 should be added to that
byte's value to form the offset. This code byte may occur
multiple times, adding 126 to the total offset each time.
Since deletion operations identify character positions
within the source string while insertion operations iden-
tify positions between characters, special care must be
taken to handle string positions and offsets in a manner
that permits unambiguous decoding.

In the case of an insertion operation, the character to be
inserted is not recorded in-place but written to a separate
file. This breaks the "edit script as black box" design prin-
ciple, however separating the edit script and nucleotide
data streams in this way makes the distributions of bytes
in each stream more predictable, resulting in compression
gains that cannot be overlooked.

Although the encoding described is fairly compact, it is
clearly not optimal: for example, we expect position off-
sets to be tightly clustered around zero, implying that an
encoding in which lower offset values were represented
with fewer bits would yield higher space savings. How-
ever, this and any other detectable redundancy will be
eliminated when the coil archive files are passed through
an external general-purpose compression program.

Sequence buffering
We have developed a simple buffering system that enables
maximally efficient random access to sequence data [see
Additional file 2].

Incremental compression
Large sequence databases such as GenBank [36] are not
static. They are being updated daily, and there is a need for
database users to access the latest versions. The solution
found by most organisations distributing these databases
is to make available daily or weekly updates in the form of
deltas – lists of sequences added, changed or removed
from the original database release. End users who already
have the main database installed can download the
updates and apply these changes to their local database
copies to produce up-to-date versions. These database del-
tas are much smaller files, often less than 100 Mb in size,
and coil's usual mode of compression performs poorly on
such small files.

A common update performed on a database is the addi-
tion of one or more new sequences. coil therefore sup-
ports incremental compression: the ability to efficiently
encode one sequence database, the increment, in terms of
another baseline database. We presume a user who down-
loads a database delta already has the original baseline
database, so we can "refer back" to baseline sequences
from within the encoding graph of the compressed incre-
Page 8 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
ment. This approach makes available a large pool of can-
didate root sequences that can be used for efficient delta-
encoding of each sequence in an increment.

In the remainder of this section we will refer to the mode
of compression discussed in previous sections as stan-
dalone compression. For concreteness we will talk about
compressing an increment database named incr.fasta in
terms of a baseline database named base.fasta. It may be
helpful to first read the subsection "Using coil", which
describes the overall workflow and the individual files
read and written by the various programs in the coil pack-
age.

Implementation of incremental compression
Suppose that there are B sequences in the baseline data-
base base.fasta, and I sequences in the increment
incr.fasta. Compression of the baseline entails ordinary
standalone compression; the only difference is that the k-
tuple index files, and also the base.coil.seqnos file, need to
be retained for compression (though not for decompres-
sion) of the increment. To compress the increment, k-
tuple index files are produced from the increment data-
base itself. However, we require the seqnums of the incre-
ment to be distinct from those of the baseline, so the -i B
command-line option must be used with the make_index
and find_edges programs to offset the starting seqnum by
B.

The encode program is then run with the command-line
option -i base, to indicate that the input file should be
encoded in terms of the coil archive base. When this
option is chosen, the Kruskal maximum spanning forest
algorithm is modified to avoid adding an edge between
two components which both contain baseline sequences.
This is easy to accomplish, since all baseline sequences
have seqnums less than B, and as components are identi-
fied by their lowest-numbered seqnum any component
that contains a baseline sequence will be represented by
the seqnum of that sequence, effectively limiting the
involvement of baseline sequences to being the roots of
components in the encoding graph.

Once encode has produced an encoding graph, it needs
access to the sequence data so that delta-encoding can be
performed. Obtaining the sequence data for an increment
seqnum can be accomplished in the usual way (via the
sequence buffering system), but this is not the case for
baseline sequences.

The necessary baseline sequences are obtained from the
baseline coil archive by decompressing the entire baseline
database in-memory, but writing out only those baseline
sequences which are the roots of trees in the increment
encoding graph (which we call buds). During this step,

baseline sequences will be visited in decode order; by first
sorting the list of required baseline seqnums into this
order, extraction can take the form of a list merge. The
sorting step involves inverting the permutation of base-
line seqnums recorded in the base.coil.seqnos file (a lin-
ear-time and -space operation), hence the requirement
that this file be retained after compression of the baseline
database. Then the encode program traverses and encodes
all trees rooted in baseline sequences in decode order. For
each such tree, the (strictly increasing) position of the
baseline root sequence in the decode order is written to
the file incr.bud, which is included in the increment
archive to facilitate decompression. Finally, the program
traverses and encodes all trees rooted in increment
sequences in the usual fashion.

When decompressing an increment with the program
decode, the command-line option -i base is used to spec-
ify that the coil archive base should be used as the base-
line. decode first decompresses the baseline sequences at
the positions listed in incr.bud and uses these sequences
as roots for decoding the initial segment of the increment
archive; then the remaining trees are decompressed as
usual.

Using coil
Compressing a FASTA database using coil involves run-
ning several C programs that work together to produce a
number of output files. Some of these files, collectively
termed the coil archive, are required for recovering the
original data, while the remainder may be discarded once
compression is complete. Alternatively, the user may run
a Perl script which automates these steps. The final step
requires the files comprising the coil archive to be com-
pressed by a general-purpose compression program. Fig-
ure 3 shows the complete process of using coil for
standalone compression of a database; incremental com-
pression is similar, but requires that all output files pro-
duced during baseline compression (including the
baseline coil archive itself) are also available. Incremental
compression produces one additional, small output file
ending with the extension .bud which must be included in
the final archive.

Decompression
Decompressing a coil archive is simple: first "undo" the
general-purpose compression used to compress the
archive, then run the program decode. Decompression of
an increment requires the name of the baseline database
be specified on the command line with the -i switch. The
process takes O(D) time and requires O(max(seqLen) *
max(treeDepth)) memory. max(treeDepth) is typically
small, but could be bounded using a simple adjustment to
the Kruskal algorithm if necessary.
Page 9 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
If the x.coil.seqnos file has been stored with the coil
archive, then it is possible to recover the original sequence
order at decompression time using the -x switch to the
decode program; otherwise, sequences in the FASTA out-
put file will appear in a different order.

For maximum portability across platforms, all files con-
taining binary integer data use the little-endian storage
format. All reading and writing of such files occurs
through platform-specific load_vector() and
save_vector() functions.

All programs accept the -h and --help switches, which can
be used to display usage information. A brief description
of each program can also be found in the README.txt file
included in the coil software package.

Results and discussion
To investigate the compression ratio achieved and run-
ning time required by coil for datasets of various sizes,
tests were performed on randomly chosen subsets of
sequences from a version of the GenBank est_mouse data-
base, which contains 1,729,518,522 nucleotides in
3,852,398 sequences. Twelve dataset sizes were examined,
with three test datasets produced for each size. Each data-
set having a name of the form emsn contains 3,852,398 ×
n/100 sequences randomly selected from the est_mouse
database. For the 100% size level, a single dataset (the
original est_mouse database) was run three times, giving
an indication of the noise level involved in execution time
measurements.

A number of alternative compression programs were
tested in addition to coil:

1. bz2: The general-purpose compressor bzip2 [27] with
compression level 9.

2. nrdb+bz2: Elimination of duplicate sequences with the
nrdb program, followed by bzip2 with compression level
9.

3. PPMdi: The PPMd general-purpose compressor variant
I described in [37], with model order 8 and RAM usage
256 Mb (the most allowed by the program).

4. 7z: the LZMA compression mode of the freely available
general-purpose compression program 7-Zip [38]. This
was the only other program we found that was capable of
utilising 1 Gb of RAM during compression.

Compression ratio vs. DB sizeFigure 4
Compression ratio vs. DB size. The compression ratios 
of all tested algorithms increase as the input size increases; 
those of coil and 7-Zip increase faster than the rest.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.01 0.1 1

Fraction of est_mouse sequences

C
o

m
p

re
ss

ed
 s

iz
e 

/ O
ri

g
in

al
 s

iz
e

bz2

7z

PPMdi

coil

Using coil to compress a FASTA databaseFigure 3
Using coil to compress a FASTA database. As few 
two-file k-tuple index segments are produced as memory 
allows.

x.seq

make_index

x1.ktl

x1.kti

x2.ktl

x2.kti

x3.ktl

x3.kti

find_edges

find_edges

find_edges

x1.edges x2.edges x3.edges

x.fastaextract_seqs

combine_edges

x.edges

x.names

x.nidx

encode x.coil.ins

x.coil.child

x.coil.names

x.coil.seq

x.coil.es

x.coil.tar.bz2

x.coil.seqnos

select_lines

tar

bzip2

x.idx
Page 10 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
We also attempted to compress our datasets with the pro-
gram DNACompress [9], however we found that we were
unable to compress datasets larger than 14.5 million
bases using this program. Unfortunately this is smaller
than the smallest dataset we used in testing, and we were
forced to abandon this attempt.

The tests on the emsn datasets specified restrictions for 7-
Zip and coil to use at most 1 Gb of RAM. It should be
noted that the other programs all use substantially less
memory than this; in particular, PPMdi is limited to 256
Mb. To enable a fairer comparison with the PPMdi pro-
gram, a further series of tests was carried out on the 140
Mb FASTA dataset rfam_full using 256 Mb settings for
each program, and the -x option to coil.pl to enforce in-
order sequence recovery. Due to problems attempting to
compile nrdb on Windows, the nrdb+bz2 measurements
were performed on a different computer: a Linux 3.2 GHz
Xeon machine with 4 Gb of RAM. To measure decompres-
sion speed for nrdb+bz2, a simple C program, unnrdb,
was written to expand the multi-header FASTA files pro-
duced by nrdb.

All coil runs used the parameter values k = 12, s = 8, f = 4
and h = 10, and were run on a 2 GHz Intel Core 2 Duo
computer with 2 Gb RAM running Microsoft Windows
XP. Most parameter values were chosen by earlier experi-
mentation, however the choice of f and h received extra
attention. Since it is important for the speed of the
find_edges program that its hashtable data structure fit in
cache memory, and it is not obvious how to trade off the
f and h parameters for a fixed memory size, preliminary
testing was conducted with several values of these param-
eters, suggesting (f = 4, h = 10) is best for the case where
the hashtable is limited to 16 Kb in size – small enough to
fit in the first-level cache of any modern computer system.

Table 2 shows the sizes of the resulting compressed files.
We immediately see that it is a race between coil and 7-
Zip's LZMA compression mode: these two compressors
easily outstrip all others, with the gap widening as file
sizes increase. This is depicted graphically in figure 4. It
appears that the 1 Gb of RAM available to both these com-
pressors makes a big difference when compressing files
containing many sparse repeats. There is never more than
a 6% difference in file sizes between these two compres-
sors, with 7-Zip performing best on smaller files. coil
edges out 7-Zip on the larger files, eventually claiming a
5% improvement on the largest dataset tested, ems100.

Looking at the execution times in Table 3, a similar tran-
sition takes place: coil is faster than 7-Zip up until around
ems50, at which point the quadratic nature of find_edges
starts to dominate. coil compresses the ems25 datasets
faster and better than does 7-Zip by a small margin.

Finally, the decompression times shown in Table 4 show
that coil is somewhat slower than the other programs,
though still essentially linear-time as expected. Not shown
in the table is coil's frugal memory usage during decom-
pression – the maximum memory usage while decom-
pressing ems100 is just 4.5 Mb, in comparison to the 89
Mb used by 7-Zip and the 270 Mb used by PPMdi.

With respect to the rfam_full datasets, coil outperformed
the nearest competition – again, 7-Zip – by around 3% in
terms of compression ratio, though requiring more than
twice as much time to do so. PPMdi performed poorly,
producing a file more than twice the size of that produced
by coil or 7-Zip. This is especially surprising given that
these other programs were operating with the same 256
Mb RAM constraints as PPMdi for this dataset. bzip2 does
substantially better with only 8 Mb of RAM at its disposal.

Surprisingly, although the optimised Pentium 4 version of
find_edges produced a speed improvement of 25% on the
Pentium 4 computer on which we performed initial test-
ing, using this version of the program actually decreased
performance by 6% on the Core 2 Duo platform. Only
one test was run using this version of the program, indi-
cated by a row with an asterisk in Tables 2, 3 and 4; all
other results shown use the regular version of find_edges.

Conclusion
We have demonstrated that the concept of edit-tree cod-
ing can be applied to produce a practical compression tool
for sequence databases. The execution time required is not
negligible and appears to grow quadratically with data-
base size, but adequate compression on large EST data-
bases can nevertheless be achieved on "everyday" modern
computers. Furthermore, concern over compression time
diminishes when it is considered to be amortised over the
many decompressions that may take place in the targeted
field of one-source-many-sinks operations. Decompression
is acceptably fast, uses very little memory and can be per-
formed on any computer with a C compiler. Source code
portability and binary compatibility of compressed files
has been tested on two widely used platforms, Linux and
Win32.

There remains a wide scope for experimentation with coil
and fine-tuning of algorithms and parameters. For exam-
ple, one avenue not pursued here is the extent to which fil-
tering of common k-tuples affects execution time and
matching accuracy. It may be that the most commonly
occurring 80% of k-tuples can be removed without dra-
matically affecting overall compression. While this kind
of search space pruning would never be acceptable in a
program like SSAHA that is specifically designed to find
matches between sequences, we only care about accuracy
Page 11 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
of sequence matches where it noticeably improves com-
pression.

Availability and requirements
Project name: coil

Project home page: http://awcmee.massey.ac.nz/Data/
wtwhite/coil-1.1.2

Operating system(s): Tested on Windows and Linux.
Binaries are additionally provided for Windows. Source
code should compile in any UNIX-like environment.

Programming languages: ANSI C, Perl

Other requirements: Perl 5.6 or higher

Licence: BSD-style license

Table 2: Compressed file sizes

Dataset FASTA bz2 nrdb+bz2 7z PPMdi coil

ems1 23292780 5876747 5871445 4870989 5331953 4990193
23199910 5853780 5852865 4854519 5311350 4981279
23201245 5852837 5852772 4857631 5312411 4988747

ems2 46519702 11576074 11574420 9057588 10475531 9432789
46428669 11557030 11556573 9023980 10454826 9410376
46390115 11547516 11549117 9036246 10445740 9426594

ems3 69631679 17211495 17205793 12922145 15537092 13607729
69647486 17212318 17208461 12907737 15543489 13592739
69715954 17231912 17225610 12920294 15558845 13623246

ems4 92905691 22841127 22810035 16600091 20601712 17625302
93012024 22868732 22849091 16611294 20629724 17655369
92850447 22813494 22799324 16587471 20585812 17584008

ems5 116125238 28428297 28415051 20245345 25636473 21509065
116249077 28451622 28426520 20260429 25663621 21547174
116117128 28413464 28397742 20239745 25630456 21496207

ems10 232365230 56136032 56054164 37932764 50662993 39774087
232226017 56101887 56085818 37910566 50643774 39711435
232230440 56099503 56030860 37871106 50622855 39685294

ems15 348404276 83539894 83461996 55591757 75411889 56758484
348435883 83529794 83463158 55594352 75435650 56771053
348292392 83453434 83396104 55580937 75374710 56768193

ems20 464825178 110838776 110755872 72989089 100113255 72984372
464778933 110777795 110650470 72991749 100083039 73004561
464532828 110766213 110653180 72918789 100046482 72978434

ems25 581105516 137940393 137814551 89636246 124600275 88816000
580758935 137898843 137748733 89647136 124521398 88829572
580693026 137884675 137756070 89594767 124526386 88745435

ems50 1161787240 272394718 271857439 169833915 244302747 164139098
1161908810 272481687 271896055 169824808 244355206 164069331
1161582289 272310746 271844248 169812165 244255108 164093038

ems75 1742471477 405262890 404293340 247835911 362403056 236517596
1742664959 405243466 404268271 247921410 362419128 236506851
1742458336 405281768 404397179 247684455 362394809 236572552

ems100 2323234744 533757352 324292321 478735224 308211386
2323234744 533757352 324292321 478735224 308211685
2323234744 533757352 324292321 478735224 308211677

ems100* 2323234744 308212275
rfam_full 140518668 4413613 4113889 9504648 3995880

140518668 4413613 4113889 9504648 3996447
140518668 4413613 4113889 9504648 3995925

All sizes are in bytes. The FASTA column shows the size of the original uncompressed FASTA file. The smallest file in each row is shown in bold. * 
This row shows the result of using version of find_edges optimised for the Pentium 4. nrdb+bz2 failed to compress the ems100 dataset because the 
size of the FASTA file exceeded 2 Gb. All coil runs performed on the rfam_full dataset used the -x option to enable in-order recovery of sequences. 
nrdb+bz2 was not used with the rfam_full dataset because it is incapable of restoring this order.
Page 12 of 15
(page number not for citation purposes)

http://awcmee.massey.ac.nz/Data/wtwhite/coil-1.1.2
http://awcmee.massey.ac.nz/Data/wtwhite/coil-1.1.2


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
Abbreviations
EST: Expressed Sequence Tag. In order to produce a pro-
tein, a cell first copies the segment of DNA encoding that
protein (the gene) to a complementary messenger RNA
molecule. ESTs are DNA sequences obtained by extracting
and sequencing the messenger RNA molecules from a cell.
Because only a single sequence read is performed on each
molecule, ESTs are limited to approximately 800 bases in
length. 

FASTA: A simple text file format for storing multiple DNA
or protein sequences. Each sequence begins with a single
line starting with the character ">" and containing the
sequence name, followed by any number of lines contain-
ing the sequence data. 

MST: Minimum Spanning Tree. Given a connected, edge-
weighted graph, a minimum spanning tree of the graph is
a subgraph that (a) contains all vertices of the graph, (b)

Table 3: Compression execution time

coil

Dataset bz2 nrdb+ bz2 7z PPMdi find_edges encode tar+bz other total

ems1 5.6 7.7 43.3 5.6 3.5 9.3 1.2 10.8 24.8
5.5 9.2 48.2 4.7 3.6 8.1 1.0 10.4 23.1
5.7 10.2 43.5 4.3 3.6 8.2 1.2 12.5 25.5

ems2 10.3 15.1 96.5 9.9 9.5 20.0 1.0 19.1 49.5
10.3 17.7 101.4 8.5 9.7 20.2 1.0 19.0 49.8
10.1 16.8 95.6 8.3 9.7 20.3 1.1 20.8 51.9

ems3 15.2 22.6 154.8 14.7 17.0 36.5 1.9 27.7 83.1
16.7 24.6 162.9 12.7 17.3 35.4 2.2 28.0 82.9
15.4 22.4 154.2 12.8 17.3 34.5 3.1 29.2 84.2

ems4 20.3 32.0 216.5 20.0 25.8 50.6 3.6 34.3 114.3
20.5 33.4 221.1 17.4 26.7 50.3 3.0 37.1 117.1
20.0 31.1 215.0 17.0 26.0 49.1 3.3 39.3 117.7

ems5 30.3 39.1 276.7 25.3 35.5 65.4 4.2 42.7 147.7
25.5 43.5 280.4 21.5 35.7 65.5 4.1 45.2 150.5
25.3 38.7 275.9 21.4 35.8 64.6 4.1 46.2 150.7

ems10 62.5 85.1 573.8 49.8 100.5 179.3 8.4 100.4 388.6
60.7 88.0 580.8 45.5 102.1 176.0 9.0 84.4 371.4
50.6 80.1 575.4 43.5 100.5 160.8 9.4 87.7 358.4

ems15 94.3 117.8 871.1 69.0 197.9 271.8 12.6 118.2 600.6
76.7 136.5 876.5 64.7 198.5 276.9 13.7 130.5 619.5
89.8 119.6 869.5 64.5 196.1 275.8 13.8 133.4 619.1

ems20 101.5 169.9 1163.0 92.8 317.1 393.5 16.7 176.0 903.5
101.7 179.7 1169.3 86.9 321.6 393.5 18.5 212.2 945.8
120.5 158.8 1161.3 84.9 319.6 399.7 16.9 215.6 951.8

ems25 133.0 207.7 1482.2 116.0 471.7 503.2 22.5 280.8 1278.1
152.3 220.7 1438.9 105.9 470.9 467.2 22.3 218.4 1178.8
171.0 196.3 1456.4 106.0 468.0 504.4 23.4 248.0 1243.8

ems50 306.2 411.2 2882.0 215.4 1657.4 1172.3 105.4 716.3 3651.4
340.0 452.4 2893.3 209.2 1658.2 1170.7 104.5 583.7 3517.1
291.1 411.6 2888.3 207.2 1655.5 1174.7 107.9 671.8 3609.9

ems75 500.7 712.4 4328.8 314.4 3517.1 1814.9 167.8 1173.4 6673.2
506.8 618.1 4304.5 311.9 3502.1 1810.2 164.7 992.9 6469.8
508.7 593.5 4298.9 317.7 3490.7 1798.6 165.0 1116.4 6570.6

ems100 668.6 5760.8 408.5 6064.4 2552.4 223.9 1421.8 10262.6
634.1 5707.5 404.1 6042.3 2524.8 219.3 1429.0 10215.3
689.2 5773.6 403.3 6114.1 2496.4 217.5 1546.2 10374.1

ems100* 6446.3 2515.4 218.6 1505.8 10686.1
rfam_full 32.8 75.8 7.9 114.8 12.8 4.0 40.7 172.3

29.6 75.3 7.9 113.9 12.3 4.3 38.2 168.7
29.6 75.5 7.8 114.5 12.4 4.2 36.0 167.1

All durations are in seconds. The rightmost five columns break down the execution of coil by its main component programs; the "other" column 
includes the time needed for the programs extract_seqs, make_index and select_lines.
*This row shows the result of using the Pentium 4-optimised version of find_edges – surprisingly, this version of find_edges is actually about 6% 
slower than the original version on this CPU.
Page 13 of 15
(page number not for citation purposes)


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
retains connectivity and (c) has minimum total weight
among all such subgraphs. It follows that in a graph with
positive edge weights, such a subgraph is always a tree.

Authors' contributions
WTJW conceived the concept of edit-tree coding, designed
and implemented the coil software, performed perform-
ance measurement, and produced an early draft of the
manuscript. MDH provided design advice and made sig-
nificant contributions to the final version of the manu-
script.

Additional material

Acknowledgements
The authors would like to acknowledge the helpful comments of D. Penny 
in preparing this manuscript.

References
1. NCBI: NCBI-GenBank Flat File Release 159 Release Notes.

[ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb159.release.notes].
2. NCBI News: GenBank Passes the 100 Gigabase Mark.  NCBI

News  [http://www.ncbi.nlm.nih.gov/Web/Newsltr/V14N2/
100gig.html].

3. Ziv J, Lempel A: Universal Algorithm for Sequential Data
Compression.  IEEE Transactions on Information Theory 1977,
23:337-343.

4. Gailly J, Adler M: gzip (GNU zip) compression utility.   [http://
www.gnu.org/software/gzip/].

5. Matsumoto T, Sadakane K, Imai H: Biological sequence compres-
sion algorithms: December 18-19; Tokyo.  Universal Academy
Press; 2000:43-52. 

6. Grumbach S, Tahi F: Compression of DNA sequences: 30
March-2 April; Snowbird, Utah.  Edited by: Storer JA and Cohn
M. IEEE Computer Society Press; 1993:340-350. 

7. Grumbach S, Tahi F: A New Challenge for Compression Algo-
rithms - Genetic Sequences.  Inf Process Manage 1994,
30:875-886.

8. Chen X, Kwong S, Li M: A compression algorithm for DNA
sequences.  IEEE Engineering in Medicine and Biology Magazine 2001,
20:61-66.

9. Chen X, Li M, Ma B, Tromp J: DNACompress: fast and effective
DNA sequence compression.  Bioinformatics 2002, 18:1696-1698.

10. Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang HY: An infor-
mation-based sequence distance and its application to whole
mitochondrial genome phylogeny.  Bioinformatics 2001,
17:149-154.

11. Kocsor A, Kertesz-Farkas A, Kajan L, Pongor S: Application of
compression-based distance measures to protein sequence
classification: a methodological study.  Bioinformatics 2006,
22:407-412.

12. Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive
homology search.  Bioinformatics 2002, 18:440-445.

13. Strelets VB, Lim HA: Compression of Protein-Sequence Data-
bases.  Comput Appl Biosci 1995, 11:557-561.

14. Wu CH, Yeh LSL, Huang HZ, Arminski L, Castro-Alvear J, Chen YX,
Hu ZZ, Kourtesis P, Ledley RS, Suzek BE, Vinayaka CR, Zhang J,
Barker WC: The Protein Information Resource.  Nucleic Acids
Res 2003, 31:345-347.

15. Katz P: PKZIP.  1.1th edition. 1990 [http://www.pkware.com/]. Mil-
waukee, WI, USA, PKWARE, Inc.

16. Li WZ, Jaroszewski L, Godzik A: Clustering of highly homolo-
gous sequences to reduce the size of large protein databases.
Bioinformatics 2001, 17:282-283.

Additional file 1
Appendix 1 – Pentium IV optimised find_edges. Describes the version of 
the find_edges program optimised for the Pentium IV processor.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-9-242-S1.doc]

Additional file 2
Appendix 2 – Sequence Buffering System. Describes the system used for 
efficiently obtaining random access to sequence data in memory-con-
strained environment.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-9-242-S2.doc]

Table 4: Decompression execution time

Dataset bz2 nrdb+bz2 7z PPMdi coil

ems1 1.8 3.0 1.5 4.6 3.6
1.8 3.1 1.5 3.9 4.5
1.8 3.1 1.5 4.3 4.5

ems2 3.5 6.2 3.2 8.7 6.5
3.5 6.1 3.2 7.9 6.6
3.4 6.0 3.1 8.6 7.1

ems3 5.2 9.0 4.8 13.2 10.4
5.2 9.3 4.9 11.7 11.0
5.2 9.0 4.9 13.2 10.3

ems4 6.9 12.0 6.5 17.5 14.3
7.0 12.3 6.4 15.6 13.6
6.9 12.0 6.5 17.6 13.7

ems5 8.7 15.2 7.8 22.1 16.8
8.6 15.3 7.3 19.7 17.7
8.6 14.9 8.1 22.0 18.3

ems10 17.0 29.8 15.1 44.9 36.3
17.0 30.6 14.8 40.1 36.8
17.1 29.9 17.6 44.5 36.9

ems15 25.2 44.6 22.1 65.9 52.3
25.5 46.0 22.0 59.1 53.7
25.5 44.4 24.4 65.5 52.8

ems20 33.8 59.5 30.3 87.1 68.2
34.0 60.7 29.8 78.1 70.2
34.1 59.4 32.0 86.2 69.1

ems25 41.9 74.4 36.7 107.6 91.1
42.1 76.3 38.9 100.9 85.9
42.4 74.0 36.2 106.7 86.8

ems50 126.6 147.7 71.7 210.4 286.1
128.6 152.0 71.9 202.8 274.9
129.0 148.4 74.5 209.7 280.9

ems75 187.7 223.2 110.3 312.0 511.3
190.5 228.3 111.3 306.7 471.3
191.2 221.5 116.3 352.4 464.9

ems100 247.1 142.1 324.2 646.1
248.6 137.6 404.9 674.2
252.4 143.5 531.9 700.8

ems100* 649.9
rfam_full 9.6 7.9 9.1 60.7

6.3 6.4 9.1 59.9
6.1 6.2 9.1 60.2

All durations are in seconds. *This row shows the result of using the 
Pentium 4-optimised version of find_edges.
Page 14 of 15
(page number not for citation purposes)

http://www.biomedcentral.com/content/supplementary/1471-2105-9-242-S1.doc
http://www.biomedcentral.com/content/supplementary/1471-2105-9-242-S2.doc
ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb159.release.notes
http://www.ncbi.nlm.nih.gov/Web/Newsltr/V14N2/100gig.html
http://www.ncbi.nlm.nih.gov/Web/Newsltr/V14N2/100gig.html
http://www.gnu.org/software/gzip/
http://www.gnu.org/software/gzip/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12490460
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12490460
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11238070
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11238070
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11238070
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16317070
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16317070
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16317070
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11934743
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11934743
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=8590180
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=8590180
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12520019
http://www.pkware.com/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11294794
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11294794


BMC Bioinformatics 2008, 9:242 http://www.biomedcentral.com/1471-2105/9/242
Publish with BioMed Central   and  every 
scientist can read your work free of charge

"BioMed Central will be the most significant development for 
disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central 

yours — you keep the copyright

Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

17. Li WZ, Jaroszewski L, Godzik A: Tolerating some redundancy
significantly speeds up clustering of large protein databases.
Bioinformatics 2002, 18:77-82.

18. Li WZ, Godzik A: Cd-hit: a fast program for clustering and
comparing large sets of protein or nucleotide sequences.  Bio-
informatics 2006, 22:1658-1659.

19. nrdb   [http://blast.wustl.edu/pub/nrdb/]
20. Thompson JD, Higgins DG, Gibson TJ: Clustal-W - Improving the

Sensitivity of Progressive Multiple Sequence Alignment
through Sequence Weighting, Position-Specific Gap Penal-
ties and Weight Matrix Choice.  Nucleic Acids Res 1994,
22:4673-4680.

21. Foulds LR, Graham RL: The Steiner problem in phylogeny is
NP-complete.  Advances in Applied Mathematics 1982, 3:43-49.

22. Chazelle B: A minimum spanning tree algorithm with Inverse
Ackermann type complexity.  Journal of the ACM 2000,
47:1028-1047.

23. Ferragina P, Manzini G: Indexing compressed text.  J ACM 2005,
52:552-581.

24. Russo LMS, Oliveira AL: A compressed self-index using a Ziv-
Lempel dictionary.  In String Processing and Information Retrieval, Pro-
ceedings Volume 4209. Berlin, SPRINGER-VERLAG BERLIN;
2006:163-180. 

25. Foschini L, Grossi R, Gupta A, Vitter JS: When indexing equals
compression: Experiments with compressing suffix arrays
and applications.  ACM Trans Algorithms 2006, 2:611-639.

26. Lipman DJ, Pearson WR: Rapid and Sensitive Protein Similarity
Searches.  Science 1985, 227:1435-1441.

27. Seward J: bzip2 and libbzip2 - A program and library for data
compression.  1.0.3 edition. 1997.

28. Hunt JW, Szymanski TG: A Fast Algorithm for Computing
Longest Common Subsequences.  Communications of the ACM
1977, 20:350-353.

29. Ning ZM, Cox AJ, Mullikin JC: SSAHA: A fast search method for
large DNA databases.  Genome Res 2001, 11:1725-1729.

30. Burkhardt S, Karkkainen J: One-gapped q-gram filters for Leven-
shtein distance.  In Combinatorial Pattern Matching Volume 2373. Ber-
lin, SPRINGER-VERLAG BERLIN; 2002:225-234. 

31. Kruskal JB Jr.: On the Shortest Spanning Subtree of a Graph
and the Traveling Salesman Problem.  Proceedings of the Ameri-
can Mathematical Society 1956, 7:48-50.

32. Prim RC: Shortest Connection Networks and Some General-
izations.  Bell System Technical Journal 1957, 36:1389-1401.

33. Moret B, Shapiro H: Algorithms from P to NP: Design and Effi-
ciency.  Redwood City, CA, Benjamin/Cummings; 1991. 

34. Tarjan RE: Efficiency of a Good but Not Linear Set Union
Algorithm.  J ACM 1975, 22:215-225.

35. Myers EW: An O(ND) Difference Algorithm and its Varia-
tions.  Algorithmica 1986, 1:251-266.

36. GenBank Sequence Database   [http://www.ncbi.nlm.nih.gov/
Genbank/index.html]

37. Shkarin D: PPM: One Step to Practicality.  2002:202-211.
38. 7-Zip   [http://www.7-zip.org]
Page 15 of 15
(page number not for citation purposes)

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11836214
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11836214
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16731699
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16731699
http://blast.wustl.edu/pub/nrdb/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=7984417
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=7984417
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=7984417
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=2983426
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=2983426
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11591649
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11591649
http://www.ncbi.nlm.nih.gov/Genbank/index.html
http://www.ncbi.nlm.nih.gov/Genbank/index.html
http://www.7-zip.org
http://www.biomedcentral.com/
http://www.biomedcentral.com/info/publishing_adv.asp
http://www.biomedcentral.com/

	Coverpage template_cc license
	White-2008-Compressing DNA sequ.pdf
	Abstract
	Background
	Results
	Conclusion

	Background
	Compressing individual biological sequences
	Compressing databases of biological sequences
	Compression and the maximum parsimony criterion
	Goals of coil

	Implementation
	Overview
	Edit distances and similarity graphs
	The k-tuple index
	Creating the similarity graph
	Letting go of perfection: the leaky move-to-front hashtable
	Frequently occurring seqnums are found more quickly and require fewer updates
	Packing the seqnum and count values
	Fixed-size multidimensional arrays require very low memory overhead
	Computers like fixed-size arrays

	Pentium 4 optimised version
	Extracting the encoding graph
	Delta-encoding sequences
	Sequence buffering
	Incremental compression
	Implementation of incremental compression
	Using coil
	Decompression

	Results and discussion
	Conclusion
	Availability and requirements
	Abbreviations
	Authors' contributions
	Additional material
	Acknowledgements
	References