Compressing DNA sequence databases with coil

White, W. Timothy J.; Hendy, Michael D.

Compressing DNA sequence databases with coil

dc.contributor.author	White, W. Timothy J.
dc.contributor.author	Hendy, Michael D.
dc.date.accessioned	2010-11-23T03:35:31Z
dc.date.accessioned	2016-03-06T22:26:12Z
dc.date.accessioned	2016-09-07T13:57:51Z
dc.date.available	NO_RESTRICTION	en_US
dc.date.available	2010-11-23T03:35:31Z
dc.date.available	2016-03-06T22:26:12Z
dc.date.available	2016-09-07T13:57:51Z
dc.date.issued	2008-05-20
dc.description.abstract	Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.	en_US
dc.identifier.citation	White, W. T. J., & Hendy, M. D. (2008). Compressing DNA sequence databases with coil. Bmc Bioinformatics, 9. doi: 10.1186/1471-2105-9-242	en_US
dc.identifier.harvested	Massey_Dark
dc.identifier.harvested	Massey_Dark
dc.identifier.issn	1471-2105
dc.identifier.uri	http://hdl.handle.net/10179/9717
dc.language.iso	en	en_US
dc.publisher	BioMed Central	en_US
dc.relation.isbasedon	BioMed Central	en_US
dc.relation.isformatof	http://www.biomedcentral.com/1471-2105/9/242	en_US
dc.subject	DNA sequence	en_US
dc.subject	Databases	en_US
dc.subject.other	Fields of Research::280000 Information, Computing and Communication Sciences::280300 Computer Software	en_US
dc.title	Compressing DNA sequence databases with coil	en_US
dc.type	Journal Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2008_White and Hendy.pdf
Size:: 388.65 KB
Format:: Adobe Portable Document Format

Download

Collections

Journal Articles