• Login
    View Item 
    •   Home
    • Massey Documents by Type
    • Journal Articles
    • View Item
    •   Home
    • Massey Documents by Type
    • Journal Articles
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Compressing DNA sequence databases with coil

    Icon
    View/Open Full Text
    2008_White and Hendy.pdf (388.6Kb)
    Abstract
    Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.
    Citation
    White, W. T. J., & Hendy, M. D. (2008). Compressing DNA sequence databases with coil. Bmc Bioinformatics, 9. doi: 10.1186/1471-2105-9-242
    Date
    2008-05-20
    Author
    White, W. Timothy J.
    Hendy, Michael D.
    Publisher
    BioMed Central
    URI
    http://hdl.handle.net/10179/9717
    Collections
    • Journal Articles
    Metadata
    Show full item record

    Copyright © Massey University
    | Contact Us | Feedback | Copyright Take Down Request | Massey University Privacy Statement
    DSpace software copyright © Duraspace
    v5.7-2020.1-beta1
     

     

    Tweets by @Massey_Research
    Information PagesContent PolicyDepositing content to MROCopyright and Access InformationDeposit LicenseDeposit License SummaryTheses FAQFile FormatsDoctoral Thesis Deposit

    Browse

    All of MROCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Statistics

    View Usage Statistics

    Copyright © Massey University
    | Contact Us | Feedback | Copyright Take Down Request | Massey University Privacy Statement
    DSpace software copyright © Duraspace
    v5.7-2020.1-beta1