Research Projects

  • Sequence Mapping

    Sequence Mapping is the problem of mapping very short reads to a reference genome. Read More

    High Throughput Sequence (HTS) has become an invaluable technology for many applications, e.g. the detection of single-nucleotide polymorphisms, structural variations. In most of these applications, mapping sequenced "reads" to their potential genomic origin is the first fundamental step for subsequent analyses. Many tools have been developed to address this problem. Because of the large amount of HTS data availability, much emphasis has been placed on speed and memory.

    We introduced two novel methods namely mrsFAST and drFAST to map HTS short-reads to the reference genome. These methods are cache oblivious and guarantee perfect sensitivity. Both are specifically designed to address the bottleneck of multi-mapping for the purpose of structural variation detection.

  • Sequence Compression

    Sequence Compression is about compressing biological data in raw and processed form. Read More

    In fact, as HTS data grow in size, data management and storage are becoming major logistical obstacles for adopting HTS-platforms. The requirements for ever increasing monetary investment almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information, which holds most of the sequence data generated world wide. One way to solve storage requirements for HTS data is compression. Currently, most HTS data is compressed through general purpose algorithms such as gzip. These algorithms are not specifically designed for compressing data generated by the HTS-platforms. Recently, a number of fast and efficient compression algorithms have been designed specifically for HTS data to address some of the issues in data management, storage and communication.

    We address the storage and communication problems in HTS data by introducing SCALCE, a "boosting" scheme based on Locally Consistent Parsing technique. SCALCE re-orders the data in order to increase the locality of reference and subsequently improve the performance of well-known compression methods in terms of speed and space.

  • Trascriptomics Structural Variation Discovery

    Discovery of structural variation using RNA-Seq. Read More.

    Computational identification of genomic structural variants via high-throughput sequencing is an important problem for which a number of highly sophisticated solutions have been recently developed. With the advent of high-throughput transcriptome sequencing (RNA-Seq), the problem of identifying structural alterations in the transcriptome is now attracting significant attention.

    We introduce two novel algorithmic formulations for identifying transcriptomic structural variants through aligning transcripts to the reference genome under the consideration of such variation. The first formulation is based on a nucleotide-level alignment model; a second, potentially faster formulation is based on chaining fragments shared between each transcript and the reference genome. Based on these formulations, we introduce a novel transcriptome-to-genome alignment tool, Dissect (DIScovery of Structural Alteration Event Containing Transcripts), which can identify and characterize transcriptomic events such as duplications, inversions, rearrangements and fusions. Dissect is suitable for whole transcriptome structural variation discovery problems involving sufficiently long reads or accurately assembled contigs.