Filter by type:

Sort by year:

HASLR: Fast Hybrid Assembly of Long Reads

Ehsan Haghshenas, Hossein Asghari, Jens Stoye, Cedric Chauve, and Faraz Hach
Journal Papers iScience. 2020 Aug 21;23(8).
Presentation RECOMB-Seq 2020 Regular Talk (10th Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Virtual Conference, June 26-27 2020


Third-generation sequencing technologies from companies such as Oxford Nanopore and Pacific Biosciences have paved the way for building more contiguous and potentially gap-free assemblies. The larger effective length of their reads has provided a means to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive, whereas faster methods are not as accurate. Moreover, despite recent advances in third-generation sequencing, researchers still tend to generate accurate short reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler that uses error-prone long reads together with high-quality short reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on most of the samples, while being on par with other assemblers in terms of contiguity and accuracy.
Availability: HASLR is an open source tool available at

CircMiner: Accurate and Rapid Detection of Circular RNA through Splice-Aware Pseudo-Alignment Scheme

Hossein Asghari, Yen-Yi Lin, Yang Xu, Ehsan Haghshenas, Colin C Collins, and Faraz Hach
Journal Papers Bioinformatics. 2020 Apr 7;-. doi:


Motivation: The ubiquitous abundance of circular RNAs (circRNAs) has been revealed by performing high-throughput sequencing in a variety of eukaryotes. circRNAs are related to some diseases such as cancer in which they act as oncogenes or tumor-suppressors, and therefore have the potential to be used as biomarkers or therapeutic targets. Accurate and rapid detection of circRNAs from short reads remains computationally challenging. This is due to the fact that identifying chimeric reads, which is essential for finding back-splice junctions, is a complex process. The sensitivity of discovery methods, to a high degree, relies on the underlying mapper that is used for finding chimeric reads. Furthermore, all the available circRNA discovery pipelines are resource intensive.
Results: We introduce CircMiner, a novel stand-alone circRNA detection method that rapidly identifies and filters out linear RNA-Seq reads and detects back-splice junctions. CircMiner employs a rapid pseudoalignment technique to identify linear reads that originate from transcripts, genes, or the genome. CircMiner further processes the remaining reads to identify the back-splice junctions and detect circRNAs with single-nucleotide resolution. We evaluated the efficacy of CircMiner using simulated datasets generated from known back-splice junctions and showed that CircMiner has superior accuracy and speed compared to the existing circRNA detection tools. Additionally, on two RNase R treated cell line datasets, CircMiner was able to detect most of consistent, high confidence circRNAs compared to untreated samples of the same cell line.
Availability: CircMiner is implemented in C++ and is available online at

PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data

Salem Malikic, Farid Rashidi Mehrabadi, Simone Ciccolella, Md. Khaledur Rahman, Camir Ricketts, Ehsan Haghshenas, Daniel Seidman, Faraz Hach , Iman Hajirasouliha, and S. Cenk Sahinalp
Journal PapersGenome Res. 2019 Nov;29(11):1860-1877. doi: 10.1101/gr.234435.118
Presentation RECOMB 2020 Highlight Talk (24TH Internation Conference on Research in Computationl Molecular Biology), Virtual Conference, June 22-25 2020


Available computational methods for tumor phylogeny inference via single-cell sequencing (SCS) data typically aim to identify the most likely perfect phylogeny tree satisfying the infinite sites assumption (ISA). However, the limitations of SCS technologies including frequent allele dropout and variable sequence coverage may prohibit a perfect phylogeny. In addition, ISA violations are commonly observed in tumor phylogenies due to the loss of heterozygosity, deletions, and convergent evolution. In order to address such limitations, we introduce the optimal subperfect phylogeny problem which asks to integrate SCS data with matching bulk sequencing data by minimizing a linear combination of potential false negatives (due to allele dropout or variance in sequence coverage), false positives (due to read errors) among mutation calls, and the number of mutations that violate ISA (real or because of incorrect copy number estimation). We then describe a combinatorial formulation to solve this problem which ensures that several lineage constraints imposed by the use of variant allele frequencies (VAFs, derived from bulk sequence data) are satisfied. We express our formulation both in the form of an integer linear program (ILP) and—as a first in tumor phylogeny reconstruction—a Boolean constraint satisfaction problem (CSP) and solve them by leveraging state-of-the-art ILP/CSP solvers. The resulting method, which we name PhISCS, is the first to integrate SCS and bulk sequencing data while accounting for ISA violating mutations. In contrast to the alternative methods, typically based on probabilistic approaches, PhISCS provides a guarantee of optimality in reported solutions. Using simulated and real data sets, we demonstrate that PhISCS is more general and accurate than all available approaches.

Alignment-free Clustering of UMI Tagged DNA Molecules

Baraa Orabi, Emre Erhan, Brian McConeghy, Stanislav V Volik, Stephane Le Bihan, Robert Bell, Colin C Collins, Cedric Chauve and Faraz Hach
Journal Papers Bioinformatics. 2019 Jun 1;35(11):1829-1836. doi: 10.1093/bioinformatics/bty888.
Presentation RECOMB-Seq 2018 Regular Talk (8th Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Paris, France, April 19-20 2018


Motivation: Next Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, for example, for detecting low allele frequency variations from circulating tumour DNA. Barcode tagging of DNA molecules with Unique Molecular Identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are PCR amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient.
Results: We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments using substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib's default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of likely false positive in downstream mutation calling.
Availability: Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available on our GitHub repository at

Structural variation and fusion detection using targeted sequencing data from circulating cell free DNA

Alexander R Gawroński, Yen-Yi Lin, Brian McConeghy, Stephane LeBihan, Hossein Asghari, Can Koçkan, Baraa Orabi, Nabil Adra, Roberto Pili, Colin C Collins, S Cenk Sahinalp and Faraz Hach
Journal Papers Nucleic Acids Res. 2019 Apr 23;47(7):e38. doi: 10.1093/nar/gkz067.


Motivation: Cancer is a complex disease that involves rapidly evolving cells, often forming multiple distinct clones. In order to effectively understand progression of a patient-specific tumor, one needs to comprehensively sample tumor DNA at multiple time points, ideally obtained through inexpensive and minimally invasive techniques. Current sequencing technologies make the 'liquid biopsy' possible, which involves sampling a patient's blood or urine and sequencing the circulating cell free DNA (cfDNA). A certain percentage of this DNA originates from the tumor, known as circulating tumor DNA (ctDNA). The ratio of ctDNA may be extremely low in the sample, and the ctDNA may originate from multiple tumors or clones. These factors present unique challenges for applying existing tools and workflows to the analysis of ctDNA, especially in the detection of structural variations which rely on sufficient read coverage to be detectable.
Results: Here we introduce SViCT , a structural variation (SV) detection tool designed to handle the challenges associated with cfDNA analysis. SViCT can detect breakpoints and sequences of various structural variations including deletions, insertions, inversions, duplications and translocations. SViCT extracts discordant read pairs, one-end anchors and soft-clipped/split reads, assembles them into contigs, and re-maps contig intervals to a reference genome using an efficient k-mer indexing approach. The intervals are then joined using a combination of graph and greedy algorithms to identify specific structural variant signatures. We assessed the performance of SViCT and compared it to state-of-the-art tools using simulated cfDNA datasets with properties matching those of real cfDNA samples. The positive predictive value and sensitivity of our tool was superior to all the tested tools and reasonable performance was maintained down to the lowest dilution of 0.01% tumor DNA in simulated datasets. Additionally, SViCT was able to detect all known SVs in two real cfDNA reference datasets (at 0.6-5% ctDNA) and predict a novel structural variant in a prostate cancer cohort.
Availability: SViCT is available at

BAP1 haploinsufficiency predicts a distinct immunogenic class of malignant peritoneal mesothelioma

Raunak Shrestha, Noushin Nabavi, Yen-Yi Lin, Fan Mo, Shawn Anderson, Stanislav Volik, Hans H. Adomat, Dong Lin, Hui Xue, Xin Dong, Robert Shukin, Robert H. Bell, Brian McConeghy, Anne Haegert, Sonal Brahmbhatt, Estelle Li, Htoo Zarni Oo, Antonio Hurtado-Coll, Ladan Fazli, Joshua Zhou, Yarrow McConnell, Andrea McCart, Andrew Lowy, Gregg B. Morin, Tianhui Chen, Mads Daugaard, S. Cenk Sahinalp, Faraz Hach, Stephane Le Bihan, Martin E. Gleave, Yuzhuo Wang, Andrew Churg and Colin C. Collins
Journal Papers Genome Med. 2019 Feb 18;11(1):8. doi: 10.1186/s13073-019-0620-3.


Backgroung: Malignant peritoneal mesothelioma (PeM) is a rare and fatal cancer that originates from the peritoneal lining of the abdomen. Standard treatment of PeM is limited to cytoreductive surgery and/or chemotherapy, and no effective targeted therapies for PeM exist. Some immune checkpoint inhibitor studies of mesothelioma have found positivity to be associated with a worse prognosis.
Methods: o search for novel therapeutic targets for PeM, we performed a comprehensive integrative multi-omics analysis of the genome, transcriptome, and proteome of 19 treatment-naïve PeM, and in particular, we examined BAP1 mutation and copy number status and its relationship to immune checkpoint inhibitor activation.
Results: We found that PeM could be divided into tumors with an inflammatory tumor microenvironment and those without and that this distinction correlated with haploinsufficiency of BAP1. To further investigate the role of BAP1, we used our recently developed cancer driver gene prioritization algorithm, HIT'nDRIVE, and observed that PeM with BAP1 haploinsufficiency form a distinct molecular subtype characterized by distinct gene expression patterns of chromatin remodeling, DNA repair pathways, and immune checkpoint receptor activation. We demonstrate that this subtype is correlated with an inflammatory tumor microenvironment and thus is a candidate for immune checkpoint blockade therapies.
Conclusions: Our findings reveal BAP1 to be a potential, easily trackable prognostic and predictive biomarker for PeM immunotherapy that refines PeM disease classification. BAP1 stratification may improve drug response rates in ongoing phases I and II clinical trials exploring the use of immune checkpoint blockade therapies in PeM in which BAP1 status is not considered. This integrated molecular characterization provides a comprehensive foundation for improved management of a subset of PeM patients.

lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data.

Ehsan Haghshenas, S Cenk Sahinalp and Faraz Hach
Journal Papers Bioinformatics. 2019 Jan 1;35(1):20-27. doi: 10.1093/bioinformatics/bty544.


Motivation: Recent advances in genomics and precision medicine have been made possible through the application of high throughput sequencing (HTS) to large collections of human genomes. Although HTS technologies have proven their use in cataloging human genome variation, computational analysis of the data they generate is still far from being perfect. The main limitation of Illumina and other popular sequencing technologies is their short read length relative to the lengths of (common) genomic repeats. Newer (single molecule sequencing - SMS) technologies such as Pacific Biosciences and Oxford Nanopore are producing longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. Unfortunately, because of their high sequencing error rate, reads generated by these technologies are very difficult to work with and cannot be used in many of the standard downstream analysis pipelines. Note that it is not only difficult to find the correct mapping locations of such reads in a reference genome, but also to establish their correct alignment so as to differentiate sequencing errors from real genomic variants. Furthermore, especially since newer SMS instruments provide higher throughput, mapping and alignment need to be performed much faster than before, maintaining high sensitivity.
Results: We introduce lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.
Availability and Implementation: lordFAST is implemented in C++ and supports multi-threading. The source code of lordFAST is available at

Fast characterization of segmental duplications in genome assemblies

Ibrahim Numanagić, Alim S Gökkaya, Lillian Zhang, Bonnie Berger, Can Alkan and Faraz Hach
Journal Papers Bioinformatics. 2018 Sep 1;34(17):i706-i714
Presentation ECCB 2018 (17TH European Conference on Computational Biology), Athens, Greece, September 8-12 2018


Motivation: Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner.
Results: Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% 'pairwise error' between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome.
Availability and Implementation: SEDEF is available at

Dynamic Alignment-Free and Reference-Free Read Compression

Guillaume Holley, Roland Wittler, Jens Stoye and Faraz Hach
Journal Papers J Comput Biol. 2018 Jul;25(7):825-836.


The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.

Computational identification of micro-structural variations and their proteogenomic consequences in cancer.

Yen-Yi Lin, Alexander Gawronski, Faraz Hach, Sujun Li, Ibrahim Numanagić, Iman Sarrafi, Swati Mishra, Andrew McPherson, Colin C Collins, Milan Radovich, Haixu Tang and S Cenk Sahinalp
Journal Papers Bioinformatics. 2018 May 15;34(10):1672-1681


Purpose: Rapid advancement in high throughput genome and transcriptome sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. We introduce a computational framework, ProTIE, to integratively analyze all three types of omics data for a complete molecular profile of a tissue sample. Our framework features MiStrVar, a novel algorithmic method to identify micro structural variants (microSVs) on genomic HTS data. Coupled with deFuse, a popular gene fusion detection method we developed earlier, MiStrVar can accurately profile structurally aberrant transcripts in tumors. Given the breakpoints obtained by MiStrVar and deFuse, our framework can then identify all relevant peptides that span the breakpoint junctions and match them with unique proteomic signatures. Observing structural aberrations in all three types of omics data validates their presence in the tumor samples
Results: We have applied our framework to all The Cancer Genome Atlas (TCGA) breast cancer Whole Genome Sequencing (WGS) and/or RNA-Seq datasets, spanning all four major subtypes, for which proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) have been released. A recent study on this dataset focusing on SNVs has reported many that lead to novel peptides. Complementing and significantly broadening this study, we detected 244 novel peptides from 432 candidate genomic or transcriptomic sequence aberrations. Many of the fusions and microSVs we discovered have not been reported in the literature. Interestingly, the vast majority of these translated aberrations, fusions in particular, were private, demonstrating the extensive inter-genomic heterogeneity present in breast cancer. Many of these aberrations also have matching out-of-frame downstream peptides, potentially indicating novel protein sequence and structure.
Availability and Implementation: MiStrVar is available for download at, and ProTIE is available at

Mutational Analysis of Gene Fusions Predicts Novel MHC Class I-Restricted T-Cell Epitopes and Immune Signatures in a Subset of Prostate Cancer.

Jennifer L. Kalina , David S. Neilson, Yen-Yi Lin, Phineas T. Hamilton, Alexandra P. Comber, Emma M.H. Loy, S. Cenk Sahinalp, Colin C. Collins, Faraz Hach and Julian J. Lum
Journal Papers Clin Cancer Res. 2017 Dec 15;23(24):7596-7607.


Purpose: Gene fusions are frequently found in prostate cancer and may result in the formation of unique chimeric amino acid sequences (CASQ) that span the breakpoint of two fused gene products. This study evaluated the potential for fusion-derived CASQs to be a source of tumor neoepitopes, and determined their relationship to patterns of immune signatures in prostate cancer patients.Experimental Design: A computational strategy was used to identify CASQs and their corresponding predicted MHC class I epitopes using RNA-Seq data from The Cancer Genome Atlas of prostate tumors. In vitro peptide-specific T-cell expansion was performed to identify CASQ-reactive T cells. A multivariate analysis was used to relate patterns of in silico-predicted tumor-infiltrating immune cells with prostate tumors harboring these mutational events.
Results: Eighty-seven percent of tumors contained gene fusions with a mean of 12 per tumor. In total, 41% of fusion-positive tumors were found to encode CASQs. Within these tumors, 87% gave rise to predicted MHC class I-binding epitopes. This observation was more prominent when patients were stratified into low- and intermediate/high-risk categories. One of the identified CASQ from the recurrent TMPRSS2:ERG type VI fusion contained several high-affinity HLA-restricted epitopes. These peptides bound HLA-A*02:01 in vitro and were recognized by CD8+ T cells. Finally, the presence of fusions and CASQs were associated with expression of immune cell infiltration.
Conclusions: Mutanome analysis of gene fusion-derived CASQs can give rise to patient-specific predicted neoepitopes. Moreover, these fusions predicted patterns of immune cell infiltration within a subgroup of prostate cancer patients.

Discovery and genotyping of novel sequence insertions in many sequenced individuals.

Pinar Kavak, YenYi Lin, Ibrahim Numanagic, Hossein Asghari, Tunga Güngör, Can Alkan and Faraz Hach.
Journal PapersBioinformatics. 2017 Jul 15;33(14):i161-i169
Presentation ISMB 2017 (25th Conference on Intelligent Systems for Molecular Biology and 16th European Conference on Computational Biology), Prague, Czech Republic, July 21-25 2017


Motivation: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects.
Results: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects.
Availability and implementation: Pamir is available at

SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA.

Can Kockan, Faraz Hach, Iman Sarrafi, Robert H. Bell, Brian McConeghy, Kevin Beja, Anne Haegert, Alexander W. Wyatt, Stanislav V. Volik, Kim N. Chi, Colin C. Collins and S. Cenk Sahinalp
Journal Papers Bioinformatics. 2017 Jan 1;33(1):26-34


Motivation: Successful development and application of precision oncology approaches require robust elucidation of the genomic landscape of a patient's cancer and, ideally, the ability to monitor therapy-induced genomic changes in the tumour in an inexpensive and minimally invasive manner. Thanks to recent advances in sequencing technologies, 'liquid biopsy', the sampling of patient's bodily fluids such as blood and urine, is considered as one of the most promising approaches to achieve this goal. In many cancer patients, and especially those with advanced metastatic disease, deep sequencing of circulating cell free DNA (cfDNA) obtained from patient's blood yields a mixture of reads originating from the normal DNA and from multiple tumour subclones - called circulating tumour DNA or ctDNA. The ctDNA/cfDNA ratio as well as the proportion of ctDNA originating from specific tumour subclones depend on multiple factors, making comprehensive detection of mutations difficult, especially at early stages of cancer. Furthermore, sensitive and accurate detection of single nucleotide variants (SNVs) and indels from cfDNA is constrained by several factors such as the sequencing errors and PCR artifacts, and mapping errors related to repeat regions within the genome. In this article, we introduce SiNVICT, a computational method that increases the sensitivity and specificity of SNV and indel detection at very low variant allele frequencies. SiNVICT has the capability to handle multiple sequencing platforms with different error properties; it minimizes false positives resulting from mapping errors and other technology specific artifacts including strand bias and low base quality at read ends. SiNVICT also has the capability to perform time-series analysis, where samples from a patient sequenced at multiple time points are jointly examined to report locations of interest where there is a possibility that certain clones were wiped out by some treatment while some subclones gained selective advantage.
Results: We tested SiNVICT on simulated data as well as prostate cancer cell lines and cfDNA obtained from castration-resistant prostate cancer patients. On both simulated and biological data, SiNVICT was able to detect SNVs and indels with variant allele percentages as low as 0.5%. The lowest amounts of total DNA used for the biological data where SNVs and indels could be detected with very high sensitivity were 2.5 ng on the Ion Torrent platform and 10 ng on Illumina. With increased sequencing and mapping accuracy, SiNVICT might be utilized in clinical settings, making it possible to track the progress of point mutations and indels that are associated with resistance to cancer therapies and provide patients personalized treatment. We also compared SiNVICT with other popular SNV callers such as MuTect, VarScan2 and Freebayes. Our results show that SiNVICT performs better than these tools in most cases and allows further data exploration such as time-series analysis on cfDNA sequencing data.
Results: We tested SiNVICT on simulated data as well as prostate cancer cell lines and cfDNA obtained from castration-resistant prostate cancer patients. On both simulated and biological data, SiNVICT was able to detect SNVs and indels with variant allele percentages as low as 0.5%. The lowest amounts of total DNA used for the biological data where SNVs and indels could be detected with very high sensitivity were 2.5 ng on the Ion Torrent platform and 10 ng on Illumina. With increased sequencing and mapping accuracy, SiNVICT might be utilized in clinical settings, making it possible to track the progress of point mutations and indels that are associated with resistance to cancer therapies and provide patients personalized treatment. We also compared SiNVICT with other popular SNV callers such as MuTect, VarScan2 and Freebayes. Our results show that SiNVICT performs better than these tools in most cases and allows further data exploration such as time-series analysis on cfDNA sequencing data.
Availability and implementation: SiNVICT is available at: Supplementary information: Supplementary data are available at Bioinformatics online.

Comparison of high-throughput sequencing data compression tools

Ibrahim Numanagic, James K Bonfield, Faraz Hach, Jan Voges, Jorn Ostermann, Claudio Alberti, Marco Mattavelli, S. Cenk Sahinalp
Journal Papers Nat Methods. 2016 Dec;13(12):1005-1008


High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.

CoLoRMap: Correcting Long Reads by Mapping short reads.

Ehsan Haghshenas, Faraz Hach, S. Cenk Sahinalp, Cedric Chauve
Journal Papers Bioinformatics. 2016 Sep 1;32(17):i545-i551
Presentation ECCB'16 (15th European Conference on Computational Biology), The Hauge, Netherlands, September 3-7 2016


Motivation: Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.
Results: We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.
Availability and Implementation: The source code of CoLoRMap is freely available for non-commercial use at

Robustness of Massively Parallel Sequencing Platforms

Pinar Kavak, Bayram Yuksel, Soner Aksu, M. Oguzhan Kulekci, Tunga Gungor, Faraz Hach, S. Cenk Sahinalp, Turkish Human Genome Project, Can Alkan, Mahmut Şamil Sagıroglu
Journal PapersPLoS One. 2015 Sep 18;10(9):e0138259


The improvements in high throughput sequencing technologies (HTS) made clinical sequencing projects such as ClinSeq and Genomics England feasible. Although there are significant improvements in accuracy and reproducibility of HTS based analyses, the usability of these types of data for diagnostic and prognostic applications necessitates a near perfect data generation. To assess the usability of a widely used HTS platform for accurate and reproducible clinical applications in terms of robustness, we generated whole genome shotgun (WGS) sequence data from the genomes of two human individuals in two different genome sequencing centers. After analyzing the data to characterize SNPs and indels using the same tools (BWA, SAMtools, and GATK), we observed significant number of discrepancies in the call sets. As expected, the most of the disagreements between the call sets were found within genomic regions containing common repeats and segmental duplications, albeit only a small fraction of the discordant variants were within the exons and other functionally relevant regions such as promoters. We conclude that although HTS platforms are sufficiently powerful for providing data for first-pass clinical tests, the variant predictions still need to be confirmed using orthogonal methods before using in clinical applications.

Spatial genomic heterogeneity within localized, multifocal prostate cancer

Paul C Boutros, Michael Fraser, Nicholas J Harding, Richard de Borja, Dominique Trudel, Emilie Lalonde, Alice Meng, Pablo H Hennings-Yeomans, Andrew McPherson, Veronica Y Sabelnykova, Amin Zia, Natalie S Fox, Julie Livingstone, Yu-Jia Shiah, Jianxin Wang, Timothy A Beck, Cherry L Have, Taryne Chong, Michelle Sam, Jeremy Johns, Lee Timms, Nicholas Buchner, Ada Wong, John D Watson, Trent T Simmons, Christine P'ng, Gaetano Zafarana, Francis Nguyen, Xuemei Luo, Kenneth C Chu, Stephenie D Prokopec, Jenna Sykes, Alan Dal Pra, Alejandro Berlin, Andrew Brown, Michelle A Chan-Seng-Yue, Fouad Yousif, Robert E Denroche, Lauren C Chong, Gregory M Chen, Esther Jung, Clement Fung, Maud H W Starmans, Hanbo Chen, Shaylan K Govind, James Hawley, Alister D'Costa, Melania Pintilie, Daryl Waggott, Faraz Hach, Philippe Lambin, Lakshmi B Muthuswamy, Colin Cooper, Rosalind Eeles, David Neal, Bernard Tetu, Cenk Sahinalp, Lincoln D Stein, Neil Fleshner, Sohrab P Shah, Colin C Collins, Thomas J Hudson, John D McPherson, Theodorus van der Kwast and Robert G Bristow
Journal PapersNat Genet. 2015 Jul;47(7):736-45


Herein we provide a detailed molecular analysis of the spatial heterogeneity of clinically localized, multifocal prostate cancer to delineate new oncogenes or tumor suppressors. We initially determined the copy number aberration (CNA) profiles of 74 patients with index tumors of Gleason score 7. Of these, 5 patients were subjected to whole-genome sequencing using DNA quantities achievable in diagnostic biopsies, with detailed spatial sampling of 23 distinct tumor regions to assess intraprostatic heterogeneity in focal genomics. Multifocal tumors are highly heterogeneous for single-nucleotide variants (SNVs), CNAs and genomic rearrangements. We identified and validated a new recurrent amplification of MYCL, which is associated with TP53 deletion and unique profiles of DNA damage and transcriptional dysregulation. Moreover, we demonstrate divergent tumor evolution in multifocal cancer and, in some cases, tumors of independent clonal origin. These data represent the first systematic relation of intraprostatic genomic heterogeneity to predicted clinical outcome and inform the development of novel biomarkers that reflect individual prognosis.

Fast and accurate mapping of Complete Genomics reads

Donghyuk Lee, Farhad Hormozdiari, Hongyi Xin, Faraz Hach, Onur Mutlu and Can Alkan
Journal PapersMethods. 2015 Jun;79-80:3-10


Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologies to correct for shortcomings; but also requires to develop different algorithms for each platform due to the differences in data types and error models. The first problem to tackle in analyzing HTS data for resequencing applications is the read mapping stage, where many tools have been developed for the most popular HTS methods, but publicly available and open source aligners are still lacking for the Complete Genomics (CG) platform. Unfortunately, Burrows-Wheeler based methods are not practical for CG data due to the gapped nature of the reads generated by this method. Here we provide a sensitive read mapper (sirFAST) for the CG technology based on the seed-and-extend paradigm that can quickly map CG reads to a reference genome. We evaluate the performance and accuracy of sirFAST using both simulated and publicly available real data sets, showing high precision and recall rates.

DeeZ: reference-based compression by local assembly

Faraz Hach, Ibrahim Numanagic and S Cenk Sahinalp
Journal PapersNature Methods, 2014 Oct 30;11(11):1082-4
Presentation RECOMB-Seq 2015 HighLight Talk (5th Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Warsaw, Poland, April 10-21 2015

mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications

Faraz Hach, Iman Sarrafi, Farhad Hormozdiari, Can Alkan, Evan E. Eichler and S. Cenk Sahinalp.
Journal Papers Nucl. Acids Res. (1 July 2014) 42 (W1): W494-W500.


High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the 'best' mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at

Optimal resolution of ambiguous RNA-Seq multimappings in the presence of novel isoforms

Phuong Dao, Ibrahim Numanagic, Yen-Yi Lin, Faraz Hach, Emre Karakoc, Nilgun Donmez, Colin Collins, Evan E. Eichler and S. Cenk Sahinalp
Journal Papers Bioinformatics (2014) 30 (5): 644-651.


Motivation: RNA-Seq technology is promising to uncover many novel alternative splicing events, gene fusions and other variations in RNA transcripts. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for those RNA-Seq reads that can be mapped to multiple loci: >17% of the reads from mouse RNA-Seq data and 50% of the reads from some plant RNA-Seq data have multiple mapping loci. In this study, we show how to resolve the mapping ambiguity in the presence of novel transcriptomic events such as exon skipping and novel indels towards accurate downstream analysis. We introduce ORMAN ( O ptimal R esolution of M ultimapping A mbiguity of R N A-Seq Reads), which aims to compute the minimum number of potential transcript products for each gene and to assign each multimapping read to one of these transcripts based on the estimated distribution of the region covering the read. ORMAN achieves this objective through a combinatorial optimization formulation, which is solved through well-known approximation algorithms, integer linear programs and heuristics.
Results: On a simulated RNA-Seq dataset including a random subset of transcripts from the UCSC database, the performance of several state-of-the-art methods for identifying and quantifying novel transcripts, such as Cufflinks, IsoLasso and CLIIQ, is significantly improved through the use of ORMAN. Furthermore, in an experiment using real RNA-Seq reads, we show that ORMAN is able to resolve multimapping to produce coverage values that are similar to the original distribution, even in genes with highly non-uniform coverage.
Availability: ORMAN is available at

SCALCE: boosting sequence compression algorithms using locally consistent encoding

Faraz Hach, Ibrahim Numanagic, Can Alkan and S Cenk Sahinalp
Journal Papers Bioinformatics (2012) 28 (23): 3051-3057.
Presentation RECOMB-Seq 2012 (2nd Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Barcelona, Spain, April 19-20 2012


Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome.
Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19-when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.
Availability: Our algorithm, SCALCE (Sequence Compression Algorithm using Locally Consistent Encoding), is implemented in C++ with both gzip and bzip2 compression options. It also supports multithreading when gzip option is selected, and the pigz binary is available. It is available at

CLIIQ: Accurate Comparative Detection and Quantification of Expressed Isoforms in a Population

Yen-Yi Lin, Phuong Dao, Faraz Hach, Marzieh Bakhshi, Fan Mo, Anna Lapuk, Colin Collins and S. Cenk Sahinalp
Conference PapersProc. of WABI 2012: Lecture Notes in Computer Science Volume 7534, 2012, pp 178-189
Presentation WABI 2012 (12th Workshop on Algorithms in Bioinformatics), September 10-12, Ljubljana, Slovenia


The recently developed RNA-Seq technology provides a high-throughput and reasonably accurate way to analyze the transcriptomic landscape of a tissue. Unfortunately, from a computational perspective, identification and quantification of a gene's isoforms from RNA-Seq data remains to be a non-trivial problem. We propose CLIIQ, a novel computational method for identification and quantification of expressed isoforms from multiple samples in a population. Motivated by ideas from compressed sensing literature, CLIIQ is based on an integer linear programming formulation for identifying and quantifying the most parsimonious set of isoforms. We show through simulations that, on a single sample, CLIIQ provides better results in isoform identification and quantification to alternative popular tools. More importantly, CLIIQ has an option to jointly analyze multiple samples, which significantly outperforms other tools in both isoform identification and quantification.

Dissect: detection and characterization of novel structural alterations in transcribed sequences

Deniz Yorukoglu, Faraz Hach, Lucas Swanson, Colin C. Collins, Inanc Birol and S. Cenk Sahinalp
Journal PapersBioinformatics (2012) 28 (12): i179-i187
PresentationISMB 2012 (20th Annual International Conference on Intelligent Systems for Molecular Biology), Long Beach, USA, July 19-23 2012
Award Won Ian Lawson Van Toch Memorial Award for Outstanding Student Paper


Motivation: Computational identification of genomic structural variants via high-throughput sequencing is an important problem for which a number of highly sophisticated solutions have been recently developed. With the advent of high-throughput transcriptome sequencing (RNA-Seq), the problem of identifying structural alterations in the transcriptome is now attracting significant attention. In this article, we introduce two novel algorithmic formulations for identifying transcriptomic structural variants through aligning transcripts to the reference genome under the consideration of such variation. The first formulation is based on a nucleotide-level alignment model; a second, potentially faster formulation is based on chaining fragments shared between each transcript and the reference genome. Based on these formulations, we introduce a novel transcriptome-to-genome alignment tool, Dissect (DIScovery of Structural Alteration Event Containing Transcripts), which can identify and characterize transcriptomic events such as duplications, inversions, rearrangements and fusions. Dissect is suitable for whole transcriptome structural variation discovery problems involving sufficiently long reads or accurately assembled contigs.
Results: We tested Dissect on simulated transcripts altered via structural events, as well as assembled RNA-Seq contigs from human prostate cancer cell line C4-2. Our results indicate that Dissect has high sensitivity and specificity in identifying structural alteration events in simulated transcripts as well as uncovering novel structural alterations in cancer transcriptomes.
Availability: Dissect is available for public use at:

From sequence to molecular pathology, and a mechanism driving the neuroendocrine phenotype in prostate cancer

Anna V Lapuk, Chunxiao Wu, Alexander W Wyatt, Andrew McPherson, Brian J McConeghy, Sonal Brahmbhatt, Fan Mo, Amina Zoubeidi, Shawn Anderson, Robert H Bell, Anne Haegert, Robert Shukin, Yuzhuo Wang, Ladan Fazli, Antonio Hurtado-Coll, Edward C Jones, Faraz Hach, Fereydoun Hormozdiari, Iman Hajirasouliha, Paul C Boutros, Robert G Bristow, Yongjun Zhao, Marco A Marra, Andrea Fanjul, Christopher A Maher, Arul M Chinnaiyan, Mark A Rubin, Himisha Beltran, S Cenk Sahinalp, Martin E Gleave, Stanislav V Volik and Colin C Collins
Journal PapersJ Pathol. 2012 Jul;227(3):286-97


The current paradigm of cancer care relies on predictive nomograms which integrate detailed histopathology with clinical data. However, when predictions fail, the consequences for patients are often catastrophic, especially in prostate cancer where nomograms influence the decision to therapeutically intervene. We hypothesized that the high dimensional data afforded by massively parallel sequencing (MPS) is not only capable of providing biological insights, but may aid molecular pathology of prostate tumours. We assembled a cohort of six patients with high-risk disease, and performed deep RNA and shallow DNA sequencing in primary tumours and matched metastases where available. Our analysis identified copy number abnormalities, accurately profiled gene expression levels, and detected both differential splicing and expressed fusion genes. We revealed occult and potentially dormant metastases, unambiguously supporting the patients' clinical history, and implicated the REST transcriptional complex in the development of neuroendocrine prostate cancer, validating this finding in a large independent cohort. We massively expand on the number of novel fusion genes described in prostate cancer; provide fresh evidence for the growing link between fusion gene aetiology and gene expression profiles; and show the utility of fusion genes for molecular pathology. Finally, we identified chromothripsis in a patient with chronic prostatitis. Our results provide a strong foundation for further development of MPS-based molecular pathology.

Integrated genome and transcriptome sequencing identifies a novel form of hybrid and aggressive prostate cancer

Chunxiao Wu, Alexander W Wyatt, Anna V Lapuk, Andrew McPherson, Brian J McConeghy, Robert H Bell, Shawn Anderson, Anne Haegert, Sonal Brahmbhatt, Robert Shukin, Fan Mo, Estelle Li, Ladan Fazli, Antonio Hurtado-Coll, Edward C Jones, Yaron S Butterfield, Faraz Hach, Fereydoun Hormozdiari, Iman Hajirasouliha, Paul C Boutros, Robert G Bristow, Steven JM Jones, Martin Hirst, Marco A Marra, Christopher A Maher, Arul M Chinnaiyan, S Cenk Sahinalp, Martin E Gleave, Stanislav V Volik and Colin C Collins
Journal PapersJ Pathol. 2012 May;227(1):53-61


Next-generation sequencing is making sequence-based molecular pathology and personalized oncology viable. We selected an individual initially diagnosed with conventional but aggressive prostate adenocarcinoma and sequenced the genome and transcriptome from primary and metastatic tissues collected prior to hormone therapy. The histology-pathology and copy number profiles were remarkably homogeneous, yet it was possible to propose the quadrant of the prostate tumour that likely seeded the metastatic diaspora. Despite a homogeneous cell type, our transcriptome analysis revealed signatures of both luminal and neuroendocrine cell types. Remarkably, the repertoire of expressed but apparently private gene fusions, including C15orf21:MYC, recapitulated this biology. We hypothesize that the amplification and over-expression of the stem cell gene MSI2 may have contributed to the stable hybrid cellular identity. This hybrid luminal-neuroendocrine tumour appears to represent a novel and highly aggressive case of prostate cancer with unique biological features and, conceivably, a propensity for rapid progression to castrate-resistance. Overall, this work highlights the importance of integrated analyses of genome, exome and transcriptome sequences for basic tumour biology, sequence-based molecular pathology and personalized oncology.

Sensitive and fast mapping of di-base encoded reads

Farhad Hormozdiari, Faraz Hach S. Cenk Sahinalp, Evan E. Eichler and Can Alkan
Journal PapersBioinformatics (2011) 27 (14): 1915-1921


Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded 'color-space' sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at

Comrad: detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data

Andrew McPherson, Chunxiao Wu, Iman Hajirasouliha, Fereydoun Hormozdiari, Faraz Hach, Anna Lapuk, Stanislav Volik, Sohrab Shah, Colin Collins and S. Cenk Sahinalp
Journal PapersBioinformatics (2011) 27 (11): 1481-1488.
Presentation ISMB-HitSeq 2011 (High Throughput Sequencing Analysis and Algorithms, Special Interest Group of ISMB 2011), Vienna, Austria, July 15-16 2011
Award Won Best Paper Award


Motivation: Comrad is a novel algorithmic framework for the integrated analysis of RNA-Seq and whole genome shotgun sequencing (WGSS) data for the purposes of discovering genomic rearrangements and aberrant transcripts. The Comrad framework leverages the advantages of both RNA-Seq and WGSS data, providing accurate classification of rearrangements as expressed or not expressed and accurate classification of the genomic or non-genomic origin of aberrant transcripts. A major benefit of Comrad is its ability to accurately identify aberrant transcripts and associated rearrangements using low coverage genome data. As a result, a Comrad analysis can be performed at a cost comparable to that of two RNA-Seq experiments, significantly lower than an analysis requiring high coverage genome data.
Results: We have applied Comrad to the discovery of gene fusions and read-throughs in prostate cancer cell line C4-2, a derivative of the LNCaP cell line with androgen-independent characteristics. As a proof of concept, we have rediscovered in the C4-2 data 4 of the 6 fusions previously identified in LNCaP. We also identified six novel fusion transcripts and associated genomic breakpoints, and verified their existence in LNCaP, suggesting that Comrad may be more sensitive than previous methods that have been applied to fusion discovery in LNCaP. We show that many of the gene fusions discovered using Comrad would be difficult to identify using currently available techniques.
Availability: A C++ and Perl implementation of the method demonstrated in this article is available at

Alu repeat discovery and characterization within human genomes

Fereydoun Hormozdiari, Can Alkan, Mario Ventura, Iman Hajirasouliha, Maika Malig, Faraz Hach, Deniz Yorukoglu, Phuong Dao, Marzieh Bakhshi, S. Cenk Sahinalp and Evan E. Eichler
Journal PapersGenome Res. 2011 Jun;21(6):840-9


Human genomes are now being rapidly sequenced, but not all forms of genetic variation are routinely characterized. In this study, we focus on Alu retrotransposition events and seek to characterize differences in the pattern of mobile insertion between individuals based on the analysis of eight human genomes sequenced using next-generation sequencing. Applying a rapid read-pair analysis algorithm, we discover 4342 Alu insertions not found in the human reference genome and show that 98% of a selected subset (63/64) experimentally validate. Of these new insertions, 89% correspond to AluY elements, suggesting that they arose by retrotransposition. Eighty percent of the Alu insertions have not been previously reported and more novel events were detected in Africans when compared with non-African samples (76% vs. 69%). Using these data, we develop an experimental and computational screen to identify ancestry informative Alu retrotransposition events among different human populations.

mrsFAST: a cache-oblivious algorithm for short-read mapping

Faraz Hach, Fereydoun Hormozdiari, Can Alkan, Farhad Hormozdiari, Inanc Birol, Evan E Eichler and S Cenk Sahinalp
Journal PapersNature Methods, 2010 Aug;7(8):576-7

Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery

Fereydoun Hormozdiari, Iman Hajirasouliha, Phuong Dao, Faraz Hach, Deniz Yorukoglu, Can Alkan, Evan E. Eichler and S. Cenk Sahinalp
Journal PapersBioinformatics (2010) 26 (12): i350-i357
Presentation ISMB (18th Annual International Conference Intelligent Systems for Molecular Biology), Boston, USA, July 11-13 2010


Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present 'conflict resolution' improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507).
Availability: The implementation of algorithm is available at

Faster Phylogenetic Inference with MXG

David G. Mitchell, Faraz Hach, Raheleh Mohebali
Conference PapersProc of LPAR 2007. Lecture Notes in Computer Science Volume 4790, 2007, pp 423-437


We apply the logic-based declarative programming approach of Model Expansion (MX) to a phylogenetic inference task. We axiomatize the task in multi-sorted first-order logic with cardinality constraints. Using the model expansion solver MXG and SAT+cardinality solver MXC, we compare the performance of several MX axiomatizations on a challenging set of test instances. Our methods perform orders of magnitude faster than previously reported declarative solutions. Our best solution involves polynomial-time pre-processing, redundant axioms, and symmetry-breaking axioms. We also discuss our method of test instance generation, and the role of pre-processing in declarative programming.

MXG: A Model Expansion Grounder and Solver

Raheleh Mohebali, Faraz Hach, and David Mitchell,
Technical ReportsAccepted to LPAR 2007, 2007.


We describe MXG, a solver for NP search problems expressed as model expansion (MX). Problems are specified in an extension of first-order logic, and solved by grounding. That is, MXG combines a high-level specification with an instance and produces a propositional formula encoding the solutions. It calls a SAT (or extended SAT) solver to find solutions. MXG is distinguished from other grounding software in its use of a grounding algorithm based on a generalization of the relational algebra.

Model expansion as a framework for modelling and solving search problems

David Mitchell, Eugenia Ternovska, Faraz Hach, and Raheleh Mohebal
Technical ReportsTR 2006-24, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada


We propose a framework for modelling and solving search problems using logic, and describe a project whose goal is to produce practically effective, general purpose tools for representing and solving search problems based on this framework. The mathematical foundation lies in the areas of finite model theory and descriptive complexity, which provide us with many classical results, as well as powerful techniques not available to many other approaches with similar goals. We describe the mathematical foundations; explain an extension to classical logic with inductive definitions that we consider central; give a summary of complexity and expressiveness properties; describe an approach to implementing solvers based on grounding; present grounding algorithms based on an extension of the relational algebra; describe an implementation of our framework which includes use of inductive definitions, sorts and order; and give experimental results comparing the performance of our implementation with ASP solvers and another solver based on the same framework.