Filter by type:

Sort by year:

Freddie: annotation-independent detection and discovery of transcriptomic alternative splicing isoforms using long-read sequencing

Baraa Orabi, Ning Xie, Brian McConeghy, Xuesen Dong, Cedric Chauve, Faraz Hach
Journal Papers Nucleic Acids Res . 2022 Dec 8;gkac1112. doi: 10.1093/nar/gkac1112. Online ahead of print.
Presentation RECOMB 2021 Regular Talk (25th Annual International Conference on Research in Computational Molecular Biology), Virtual Conference, August 29-September 1 2021


Alternative splicing (AS) is an important mechanism in the development of many cancers, as novel or aberrant AS patterns play an important role as an independent onco-driver. In addition, cancer-specific AS is potentially an effective target of personalized cancer therapeutics. However, detecting AS events remains a challenging task, especially if these AS events are novel. This is exacerbated by the fact that existing transcriptome annotation databases are far from being comprehensive, especially with regard to cancer-specific AS. Additionally, traditional sequencing technologies are severely limited by the short length of the generated reads, which rarely spans more than a single splice junction site. Given these challenges, transcriptomic long-read (LR) sequencing presents a promising potential for the detection and discovery of AS. We present Freddie, a computational annotation-independent isoform discovery and detection tool. Freddie takes as input transcriptomic LR sequencing of a sample alongside its genomic split alignment and computes a set of isoforms for the given sample. It then partitions the input reads into sets that can be processed independently and in parallel. For each partition, Freddie segments the genomic alignment of the reads into canonical exon segments. The goal of this segmentation is to be able to represent any potential isoform as a subset of these canonical exons. This segmentation is formulated as an optimization problem and is solved with a dynamic programming algorithm. Then, Freddie reconstructs the isoforms by jointly clustering and error-correcting the reads using the canonical segmentation as a succinct representation. The clustering and error-correcting step is formulated as an optimization problem-the Minimum Error Clustering into Isoforms (MErCi) problem-and is solved using integer linear programming (ILP). We compare the performance of Freddie on simulated datasets with other isoform detection tools with varying dependence on annotation databases. We show that Freddie outperforms the other tools in its accuracy, including those given the complete ground truth annotation. We also run Freddie on a transcriptomic LR dataset generated in-house from a prostate cancer cell line with a matched short-read RNA-seq dataset. Freddie results in isoforms with a higher short-read cross-validation rate than the other tested tools.

Genetic determinants of chromatin reveal prostate cancer risk mediated by context-dependent gene regulation

Sylvan C Baca, Cassandra Singler, Soumya Zacharia, Ji-Heui Seo, Tunc Morova, Faraz Hach, Yi Ding, Tommer Schwarz, Chia-Chi Flora Huang, Jacob Anderson, André P Fay, Cynthia Kalita, Stefan Groha, Mark M Pomerantz, Victoria Wang, Simon Linder, Christopher J Sweeney, Wilbert Zwart, Nathan A Lack, Bogdan Pasaniuc, David Y Takeda, Alexander Gusev, Matthew L Freedman
Journal PapersNat Genet . 2022 Sep;54(9):1364-1375. doi: 10.1038/s41588-022-01168-y. Epub 2022 Sep 7


Many genetic variants affect disease risk by altering context-dependent gene regulation. Such variants are difficult to study mechanistically using current methods that link genetic variation to steady-state gene expression levels, such as expression quantitative trait loci (eQTLs). To address this challenge, we developed the cistrome-wide association study (CWAS), a framework for identifying genotypic and allele-specific effects on chromatin that are also associated with disease. In prostate cancer, CWAS identified regulatory elements and androgen receptor-binding sites that explained the association at 52 of 98 known prostate cancer risk loci and discovered 17 additional risk loci. CWAS implicated key developmental transcription factors in prostate cancer risk that are overlooked by eQTL-based approaches due to context-dependent gene regulation. We experimentally validated associations and demonstrated the extensibility of CWAS to additional epigenomic datasets and phenotypes, including response to prostate cancer treatment. CWAS is a powerful and biologically interpretable paradigm for studying variants that influence traits by affecting transcriptional regulation.

Fast and accurate matching of cellular barcodes across short-reads and long-reads of single-cell RNA-seq experiments

Ghazal Ebrahimi, Baraa Orabi, Meghan Robinson, Cedric Chauve, Ryan Flannigan, Faraz Hach
Journal PapersiScience . 2022 Jun 7;25(7):104530.
Presentation RECOMB-Seq 2022 Regular Talk (The 12th RECOMB Satellite Workshop on Massively Parallel Sequencing), La Jolla, USA, May 20-21 2022


Single-cell RNA sequencing allows for characterizing the gene expression landscape at the cell type level. However, because of its use of short-reads, it is severely limited at detecting full-length features of transcripts such as alternative splicing. New library preparation techniques attempt to extend single-cell sequencing by utilizing both long-reads and short-reads. These techniques split the library material, after it is tagged with cellular barcodes, into two pools: one for short-read sequencing and one for long-read sequencing. However, the challenge of utilizing these techniques is that they require matching the cellular barcodes sequenced by the erroneous long-reads to the cellular barcodes detected by the short-reads. To overcome this challenge, we introduce scTagger, a computational method to match cellular barcodes data from long-reads and short-reads. We tested scTagger against another state-of-the-art tool on both real and simulated datasets, and we demonstrate that scTagger has both significantly better accuracy and time efficiency.

Fast characterization of segmental duplication structure in multiple genome assemblies

Hamza Išerić, Can Alkan, Faraz Hach, Ibrahim Numanagić
Journal PapersAlgorithms Mol Biol . 2022 Mar 18;17(1):4.
Presentation WABI 2021 Regular Talk (Workshop on Algorithms in Bioinformatics), Virtual Conference, August 2-4 2021


Motivation. The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural elements, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure and inventing new genes. Optimal computation of SDs within a genome requires quadratic-time local alignment algorithms that are impractical due to the size of most genomes. Additionally, to perform evolutionary analysis, one needs to characterize SDs in multiple genomes and find relations between those SDs and unique (non-duplicated) segments in other genomes. A naïve approach consisting of multiple sequence alignment would make the optimal solution to this problem even more impractical. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today.
Results. Here we introduce a new approach, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology to multiple genomes while introducing further 7-33 speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 300 million years.

Genion, an accurate tool to detect gene fusion from long transcriptomics reads

Fatih Karaoglanoglu, Cedric Chauve, Faraz Hach
Journal PapersBMC Genomics . 2022 Feb 14;23(1):129.


Background. The advent of next-generation sequencing technologies empowered a wide variety of transcriptomics studies. A widely studied topic is gene fusion which is observed in many cancer types and suspected of having oncogenic properties. Gene fusions are the result of structural genomic events that bring two genes closely located and result in a fused transcript. This is different from fusion transcripts created during or after the transcription process. These chimeric transcripts are also known as read-through and trans-splicing transcripts. Gene fusion discovery with short reads is a well-studied problem, and many methods have been developed. But the sensitivity of these methods is limited by the technology, especially the short read length. Advances in long-read sequencing technologies allow the generation of long transcriptomics reads at a low cost. Transcriptomic long-read sequencing presents unique opportunities to overcome the shortcomings of short-read technologies for gene fusion detection while introducing new challenges.
Results. We present Genion, a sensitive and fast gene fusion detection method that can also detect read-through events. We compare Genion against a recently introduced long-read gene fusion discovery method, LongGF, both on simulated and real datasets. On simulated data, Genion accurately identifies the gene fusions and its clustering accuracy for detecting fusion reads is better than LongGF. Furthermore, our results on the breast cancer cell line MCF-7 show that Genion correctly identifies all the experimentally validated gene fusions.

An androgen receptor switch underlies lineage infidelity in treatment-resistant prostate cancer

Alastair Davies, Shaghayegh Nouruzi, Dwaipayan Ganguli, Takeshi Namekawa, Daksh Thaper, Simon Linder, Fatih Karaoğlanoğlu, Meltem E Omur, Soojin Kim, Maxim Kobelev, Sahil Kumar, Olena Sivak, Chiara Bostock, Jennifer Bishop, Marlous Hoogstraat, Amina Talal, Suzan Stelloo, Henk van der Poel, Andries M Bergman, Musaddeque Ahmed, Ladan Fazli, Haojie Huang, Wayne Tilley, David Goodrich, Felix Y Feng, Martin Gleave, Housheng Hansen He, Faraz Hach, Wilbert Zwart, Himisha Beltran, Luke Selth, Amina Zoubeidi
Journal PapersNat Cell Biol . 2021 Sep;23(9):1023-1034.


Cancers adapt to increasingly potent targeted therapies by reprogramming their phenotype. Here we investigated such a phenomenon in prostate cancer, in which tumours can escape epithelial lineage confinement and transition to a high-plasticity state as an adaptive response to potent androgen receptor (AR) antagonism. We found that AR activity can be maintained as tumours adopt alternative lineage identities, with changes in chromatin architecture guiding AR transcriptional rerouting. The epigenetic regulator enhancer of zeste homologue 2 (EZH2) co-occupies the reprogrammed AR cistrome to transcriptionally modulate stem cell and neuronal gene networks—granting privileges associated with both fates. This function of EZH2 was associated with T350 phosphorylation and establishment of a non-canonical polycomb subcomplex. Our study provides mechanistic insights into the plasticity of the lineage-infidelity state governed by AR reprogramming that enabled us to redirect cell fate by modulating EZH2 and AR, highlighting the clinical potential of reversing resistance phenotypes.

Functional mapping of androgen receptor enhancer activity

Chia-Chi Flora Huang, Shreyas Lingadahalli, Tunc Morova, Dogancan Ozturan, Eugene Hu, Ivan Pak Lok Yu, Simon Linder, Marlous Hoogstraat, Suzan Stelloo, Funda Sar, Henk van der Poel, Umut Berkay Altintas, Mohammadali Saffarzadeh, Stephane Le Bihan, Brian McConeghy, Bengul Gokbayrak, Felix Y Feng, Martin E Gleave, Andries M Bergman, Colin Collins, Faraz Hach, Wilbert Zwart, Eldon Emberly, Nathan A Lack
Journal PapersGenome Biol . 2021 May 11;22(1):149.


Background. Androgen receptor (AR) is critical to the initiation, growth, and progression of prostate cancer. Once activated, the AR binds to cis-regulatory enhancer elements on DNA that drive gene expression. Yet, there are 10-100× more binding sites than differentially expressed genes. It is unclear how or if these excess binding sites impact gene transcription.
Results. To characterize the regulatory logic of AR-mediated transcription, we generated a locus-specific map of enhancer activity by functionally testing all common clinical AR binding sites with Self-Transcribing Active Regulatory Regions sequencing (STARRseq). Only 7% of AR binding sites displayed androgen-dependent enhancer activity. Instead, the vast majority of AR binding sites were either inactive or constitutively active enhancers. These annotations strongly correlated with enhancer-associated features of both in vitro cell lines and clinical prostate cancer samples. Evaluating the effect of each enhancer class on transcription, we found that AR-regulated enhancers frequently interact with promoters and form central chromosomal loops that are required for transcription. Somatic mutations of these critical AR-regulated enhancers often impact enhancer activity.
Conclusion. Using a functional map of AR enhancer activity, we demonstrated that AR-regulated enhancers act as a regulatory hub that increases interactions with other AR binding sites and gene promoters.

Detecting High Scoring Local Alignments in Pangenome Graphs

Tizian Schulz, Roland Wittler, Sven Rahmann, Faraz Hach, Jens Stoye
Journal PapersBioinformatics . 2021 Aug 15;37(16):2266-2274.


Motivation. Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet.
Results. We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome.

HASLR: Fast Hybrid Assembly of Long Reads

Ehsan Haghshenas, Hossein Asghari, Jens Stoye, Cedric Chauve, and Faraz Hach
Journal Papers iScience. 2020 Aug 21;23(8).
Presentation RECOMB-Seq 2020 Regular Talk (10th Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Virtual Conference, June 26-27 2020


Third-generation sequencing technologies from companies such as Oxford Nanopore and Pacific Biosciences have paved the way for building more contiguous and potentially gap-free assemblies. The larger effective length of their reads has provided a means to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive, whereas faster methods are not as accurate. Moreover, despite recent advances in third-generation sequencing, researchers still tend to generate accurate short reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler that uses error-prone long reads together with high-quality short reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on most of the samples, while being on par with other assemblers in terms of contiguity and accuracy.
Availability: HASLR is an open source tool available at

CircMiner: Accurate and Rapid Detection of Circular RNA through Splice-Aware Pseudo-Alignment Scheme

Hossein Asghari, Yen-Yi Lin, Yang Xu, Ehsan Haghshenas, Colin C Collins, and Faraz Hach
Journal Papers Bioinformatics. 2020 Apr 7;-. doi:


Motivation: The ubiquitous abundance of circular RNAs (circRNAs) has been revealed by performing high-throughput sequencing in a variety of eukaryotes. circRNAs are related to some diseases such as cancer in which they act as oncogenes or tumor-suppressors, and therefore have the potential to be used as biomarkers or therapeutic targets. Accurate and rapid detection of circRNAs from short reads remains computationally challenging. This is due to the fact that identifying chimeric reads, which is essential for finding back-splice junctions, is a complex process. The sensitivity of discovery methods, to a high degree, relies on the underlying mapper that is used for finding chimeric reads. Furthermore, all the available circRNA discovery pipelines are resource intensive.
Results: We introduce CircMiner, a novel stand-alone circRNA detection method that rapidly identifies and filters out linear RNA-Seq reads and detects back-splice junctions. CircMiner employs a rapid pseudoalignment technique to identify linear reads that originate from transcripts, genes, or the genome. CircMiner further processes the remaining reads to identify the back-splice junctions and detect circRNAs with single-nucleotide resolution. We evaluated the efficacy of CircMiner using simulated datasets generated from known back-splice junctions and showed that CircMiner has superior accuracy and speed compared to the existing circRNA detection tools. Additionally, on two RNase R treated cell line datasets, CircMiner was able to detect most of consistent, high confidence circRNAs compared to untreated samples of the same cell line.
Availability: CircMiner is implemented in C++ and is available online at

Identification of gene signature for treatment response to guide precision oncology in clear-cell renal cell carcinoma

Ninadh M D'Costa, Davide Cina, Raunak Shrestha, Robert H Bell, Yen-Yi Lin, Hossein Asghari, Cesar U Monjaras-Avila, Christian Kollmannsberger, Faraz Hach, Claudia I Chavez-Munoz, Alan I So
Journal Papers Sci Rep . 2020 Feb 6;10(1):2026.


Clear-cell renal cell carcinoma (ccRCC) is a common therapy resistant disease with aberrant angiogenic and immunosuppressive features. Patients with metastatic disease are treated with targeted therapies based on clinical features: low-risk patients are usually treated with anti-angiogenic drugs and intermediate/high-risk patients with immune therapy. However, there are no biomarkers available to guide treatment choice for these patients. A recently published phase II clinical trial observed a correlation between ccRCC patients’ clustering and their response to targeted therapy. However, the clustering of these groups was not distinct. Here, we analyzed the gene expression profile of 469 ccRCC patients, using featured selection technique, and have developed a refined 66-gene signature for improved sub-classification of patients. Moreover, we have identified a novel comprehensive expression profile to distinguish between migratory stromal and immune cells. Furthermore, the proposed 66-gene signature was validated using a different cohort of 64 ccRCC patients. These findings are foundational for the development of reliable biomarkers that may guide treatment decision-making and improve therapy response in ccRCC patients.

PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data

Salem Malikic, Farid Rashidi Mehrabadi, Simone Ciccolella, Md. Khaledur Rahman, Camir Ricketts, Ehsan Haghshenas, Daniel Seidman, Faraz Hach , Iman Hajirasouliha, and S. Cenk Sahinalp
Journal PapersGenome Res. 2019 Nov;29(11):1860-1877. doi: 10.1101/gr.234435.118
Presentation RECOMB 2020 Highlight Talk (24TH Internation Conference on Research in Computationl Molecular Biology), Virtual Conference, June 22-25 2020


Available computational methods for tumor phylogeny inference via single-cell sequencing (SCS) data typically aim to identify the most likely perfect phylogeny tree satisfying the infinite sites assumption (ISA). However, the limitations of SCS technologies including frequent allele dropout and variable sequence coverage may prohibit a perfect phylogeny. In addition, ISA violations are commonly observed in tumor phylogenies due to the loss of heterozygosity, deletions, and convergent evolution. In order to address such limitations, we introduce the optimal subperfect phylogeny problem which asks to integrate SCS data with matching bulk sequencing data by minimizing a linear combination of potential false negatives (due to allele dropout or variance in sequence coverage), false positives (due to read errors) among mutation calls, and the number of mutations that violate ISA (real or because of incorrect copy number estimation). We then describe a combinatorial formulation to solve this problem which ensures that several lineage constraints imposed by the use of variant allele frequencies (VAFs, derived from bulk sequence data) are satisfied. We express our formulation both in the form of an integer linear program (ILP) and—as a first in tumor phylogeny reconstruction—a Boolean constraint satisfaction problem (CSP) and solve them by leveraging state-of-the-art ILP/CSP solvers. The resulting method, which we name PhISCS, is the first to integrate SCS and bulk sequencing data while accounting for ISA violating mutations. In contrast to the alternative methods, typically based on probabilistic approaches, PhISCS provides a guarantee of optimality in reported solutions. Using simulated and real data sets, we demonstrate that PhISCS is more general and accurate than all available approaches.

Alignment-free Clustering of UMI Tagged DNA Molecules

Baraa Orabi, Emre Erhan, Brian McConeghy, Stanislav V Volik, Stephane Le Bihan, Robert Bell, Colin C Collins, Cedric Chauve and Faraz Hach
Journal Papers Bioinformatics. 2019 Jun 1;35(11):1829-1836. doi: 10.1093/bioinformatics/bty888.
Presentation RECOMB-Seq 2018 Regular Talk (8th Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Paris, France, April 19-20 2018


Motivation: Next Generation Sequencing has led to the availability of massive genomic datasets whose processing raises many challenges, including the handling of sequencing errors. This is especially pertinent in cancer genomics, for example, for detecting low allele frequency variations from circulating tumour DNA. Barcode tagging of DNA molecules with Unique Molecular Identifiers (UMI) attempts to mitigate sequencing errors; UMI tagged molecules are PCR amplified, and the PCR copies of UMI tagged molecules are sequenced independently. However, the PCR and sequencing steps can generate errors in the sequenced reads that can be located in the barcode and/or the DNA sequence. Analyzing UMI tagged sequencing data requires an initial clustering step, with the aim of grouping reads sequenced from PCR duplicates of the same UMI tagged molecule into a single cluster, and the size of the current datasets requires this clustering process to be resource-efficient.
Results: We introduce Calib, a computational tool that clusters paired-end reads from UMI tagged sequencing experiments using substitution-error-dominant sequencing platforms such as Illumina. Calib clusters are defined as connected components of a graph whose edges are defined in terms of both barcode similarity and read sequence similarity. The graph is constructed efficiently using locality sensitive hashing and MinHashing techniques. Calib's default clustering parameters are optimized empirically, for different UMI and read lengths, using a simulation module that is packaged with Calib. Compared to other tools, Calib has the best accuracy on simulated data, while maintaining reasonable runtime and memory footprint. On a real dataset, Calib runs with far less resources than alignment-based methods, and its clusters reduce the number of likely false positive in downstream mutation calling.
Availability: Calib is implemented in C++ and its simulation module is implemented in Python. Calib is available on our GitHub repository at

Structural variation and fusion detection using targeted sequencing data from circulating cell free DNA

Alexander R Gawroński, Yen-Yi Lin, Brian McConeghy, Stephane LeBihan, Hossein Asghari, Can Koçkan, Baraa Orabi, Nabil Adra, Roberto Pili, Colin C Collins, S Cenk Sahinalp and Faraz Hach
Journal Papers Nucleic Acids Res. 2019 Apr 23;47(7):e38. doi: 10.1093/nar/gkz067.


Motivation: Cancer is a complex disease that involves rapidly evolving cells, often forming multiple distinct clones. In order to effectively understand progression of a patient-specific tumor, one needs to comprehensively sample tumor DNA at multiple time points, ideally obtained through inexpensive and minimally invasive techniques. Current sequencing technologies make the 'liquid biopsy' possible, which involves sampling a patient's blood or urine and sequencing the circulating cell free DNA (cfDNA). A certain percentage of this DNA originates from the tumor, known as circulating tumor DNA (ctDNA). The ratio of ctDNA may be extremely low in the sample, and the ctDNA may originate from multiple tumors or clones. These factors present unique challenges for applying existing tools and workflows to the analysis of ctDNA, especially in the detection of structural variations which rely on sufficient read coverage to be detectable.
Results: Here we introduce SViCT , a structural variation (SV) detection tool designed to handle the challenges associated with cfDNA analysis. SViCT can detect breakpoints and sequences of various structural variations including deletions, insertions, inversions, duplications and translocations. SViCT extracts discordant read pairs, one-end anchors and soft-clipped/split reads, assembles them into contigs, and re-maps contig intervals to a reference genome using an efficient k-mer indexing approach. The intervals are then joined using a combination of graph and greedy algorithms to identify specific structural variant signatures. We assessed the performance of SViCT and compared it to state-of-the-art tools using simulated cfDNA datasets with properties matching those of real cfDNA samples. The positive predictive value and sensitivity of our tool was superior to all the tested tools and reasonable performance was maintained down to the lowest dilution of 0.01% tumor DNA in simulated datasets. Additionally, SViCT was able to detect all known SVs in two real cfDNA reference datasets (at 0.6-5% ctDNA) and predict a novel structural variant in a prostate cancer cohort.
Availability: SViCT is available at

BAP1 haploinsufficiency predicts a distinct immunogenic class of malignant peritoneal mesothelioma

Raunak Shrestha, Noushin Nabavi, Yen-Yi Lin, Fan Mo, Shawn Anderson, Stanislav Volik, Hans H. Adomat, Dong Lin, Hui Xue, Xin Dong, Robert Shukin, Robert H. Bell, Brian McConeghy, Anne Haegert, Sonal Brahmbhatt, Estelle Li, Htoo Zarni Oo, Antonio Hurtado-Coll, Ladan Fazli, Joshua Zhou, Yarrow McConnell, Andrea McCart, Andrew Lowy, Gregg B. Morin, Tianhui Chen, Mads Daugaard, S. Cenk Sahinalp, Faraz Hach, Stephane Le Bihan, Martin E. Gleave, Yuzhuo Wang, Andrew Churg and Colin C. Collins
Journal Papers Genome Med. 2019 Feb 18;11(1):8. doi: 10.1186/s13073-019-0620-3.


Backgroung: Malignant peritoneal mesothelioma (PeM) is a rare and fatal cancer that originates from the peritoneal lining of the abdomen. Standard treatment of PeM is limited to cytoreductive surgery and/or chemotherapy, and no effective targeted therapies for PeM exist. Some immune checkpoint inhibitor studies of mesothelioma have found positivity to be associated with a worse prognosis.
Methods: o search for novel therapeutic targets for PeM, we performed a comprehensive integrative multi-omics analysis of the genome, transcriptome, and proteome of 19 treatment-naïve PeM, and in particular, we examined BAP1 mutation and copy number status and its relationship to immune checkpoint inhibitor activation.
Results: We found that PeM could be divided into tumors with an inflammatory tumor microenvironment and those without and that this distinction correlated with haploinsufficiency of BAP1. To further investigate the role of BAP1, we used our recently developed cancer driver gene prioritization algorithm, HIT'nDRIVE, and observed that PeM with BAP1 haploinsufficiency form a distinct molecular subtype characterized by distinct gene expression patterns of chromatin remodeling, DNA repair pathways, and immune checkpoint receptor activation. We demonstrate that this subtype is correlated with an inflammatory tumor microenvironment and thus is a candidate for immune checkpoint blockade therapies.
Conclusions: Our findings reveal BAP1 to be a potential, easily trackable prognostic and predictive biomarker for PeM immunotherapy that refines PeM disease classification. BAP1 stratification may improve drug response rates in ongoing phases I and II clinical trials exploring the use of immune checkpoint blockade therapies in PeM in which BAP1 status is not considered. This integrated molecular characterization provides a comprehensive foundation for improved management of a subset of PeM patients.

lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data.

Ehsan Haghshenas, S Cenk Sahinalp and Faraz Hach
Journal Papers Bioinformatics. 2019 Jan 1;35(1):20-27. doi: 10.1093/bioinformatics/bty544.


Motivation: Recent advances in genomics and precision medicine have been made possible through the application of high throughput sequencing (HTS) to large collections of human genomes. Although HTS technologies have proven their use in cataloging human genome variation, computational analysis of the data they generate is still far from being perfect. The main limitation of Illumina and other popular sequencing technologies is their short read length relative to the lengths of (common) genomic repeats. Newer (single molecule sequencing - SMS) technologies such as Pacific Biosciences and Oxford Nanopore are producing longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. Unfortunately, because of their high sequencing error rate, reads generated by these technologies are very difficult to work with and cannot be used in many of the standard downstream analysis pipelines. Note that it is not only difficult to find the correct mapping locations of such reads in a reference genome, but also to establish their correct alignment so as to differentiate sequencing errors from real genomic variants. Furthermore, especially since newer SMS instruments provide higher throughput, mapping and alignment need to be performed much faster than before, maintaining high sensitivity.
Results: We introduce lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.
Availability and Implementation: lordFAST is implemented in C++ and supports multi-threading. The source code of lordFAST is available at

Fast characterization of segmental duplications in genome assemblies

Ibrahim Numanagić, Alim S Gökkaya, Lillian Zhang, Bonnie Berger, Can Alkan and Faraz Hach
Journal Papers Bioinformatics. 2018 Sep 1;34(17):i706-i714
Presentation ECCB 2018 (17TH European Conference on Computational Biology), Athens, Greece, September 8-12 2018


Motivation: Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner.
Results: Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% 'pairwise error' between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome.
Availability and Implementation: SEDEF is available at

Dynamic Alignment-Free and Reference-Free Read Compression

Guillaume Holley, Roland Wittler, Jens Stoye and Faraz Hach
Journal Papers J Comput Biol. 2018 Jul;25(7):825-836.


The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.

Computational identification of micro-structural variations and their proteogenomic consequences in cancer.

Yen-Yi Lin, Alexander Gawronski, Faraz Hach, Sujun Li, Ibrahim Numanagić, Iman Sarrafi, Swati Mishra, Andrew McPherson, Colin C Collins, Milan Radovich, Haixu Tang and S Cenk Sahinalp
Journal Papers Bioinformatics. 2018 May 15;34(10):1672-1681


Purpose: Rapid advancement in high throughput genome and transcriptome sequencing (HTS) and mass spectrometry (MS) technologies has enabled the acquisition of the genomic, transcriptomic and proteomic data from the same tissue sample. We introduce a computational framework, ProTIE, to integratively analyze all three types of omics data for a complete molecular profile of a tissue sample. Our framework features MiStrVar, a novel algorithmic method to identify micro structural variants (microSVs) on genomic HTS data. Coupled with deFuse, a popular gene fusion detection method we developed earlier, MiStrVar can accurately profile structurally aberrant transcripts in tumors. Given the breakpoints obtained by MiStrVar and deFuse, our framework can then identify all relevant peptides that span the breakpoint junctions and match them with unique proteomic signatures. Observing structural aberrations in all three types of omics data validates their presence in the tumor samples
Results: We have applied our framework to all The Cancer Genome Atlas (TCGA) breast cancer Whole Genome Sequencing (WGS) and/or RNA-Seq datasets, spanning all four major subtypes, for which proteomics data from Clinical Proteomic Tumor Analysis Consortium (CPTAC) have been released. A recent study on this dataset focusing on SNVs has reported many that lead to novel peptides. Complementing and significantly broadening this study, we detected 244 novel peptides from 432 candidate genomic or transcriptomic sequence aberrations. Many of the fusions and microSVs we discovered have not been reported in the literature. Interestingly, the vast majority of these translated aberrations, fusions in particular, were private, demonstrating the extensive inter-genomic heterogeneity present in breast cancer. Many of these aberrations also have matching out-of-frame downstream peptides, potentially indicating novel protein sequence and structure.
Availability and Implementation: MiStrVar is available for download at, and ProTIE is available at

Mutational Analysis of Gene Fusions Predicts Novel MHC Class I-Restricted T-Cell Epitopes and Immune Signatures in a Subset of Prostate Cancer.

Jennifer L. Kalina , David S. Neilson, Yen-Yi Lin, Phineas T. Hamilton, Alexandra P. Comber, Emma M.H. Loy, S. Cenk Sahinalp, Colin C. Collins, Faraz Hach and Julian J. Lum
Journal Papers Clin Cancer Res. 2017 Dec 15;23(24):7596-7607.


Purpose: Gene fusions are frequently found in prostate cancer and may result in the formation of unique chimeric amino acid sequences (CASQ) that span the breakpoint of two fused gene products. This study evaluated the potential for fusion-derived CASQs to be a source of tumor neoepitopes, and determined their relationship to patterns of immune signatures in prostate cancer patients.Experimental Design: A computational strategy was used to identify CASQs and their corresponding predicted MHC class I epitopes using RNA-Seq data from The Cancer Genome Atlas of prostate tumors. In vitro peptide-specific T-cell expansion was performed to identify CASQ-reactive T cells. A multivariate analysis was used to relate patterns of in silico-predicted tumor-infiltrating immune cells with prostate tumors harboring these mutational events.
Results: Eighty-seven percent of tumors contained gene fusions with a mean of 12 per tumor. In total, 41% of fusion-positive tumors were found to encode CASQs. Within these tumors, 87% gave rise to predicted MHC class I-binding epitopes. This observation was more prominent when patients were stratified into low- and intermediate/high-risk categories. One of the identified CASQ from the recurrent TMPRSS2:ERG type VI fusion contained several high-affinity HLA-restricted epitopes. These peptides bound HLA-A*02:01 in vitro and were recognized by CD8+ T cells. Finally, the presence of fusions and CASQs were associated with expression of immune cell infiltration.
Conclusions: Mutanome analysis of gene fusion-derived CASQs can give rise to patient-specific predicted neoepitopes. Moreover, these fusions predicted patterns of immune cell infiltration within a subgroup of prostate cancer patients.

Discovery and genotyping of novel sequence insertions in many sequenced individuals.

Pinar Kavak, YenYi Lin, Ibrahim Numanagic, Hossein Asghari, Tunga Güngör, Can Alkan and Faraz Hach.
Journal PapersBioinformatics. 2017 Jul 15;33(14):i161-i169
Presentation ISMB 2017 (25th Conference on Intelligent Systems for Molecular Biology and 16th European Conference on Computational Biology), Prague, Czech Republic, July 21-25 2017


Motivation: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects.
Results: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects.
Availability and implementation: Pamir is available at

SiNVICT: ultra-sensitive detection of single nucleotide variants and indels in circulating tumour DNA.

Can Kockan, Faraz Hach, Iman Sarrafi, Robert H. Bell, Brian McConeghy, Kevin Beja, Anne Haegert, Alexander W. Wyatt, Stanislav V. Volik, Kim N. Chi, Colin C. Collins and S. Cenk Sahinalp
Journal Papers Bioinformatics. 2017 Jan 1;33(1):26-34


Motivation: Successful development and application of precision oncology approaches require robust elucidation of the genomic landscape of a patient's cancer and, ideally, the ability to monitor therapy-induced genomic changes in the tumour in an inexpensive and minimally invasive manner. Thanks to recent advances in sequencing technologies, 'liquid biopsy', the sampling of patient's bodily fluids such as blood and urine, is considered as one of the most promising approaches to achieve this goal. In many cancer patients, and especially those with advanced metastatic disease, deep sequencing of circulating cell free DNA (cfDNA) obtained from patient's blood yields a mixture of reads originating from the normal DNA and from multiple tumour subclones - called circulating tumour DNA or ctDNA. The ctDNA/cfDNA ratio as well as the proportion of ctDNA originating from specific tumour subclones depend on multiple factors, making comprehensive detection of mutations difficult, especially at early stages of cancer. Furthermore, sensitive and accurate detection of single nucleotide variants (SNVs) and indels from cfDNA is constrained by several factors such as the sequencing errors and PCR artifacts, and mapping errors related to repeat regions within the genome. In this article, we introduce SiNVICT, a computational method that increases the sensitivity and specificity of SNV and indel detection at very low variant allele frequencies. SiNVICT has the capability to handle multiple sequencing platforms with different error properties; it minimizes false positives resulting from mapping errors and other technology specific artifacts including strand bias and low base quality at read ends. SiNVICT also has the capability to perform time-series analysis, where samples from a patient sequenced at multiple time points are jointly examined to report locations of interest where there is a possibility that certain clones were wiped out by some treatment while some subclones gained selective advantage.
Results: We tested SiNVICT on simulated data as well as prostate cancer cell lines and cfDNA obtained from castration-resistant prostate cancer patients. On both simulated and biological data, SiNVICT was able to detect SNVs and indels with variant allele percentages as low as 0.5%. The lowest amounts of total DNA used for the biological data where SNVs and indels could be detected with very high sensitivity were 2.5 ng on the Ion Torrent platform and 10 ng on Illumina. With increased sequencing and mapping accuracy, SiNVICT might be utilized in clinical settings, making it possible to track the progress of point mutations and indels that are associated with resistance to cancer therapies and provide patients personalized treatment. We also compared SiNVICT with other popular SNV callers such as MuTect, VarScan2 and Freebayes. Our results show that SiNVICT performs better than these tools in most cases and allows further data exploration such as time-series analysis on cfDNA sequencing data.
Results: We tested SiNVICT on simulated data as well as prostate cancer cell lines and cfDNA obtained from castration-resistant prostate cancer patients. On both simulated and biological data, SiNVICT was able to detect SNVs and indels with variant allele percentages as low as 0.5%. The lowest amounts of total DNA used for the biological data where SNVs and indels could be detected with very high sensitivity were 2.5 ng on the Ion Torrent platform and 10 ng on Illumina. With increased sequencing and mapping accuracy, SiNVICT might be utilized in clinical settings, making it possible to track the progress of point mutations and indels that are associated with resistance to cancer therapies and provide patients personalized treatment. We also compared SiNVICT with other popular SNV callers such as MuTect, VarScan2 and Freebayes. Our results show that SiNVICT performs better than these tools in most cases and allows further data exploration such as time-series analysis on cfDNA sequencing data.
Availability and implementation: SiNVICT is available at: Supplementary information: Supplementary data are available at Bioinformatics online.

Comparison of high-throughput sequencing data compression tools

Ibrahim Numanagic, James K Bonfield, Faraz Hach, Jan Voges, Jorn Ostermann, Claudio Alberti, Marco Mattavelli, S. Cenk Sahinalp
Journal Papers Nat Methods. 2016 Dec;13(12):1005-1008


High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.

CoLoRMap: Correcting Long Reads by Mapping short reads.

Ehsan Haghshenas, Faraz Hach, S. Cenk Sahinalp, Cedric Chauve
Journal Papers Bioinformatics. 2016 Sep 1;32(17):i545-i551
Presentation ECCB'16 (15th European Conference on Computational Biology), The Hauge, Netherlands, September 3-7 2016


Motivation: Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.
Results: We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.
Availability and Implementation: The source code of CoLoRMap is freely available for non-commercial use at

Robustness of Massively Parallel Sequencing Platforms

Pinar Kavak, Bayram Yuksel, Soner Aksu, M. Oguzhan Kulekci, Tunga Gungor, Faraz Hach, S. Cenk Sahinalp, Turkish Human Genome Project, Can Alkan, Mahmut Şamil Sagıroglu
Journal PapersPLoS One. 2015 Sep 18;10(9):e0138259


The improvements in high throughput sequencing technologies (HTS) made clinical sequencing projects such as ClinSeq and Genomics England feasible. Although there are significant improvements in accuracy and reproducibility of HTS based analyses, the usability of these types of data for diagnostic and prognostic applications necessitates a near perfect data generation. To assess the usability of a widely used HTS platform for accurate and reproducible clinical applications in terms of robustness, we generated whole genome shotgun (WGS) sequence data from the genomes of two human individuals in two different genome sequencing centers. After analyzing the data to characterize SNPs and indels using the same tools (BWA, SAMtools, and GATK), we observed significant number of discrepancies in the call sets. As expected, the most of the disagreements between the call sets were found within genomic regions containing common repeats and segmental duplications, albeit only a small fraction of the discordant variants were within the exons and other functionally relevant regions such as promoters. We conclude that although HTS platforms are sufficiently powerful for providing data for first-pass clinical tests, the variant predictions still need to be confirmed using orthogonal methods before using in clinical applications.

Spatial genomic heterogeneity within localized, multifocal prostate cancer

Paul C Boutros, Michael Fraser, Nicholas J Harding, Richard de Borja, Dominique Trudel, Emilie Lalonde, Alice Meng, Pablo H Hennings-Yeomans, Andrew McPherson, Veronica Y Sabelnykova, Amin Zia, Natalie S Fox, Julie Livingstone, Yu-Jia Shiah, Jianxin Wang, Timothy A Beck, Cherry L Have, Taryne Chong, Michelle Sam, Jeremy Johns, Lee Timms, Nicholas Buchner, Ada Wong, John D Watson, Trent T Simmons, Christine P'ng, Gaetano Zafarana, Francis Nguyen, Xuemei Luo, Kenneth C Chu, Stephenie D Prokopec, Jenna Sykes, Alan Dal Pra, Alejandro Berlin, Andrew Brown, Michelle A Chan-Seng-Yue, Fouad Yousif, Robert E Denroche, Lauren C Chong, Gregory M Chen, Esther Jung, Clement Fung, Maud H W Starmans, Hanbo Chen, Shaylan K Govind, James Hawley, Alister D'Costa, Melania Pintilie, Daryl Waggott, Faraz Hach, Philippe Lambin, Lakshmi B Muthuswamy, Colin Cooper, Rosalind Eeles, David Neal, Bernard Tetu, Cenk Sahinalp, Lincoln D Stein, Neil Fleshner, Sohrab P Shah, Colin C Collins, Thomas J Hudson, John D McPherson, Theodorus van der Kwast and Robert G Bristow
Journal PapersNat Genet. 2015 Jul;47(7):736-45


Herein we provide a detailed molecular analysis of the spatial heterogeneity of clinically localized, multifocal prostate cancer to delineate new oncogenes or tumor suppressors. We initially determined the copy number aberration (CNA) profiles of 74 patients with index tumors of Gleason score 7. Of these, 5 patients were subjected to whole-genome sequencing using DNA quantities achievable in diagnostic biopsies, with detailed spatial sampling of 23 distinct tumor regions to assess intraprostatic heterogeneity in focal genomics. Multifocal tumors are highly heterogeneous for single-nucleotide variants (SNVs), CNAs and genomic rearrangements. We identified and validated a new recurrent amplification of MYCL, which is associated with TP53 deletion and unique profiles of DNA damage and transcriptional dysregulation. Moreover, we demonstrate divergent tumor evolution in multifocal cancer and, in some cases, tumors of independent clonal origin. These data represent the first systematic relation of intraprostatic genomic heterogeneity to predicted clinical outcome and inform the development of novel biomarkers that reflect individual prognosis.

Fast and accurate mapping of Complete Genomics reads

Donghyuk Lee, Farhad Hormozdiari, Hongyi Xin, Faraz Hach, Onur Mutlu and Can Alkan
Journal PapersMethods. 2015 Jun;79-80:3-10


Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologies to correct for shortcomings; but also requires to develop different algorithms for each platform due to the differences in data types and error models. The first problem to tackle in analyzing HTS data for resequencing applications is the read mapping stage, where many tools have been developed for the most popular HTS methods, but publicly available and open source aligners are still lacking for the Complete Genomics (CG) platform. Unfortunately, Burrows-Wheeler based methods are not practical for CG data due to the gapped nature of the reads generated by this method. Here we provide a sensitive read mapper (sirFAST) for the CG technology based on the seed-and-extend paradigm that can quickly map CG reads to a reference genome. We evaluate the performance and accuracy of sirFAST using both simulated and publicly available real data sets, showing high precision and recall rates.

DeeZ: reference-based compression by local assembly

Faraz Hach, Ibrahim Numanagic and S Cenk Sahinalp
Journal PapersNature Methods, 2014 Oct 30;11(11):1082-4
Presentation RECOMB-Seq 2015 HighLight Talk (5th Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Warsaw, Poland, April 10-21 2015

mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications

Faraz Hach, Iman Sarrafi, Farhad Hormozdiari, Can Alkan, Evan E. Eichler and S. Cenk Sahinalp.
Journal Papers Nucl. Acids Res. (1 July 2014) 42 (W1): W494-W500.


High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the 'best' mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at

Optimal resolution of ambiguous RNA-Seq multimappings in the presence of novel isoforms

Phuong Dao, Ibrahim Numanagic, Yen-Yi Lin, Faraz Hach, Emre Karakoc, Nilgun Donmez, Colin Collins, Evan E. Eichler and S. Cenk Sahinalp
Journal Papers Bioinformatics (2014) 30 (5): 644-651.


Motivation: RNA-Seq technology is promising to uncover many novel alternative splicing events, gene fusions and other variations in RNA transcripts. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for those RNA-Seq reads that can be mapped to multiple loci: >17% of the reads from mouse RNA-Seq data and 50% of the reads from some plant RNA-Seq data have multiple mapping loci. In this study, we show how to resolve the mapping ambiguity in the presence of novel transcriptomic events such as exon skipping and novel indels towards accurate downstream analysis. We introduce ORMAN ( O ptimal R esolution of M ultimapping A mbiguity of R N A-Seq Reads), which aims to compute the minimum number of potential transcript products for each gene and to assign each multimapping read to one of these transcripts based on the estimated distribution of the region covering the read. ORMAN achieves this objective through a combinatorial optimization formulation, which is solved through well-known approximation algorithms, integer linear programs and heuristics.
Results: On a simulated RNA-Seq dataset including a random subset of transcripts from the UCSC database, the performance of several state-of-the-art methods for identifying and quantifying novel transcripts, such as Cufflinks, IsoLasso and CLIIQ, is significantly improved through the use of ORMAN. Furthermore, in an experiment using real RNA-Seq reads, we show that ORMAN is able to resolve multimapping to produce coverage values that are similar to the original distribution, even in genes with highly non-uniform coverage.
Availability: ORMAN is available at

SCALCE: boosting sequence compression algorithms using locally consistent encoding

Faraz Hach, Ibrahim Numanagic, Can Alkan and S Cenk Sahinalp
Journal Papers Bioinformatics (2012) 28 (23): 3051-3057.
Presentation RECOMB-Seq 2012 (2nd Annual RECOMB Satellite Workshop on Massively Parallel Sequencing), Barcelona, Spain, April 19-20 2012


Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome.
Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19-when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.
Availability: Our algorithm, SCALCE (Sequence Compression Algorithm using Locally Consistent Encoding), is implemented in C++ with both gzip and bzip2 compression options. It also supports multithreading when gzip option is selected, and the pigz binary is available. It is available at

CLIIQ: Accurate Comparative Detection and Quantification of Expressed Isoforms in a Population

Yen-Yi Lin, Phuong Dao, Faraz Hach, Marzieh Bakhshi, Fan Mo, Anna Lapuk, Colin Collins and S. Cenk Sahinalp
Conference PapersProc. of WABI 2012: Lecture Notes in Computer Science Volume 7534, 2012, pp 178-189
Presentation WABI 2012 (12th Workshop on Algorithms in Bioinformatics), September 10-12, Ljubljana, Slovenia


The recently developed RNA-Seq technology provides a high-throughput and reasonably accurate way to analyze the transcriptomic landscape of a tissue. Unfortunately, from a computational perspective, identification and quantification of a gene's isoforms from RNA-Seq data remains to be a non-trivial problem. We propose CLIIQ, a novel computational method for identification and quantification of expressed isoforms from multiple samples in a population. Motivated by ideas from compressed sensing literature, CLIIQ is based on an integer linear programming formulation for identifying and quantifying the most parsimonious set of isoforms. We show through simulations that, on a single sample, CLIIQ provides better results in isoform identification and quantification to alternative popular tools. More importantly, CLIIQ has an option to jointly analyze multiple samples, which significantly outperforms other tools in both isoform identification and quantification.

Dissect: detection and characterization of novel structural alterations in transcribed sequences

Deniz Yorukoglu, Faraz Hach, Lucas Swanson, Colin C. Collins, Inanc Birol and S. Cenk Sahinalp
Journal PapersBioinformatics (2012) 28 (12): i179-i187
PresentationISMB 2012 (20th Annual International Conference on Intelligent Systems for Molecular Biology), Long Beach, USA, July 19-23 2012
Award Won Ian Lawson Van Toch Memorial Award for Outstanding Student Paper


Motivation: Computational identification of genomic structural variants via high-throughput sequencing is an important problem for which a number of highly sophisticated solutions have been recently developed. With the advent of high-throughput transcriptome sequencing (RNA-Seq), the problem of identifying structural alterations in the transcriptome is now attracting significant attention. In this article, we introduce two novel algorithmic formulations for identifying transcriptomic structural variants through aligning transcripts to the reference genome under the consideration of such variation. The first formulation is based on a nucleotide-level alignment model; a second, potentially faster formulation is based on chaining fragments shared between each transcript and the reference genome. Based on these formulations, we introduce a novel transcriptome-to-genome alignment tool, Dissect (DIScovery of Structural Alteration Event Containing Transcripts), which can identify and characterize transcriptomic events such as duplications, inversions, rearrangements and fusions. Dissect is suitable for whole transcriptome structural variation discovery problems involving sufficiently long reads or accurately assembled contigs.
Results: We tested Dissect on simulated transcripts altered via structural events, as well as assembled RNA-Seq contigs from human prostate cancer cell line C4-2. Our results indicate that Dissect has high sensitivity and specificity in identifying structural alteration events in simulated transcripts as well as uncovering novel structural alterations in cancer transcriptomes.
Availability: Dissect is available for public use at:

From sequence to molecular pathology, and a mechanism driving the neuroendocrine phenotype in prostate cancer

Anna V Lapuk, Chunxiao Wu, Alexander W Wyatt, Andrew McPherson, Brian J McConeghy, Sonal Brahmbhatt, Fan Mo, Amina Zoubeidi, Shawn Anderson, Robert H Bell, Anne Haegert, Robert Shukin, Yuzhuo Wang, Ladan Fazli, Antonio Hurtado-Coll, Edward C Jones, Faraz Hach, Fereydoun Hormozdiari, Iman Hajirasouliha, Paul C Boutros, Robert G Bristow, Yongjun Zhao, Marco A Marra, Andrea Fanjul, Christopher A Maher, Arul M Chinnaiyan, Mark A Rubin, Himisha Beltran, S Cenk Sahinalp, Martin E Gleave, Stanislav V Volik and Colin C Collins
Journal PapersJ Pathol. 2012 Jul;227(3):286-97


The current paradigm of cancer care relies on predictive nomograms which integrate detailed histopathology with clinical data. However, when predictions fail, the consequences for patients are often catastrophic, especially in prostate cancer where nomograms influence the decision to therapeutically intervene. We hypothesized that the high dimensional data afforded by massively parallel sequencing (MPS) is not only capable of providing biological insights, but may aid molecular pathology of prostate tumours. We assembled a cohort of six patients with high-risk disease, and performed deep RNA and shallow DNA sequencing in primary tumours and matched metastases where available. Our analysis identified copy number abnormalities, accurately profiled gene expression levels, and detected both differential splicing and expressed fusion genes. We revealed occult and potentially dormant metastases, unambiguously supporting the patients' clinical history, and implicated the REST transcriptional complex in the development of neuroendocrine prostate cancer, validating this finding in a large independent cohort. We massively expand on the number of novel fusion genes described in prostate cancer; provide fresh evidence for the growing link between fusion gene aetiology and gene expression profiles; and show the utility of fusion genes for molecular pathology. Finally, we identified chromothripsis in a patient with chronic prostatitis. Our results provide a strong foundation for further development of MPS-based molecular pathology.

Integrated genome and transcriptome sequencing identifies a novel form of hybrid and aggressive prostate cancer

Chunxiao Wu, Alexander W Wyatt, Anna V Lapuk, Andrew McPherson, Brian J McConeghy, Robert H Bell, Shawn Anderson, Anne Haegert, Sonal Brahmbhatt, Robert Shukin, Fan Mo, Estelle Li, Ladan Fazli, Antonio Hurtado-Coll, Edward C Jones, Yaron S Butterfield, Faraz Hach, Fereydoun Hormozdiari, Iman Hajirasouliha, Paul C Boutros, Robert G Bristow, Steven JM Jones, Martin Hirst, Marco A Marra, Christopher A Maher, Arul M Chinnaiyan, S Cenk Sahinalp, Martin E Gleave, Stanislav V Volik and Colin C Collins
Journal PapersJ Pathol. 2012 May;227(1):53-61


Next-generation sequencing is making sequence-based molecular pathology and personalized oncology viable. We selected an individual initially diagnosed with conventional but aggressive prostate adenocarcinoma and sequenced the genome and transcriptome from primary and metastatic tissues collected prior to hormone therapy. The histology-pathology and copy number profiles were remarkably homogeneous, yet it was possible to propose the quadrant of the prostate tumour that likely seeded the metastatic diaspora. Despite a homogeneous cell type, our transcriptome analysis revealed signatures of both luminal and neuroendocrine cell types. Remarkably, the repertoire of expressed but apparently private gene fusions, including C15orf21:MYC, recapitulated this biology. We hypothesize that the amplification and over-expression of the stem cell gene MSI2 may have contributed to the stable hybrid cellular identity. This hybrid luminal-neuroendocrine tumour appears to represent a novel and highly aggressive case of prostate cancer with unique biological features and, conceivably, a propensity for rapid progression to castrate-resistance. Overall, this work highlights the importance of integrated analyses of genome, exome and transcriptome sequences for basic tumour biology, sequence-based molecular pathology and personalized oncology.

Sensitive and fast mapping of di-base encoded reads

Farhad Hormozdiari, Faraz Hach S. Cenk Sahinalp, Evan E. Eichler and Can Alkan
Journal PapersBioinformatics (2011) 27 (14): 1915-1921


Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded 'color-space' sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at

Comrad: detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data

Andrew McPherson, Chunxiao Wu, Iman Hajirasouliha, Fereydoun Hormozdiari, Faraz Hach, Anna Lapuk, Stanislav Volik, Sohrab Shah, Colin Collins and S. Cenk Sahinalp
Journal PapersBioinformatics (2011) 27 (11): 1481-1488.
Presentation ISMB-HitSeq 2011 (High Throughput Sequencing Analysis and Algorithms, Special Interest Group of ISMB 2011), Vienna, Austria, July 15-16 2011
Award Won Best Paper Award


Motivation: Comrad is a novel algorithmic framework for the integrated analysis of RNA-Seq and whole genome shotgun sequencing (WGSS) data for the purposes of discovering genomic rearrangements and aberrant transcripts. The Comrad framework leverages the advantages of both RNA-Seq and WGSS data, providing accurate classification of rearrangements as expressed or not expressed and accurate classification of the genomic or non-genomic origin of aberrant transcripts. A major benefit of Comrad is its ability to accurately identify aberrant transcripts and associated rearrangements using low coverage genome data. As a result, a Comrad analysis can be performed at a cost comparable to that of two RNA-Seq experiments, significantly lower than an analysis requiring high coverage genome data.
Results: We have applied Comrad to the discovery of gene fusions and read-throughs in prostate cancer cell line C4-2, a derivative of the LNCaP cell line with androgen-independent characteristics. As a proof of concept, we have rediscovered in the C4-2 data 4 of the 6 fusions previously identified in LNCaP. We also identified six novel fusion transcripts and associated genomic breakpoints, and verified their existence in LNCaP, suggesting that Comrad may be more sensitive than previous methods that have been applied to fusion discovery in LNCaP. We show that many of the gene fusions discovered using Comrad would be difficult to identify using currently available techniques.
Availability: A C++ and Perl implementation of the method demonstrated in this article is available at

Alu repeat discovery and characterization within human genomes

Fereydoun Hormozdiari, Can Alkan, Mario Ventura, Iman Hajirasouliha, Maika Malig, Faraz Hach, Deniz Yorukoglu, Phuong Dao, Marzieh Bakhshi, S. Cenk Sahinalp and Evan E. Eichler
Journal PapersGenome Res. 2011 Jun;21(6):840-9


Human genomes are now being rapidly sequenced, but not all forms of genetic variation are routinely characterized. In this study, we focus on Alu retrotransposition events and seek to characterize differences in the pattern of mobile insertion between individuals based on the analysis of eight human genomes sequenced using next-generation sequencing. Applying a rapid read-pair analysis algorithm, we discover 4342 Alu insertions not found in the human reference genome and show that 98% of a selected subset (63/64) experimentally validate. Of these new insertions, 89% correspond to AluY elements, suggesting that they arose by retrotransposition. Eighty percent of the Alu insertions have not been previously reported and more novel events were detected in Africans when compared with non-African samples (76% vs. 69%). Using these data, we develop an experimental and computational screen to identify ancestry informative Alu retrotransposition events among different human populations.

mrsFAST: a cache-oblivious algorithm for short-read mapping

Faraz Hach, Fereydoun Hormozdiari, Can Alkan, Farhad Hormozdiari, Inanc Birol, Evan E Eichler and S Cenk Sahinalp
Journal PapersNature Methods, 2010 Aug;7(8):576-7

Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery

Fereydoun Hormozdiari, Iman Hajirasouliha, Phuong Dao, Faraz Hach, Deniz Yorukoglu, Can Alkan, Evan E. Eichler and S. Cenk Sahinalp
Journal PapersBioinformatics (2010) 26 (12): i350-i357
Presentation ISMB (18th Annual International Conference Intelligent Systems for Molecular Biology), Boston, USA, July 11-13 2010


Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present 'conflict resolution' improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507).
Availability: The implementation of algorithm is available at

Faster Phylogenetic Inference with MXG

David G. Mitchell, Faraz Hach, Raheleh Mohebali
Conference PapersProc of LPAR 2007. Lecture Notes in Computer Science Volume 4790, 2007, pp 423-437


We apply the logic-based declarative programming approach of Model Expansion (MX) to a phylogenetic inference task. We axiomatize the task in multi-sorted first-order logic with cardinality constraints. Using the model expansion solver MXG and SAT+cardinality solver MXC, we compare the performance of several MX axiomatizations on a challenging set of test instances. Our methods perform orders of magnitude faster than previously reported declarative solutions. Our best solution involves polynomial-time pre-processing, redundant axioms, and symmetry-breaking axioms. We also discuss our method of test instance generation, and the role of pre-processing in declarative programming.

MXG: A Model Expansion Grounder and Solver

Raheleh Mohebali, Faraz Hach, and David Mitchell,
Technical ReportsAccepted to LPAR 2007, 2007.


We describe MXG, a solver for NP search problems expressed as model expansion (MX). Problems are specified in an extension of first-order logic, and solved by grounding. That is, MXG combines a high-level specification with an instance and produces a propositional formula encoding the solutions. It calls a SAT (or extended SAT) solver to find solutions. MXG is distinguished from other grounding software in its use of a grounding algorithm based on a generalization of the relational algebra.

Model expansion as a framework for modelling and solving search problems

David Mitchell, Eugenia Ternovska, Faraz Hach, and Raheleh Mohebal
Technical ReportsTR 2006-24, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada


We propose a framework for modelling and solving search problems using logic, and describe a project whose goal is to produce practically effective, general purpose tools for representing and solving search problems based on this framework. The mathematical foundation lies in the areas of finite model theory and descriptive complexity, which provide us with many classical results, as well as powerful techniques not available to many other approaches with similar goals. We describe the mathematical foundations; explain an extension to classical logic with inductive definitions that we consider central; give a summary of complexity and expressiveness properties; describe an approach to implementing solvers based on grounding; present grounding algorithms based on an extension of the relational algebra; describe an implementation of our framework which includes use of inductive definitions, sorts and order; and give experimental results comparing the performance of our implementation with ASP solvers and another solver based on the same framework.