Improved statistical methodology for high-throughput omics data analysis

Deng, Wenjiang

Improved statistical methodology for high-throughput omics data analysis

thesis

posted on 2024-09-03, 04:49 authored by Wenjiang Deng

Over the last two decades, the advent of high-throughput omics technology has substantially revolutionized biological and biomedical research. A large volume of omics data has been produced with the rapid development of sequencing techniques. Meanwhile, researchers have developed a wide range of computational tools to manage and analyze the omics data. Although the implementation of these tools generates significant discoveries, processing and interpreting the omics data efficiently and accurately is still a big challenge.

In this thesis, we aim to develop novel statistical methodologies and algorithms for omics data analysis. We implement the methods for both simulated and real data from different types of cancers. Based on the evaluation and comparison with existing tools, we find that our methods achieve higher accuracy and better performance in analyzing different types of omics data.

In Study I, we build an analysis pipeline to integrate multiple levels of omics data and identify potential driver genes in neuroblastoma. The pipeline employs gene expression profile, microarray-based comparative genomic hybridization data, and functional gene interaction network to detect cancer-related driver genes. We identify a total of 66 patient-specific and four common driver genes. The genes are summarized into a driver-gene score (DGscore) for each patient. We find that the patients with a low DGscore have better survival than those with a high DGscore (p-value=0.006).

In Study II, we develop a novel method named XAEM to quantify isoformlevel expression using RNA sequencing data. There are two major components in this method. First, we construct a design matrix X as the starting parameter in the quantification model. Second, we utilize an alternating Expectation Maximization algorithm to estimate the design matrix X and isoform expression b iteratively. We compare XAEM with several quantification methods using both simulated and real data. The result shows that XAEM achieves higher accuracy in multipleisoform genes and obtains substantially better rediscovery rates in the differentialexpression analysis.

In Study III, we extend the algorithm from Study II and develop an approach named MAX to quantify mutant-allele expression at the isoform level. For a given gene and a list of mutations, we first generate the mutant reference by incorporating all possible mutant isoforms from the wild-type isoform. The alternating Expectation Maximization algorithm is then applied to estimate the isoform abundance. We implement MAX to a real dataset of acute myeloid leukemia. Using the mutant-allele expression, we discover a subgroup of NPM1-mutated patients that has better drug response to a kinase inhibitor.

In Study IV, we build a pipeline to detect fusion genes at DNA level using whole-exome sequencing data. The pipeline is utilized to three comprehensive datasets of acute myeloid leukemia and prostate cancer patients. Compared with the detection results from RNA sequencing data, we find that several major fusion events in these two cancer types are validated in some of the patients. However, the overall results indicate that it is challenging to identify chimeric genes using exome sequencing data due to its inherent limitations.

Altogether, we have developed several statistical and bioinformatics tools to analyze different types of omics data, which demonstrate higher accuracy and better performance than other competing approaches. The results in this thesis will provide novel insights into omics data analysis and facilitate significant discoveries in cancer research.

List of scientific papers

I. Chen Suo*, Wenjiang Deng*, Trung Nghia Vu, Mingrui Li, Leming Shi, Yudi Pawitan. Accumulation of potential driver genes with genomic alterations predicts survival of high-risk neuroblastoma patients. Biology Direct. 2018; 16;13(1):14. *Contributed equally.
https://doi.org/10.1186/s13062-018-0218-5

II. Wenjiang Deng, Tian Mou, Krishna R Kalari, Nifang Niu, Liewei Wang, Yudi Pawitan and Trung Nghia Vu. Alternating EM algorithm for a bilinear model in isoform quantification from RNA-seq data. Bioinformatics. 2020; 1;36(3):805-812.
https://doi.org/10.1093/bioinformatics/btz640

III. Wenjiang Deng, Tian Mou, Yudi Pawitan and Trung Nghia Vu. Quantification of mutant-allele expression at isoform level in cancer from RNA-seq data. [Submitted]

IV. Wenjiang Deng, Sarath Murugan, Johan Lindberg, Venkatesh Chellappa, Xia Shen, Yudi Pawitan and Trung Nghia Vu. Fusion gene detection using whole-exome sequencing data in cancer patients. [Manuscript]

History

Defence date

2021-09-17

Department

Department of Medical Epidemiology and Biostatistics

Publisher/Institution

Karolinska Institutet

Main supervisor

Pawitan, Yudi

Co-supervisors

Vu, Trung Nghia; Shen, Xia

Publication year

2021

Thesis type

Doctoral thesis

ISBN

978-91-8016-279-1

Number of supporting papers

4

Language

eng

Original publication date

2021-07-19

Author name in thesis

Deng, Wenjiang

Original department name

Department of Medical Epidemiology and Biostatistics

Place of publication

Stockholm

Usage metrics

Keywords

Thesis

Improved statistical methodology for high-throughput omics data analysis

List of scientific papers

History

Defence date

Department

Publisher/Institution

Main supervisor

Co-supervisors

Publication year

Thesis type

ISBN

Number of supporting papers

Language

Original publication date

Author name in thesis

Original department name

Place of publication

Usage metrics

Categories

Keywords

Licence

Exports