Karolinska Institutet
Browse

Machine learning and data-parallel processing for viral metagenomics

Download (3.53 MB)
thesis
posted on 2024-09-02, 21:56 authored by Zurab Bzhalava

More than 2 million cancer cases around the world each year are caused by viruses. In addition, there are epidemiological indications that other cancer-associated viruses may also exist. However, the identification of highly divergent and yet unknown viruses in human biospecimens is one of the biggest challenges in bio-informatics. Modern-day Next Generation Sequencing (NGS) technologies can be used to directly sequence biospecimens from clinical cohorts with unprecedented speed and depth. These technologies are able to generate billions of bases with rapidly decreasing cost but current bioinformatics tools are inefficient to effectively process these massive datasets. Thus, the objective of this thesis was to facilitate both the detection of highly divergent viruses among generated sequences as well as large-scale analysis of human metagenomic datasets.

To re-analyze human sample-derived sequences that were classified as being of “unknown” origin by conventional alignment-based methods, we used a methodology based on profile Hidden Markov Models (HMM) which can capture evolutionary changes by using multiple sequence alignments. We thus identified 510 sequences that were classified as distantly related to viruses. Many of these sequences were homologs to large viruses such as Herpesviridae and Mimiviridae but some of them were also related to small circular viruses such as Circoviridae. We found that bioinformatics analysis using viral profile HMM is capable of extending the classification of previously unknown sequences and consequently the detection of viruses in biospecimens from humans. Different organisms use synonymous codons differently to encode the same amino acids. To investigate whether codon usage bias could predict the presence of virus in metagenomic sequencing data originating from human samples, we trained Random Forest and Artificial Neural Networks based on Relative Synonymous Codon Usage (RSCU) frequency. Our analysis showed that machine learning techniques based on RSCU could identify putative viral sequences with area under the ROC curve of 0.79 and provide important information for taxonomic classification.

For identification of viral genomes among raw metagenomic sequences, we developed the tool ViraMiner, a deep learning-based method which uses Convolutional Neural Networks with two convolutional branches. Using 300 base-pair length sequences, ViraMiner achieved 0.923 area under the ROC curve which is considerably improved performance in comparison with previous machine learning methods for virus sequence classification. The proposed architecture, to the best of our knowledge, is the first deep learning tool which can detect viral genomes on raw metagenomic sequences originating from a variety of human samples. To enable large-scale analysis of massive metagenomic sequencing data we used Apache Hadoop and Apache Spark to develop ViraPipe, a scalable parallel bio-informatics pipeline for viral metagenomics. Comparing ViraPipe (executed on 23 nodes) with the sequential pipeline (executed on a single node) was 11 times faster in the metagenome analysis. The new distributed workflow contains several standard bioinformatics tools and can scale to terabytes of data by accessing more computer power from the nodes. To analyze terabytes of RNA-seq data originating from head and neck squamous cell carcinoma samples, we used our parallel bioinformatics pipeline ViraPipe and the most recent version of the HPV sequence database. We detected transcription of HPV viral oncogenes in 92/500 cancers. HPV 16 was the most important HPV type, followed by HPV 33 as the second most common infection. If these cancers are indeed caused by HPV, we estimated that vaccination might prevent about 36 000 head and neck cancer cases in the United States every year.

In conclusion, the work in this thesis improves the prospects for biomedical researchers to classify the sequence contents of ultra-deep datasets, conduct large-scale analysis of metagenome studies, and detect presence of viral genomes in human biospecimens. Hopefully, this work will contribute to our understanding of biodiversity of viruses in humans which in turn can help exploring infectious causes of human disease.

List of scientific papers

I. BZHALAVA Z, Hultin E, Dillner J. Extension of the viral ecology in humans using viral profile hidden Markov models. Plos ONE. 2018; 13(1):1–12.
https://doi.org/10.1371/journal.pone.0190938

II. BZHALAVA Z#, Tampuu A#, Bała P, Vicente R, Dillner J. Machine Learning for detection of viral sequences in human metagenomic datasets. BMC Bioinformatics. 2018. 19(1): p. 336. #Equal contributions.
https://doi.org/10.1186/s12859-018-2340-x

III. Tampuu A#, BZHALAVA Z#, Dillner J, Vicente R. ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples. PLOS ONE. 2019;14(9): e022227. #Equal contributions.
https://doi.org/10.1371/journal.pone.0222271

IV. Maarala AI, BZHALAVA Z, Dillner J, Heljanko K, Bzhalava D. ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads. Bioinformatics. Volume 34, Issue 6, 15 March 2018, Pages 928–935.
https://doi.org/10.1093/bioinformatics/btx702

V. BZHALAVA Z, Arroyo Mühr LS and Dillner J. Transcription of Human Papillomavirus Oncogenes in Head and Neck Squamous Cell Carcinomas. [Manuscript]

History

Defence date

2020-04-03

Department

  • Department of Laboratory Medicine

Publisher/Institution

Karolinska Institutet

Main supervisor

Dillner, Joakim

Co-supervisors

Sundström, Karin; Bała, Piotr

Publication year

2020

Thesis type

  • Doctoral thesis

ISBN

978-91-7831-708-0

Number of supporting papers

5

Language

  • eng

Original publication date

2020-03-13

Author name in thesis

Bzhalava, Zurab

Original department name

Department of Laboratory Medicine

Place of publication

Stockholm

Usage metrics

    Theses

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC