Deciphering the complexity of biological systems using machine learning applied to big data
With the rising of the omics era and the recent technological advancements in biology and medicine, the generated data is growing exponentially. Increasing the volume of the data in parallel to complexity and heterogeneity of the data results in challenges in data analysis, extraction of information, and interpretation of the results.
This thesis aims to reveal the hidden structures and knowledge of complex biological data and make use of them in fundamental biological and clinical studies using machine learning techniques and pipelines. Specifically, to map subcellular localization of proteins by building a mass spectrometry (MS)- and machine learning-based pipeline, to automate the pipeline by developing a package, and to reveal the proteome subtypes of non-small cell lung cancer (NSCLC), and to build classifier pipelines for NSCLC subtyping.
In study I we developed a mass spectrometry (MS)- and machine learning-based pipeline to generate a proteome-wide resource of protein subcellular localization across multiple human cancer cell lines. Furthermore, we analyzed protein-domain and variant-dependent localization, and tested proteome-wide relocalization driven by EGFR inhibition. In study II we performed in-depth quantitative molecular phenotyping of a NSCLC cohort and used unsupervised learning-based clustering to reveal six subtypes with distinct mutation, immune infiltration and proliferation profiles. Additionally, supervised learning-based classifiers were built (support vector machine and k-top scoring pairs) and validated using independent validation datasets. In study III we present comprehensive wet-lab and dry-lab protocols for performing protein subcellular localization analysis. Specifically, an R programming-based package was developed for the downstream analysis including preprocessing the MS-output data, building the classifier, and visualization of the output. In study IV we prototype and develop a MS-based, peptide-centric machine learning classifier for subtyping and stratification of NSCLC patients for improved therapy selection.
As a summary, the presented work in this thesis shows how to strategically analyze big and complex biological and clinical data to extract useful biological insights. Furthermore, pipelines for data-driven approaches including machine learning and data science methods were developed to understand the biology and to transfer that knowledge into the clinic. The development of such methods and building pipelines for the analysis of big biological data will indisputably be the core element for a range of applications in the fields of biology and medicine.
List of scientific papers
I. Lukas Minus Orre*, Mattias Vesterlund*, Yanbo Pan*, Taner Arslan*, Yafeng Zhu, Alejandro Fernandez Woodbridge, Oliver Frings, Erik Fredlund, and Janne Lehtiö. SubCellBarCode: Proteome-wide Mapping of Protein Localization and Relocalization. Molecular Cell. 2019 Jan 3;73(1):166-182.e7. *These authors contributed equally.
https://doi.org/10.1016/j.molcel.2018.11.035
II. Janne Lehtiö, Taner Arslan, Ioannis Siavelis, Yanbo Pan, Fabio Socciarelli, Olena Berkovska, Husen M. Umer, Georgios Mermelekas, Mohammad Pirmoradian, Mats Jönsson, Hans Brunnström, Odd Terje Brustugun, Krishna Pinganksha Purohit, Richard Cunningham, Hassan Foroughi Asl, Sofi Isaksson, Elsa Arbajian, Mattias Aine, Anna Karlsson, Marija Kotevska, Carsten Gram Hansen, Vilde Drageset Haakensen, Åslaug Helland, David Tamborero, Henrik J. Johansson, Rui M. Branca, Maria Planck, Johan Staaf, and Lukas M. Orre. Proteogenomics of non-small cell lung cancer reveals molecular subtypes associated with specific therapeutic targets and immune evasion mechanisms. [Accepted]
https://doi.org/10.1038/s43018-021-00259-9
III. Taner Arslan*, Yanbo Pan*, Georgios Mermelekas, Mattias Vesterlund, Lukas M. Orre, and Janne Lehtiö. SubCellBarCode: Integrated workflow for robust spatial proteomics by mass spectrometry. *These authors contributed equally. [Manuscript]
IV. Taner Arslan, Olena Berkovska, Georgios Mermelekas, Janne Lehtiö, and Lukas M. Orre. Stratification of non-small-cell lung cancer patients by a peptide-level mass spectrometry-based classification assay. [Manuscript]
History
Defence date
2021-10-22Department
- Department of Oncology-Pathology
Publisher/Institution
Karolinska InstitutetMain supervisor
Orre, Lukas M.Co-supervisors
Lehtiö, JannePublication year
2021Thesis type
- Doctoral thesis
ISBN
978-91-8016-306-4Number of supporting papers
4Language
- eng