Deciphering the complexity of biological systems using machine learning applied to big data
Author: Arslan, Taner
Date: 2021-10-22
Location: Lecture hall Atrium, Nobels väg 12B, Karolinska Institutet, Campus Solna
Time: 09.00
Department: Inst för onkologi-patologi / Dept of Oncology-Pathology
View/ Open:
Thesis (2.663Mb)
Abstract
With the rising of the omics era and the recent technological advancements in biology and medicine, the generated data is growing exponentially. Increasing the volume of the data in parallel to complexity and heterogeneity of the data results in challenges in data analysis, extraction of information, and interpretation of the results.
This thesis aims to reveal the hidden structures and knowledge of complex biological data and make use of them in fundamental biological and clinical studies using machine learning techniques and pipelines. Specifically, to map subcellular localization of proteins by building a mass spectrometry (MS)- and machine learning-based pipeline, to automate the pipeline by developing a package, and to reveal the proteome subtypes of non-small cell lung cancer (NSCLC), and to build classifier pipelines for NSCLC subtyping.
In study I we developed a mass spectrometry (MS)- and machine learning-based pipeline to generate a proteome-wide resource of protein subcellular localization across multiple human cancer cell lines. Furthermore, we analyzed protein-domain and variant-dependent localization, and tested proteome-wide relocalization driven by EGFR inhibition. In study II we performed in-depth quantitative molecular phenotyping of a NSCLC cohort and used unsupervised learning-based clustering to reveal six subtypes with distinct mutation, immune infiltration and proliferation profiles. Additionally, supervised learning-based classifiers were built (support vector machine and k-top scoring pairs) and validated using independent validation datasets. In study III we present comprehensive wet-lab and dry-lab protocols for performing protein subcellular localization analysis. Specifically, an R programming-based package was developed for the downstream analysis including preprocessing the MS-output data, building the classifier, and visualization of the output. In study IV we prototype and develop a MS-based, peptide-centric machine learning classifier for subtyping and stratification of NSCLC patients for improved therapy selection.
As a summary, the presented work in this thesis shows how to strategically analyze big and complex biological and clinical data to extract useful biological insights. Furthermore, pipelines for data-driven approaches including machine learning and data science methods were developed to understand the biology and to transfer that knowledge into the clinic. The development of such methods and building pipelines for the analysis of big biological data will indisputably be the core element for a range of applications in the fields of biology and medicine.
This thesis aims to reveal the hidden structures and knowledge of complex biological data and make use of them in fundamental biological and clinical studies using machine learning techniques and pipelines. Specifically, to map subcellular localization of proteins by building a mass spectrometry (MS)- and machine learning-based pipeline, to automate the pipeline by developing a package, and to reveal the proteome subtypes of non-small cell lung cancer (NSCLC), and to build classifier pipelines for NSCLC subtyping.
In study I we developed a mass spectrometry (MS)- and machine learning-based pipeline to generate a proteome-wide resource of protein subcellular localization across multiple human cancer cell lines. Furthermore, we analyzed protein-domain and variant-dependent localization, and tested proteome-wide relocalization driven by EGFR inhibition. In study II we performed in-depth quantitative molecular phenotyping of a NSCLC cohort and used unsupervised learning-based clustering to reveal six subtypes with distinct mutation, immune infiltration and proliferation profiles. Additionally, supervised learning-based classifiers were built (support vector machine and k-top scoring pairs) and validated using independent validation datasets. In study III we present comprehensive wet-lab and dry-lab protocols for performing protein subcellular localization analysis. Specifically, an R programming-based package was developed for the downstream analysis including preprocessing the MS-output data, building the classifier, and visualization of the output. In study IV we prototype and develop a MS-based, peptide-centric machine learning classifier for subtyping and stratification of NSCLC patients for improved therapy selection.
As a summary, the presented work in this thesis shows how to strategically analyze big and complex biological and clinical data to extract useful biological insights. Furthermore, pipelines for data-driven approaches including machine learning and data science methods were developed to understand the biology and to transfer that knowledge into the clinic. The development of such methods and building pipelines for the analysis of big biological data will indisputably be the core element for a range of applications in the fields of biology and medicine.
List of papers:
I. Lukas Minus Orre*, Mattias Vesterlund*, Yanbo Pan*, Taner Arslan*, Yafeng Zhu, Alejandro Fernandez Woodbridge, Oliver Frings, Erik Fredlund, and Janne Lehtiö. SubCellBarCode: Proteome-wide Mapping of Protein Localization and Relocalization. Molecular Cell. 2019 Jan 3;73(1):166-182.e7. *These authors contributed equally.
Fulltext (DOI)
Pubmed
View record in Web of Science®
II. Janne Lehtiö, Taner Arslan, Ioannis Siavelis, Yanbo Pan, Fabio Socciarelli, Olena Berkovska, Husen M. Umer, Georgios Mermelekas, Mohammad Pirmoradian, Mats Jönsson, Hans Brunnström, Odd Terje Brustugun, Krishna Pinganksha Purohit, Richard Cunningham, Hassan Foroughi Asl, Sofi Isaksson, Elsa Arbajian, Mattias Aine, Anna Karlsson, Marija Kotevska, Carsten Gram Hansen, Vilde Drageset Haakensen, Åslaug Helland, David Tamborero, Henrik J. Johansson, Rui M. Branca, Maria Planck, Johan Staaf, and Lukas M. Orre. Proteogenomics of non-small cell lung cancer reveals molecular subtypes associated with specific therapeutic targets and immune evasion mechanisms. [Accepted]
Fulltext (DOI)
Pubmed
View record in Web of Science®
III. Taner Arslan*, Yanbo Pan*, Georgios Mermelekas, Mattias Vesterlund, Lukas M. Orre, and Janne Lehtiö. SubCellBarCode: Integrated workflow for robust spatial proteomics by mass spectrometry. *These authors contributed equally. [Manuscript]
IV. Taner Arslan, Olena Berkovska, Georgios Mermelekas, Janne Lehtiö, and Lukas M. Orre. Stratification of non-small-cell lung cancer patients by a peptide-level mass spectrometry-based classification assay. [Manuscript]
I. Lukas Minus Orre*, Mattias Vesterlund*, Yanbo Pan*, Taner Arslan*, Yafeng Zhu, Alejandro Fernandez Woodbridge, Oliver Frings, Erik Fredlund, and Janne Lehtiö. SubCellBarCode: Proteome-wide Mapping of Protein Localization and Relocalization. Molecular Cell. 2019 Jan 3;73(1):166-182.e7. *These authors contributed equally.
Fulltext (DOI)
Pubmed
View record in Web of Science®
II. Janne Lehtiö, Taner Arslan, Ioannis Siavelis, Yanbo Pan, Fabio Socciarelli, Olena Berkovska, Husen M. Umer, Georgios Mermelekas, Mohammad Pirmoradian, Mats Jönsson, Hans Brunnström, Odd Terje Brustugun, Krishna Pinganksha Purohit, Richard Cunningham, Hassan Foroughi Asl, Sofi Isaksson, Elsa Arbajian, Mattias Aine, Anna Karlsson, Marija Kotevska, Carsten Gram Hansen, Vilde Drageset Haakensen, Åslaug Helland, David Tamborero, Henrik J. Johansson, Rui M. Branca, Maria Planck, Johan Staaf, and Lukas M. Orre. Proteogenomics of non-small cell lung cancer reveals molecular subtypes associated with specific therapeutic targets and immune evasion mechanisms. [Accepted]
Fulltext (DOI)
Pubmed
View record in Web of Science®
III. Taner Arslan*, Yanbo Pan*, Georgios Mermelekas, Mattias Vesterlund, Lukas M. Orre, and Janne Lehtiö. SubCellBarCode: Integrated workflow for robust spatial proteomics by mass spectrometry. *These authors contributed equally. [Manuscript]
IV. Taner Arslan, Olena Berkovska, Georgios Mermelekas, Janne Lehtiö, and Lukas M. Orre. Stratification of non-small-cell lung cancer patients by a peptide-level mass spectrometry-based classification assay. [Manuscript]
Institution: Karolinska Institutet
Supervisor: Orre, Lukas M.
Co-supervisor: Lehtiö, Janne
Issue date: 2021-10-01
Rights:
Publication year: 2021
ISBN: 978-91-8016-306-4
Statistics
Total Visits
Views | |
---|---|
Deciphering ... | 751 |
Total Visits Per Month
October 2023 | November 2023 | December 2023 | January 2024 | February 2024 | March 2024 | April 2024 | |
---|---|---|---|---|---|---|---|
Deciphering ... | 21 | 16 | 19 | 15 | 16 | 22 | 12 |
File Visits
Views | |
---|---|
Thesis_Taner_Arslan.pdf | 313 |
Top country views
Views | |
---|---|
Sweden | 220 |
United States | 108 |
Germany | 95 |
Ireland | 78 |
United Kingdom | 50 |
China | 22 |
Australia | 17 |
Lithuania | 10 |
South Korea | 9 |
Russia | 8 |
Top cities views
Views | |
---|---|
Dublin | 76 |
Stockholm | 44 |
Bromma | 18 |
Sydney | 17 |
Ashburn | 15 |
Hangzhou | 14 |
Skövde | 13 |
Solna | 9 |
Boardman | 7 |
Malmo | 7 |