Karolinska Institutet
Browse

Mass spectrometry based proteomics : data analysis and applications

Download (2.4 MB)
thesis
posted on 2024-09-03, 03:26 authored by Yafeng Zhu

Mass spectrometry (MS) based proteomics has become a widely used high throughput method to investigate protein expression and functional regulation. From being able to study only dozens of proteins, state-of-art MS proteomic techniques are now able to identify and quantify ten thousand proteins. Nevertheless, MS proteomics are facing problems investigating protein variants derived from alternative splicing, detecting peptides from novel coding sequences, identifying peptide variants from genetic changes and statistical analysis of quantitative proteome. The work present in this thesis start from these problems and contribute solutions to them.

In standard shotgun proteomics studies, protein identifications are inferred from a list of identified peptides using Occam Razor’s rule, which outputs a minimum list of proteins sufficient to explain peptide evidences. The protein inference process creates a potential problem in protein level quantification, resulting mixture of quantitative signals from different splice variants if the inferred proteins do not correctly represent the peptide populations. Paper I present a tool to investigate splice variants using MS proteomics data. By clustering the quantitative pattern of peptides and showing their transcript positions, it is able to reveal splice variants specific peptides with different quantitative signal. The tool was applied to a comprehensive proteomics data of A431 cells treated with Gefitinib (EGFR inhibitor). For certain genes, we observed splice-variant-centric quantification differs from traditional proteincentric or gene-centric quantification, suggesting differentially regulated splice variants after Gefitinib treatment.

Previously, MS proteomics has been used to refine genome annotation. However, the applications were limited to validate and confirm predicted gene models. In Paper II, we demonstrate an integrative genome annotation workflow that combines MS proteomics data and RNA-sequencing to perform evidence-based whole genome annotation of a newly sequenced commensal yeast. The workflow showed higher accuracy of protein coding gene annotation compared to conventional way of using only RNA-sequencing data. The study exemplifies that proteomics data used in combination with RNA-seq data is able to produce a more accurate and complete whole genome annotation.

Paper III shows an integrative proteogenomics analysis workflow. Compared to standard proteomics which analyzes known proteins in reference database, proteogenomics aims to discover peptides from novel coding sequences and disease relevant mutations. To identify novel coding sequences in well annotated genomes, such as human, it is particular challenging due to several reasons. First, protein-coding sequences in the human genome consists of only 2%-3% of the total sequences. There are approximately one million peptides from known coding genes, and the novel peptides from undiscovered coding loci constitutes a minor part of the total peptide population. That means the vast majority of experimental spectra are produced from known peptides. Identification of peptides with MS proteomics technique relies on correct matching between experimental spectra to in silico generated spectra of the peptides in search space. Detecting of novel peptides requires correct spectra matching for both known and novel peptides, and the process is doomed to produce false positives. Previously, conservative criteria and manual curation has been applied to ensure the quality of findings. Paper III presents a workflow which improves the reliability of proteogenomics findings by automated extensive data curation and evidence searching in orthogonal data. In analysis of the proteomics data of a cancer cell line and five normal human tissues, the workflow successfully detected novel peptides from unknown coding regions and peptide variants from non-synonymous single nucleotide polymorphisms (nsSNPs) and mutations, with multiple sources of evidence provided. Moreover, our quantitative MS data indicated that certain pseudogenes and lncRNAs were expressed and translated in tissue-specific manner.

Paper IV addresses the statistical analysis of quantitative proteomics. Currently, there is no consensus in the usage of statistical methods to analyze labelled and label-free proteomics data. One of the main reasons is the lack of statistical tool with high performance, ease to use, and broad applicability to various proteomics datasets. The presented statistical method, DEqMS, is a robust and universal tool to assess differential protein expression for quantitative MS proteomics. DEqMS takes into account the variance dependence on the number of peptides/PSMs used for protein quantification in statistical significance test. Compared to existing methods in several benchmarking datasets, DEqMS was demonstrated with both high statistical accuracy and general applicability.

In summary, the work included in this thesis contributes with improved data interpretation and applications of MS proteomics data in analysis of splice variants, genome annotation, proteogenomics studies and statistical analysis of protein expression changes. Development of these methods facilitate a wide range of applications of MS proteomics data in the systems biology research.

List of scientific papers

I. Yafeng Zhu, Lina Hultin-Rosenberg, Jenny Forshed, Rui M. M. Branca, Lukas M. Orre, and Janne Lehtiö. SpliceVista, a tool for splice variant identification and visualization in shotgun proteomics data. Molecular & Cellular Proteomics. 2014 Jun;13(6):1552-62.
https://doi.org/10.1074/mcp.M113.031203

II. Yafeng Zhu, Pär G Engström§, Christian Tellgren-Roth, Charles Baudo, John C Kennell, Sheng Sun, R. Blake Billmyre, Markus S. Schröder, Anna Andersson, Tina Holm, Benjamin Sigurgeirsson, Guangxi Wu, Sundar Ram Sankaranarayanan, Rahul Siddharthan, Kaustuv Sanyal, Joakim Lundeberg, Björn Nystedt, Teun Boekhout, Thomas L Dawson Jr., Joseph Heitman, Annika Scheynius, Janne Lehtiö. Proteogenomics produces comprehensive and highly accurate protein-codinggene annotation in a complete genome assembly of Malassezia sympodialis. Nucleic Acids Research. 2017 Mar 17;45(5):2629-2643.
https://doi.org/10.1093/nar/gkx006

III. Yafeng Zhu, Lukas M. Orre, Henrik J. Johansson, Mikael Huss, Jorrit Boekel, Mattias Vesterlund, Alejandro Fernandez-Woodbridge, Rui M. M. Branca & Janne Lehtiö. Discovery of coding regions in human genome using an integrated proteogenomics analysis workflow. Nature Communications. 2018 Mar 2;9(1):903.
https://doi.org/10.1038/s41467-018-03311-y

IV. Yafeng Zhu, Lukas M. Orre, Georgios Mermelekas, Henrik J. Johansson, Alina Malyutina, Simon Anders, Janne Lehtiö. DEqMS: a robust and universal statistical method for quantitative mass spectrometry proteomics. [Manuscript]

History

Defence date

2018-05-25

Department

  • Department of Oncology-Pathology

Publisher/Institution

Karolinska Institutet

Main supervisor

Lehtiö, Janne

Co-supervisors

Forshed, Jenny

Publication year

2018

Thesis type

  • Doctoral thesis

ISBN

978-91-7831-044-9

Number of supporting papers

4

Language

  • eng

Original publication date

2018-05-08

Author name in thesis

Zhu, Yafeng

Original department name

Department of Oncology-Pathology

Place of publication

Stockholm

Usage metrics

    Theses

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC