Predicting transmembrane topology and signal peptides with hidden Markov models
Transmembrane proteins make up a large and important class of proteins. About 20% of all genes encode transmembrane proteins. They control both substances and information going in and out of a cell. Yet basic knowledge about membrane insertion and folding is sparse, and our ability to identify, over-express, purify, and crystallize transmembrane proteins lags far behind the field of water-soluble proteins.
It is difficult to determine the three dimensional structures of transmembrane proteins. Therefore, researchers normally attempt to determine their topology, i.e. which parts of the protein are buried in the membrane, and on what side of the membrane are the other parts located.
Proteins aimed for export have an N-terminal sequence known as a signal peptide that is inserted into the membrane and cleaved off. The same mechanism that inserts transmembrane proteins into their membranes also handles the export of protein with signal peptides. Transmembrane helices and signal peptides thus have many features in common.
In silico methods for predicting transmembrane topology and methods for predicting signal peptides from amino acid sequence are a fast and relatively accurate alternative to biochemical experiments. A methodology called hidden Markov models (HMMs) has proved particularly useful for these and other prediction tasks.
In this thesis, properties of transmembrane topology predictors and signal peptide predictors are investigated. It includes three novel HMM based prediction methods:
i) A combined transmembrane topology and signal peptide predictor, Phobius. The paper shows that cross predictions, i.e. signal peptides predicted as transmembrane helices and vice versa, are a common problem. About 10% of the genes in E.coli have overlapping signal peptide and transmembrane helix predictions by conventional predictors. We were able to dramatically lower these false cross predictions.
ii) A method for detecting remote G protein-coupled receptor (GPCR) families, GPCRHMM. GPCRs are a very large and divergent superfamily of transmembrane proteins. We designed a hidden Markov model based on the topological regions of the superfamily. We searched five genomes and predicted 120 previously not annotated sequences as possible GPCRs. e majority of these predictions (102) were in C. elegans, but 4 were found in human and 7 in mouse. We as well conclude that a family of odorant receptors in Drosophila are not GPCRs.
iii) A method to improve predictions with HMMs of generic sequence features (such as transmembrane segments or signal peptides) by including homologs. We show that the performance of Phobius using this decoder was significantly better than with other decoders.
We also assessed the difficulty of benchmark sets used in transmembrane topology prediction. By studying the level of agreement between different predictors applied to typical benchmark sets and whole proteome sets, we concluded that the benchmark sets are far easier to predict than reality. In other words, the accuracies reported in benchmark studies are exaggerated.
Thesis also includes a paper presenting a hypothesis of the transmembrane topology of presenilin, a protein involved in the development of Alzheimer's disease. By comparing the output of several transmembrane topology predictors with experimental results from previous studies, a novel nine-transmembrane topology with an extracellular C-terminus was elucidated.
List of scientific papers
I. Kall L, Sonnhammer EL (2002). Reliability of transmembrane predictions in whole-genome data. FEBS Lett. 532(3): 415-8.
https://pubmed.ncbi.nlm.nih.gov/12482603
II. Kall L, Krogh A, Sonnhammer EL (2004). A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 338(5): 1027-36.
https://pubmed.ncbi.nlm.nih.gov/15111065
III. Henricson A, Kall L, Sonnhammer EL (2005). A novel transmembrane topology of presenilin based on reconciling experimental and computational evidence. FEBS J. 272(11): 2727-33.
https://pubmed.ncbi.nlm.nih.gov/15943807
IV. Kall L, Krogh A, Sonnhammer EL (2005). An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics. Suppl 1: i251-i257.
https://pubmed.ncbi.nlm.nih.gov/15961464
V. Wistrand M, Kall L, Sonnhammer EL (2006). A general model of G protein-coupled receptor sequences and its application to detect remote homologs. Protein Sci. 15(3): 509-21. Epub 2006 Feb 1
https://pubmed.ncbi.nlm.nih.gov/16452613
History
Defence date
2006-04-07Department
- Department of Cell and Molecular Biology
Publisher/Institution
Karolinska InstitutetPublication year
2006Thesis type
- Doctoral thesis
ISBN-10
91-7140-719-7Number of supporting papers
5Language
- eng