Hidden Markov models for remote protein homology detection
Author: Wistrand, Markus
Date: 2006-01-13
Location: Föreläsningssalen vid institutionen för Fysiologi och Farmakologi, Nanna Svartz väg 2
Time: 13.00
Department: Centrum för Genomik och Bioinformatik (CGB) / Center for Genomics Research
View/ Open:
Thesis (1.985Mb)
Abstract
Genome sequencing projects are advancing at a staggering pace and are daily producing large amounts of sequence data. However, the experimental characterization of the encoded genes and proteins is lagging far behind. Interpretation of genomic sequences therefore largely relies on computational algorithms and on transferring annotation from characterized proteins to related uncharacterized proteins. Detection of evolutionary relationships between sequences protein homology detection - has become one of the main fields of computational biology. Arguably the most successful technique for modeling protein homology is the Hidden Markov Model (HMM), which is based on a probabilistic framework.
This thesis describes improvements to protein homology detection methods. The main part of the work is devoted to profile HMMs that are used in database searches to identify homologous protein sequences that belong to the same protein family. The key step is the model estimation which aims to create a model that generalizes an often limited and biased training set to the entire protein family including members that have not yet been observed. This thesis addresses several issues in model estimation: i) prior probability settings, pointing at a conflict between modeling true positives and high discrimination; ii) discriminative training, by proposing an algorithm that adapts model parameters from non-homologous sequences; and iii) key HMM parameters, assessing the relative importance of different aspects of the estimation process, leading to an optimized procedure. Taken together, the work extends our knowledge of theoretical aspects of profile HMMs and can immediately be used for improved protein homology detection by profile HMMs.
If related sequences are highly divergent, standard methods often fail to detect homology. The superfamily of G protein-coupled receptors (GPCRs) can be divided into families with almost complete lack of sequence similarity, yet sharing the same seven membrane-spanning topology. It would not be possible to construct a profile HMM that models the entire superfamily. We instead analyzed the GPCR superfamily and found conserved features in the amino acid distributions and lengths of membrane and non-membrane regions. Based on those high-level features we estimated an HMM (GPCRHMM), with the specific goal of detecting remotely related GPCRs. GPCRHMM is, at comparable error rates, much more sensitive than other strategies for GPCR discovery. In a search of five genomes we predicted 120 sequences that lacked previous annotation as possible GPCRs. The majority of these predictions (102) were in C. elegans, but also 4 were found in human and 7 in mouse.
This thesis describes improvements to protein homology detection methods. The main part of the work is devoted to profile HMMs that are used in database searches to identify homologous protein sequences that belong to the same protein family. The key step is the model estimation which aims to create a model that generalizes an often limited and biased training set to the entire protein family including members that have not yet been observed. This thesis addresses several issues in model estimation: i) prior probability settings, pointing at a conflict between modeling true positives and high discrimination; ii) discriminative training, by proposing an algorithm that adapts model parameters from non-homologous sequences; and iii) key HMM parameters, assessing the relative importance of different aspects of the estimation process, leading to an optimized procedure. Taken together, the work extends our knowledge of theoretical aspects of profile HMMs and can immediately be used for improved protein homology detection by profile HMMs.
If related sequences are highly divergent, standard methods often fail to detect homology. The superfamily of G protein-coupled receptors (GPCRs) can be divided into families with almost complete lack of sequence similarity, yet sharing the same seven membrane-spanning topology. It would not be possible to construct a profile HMM that models the entire superfamily. We instead analyzed the GPCR superfamily and found conserved features in the amino acid distributions and lengths of membrane and non-membrane regions. Based on those high-level features we estimated an HMM (GPCRHMM), with the specific goal of detecting remotely related GPCRs. GPCRHMM is, at comparable error rates, much more sensitive than other strategies for GPCR discovery. In a search of five genomes we predicted 120 sequences that lacked previous annotation as possible GPCRs. The majority of these predictions (102) were in C. elegans, but also 4 were found in human and 7 in mouse.
List of papers:
I. Wistrand M, Sonnhammer EL (2004). Transition priors for protein hidden Markov models: an empirical study towards maximum discrimination. J Comput Biol. 11(1): 181-93.
Fulltext (DOI)
Pubmed
View record in Web of Science®
II. Wistrand M, Sonnhammer EL (2004). Improving profile HMM discrimination by adapting transition probabilities. J Mol Biol. 338(4): 847-54.
Fulltext (DOI)
Pubmed
View record in Web of Science®
III. Wistrand M, Sonnhammer EL (2005). Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER. BMC Bioinformatics. 6(1): 99.
Fulltext (DOI)
Pubmed
View record in Web of Science®
IV. Wistrand M, Kall L, Sonnhammer ELL (2005). A general model of G protein-coupled receptor sequences and its application to detect remote homologs. [Accepted]
Fulltext (DOI)
Pubmed
View record in Web of Science®
I. Wistrand M, Sonnhammer EL (2004). Transition priors for protein hidden Markov models: an empirical study towards maximum discrimination. J Comput Biol. 11(1): 181-93.
Fulltext (DOI)
Pubmed
View record in Web of Science®
II. Wistrand M, Sonnhammer EL (2004). Improving profile HMM discrimination by adapting transition probabilities. J Mol Biol. 338(4): 847-54.
Fulltext (DOI)
Pubmed
View record in Web of Science®
III. Wistrand M, Sonnhammer EL (2005). Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER. BMC Bioinformatics. 6(1): 99.
Fulltext (DOI)
Pubmed
View record in Web of Science®
IV. Wistrand M, Kall L, Sonnhammer ELL (2005). A general model of G protein-coupled receptor sequences and its application to detect remote homologs. [Accepted]
Fulltext (DOI)
Pubmed
View record in Web of Science®
Issue date: 2005-12-23
Rights:
Publication year: 2006
ISBN: 91-7140-598-4
Statistics
Total Visits
Views | |
---|---|
Hidden ...(legacy) | 654 |
Hidden ... | 133 |
Total Visits Per Month
October 2023 | November 2023 | December 2023 | January 2024 | February 2024 | March 2024 | April 2024 | |
---|---|---|---|---|---|---|---|
Hidden ... | 2 | 0 | 0 | 0 | 0 | 2 | 0 |
File Visits
Views | |
---|---|
thesis.pdf(legacy) | 402 |
thesis.pdf | 342 |
thesis.pdf.txt(legacy) | 2 |
Top country views
Views | |
---|---|
United States | 322 |
Germany | 55 |
Sweden | 52 |
China | 46 |
Russia | 18 |
South Korea | 16 |
Ukraine | 12 |
United Kingdom | 10 |
Denmark | 7 |
Finland | 7 |
Top cities views
Views | |
---|---|
Sunnyvale | 46 |
Ashburn | 40 |
Kiez | 24 |
Beijing | 19 |
Romeo | 17 |
Seoul | 16 |
London | 8 |
Dublin | 7 |
Ballerup | 6 |
Mountain View | 6 |