Karolinska Institutet
Browse

Hidden Markov models for remote protein homology detection

Download (1.99 MB)
thesis
posted on 2024-09-03, 01:21 authored by Markus Wistrand

Genome sequencing projects are advancing at a staggering pace and are daily producing large amounts of sequence data. However, the experimental characterization of the encoded genes and proteins is lagging far behind. Interpretation of genomic sequences therefore largely relies on computational algorithms and on transferring annotation from characterized proteins to related uncharacterized proteins. Detection of evolutionary relationships between sequences protein homology detection - has become one of the main fields of computational biology. Arguably the most successful technique for modeling protein homology is the Hidden Markov Model (HMM), which is based on a probabilistic framework.

This thesis describes improvements to protein homology detection methods. The main part of the work is devoted to profile HMMs that are used in database searches to identify homologous protein sequences that belong to the same protein family. The key step is the model estimation which aims to create a model that generalizes an often limited and biased training set to the entire protein family including members that have not yet been observed. This thesis addresses several issues in model estimation: i) prior probability settings, pointing at a conflict between modeling true positives and high discrimination; ii) discriminative training, by proposing an algorithm that adapts model parameters from non-homologous sequences; and iii) key HMM parameters, assessing the relative importance of different aspects of the estimation process, leading to an optimized procedure. Taken together, the work extends our knowledge of theoretical aspects of profile HMMs and can immediately be used for improved protein homology detection by profile HMMs.

If related sequences are highly divergent, standard methods often fail to detect homology. The superfamily of G protein-coupled receptors (GPCRs) can be divided into families with almost complete lack of sequence similarity, yet sharing the same seven membrane-spanning topology. It would not be possible to construct a profile HMM that models the entire superfamily. We instead analyzed the GPCR superfamily and found conserved features in the amino acid distributions and lengths of membrane and non-membrane regions. Based on those high-level features we estimated an HMM (GPCRHMM), with the specific goal of detecting remotely related GPCRs. GPCRHMM is, at comparable error rates, much more sensitive than other strategies for GPCR discovery. In a search of five genomes we predicted 120 sequences that lacked previous annotation as possible GPCRs. The majority of these predictions (102) were in C. elegans, but also 4 were found in human and 7 in mouse.

List of scientific papers

I. Wistrand M, Sonnhammer EL (2004). Transition priors for protein hidden Markov models: an empirical study towards maximum discrimination. J Comput Biol. 11(1): 181-93.
https://doi.org/10.1089/106652704773416957

II. Wistrand M, Sonnhammer EL (2004). Improving profile HMM discrimination by adapting transition probabilities. J Mol Biol. 338(4): 847-54.
https://doi.org/10.1016/j.jmb.2004.03.023

III. Wistrand M, Sonnhammer EL (2005). Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER. BMC Bioinformatics. 6(1): 99.
https://doi.org/10.1186/1471-2105-6-99

IV. Wistrand M, Kall L, Sonnhammer ELL (2005). A general model of G protein-coupled receptor sequences and its application to detect remote homologs. [Accepted]
https://doi.org/10.1110/ps.051745906

History

Defence date

2006-01-13

Department

  • Department of Cell and Molecular Biology

Publisher/Institution

Karolinska Institutet

Publication year

2006

Thesis type

  • Doctoral thesis

ISBN-10

91-7140-598-4

Number of supporting papers

4

Language

  • eng

Original publication date

2005-12-23

Author name in thesis

Wistrand, Markus

Original department name

Center for Genomics Research

Place of publication

Stockholm

Usage metrics

    Theses

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC