Karolinska Institutet
Browse

Orthology and protein domain architecture evolution

Download (336.38 kB)
thesis
posted on 2024-09-02, 15:15 authored by Volker Hollich

A major factor behind protein evolution is the ability of proteins to evolve new domain architectures that encode new functions. Protein domains are widely considered to constitute the "atoms" of protein chains, acting as building blocks of proteins as well as evolutionary units. A small number of domains are found in many different domain combinations, while the majority of domains co-occur with very few types of other domains.

Domain architectures are not necessarily created once only during evolution. Cases of convergent evolution show how a favourable domain architecture has evolved multiple times independently. A basic concept for understanding evolution on gene level is orthology.

Two genes are orthologous if they have evolved from the same gene in the last common ancestor of the species and have thus been created by a speciation event. Paralogous genes result from a duplication event that produced two gene copies within the same species.The concept of orthology can be transferred from genes to protein domains and utilised to explain recombination of protein domains and the evolution of domain architectures.

The focus of this work is to augment the understanding of domain architecture evolution and its functional implications. We have examined, evaluated and improved existing methods as well as developed new approaches. The concept of orthology plays a major role in this work. Orthology is often inferred from phylogenetic trees that are based on pairwise distance estimations of protein sequences. The Scoredist protein sequence distance estimator has been developed as one part of this thesis. It combines robustness with low computational complexity and can be calibrated towards various evolutionary models.

Accurate phylogenetic trees are crucial for many applications, hence the appropriate tree reconstruction algorithm should be chosen with care. The strengths and weaknesses of many current tree reconstruction algorithms were assessed, and findings underscore the value of the Scoredist estimator. The Pfam protein families database comprises a large number of protein families and domains. As part of this thesis it has been enhanced by search and query tools, such as PfamAlyzer or the browser-based domain query, that can be applied on whole domain architectures instead of individual domains only.

We have developed a Maximum Parsimony algorithm for the prediction of ancestral domain architectures. In contrast to previous approaches, it employs gene trees rather than species trees. The algorithm was a starting point for an extensive study of the domain architectures present in Pfam for 50 completely sequenced species. Sampling widely across the kingdoms of life, the study sought to find and analyse cases where a domain architecture had been created multiple times. The algorithm proved robust to potential biases from horizontal gene transfer. Convergent evolution of domain architectures was found more frequently than by previous approaches. No strong biases driving convergent evolution were found. It therefore seems to be a random process in much the same way evolution through duplication and recombination, yet less frequent.

List of scientific papers

I. Hollich V, Storm CE, Sonnhammer EL (2002). OrthoGUI: graphical presentation of Orthostrapper results. Bioinformatics. 18(9): 1272-3.
https://pubmed.ncbi.nlm.nih.gov/12217923

II. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR (2004). The Pfam protein families database. Nucleic Acids Res. 32 (Database issue): D138-41.
https://pubmed.ncbi.nlm.nih.gov/14681378

III. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A (2006). Pfam: clans, web tools and services. Nucleic Acids Res. 34 (Database issue): D247-51.
https://pubmed.ncbi.nlm.nih.gov/16381856

IV. Hollich V, Sonnhammer ELL (2006). PfamAlyzer: Domain-centric homology search. [Submitted]

V. Sonnhammer EL, Hollich V (2005). Scoredist: a simple and robust protein sequence distance estimator. BMC Bioinformatics. 6(1): 108.
https://pubmed.ncbi.nlm.nih.gov/15857510

VI. Hollich V, Milchert L, Arvestad L, Sonnhammer EL. (2005). Assessment of protein distance measures and tree-building methods for phylogenetic tree reconstruction. Mol Biol Evol. 22(11): 2257-64. Epub 2005 Jul 27
https://pubmed.ncbi.nlm.nih.gov/16049194

VII. Hollich V, Henricson A, Sonnhammer ELL (2006). Gene tree based analysis of domain architecture evolution. [Submitted]

History

Defence date

2006-06-12

Department

  • Department of Cell and Molecular Biology

Publication year

2006

Thesis type

  • Doctoral thesis

ISBN-10

91-7140-783-9

Number of supporting papers

7

Language

  • eng

Original publication date

2006-05-22

Author name in thesis

Hollich, Volker

Original department name

Department of Cell and Molecular Biology

Place of publication

Stockholm

Usage metrics

    Theses

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC