Karolinska Institutet
Browse

Genomics and bioinformatics approaches to functional gene annotation

Download (914.14 kB)
thesis
posted on 2024-09-03, 04:58 authored by Danielle Kemmer

Biomedical research has been undergoing a quasi-revolution with the dawn of the genomics era. The flood of sequence data from the various genome projects, the task of cataloging the entire coding portion of a genome instead of identifying and characterizing individual genes, as well as technical demands accompanying these developments have posed great challenges to the research community. Although the entire human genome sequence has been virtually recorded, fundamental issues remain about the precise number of protein coding genes, as well as their functional characterization.

Available resources for the study of human gene function include large genome annotation pipelines, expression profiling data, and protein interaction screens. To gain biological insights from this maze of data, one must both find mechanisms to organize the information and assess the quality of the results.

This thesis focuses on the functional annotation of sparsely characterized human genes and their encoded proteins. The work includes four stages: I. Gene expression profiling II. Assessment of the level of characterization of human genes III. Projection of protein networks from lower eukaryotes onto human IV. Integration of computational and experimental results for data mining.

Initially, a cross-platform comparison for a set of gene expression profiling techniques was carried out to compare the performance of cutting-edge high-throughput methods and conventional approaches in terms of sensitivity, reliability, and throughput. In this study, we demonstrated that correlation between the different methods was poor and thus multi-technique validation was justified. Nonetheless, the strongest correlation between the new reference data in our report, i.e., a collection of traditional Northern blots, was observed with microarray-based technologies.

The assessment of the level of functional characterization of human genes was addressed in the second study, where we developed a scoring system to quantify the annotation status of each human gene. We created a metric to effectively predict the characterization status of human genes based on a set of predictors from the GeneLynx database (http://www.genelynx.org). This scoring function will not only assist the targeted analysis of groups of sparsely annotated genes and proteins, but will prove itself useful in the monitoring of long-term gene annotation efforts and the overall annotation status of the human genome.

Comparative genomics efforts to transfer gene annotation from proteins in amenable model organisms onto human proteins are currently restricted by the limited availability of experimental data. Nonetheless, we demonstrated how protein networks could be effectively projected from lower eukaryotes onto human and how the confidence in these projections increased with redundantly detected protein interactions. This so-called Interolog Analysis offers promise for reliable inference of protein function. The bioinformatics system we created (Ulysses) provides a novel intuitive interface for biologists studying human proteins. As data depth and coverage will increase over time, this system will prove to be valuable in the extended prediction of high-confidence functional associations of a large portion of human genes.

The fusion of experimental data and computational predictions is a central goal of functional genomics. We constructed a bioinformatics workbench for the study of uncharacterized human gene families. By assembling bioinformatics resources and experimental results in a common space, the NovelFam3000 system facilitates functional characterization. Working with a collection of uncharacterized genes, we demonstrated how bioinformatics methods can lead to novel inferences about cellular function of specific protein families.

This thesis unites the identification of uncharacterized human genes, the assessment of genomics data quality, and the application of high-throughput data for the inference of protein function.

List of scientific papers

I. Kemmer D, Faxen M, Hodges E, Lim J, Herzog E, Ljungstrom E, Lundmark A, Olsen MK, Podowski R, Sonnhammer ELL, Nilsson P, Reimers M, Lenhard B, Roberds SL, Wahlestedt C, Hoog C, Agarwal P, Wasserman WW (2004). Exploring the foundation of genomics: a northern blot reference set for the comparative analysis of expression profiling techniques. Comparative and Functional Genomics. 5: 584-95.
https://doi.org/10.1002/cfg.443

II. Pdowski RM, Kemmer D, Brumm J, Wahlestedt C, Lenhard B, Wasserman WW (2006). Gene chracterization index: a metric for accessing how well we understand our genes. [Manuscript]

III. Kemmer D, Huang Y, Shah SP, Lim J, Brumm J, Yuen MM, Ling J, Xu T, Wasserman WW, Ouellette BF (2005). Ulysses - an application for the projection of molecular interactions across species. Genome Biol. 6(12): R106.
https://doi.org/10.1186/gb-2005-6-12-r106

IV. Kemmer D, Podowski R, Lim J, Arenillas D, Hodges E, Roth P, Sonnhammer ELL, Hoog C, Wasserman WW (2005). NovelFam3000 - uncharcterized protein domains conserved across model organisms. [Submitted]

History

Defence date

2006-02-23

Department

  • Department of Cell and Molecular Biology

Publication year

2006

Thesis type

  • Doctoral thesis

ISBN-10

91-7140-636-0

Number of supporting papers

4

Language

  • eng

Original publication date

2006-02-02

Author name in thesis

Kemmer, Danielle

Original department name

Center for Genomics Research

Place of publication

Stockholm

Usage metrics

    Theses

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC