Karolinska Institutet
Browse

In silico prediction of CIS-regulatory elements

Download (1.43 MB)
thesis
posted on 2024-09-02, 16:12 authored by Albin Sandelin

As one of the most fundamental processes for all life forms, transcriptional regulation remains an intriguing and challenging subject for biomedical research. Experimental efforts towards understanding the regulation of genes is laborious and expensive, but can be substantially accelerated with the use of computational predictions. The growing number of fully sequenced metazoan genomes in combination with the increasing use of high-throughput methods such as microarrays has increased the necessity of combining computational methods with laboratorial. Computational "in-silico" methods for the prediction of transcription factor binding sites are mature, yet critical problems remain unsolved. In particular, the rate of falsely predicted sites is unacceptably high with current methods, due to the small and degenerate binding sites targeted by transcription factors. In addition to the false prediction rate, this restriction limits the ability of pattern discovery algorithms to find mediating binding sites in promoters of co-expressed genes. The latter problem constitutes a bottleneck when analyzing regulatory sequences in complex eukaryotes, as regulatory sequences generally are spread over extended genomic regions.

This thesis describes the development of algorithms and resources for transcription factor binding site analysis in addressing: site prediction, where a model describing the binding properties of a transcription factor is applied to a sequence to find functional binding sites pattern discovery, where over-represented patterns are sought in sets of promoters.

Initially, an open-access database (JASPAR) was created, holding high quality models for transcription factor sites. The database formed part of the foundation for the subsequent project (ConSite), where a set of methods were developed for utilizing cross-species comparison in binding site prediction (phylogenetic footprinting) to enhance predictive selectivity. In this study, we could show that ~85% of false predictions were removed when only analyzing promoter regions conserved between human and mouse. The current statistical framework for modeling binding properties of transcription factors is inadequate for some regulatory proteins, most notably the medically important nuclear hormone receptors. A Hidden Markov Model framework capable of both predicting and classifying nuclear hormone receptor response elements was developed.

In a case study, we showed that nuclear receptor genes have a high potential for cross-or auto regulation using the pufferfish genome as a predictive platform. Pattern discovery in promoters of multi-cellular eukaryotes is limited by the low strength of patterns buried in extended genomic sequence. Methods for improving both sensitivity and evaluation of resulting patterns were developed. We showed that comparison of newly found patterns to databases of experimentally verified profiles is a meaningful complement to other means to evaluate patters. Furthermore, we showed that structural constraints that are shared by families of transcription factors can be integrated as prior expectations in pattern finder algorithms for a significant increase in sensitivity.

List of scientific papers

I. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B (2004). JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32: Database issue:D91-4.
https://pubmed.ncbi.nlm.nih.gov/14681366

II. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW (2003). Identification of conserved regulatory elements by comparative genome analysis. J Biol. 2(2): Epub 2003 May 22.
https://pubmed.ncbi.nlm.nih.gov/12760745

III. Sandelin A, Wasserman WW, Lenhard B (2004). ConSite: web-based prediction of regulatory elements using cross-species comparison. Nuclic Acids Res. [Accepted]

IV. Sandelin A, Wasserman WW (2004). Prediction of nuclear hormone receptor response elements. [Submitted]

V. Sandelin A, Hoglund A, Lenhard B, Wasserman WW (2003). Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Funct Integr Genomics. 3(3): 125-34. Epub 2003 Jun 25
https://pubmed.ncbi.nlm.nih.gov/12827523

VI. Sandelin A, Wasserman WW (2004). Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J Mol Biol. [Accepted]

History

Defence date

2004-05-10

Department

  • Department of Cell and Molecular Biology

Publication year

2004

Thesis type

  • Doctoral thesis

ISBN-10

91-7349-879-3

Number of supporting papers

6

Language

  • eng

Original publication date

2004-04-19

Author name in thesis

Sandelin, Albin

Original department name

Center for Genomics Research

Place of publication

Stockholm

Usage metrics

    Theses

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC