Karolinska Institutet
Browse

Solving repeat problems in shotgun sequencing

Download (461.62 kB)
thesis
posted on 2024-09-02, 22:15 authored by Erik Arner

Shotgun sequencing is the most powerful strategy for large scale sequencing. Two main approaches exist: clone-by-clone and whole genome shotgun (WGS). In the clone-by-clone strategy, overlapping clones are amplified and then sheared in a random fashion. In the WGS approach, a sufficient amount of cells from the target organism are obtained, and the random shearing is performed on extracted DNA.

In both approaches, the resulting fragments are cloned and the fragment ends are subsequently sequenced, producing sequence reads. If a sufficient amount of sequence has been obtained, the reads will overlap in a way that makes it possible to deduce their correct order. A number of computer programs have been developed for this task. However, none of these programs are capable of producing correct assemblies if the target sequence contains repeats. This is because assembly algorithms in general are greedy, which means that when faced with different alternatives for the positioning of a read, the algorithm will fit the read at the first available position meeting the criteria for inclusion into the assembly. The resulting assemblies typically have the repeat regions degenerated, truncating the regions into a few copies with abnormally high shotgun coverage. This phenomenon occurs even when the repeat copies differ from each other, since the assembly programs are unable to distinguish the subtle differences between repeat elements from the sequencing errors produced by the sequencing apparatus.

The work presented here is aimed at solving the repeat problem by detecting and utilizing single base differences between nearly identical repeats. In paper I, a statistical method for detecting repeat differences in the presence of sequencing errors was developed, implemented, and tested on simulated data. We showed that it is possible to obtain high specificity as well as sensitivity compared to other methods, by evaluating coinciding deviations from consensus in pairs of columns in multiple alignments.

In paper II, a finishing tool (DNPTrapper) that visualizes the differences and enables manual and semi-automatic resolution of repeat regions was constructed and tested with simulated data as well as real data from the Trypanosoma cruzi WGS project. Results showed that using DNPTrapper, it is possible to resolve and analyze complicated repeat regions previously considered difficult or even impossible to resolve. Finally in paper III, five repeated genes in T. cruzi were analyzed using DNPTrapper. Different repeat characteristics in the parasite were described, and it was shown that thorough analysis of repeat regions is required for correcting erroneous consensus sequences of repeated genes in the assembly.

List of scientific papers

I. Tammi MT, Arner E, Britton T, Andersson B (2002). Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs. Bioinformatics. 18(3): 379-88.
https://pubmed.ncbi.nlm.nih.gov/11934736

II. Arner E, Tammi MT, Tran AN, Kindlund E, Andersson B (2006). DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions. BMC Bioinformatics. 7: 155.
https://pubmed.ncbi.nlm.nih.gov/16549006

III. Arner E, Kindlund E, Nilsson D, Farzana F, Ferella M, Tammi MT, Andersson B (2006). Database of Trypanosome cruzi repeated genes: 20 000 novel coding sequences. [Manuscript]

History

Defence date

2006-11-23

Department

  • Department of Cell and Molecular Biology

Publication year

2006

Thesis type

  • Doctoral thesis

ISBN-10

91-7140-996-3

Number of supporting papers

3

Language

  • eng

Original publication date

2006-11-02

Author name in thesis

Arner, Erik

Original department name

Department of Cell and Molecular Biology

Place of publication

Stockholm

Usage metrics

    Theses

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC