Solving repeat problems in shotgun sequencing
Author: Arner, Erik
Date: 2006-11-23
Location: Föreläsningssalen på Institutionen för Mikrobiologi, Tumör- och Cellbiologi (MTC), Theorells väg 1, Karolinska Institutet, Solna
Time: 10.00
Department: Institutionen för cell- och molekylärbiologi (CMB) / Department of Cell and Molecular Biology
View/ Open:
Thesis (461.6Kb)
Abstract
Shotgun sequencing is the most powerful strategy for large scale sequencing. Two main approaches exist: clone-by-clone and whole genome shotgun (WGS). In the clone-by-clone strategy, overlapping clones are amplified and then sheared in a random fashion. In the WGS approach, a sufficient amount of cells from the target organism are obtained, and the random shearing is performed on extracted DNA.
In both approaches, the resulting fragments are cloned and the fragment ends are subsequently sequenced, producing sequence reads. If a sufficient amount of sequence has been obtained, the reads will overlap in a way that makes it possible to deduce their correct order. A number of computer programs have been developed for this task. However, none of these programs are capable of producing correct assemblies if the target sequence contains repeats. This is because assembly algorithms in general are greedy, which means that when faced with different alternatives for the positioning of a read, the algorithm will fit the read at the first available position meeting the criteria for inclusion into the assembly. The resulting assemblies typically have the repeat regions degenerated, truncating the regions into a few copies with abnormally high shotgun coverage. This phenomenon occurs even when the repeat copies differ from each other, since the assembly programs are unable to distinguish the subtle differences between repeat elements from the sequencing errors produced by the sequencing apparatus.
The work presented here is aimed at solving the repeat problem by detecting and utilizing single base differences between nearly identical repeats. In paper I, a statistical method for detecting repeat differences in the presence of sequencing errors was developed, implemented, and tested on simulated data. We showed that it is possible to obtain high specificity as well as sensitivity compared to other methods, by evaluating coinciding deviations from consensus in pairs of columns in multiple alignments.
In paper II, a finishing tool (DNPTrapper) that visualizes the differences and enables manual and semi-automatic resolution of repeat regions was constructed and tested with simulated data as well as real data from the Trypanosoma cruzi WGS project. Results showed that using DNPTrapper, it is possible to resolve and analyze complicated repeat regions previously considered difficult or even impossible to resolve. Finally in paper III, five repeated genes in T. cruzi were analyzed using DNPTrapper. Different repeat characteristics in the parasite were described, and it was shown that thorough analysis of repeat regions is required for correcting erroneous consensus sequences of repeated genes in the assembly.
In both approaches, the resulting fragments are cloned and the fragment ends are subsequently sequenced, producing sequence reads. If a sufficient amount of sequence has been obtained, the reads will overlap in a way that makes it possible to deduce their correct order. A number of computer programs have been developed for this task. However, none of these programs are capable of producing correct assemblies if the target sequence contains repeats. This is because assembly algorithms in general are greedy, which means that when faced with different alternatives for the positioning of a read, the algorithm will fit the read at the first available position meeting the criteria for inclusion into the assembly. The resulting assemblies typically have the repeat regions degenerated, truncating the regions into a few copies with abnormally high shotgun coverage. This phenomenon occurs even when the repeat copies differ from each other, since the assembly programs are unable to distinguish the subtle differences between repeat elements from the sequencing errors produced by the sequencing apparatus.
The work presented here is aimed at solving the repeat problem by detecting and utilizing single base differences between nearly identical repeats. In paper I, a statistical method for detecting repeat differences in the presence of sequencing errors was developed, implemented, and tested on simulated data. We showed that it is possible to obtain high specificity as well as sensitivity compared to other methods, by evaluating coinciding deviations from consensus in pairs of columns in multiple alignments.
In paper II, a finishing tool (DNPTrapper) that visualizes the differences and enables manual and semi-automatic resolution of repeat regions was constructed and tested with simulated data as well as real data from the Trypanosoma cruzi WGS project. Results showed that using DNPTrapper, it is possible to resolve and analyze complicated repeat regions previously considered difficult or even impossible to resolve. Finally in paper III, five repeated genes in T. cruzi were analyzed using DNPTrapper. Different repeat characteristics in the parasite were described, and it was shown that thorough analysis of repeat regions is required for correcting erroneous consensus sequences of repeated genes in the assembly.
List of papers:
I. Tammi MT, Arner E, Britton T, Andersson B (2002). Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs. Bioinformatics. 18(3): 379-88.
Pubmed
II. Arner E, Tammi MT, Tran AN, Kindlund E, Andersson B (2006). DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions. BMC Bioinformatics. 7: 155.
Pubmed
III. Arner E, Kindlund E, Nilsson D, Farzana F, Ferella M, Tammi MT, Andersson B (2006). Database of Trypanosome cruzi repeated genes: 20 000 novel coding sequences. [Manuscript]
I. Tammi MT, Arner E, Britton T, Andersson B (2002). Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs. Bioinformatics. 18(3): 379-88.
Pubmed
II. Arner E, Tammi MT, Tran AN, Kindlund E, Andersson B (2006). DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions. BMC Bioinformatics. 7: 155.
Pubmed
III. Arner E, Kindlund E, Nilsson D, Farzana F, Ferella M, Tammi MT, Andersson B (2006). Database of Trypanosome cruzi repeated genes: 20 000 novel coding sequences. [Manuscript]
Issue date: 2006-11-02
Rights:
Publication year: 2006
ISBN: 91-7140-996-3
Statistics
Total Visits
Views | |
---|---|
Solving ...(legacy) | 730 |
Solving ... | 234 |
Total Visits Per Month
September 2023 | October 2023 | November 2023 | December 2023 | January 2024 | February 2024 | March 2024 | |
---|---|---|---|---|---|---|---|
Solving ... | 7 | 3 | 1 | 2 | 2 | 3 | 1 |
File Visits
Views | |
---|---|
thesis.pdf(legacy) | 907 |
thesis.pdf | 391 |
thesis.pdf.txt(legacy) | 2 |
Top country views
Views | |
---|---|
United States | 361 |
Sweden | 70 |
Germany | 68 |
China | 61 |
United Kingdom | 24 |
South Korea | 20 |
India | 19 |
Japan | 11 |
Singapore | 8 |
Australia | 7 |
Top cities views
Views | |
---|---|
Sunnyvale | 31 |
Romeo | 26 |
Beijing | 24 |
Kiez | 17 |
Seoul | 15 |
London | 12 |
Ashburn | 9 |
Mountain View | 9 |
Stockholm | 7 |
University Park | 7 |