Methods and applications in DNA sequence alignments
DNA sequence alignment is one of the most common bioinformatics tasks. Alignment analysis for eukaryotic genomes is challenging because the datasets are large. Repeat sequences also make the analysis difficult. This thesis describes new methods which we have developed for DNA sequence alignment that address these problems. We have applied these new methods in chicken and Trypanosoma cruzi genome analysis projects, and this publication also describes the result from these projects.
Most alignment programs use a seed and extend method, where subsequences (seeds) are used to locate potential alignments that are verified. There is a tradeoff between sensitivity and specificity in the seeding process, as short seeds are inefficient in eliminating spurious matches and long seeds are more likely to omit true alignments in the presence of sequencing errors and polymorphisms. We developed an approximate seed matching algorithm which reduces the impact of this tradeoff by allowing mismatches within the seeds. Approximate seed matching allows the use of long seeds, which results in high specificity in the seeding and a faster alignment program. At the same time, sequencing errors and polymorphisms between the sequences do not reduce sensitivity.
The chicken is both an important agricultural source of protein and model organism in biological research. The genome sequencing of the wild ancestor of domestic chickens have offered an opportunity to study genetic factors involved in domestication. Sequences from three domestic chicken breeds were available for comparison to the genome sequence. We used this data to find signs of selective sweeps between wild and domestic chickens by searching for regions with low diversity within domestic breeds. The results showed no evidence of large, domestic-specific sweeps. These findings indicate substantial sequence variation within chicken breeds.
Copy number variation is emerging as an important source of genotypic and phenotypic variation in humans. We investigated the presence of such structural variation in the chicken genome through array comparative genome hybridizations of different chicken breeds. The results show extensive copy number variation, in some cases unique to domestic chickens.
Trypanosoma cruzi is a protozoan parasite which causes Chagas disease. It has interesting biological features, including a genome structure with many repeated genes. Genes are often repeated in tandem arrays, including surface antigen genes and housekeeping genes. The genome assembly shows numerous gaps and collapsed gene copies. We investigated the copy number of the annotated genes and found the gene content of T. cruzi to be even more repetitive than previously thought.
The genome analysis studies described in this thesis validated the DNA sequence alignment methods we have developed, and have provided important information for the chicken and T. cruzi research communities.
List of scientific papers
I. Tammi MT, Arner E, Kindlund E, Andersson B (2003). "Correcting errors in shotgun sequences." Nucleic Acids Res 31(15): 4663-72
https://pubmed.ncbi.nlm.nih.gov/12888528
II. Kindlund E, Tammi MT, Arner E, Nilsson D, Andersson B (2007). "GRAT-genome-scale rapid alignment tool." Comput Methods Programs Biomed Feb 8: Epub ahead of print
https://pubmed.ncbi.nlm.nih.gov/17292508
III. Wong GK, Liu B, Wang J, Zhang Y, Yang X, Zhang Z, Meng Q, Zhou J, Li D, Zhang J, Ni P, Li S, Ran L, Li H, Zhang J, Li R, Li S, Zheng H, Lin W, Li G, Wang X, Zhao W, Li J, Ye C, Kindlund E International Chicken Polymorphism Map Consortium et. al (2004). "A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms." Nature 432(7018): 717-22
https://pubmed.ncbi.nlm.nih.gov/15592405
IV. Kindlund E, Rubin CJ, Stromstedt L, Andersson B, Andersson L (2007). "Detection of copy number variation in the domestic chicken and its wild ancestor." (Manuscript)
V. Arner E, Kindlund E, Nilsson D, Farzana F, Ferella M, Tammi MT, Andersson B (2007). "Database of Trypanosoma cruzi repeated genes: 20 000 novel coding sequences." (Submitted)
History
Defence date
2007-03-23Department
- Department of Cell and Molecular Biology
Publisher/Institution
Karolinska InstitutetPublication year
2007Thesis type
- Doctoral thesis
ISBN
978-91-7357-143-2Number of supporting papers
5Language
- eng