Graduation Date

Summer 8-13-2021

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Programs

Biomedical Informatics

First Advisor

Chittibabu Guda

Second Advisor

Kusum Kharbanda

Third Advisor

Kenneth Bayles

Fourth Advisor

Sanjukta Bhowmick

Abstract

Microbes are ubiquitous in nature, and they play vital roles in various processes associated with metabolism in the human body, photosynthesis in plants, or decomposition of waste in the environment. Hence, it is essential to understand how the composition of microbial communities affects the ecosystem of different environments ranging from ocean floors to hot springs to the human body. Microbial communities present in different human body sites are of particular importance due to their implications in the cause and prevention of human diseases. The traditional approaches limit microbial research to exclusively studying species that can be successfully cultured in the lab. With the advent of next-generation sequencing (NGS) technologies, our ability to study microbial communities' composition and function has increased rapidly without having to culture isolated species. More importantly, strain-level diversity is what uniquely identifies an individual's microbiome. In many cases, strain-level variation determines a microbe's ability to cause diseases, resist antibacterial drugs, or be completely harmless. Hence, we must have the ability to identify microbes at a strain level to effectively design personalized treatment regimens for patients. Many tools have been developed to identify the taxonomic composition using short-read sequencing data from metagenomics samples. They are either alignment-based, longer k-mer based, or SNPs/SNVs based and use more generic databases of genomes containing all the known microbial species. However, most of these methods were designed to predict higher-level taxa and hence are not suitable for strain-level prediction. These methods are also very sensitive to the quality of the reference genomes and the coverage uniformity of the sequencing, while a vast majority of publicly available microbial genomes are incomplete. Due to these limitations, the existing methods do not perform well for the identification of taxa at the strain level.

We developed a tool called StrainIQ (Strain Identification and Quantification), to identify and quantify microbial species at the strain level using the whole-genome sequencing (WGS) data from metagenomic samples. StrainIQ takes advantage of the discriminative nature of unique and weighted common n-grams present in complete or draft assemblies of microbial genomes. Additionally, StrainIQ leverages the body site-specific reference genome information to increase the specificity of the prediction. Comparison with popular existing tools shows that StrainIQ is consistently better than other methods at predicting strains with higher sensitivity and specificity. Similarly, StrainIQ is able to estimate the abundance more accurately in comparison to other methods. We also developed a standalone version of the StrainIQ tool and made it available to the public via Github (https://github.com/sanpande/StrainIQ)

Share

COinS