Discovering the Unknown: Improving Detection of NovelSpecies and Genera from Short Reads
[摘要] High-throughput sequencing technologies enable metagenomeprofiling, simultaneous sequencing of multiple microbial speciespresent within an environmental sample. Since metagenomic dataincludes sequence fragments (“reads”) from organismsthat are absent from any database, new algorithms must bedeveloped for the identification and annotation of novel sequencefragments. Homology-based techniques have been modified to detectnovel species and genera, but, composition-based methods, have notbeen adapted. We develop a detection technique that candiscriminate between “known” and “unknown”taxa, which can be used with composition-based methods, as well asa hybrid method. Unlike previous studies, we rigorously evaluateall algorithms for their ability to detect novel taxa.First, weshow that the integration of a detector with a composition-basedmethod performs significantly better than homology-based methodsfor the detection of novel species and genera, with bestperformance at finer taxonomic resolutions. Most importantly, weevaluate all the algorithms by introducing an“unknown” class and show that the modified version ofPhymmBL has similar or better overall classification performancethan the other modified algorithms, especially for thespecies-level and ultrashort reads. Finally, we evaluate theperformance of several algorithms on a real acid mine drainagedataset.
[发布日期] [发布机构]
[效力级别] [学科分类] 基础医学
[关键词] [时效性]