Computational Strategies for Proteogenomics Analyses
[摘要] Proteogenomics is an area of proteomics concerning the detection of novel peptides and peptide variants nominated by genomics and transcriptomics experiments. While the term primarily refers to studies utilizing a customized protein database derived from select sequencing experiments, proteogenomics methods can also be applied in the quest for identifying previously unobserved, or missing, proteins in a reference protein database. The identification of novel peptides is difficult and results can be dominated by false positives if conventional computational and statistical approaches for shotgun proteomics are directly applied without consideration of the challenges involved in proteogenomics analyses. In this dissertation, I systematically distill the sources of false positives in peptide identification and present potential remedies, including computational strategies that are necessary to make these approaches feasible for large datasets.In the first part, I analyze high scoring decoys, which are false identifications with high assigned confidences, using multiple peptide identification strategies to understand how they are generated and develop strategies for reducing false positives. I also demonstrate that modified peptides can cause violations in the target-decoy assumptions, which is a cornerstone for error rate estimation in shotgun proteomics, leading to potential underestimation in the number of false positives. Second, I address computational bottlenecks in proteogenomics workflows through the development of two database search engines: EGADS and MSFragger. EGADS aims to address issues relating to the large sequence space involved in proteogenomics studies by using graphical processing units to accelerate both in-silico digestion and similarity scoring. MSFragger implements a novel fragment ion index and searching algorithm that vastly speeds up spectra similarity calculations. For the identification of modified peptides using the open search strategy, MSFragger is over 150X faster than conventional database search tools. Finally, I will discuss refinements to the open search strategy for detecting modified peptides and tools for improved collation and annotation. Using the speed afforded by MSFragger, I perform open searching on several large-scale proteomics experiments, identifying modified peptides on an unprecedented scale and demonstrating its utility in diverse proteomics applications.The ability to rapidly and comprehensively identify modified peptides allows for the reduction of false positives in proteogenomics. It also has implications in discovery proteomics by allowing for the detection of both common and rare (including novel) biological modifications that are often not considered in large scale proteomics experiments. The ability to account for all chemically modified peptides may also improve protein abundance estimates in quantitative proteomics.
[发布日期] [发布机构] University of Michigan
[效力级别] Statistics and Numeric Data [学科分类]
[关键词] proteogenomics;Statistics and Numeric Data;Science;Bioinformatics [时效性]