Data analytics, interpretation and machine learning for environmental forensics using peak mapping methods
[摘要] In this work our driving motivation is to develop mathematically robust and computationally efficient algorithms that will help chemists towards their goal of pattern matching. Environmental chemistry today broadly faces difficult computational and interpretational challenges for vast and ever-increasing data repositories. A driving factor behind these challenges are little known intricate relationships between constituent analytes that constitute complex mixtures spanning a range of target and non-target compounds. While the end of goal of different environment applications are diverse, computationally speaking, many data interpretation bottlenecks arise from lack of efficient algorithms and robust mathematical frameworks to identify, cluster and interpret compound peaks. There is a compelling need for compound-cognizant quantitative interpretation that accounts for the full informational range of gas chromatographic (and mass spectrometric) datasets. Traditional target-oriented analysis focus only on the dominant compounds of the chemical mixture, and thus are agnostic of the contribution of unknown non-target analytes. On the other extreme, statistical methods prevalent in chemometric interpretation ignore compound identity altogether and consider only the multivariate data statistics, and thus are agnostic of intrinsic relationships between the well-known target and unknown target analytes. Thus, both schools of thought (target-based or statistical) in current-day chemical data analysis and interpretation fall short of quantifying the complex interaction between major and minor compound peaks in molecular mixtures commonly encountered in environmental toxin studies. Such interesting insights would not be revealed via these standard techniques unless a deeper analysis of these patterns be taken into account in a quantitative mathematical framework that is at once compound-cognizant and comprehensive in its coverage of all peaks, major and minor. This thesis aims to meet this grand challenge using a combination of signal processing, pattern recognition and data engineering techniques. We focus on petroleum biomarker analysis and polychlorinated biphenyl (PCB) congener studies in human breastmilk as our target applications. We propose a novel approach to chemical data analytics and interpretation that bridges the gap between target-cognizant traditional analysis from environmental chemistry with compound-agnostic computational methods in chemometric data engineering. Specically, we propose computational methods for target-cognizant data analytics that also account for local unknown analytes allied to the established target peaks. The key intuition behind our methods are based on the underlying topography of the gas chromatigraphic landscape, and we extend recent peak mapping methods as well as propose novel peak clustering and peak neighborhood allocation methods to achieve our data analytic aims. Data-driven results based
[发布日期] [发布机构] University of Iowa
[效力级别] [学科分类]
[关键词] [时效性]