Short tandem repeat (STR) profile authentication via machine learning techniques
[摘要] Short tandem repeat (STR) DNA profiles have multiple uses in forensic analysis, kinship identification, and human biometrics. However, as biotechnology progresses, there is a growing concern that STR profiles can be created using standard laboratory techniques such as whole genome amplification and molecular cloning. Such technologies can be used to synthesize any STR profile without the need for a physical sample, only knowledge of the desired genetic sequence. Therefore, to preserve the credibility of DNA as a forensic tool, it is imperative to develop means to authenticate STR profiles. The leading technique in the field, methylation analysis, is accurate but also expensive, time-consuming, and degrades the forensic sample so that further analysis is not possible. The realm of machine learning offers techniques to address the need for more effective STR profile authentication. In this work, a set of features were identified at both the channel and profile levels of STR electropherograms. A number of supervised and unsupervised machine learning algorithms were then used to predict whether a given STR electropherogram was authentic or synthesized by laboratory techniques. With the aid of the LNKnet machine learning toolkit, various classifiers were trained with the default set of parameters and the full set of features to quantify their baseline performance. Particular emphasis was placed on detecting profiles generated by Whole Genome Amplification (WGA). A greedy forward-backward search algorithm was implemented to determine the most useful subset of features from the initial group. Though the set of optimal feature values varied by classifier, a trend was observed indicating that the inter-locus imbalance error, stutter count, and range of peak widths for a profile were particularly useful features. These were selected by over two thirds of the classifiers. The signal-to- noise ratio was also a useful feature, selected by seven out of 16 classifiers. The selected features were in turn used to tune the parameters of machine learning algorithms and to compare their performance. From a set of 16 initial classifiers, the K-nearest neighbors, condensed K-nearest neighbors, multi-layer perceptron, Parzen window, and support vector machine classifiers achieved the best performance. These classification algorithms all attained error rates of approximately ten percent, defined as the percentage of profiles misclassified with the highest performing classifier achieving an error rate of less than eight percent. Overall, the classifiers performed well at detecting artificial profiles but had more difficulty accurately distinguishing natural profiles. There were many false positives for the artificial class, since profiles in this category took on a greater range of feature values. Finally, preliminary steps were taken to form classifier committees. However, combining the top performing classifiers via a majority vote did not significantly improve performance. The results of this work demonstrate the feasibility of a completely software-based approach to profile authentication. They confirm that machine learning techniques are a useful tool to trigger further investigation of profile authenticity via more expensive approaches.
[发布日期] [发布机构] Massachusetts Institute of Technology
[效力级别] [学科分类]
[关键词] [时效性]