Improving protein structure prediction using amino acid contact & distance prediction
[摘要] With more and more protein sequences generated, one of the most pressing tasks in bioinformatics has become to interpret these data. This thesis concerns how to predict the 3D structure of a protein relying on its sequence only, which is a long-standing problem in computational biology. A commonly adopted intermediate step for this task is to predict pairwise amino acid contacts based on the query sequence. Due to the simplicity of the current algorithms, which include statistical models and machine learning techniques, the accuracy of contact prediction is still low for many proteins. Also, these available algorithms are unable to predict amino acid distances (distance longer than contact). Thus, the lack of high quality and enough geometry constraints make it difficult for 3D structure prediction for many proteins. To deal with the current limitations of amino acid constraint and structure prediction, a state-of-the-art deep neural network based amino acid contact & distance prediction algorithm, DeepCDpred, is proposed in this thesis. For a given query protein sequence, the geometry constraints predicted by DeepCDpred are fed into a Rosetta ab initio modelling protocol for protein structure prediction. In addition, a neural network-based method is proposed to evaluate the quality of predicted structures. The accuracies of amino acid contact and distance predictions, the quality of structure predictions and the accuracy of confidence score predictions were evaluated by a test set of 108 protein chains whose experimental structures are known. Any sequence in the test set shares no greater than 25% sequence identity with any sequence in the training set, which was used to train DeepCDpred. The accuracy of amino acid contact predictions of DeepCDpred is just slightly worse than a newly published method, RaptorX; but exceeds all others mentioned in this thesis. Thanks to the predicted extra distance constraints and the Rosetta ab initio modelling protocol, the structure prediction quality based on the algorithms proposed in this study is better than that from the RaptorX server. A blind test, which was done with a yet to be released protein, was also used to validate the effectiveness of DeepCDpred. The protein classes of structures predicted with amino acid contact constraints from MetaPSICOV (the amino acid contact predictor, which DeepCDpred is most often compared within this thesis), are analysed and compared to the predictions based on contact constraints from DeepCDpred, and also to the predictions based on both contact and distance constraints from DeepCDpred. An online server, http://proteincoevolution.bham.ac.uk, is programmed and released to make the proposed methods for amino acid contact and distance predictions, structure prediction and structure confidence prediction accessible to average users, and it is expected beneficial to the research community.
[发布日期] [发布机构] University:University of Birmingham;Department:School of Biosciences
[效力级别] [学科分类]
[关键词] Q Science;QH Natural history;QH301 Biology [时效性]