SMARTS Approach to Chemical Data Mining and Physicochemical PropertyPrediction.
[摘要] The calculation of physicochemical and biological properties is essential in order to facilitate modern drug discovery. Chemical spaces dimensionalized by these descriptors have been used to scaffold-hop in order to discover new lead and drug-like molecules. Broadening the boundaries of structure based drug design, these molecules are expected to share the same physiological target and have similar efficacy, as do known drug molecules sharing the same region in chemical property space.In the past few decades physicochemical and ADMET (absorption, distribution, metabolism, elimination, and toxicity) property predictors have been the subject of increased focus in academia and the pharmaceutical industry. Due to the ever increasing attention given to data mining and property predictions, we first discuss the sources of experimental pKa values and current methodologies used for pKa prediction in proteins and small molecules. Of particular concern is an analysis of the scope, statistical validity, overall accuracy, and predictive power of these methods. The expressed concerns are not limited to predicting pKa, but apply to all empirical predictive methodologies.In a bottom-up approach, we explored the influence of freely generated SMARTS string representations of molecular fragments on chelation and cytotoxicity. Later investigations, involving the derivation of predictive models, use stepwise regression to determine the optimal pool of SMARTS strings having the greatest influence over the property of interest. By applying a unique scoring system to sets of highly generalized SMARTS strings, we have constructed well balanced regression trees with predictive accuracy exceeding that of many published and commercially available models for cytotoxicity, pKa, and aqueous solubility. The methodology is robust, extremely adaptable, and can handle any molecular dataset with experimental data. This story details our struggles of data gathering, curation, and the development of a machine learning methodology able to derive and validate highly accurate regression trees capable of extremely fast property predictions.Regression trees created by our method are well suited to calculate descriptors for large in silico molecular libraries, facilitating data mining of chemical spaces in search of new lead molecules in drug discovery.
[发布日期] [发布机构] University of Michigan
[效力级别] Chemoinformatics [学科分类]
[关键词] Cheminformatics;Chemoinformatics;Chemical Data Mining;Physicochemical Property Prediction;Chemical Spaces;Drug Discovery;Computer Science;Pharmacy and Pharmacology;Chemistry;Science (General);Engineering;Health Sciences;Science;Medicinal Chemistry [时效性]