CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research
[摘要] The employment of machine learning (ML) approaches to extract gene expression information from microarray studies has increased in the past years, specially on cancer-related works. However, despite this continuous interest in applying ML in cancer biomedical research, there are no curated repositories focused only on providing quality data sets exclusively for benchmarking and testing of such techniques for cancer research. Thus, in this work, we present the Curated Microarray Database (CuMiDa), a database composed of 78 handpicked microarray data sets for Homo sapiens that were carefully examined from more than 30,000 microarray experiments from the Gene Expression Omnibus using a rigorous filtering criteria. All data sets were individually submitted to background correction, normalization, sample quality analysis and were manually edited to eliminate erroneous probes. All data sets were tested using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) analyses to observe sample division and were additionally tested using various ML approaches to provide a base accuracy for the major techniques employed for microarray data sets. CuMiDa is a database created solely for benchmarking and testing of ML approaches applied to cancer research.
[发布日期] [发布机构]
[效力级别] [学科分类] 生物科学(综合)
[关键词] benchmarking;cancer;classification;curation;machine learning;microarray;supervised learning;unsupervised learning [时效性]