The synthetic student : a machine learning model to simulate MOOC data
[摘要] It;;s now possible to take all of your favorite courses online. With growing popularity, Massive Open Online Courses (MOOCs) offer a learning opportunity to anyone with a computer - as well as an opportunity for researchers to investigate student learning through the accumulation of data about student-course interactions. Unfortunately, efforts to mine student data for information are currently limited by privacy concerns over how the data can be distributed. In this thesis, we present a generative model that learns from student data at the click-by-click level. When fully trained, this model is able to generate synthetic student data at the click-by-click level that can be released to the public. To develop a model at such granularity, we had to learn problem submission tendencies, characterize time spent viewing webpages and problem submission grades, and analyze how student activity transitions from week to week. We further developed a novel multi-level time-series model that goes beyond the classic Markov model and HMM methods used by most state-of-the art ML methods for weblogs, and showed that our model performs better than these methods. After training our model on a 6.002x course on edX, we generated synthetic data and found that a classifier that predicts student dropout is 93% as effective (by AUC) when trained on the simulated data as when trained on the real data. Lastly, we found that using features learned by our model improves dropout prediction performance by 9.5%.
[发布日期] [发布机构] Massachusetts Institute of Technology
[效力级别] [学科分类]
[关键词] [时效性]