Regularised iterative multiple correspondence analysis in multiple imputation
[摘要] English: Non-responses in survey data are a prevalent problem. Various techniques for the handling of missing data have been studied and published. The application of a regularised iterative multiple correspondence analysis (RIMCA) algorithm in single imputation (SI) has been suggested for the handling of missing data in survey analysis. Multiple correspondence analysis (MCA) as an imputation procedure is appropriate for survey data, since MCA is concerned with the relationships among the variables in the data. Therefore, missing data can be imputed by exploiting the relationship between observed and missing data. The RIMCA algorithm expresses MCA as a weighted principal component analysis (PCA) of a data triplet ( ), which represents a weighted data matrix, a metric and a diagonal matrix containing row masses, respectively. Performing PCA on a triplet involves the generalised singular value decomposition of the weighted data matrix . Here, standard singular value decomposition (SVD) will not suffice, since constraints are imposed on the rows and columns because of the weighting. The success of this algorithm lies in the fact that all eigenvalues are shrunk and the last components are omitted; thus a 'double shrinkage' occurs, which reduces variance and stabilises predictions. RIMCA seems to overcome overfitting and underfitting problems with regard to categorical missing data in surveys. The idea of applying the RIMCA algorithm in MI was appealing, since advantages of MI occur over SI, such as an increase in the accuracy of estimations and the attainment of valid inferences when combining multiple datasets. The aim of this study was to establish the performance of RIMCA in MI. This was achieved by two objectives: to determine whether RIMCA in MI outperforms RIMCA in SI and to determine the accuracy of predictions made from RIMCA in MI as an imputation model. Real and simulated data were used. A simulation protocol was followed creating data drawn from multivariate Normal distributions with both high and low correlation structures. Varying the percentages of missing values in the data and missingness mechanisms (missing completely at random (MCAR) and missing at random (MAR)), as is done by Josse et al. (2012), were created in the data. The first objective was achieved by applying RIMCA in both SI and MI to real data and simulated data. The performance of RIMCA in SI and MI were compared with regard to the obtained mean estimates and confidence intervals. In the case of the real data, the estimates were compared to the mean estimates of the incomplete data, whereas for the simulated data the true mean values and confidence intervals could be compared to the estimates obtained from the imputation procedures. The second objective was achieved by calculating the apparent error rates of predictions made by the RIMCA algorithm in SI and MI in simulated datasets. Along with the apparent error rates, approximate overall success rates were calculated in order to establish the accuracy of imputations made by the SI and MI. The results of this study show that the confidence intervals provided by MI are wider in most of the cases, which confirmed the incorporation of additional variance. It was found that for some of the variables the SI procedures were statistically different from the true confidence intervals, which shows that SI was not suitable in these instances for imputation. Overall the mean estimates provided by MI were closer to the true values, with respect to the simulated and real data. A summary of the bias, mean square errors and coverage for the imputation techniques over a thousand simulations were provided, which also confirmed that RIMCA in MI was a better model than RIMCA in SI in the contexts provided by this research.
[发布日期] [发布机构] University of the Free State
[效力级别] [学科分类]
[关键词] [时效性]