A Study on Outlier Detection and Its Effects: Using the South Korean Residential Condition Survey
[摘要] This study analyzed outliers which give a significant impact on reliability in the current Big Data era. We used the Housing Condition Survey (HCS) by the Ministry of Land, Infrastructure and Transport in Korea as a main dataset. We focused on the residential satisfaction of the respondents, detected the outliers, and performed an empirical analysis for impact factors. Although we used the Mahalanobis distance method, which is a common method for outlier detection, we found that this outlier detection method was unreliable. Alternatively, the Naive Bayes Classification method was utilized to detect the outliers and to verify the impact factors. This choice of method was based on the fact that the high correlation among the demographic characteristics and residential satisfaction of respondents are critical elements in the Naive Bayes Classification. The findings include that firstly, about 2,400 samples (12% of total) of the '2014 Housing Condition Survey' were detected as outliers. Secondly, it was observed that the tendency of positive over-estimation about questions from the residential satisfaction of respondents in HCS is common. Thirdly, in order to reduce the occurrence of outliers in HCS, it is necessary to lessen the stress of respondents by avoiding long questions in table form.
[发布日期] [发布机构] Department of Architecture, Konkuk University, Faculty of Architecture, 120 Neungdong-ro, Gwangjin-gu, Seoul; 143-701, Korea, Republic of^1
[效力级别] 无线电电子学 [学科分类] 计算机科学(综合)
[关键词] Condition surveys;Critical elements;Demographic characteristics;Empirical analysis;Infrastructure and transport;Mahalanobis distance method;Naive Bayes classification;Residential satisfaction [时效性]