已收录 273091 条政策
 政策提纲
  • 暂无提纲
Categorical Feature Encoding Techniques for Improved Classifier Performance when Dealing with Imbalanced Data of Fraudulent Transactions
[摘要] Fraudulent transaction data tend to have several categorical features with high cardinality. It makes data preprocessing complicated if categories in such features do not have an order or meaningful mapping to numerical values. Even though many encoding techniques exist, their impact on highly imbalanced massive data sets is not thoroughly evaluated.Two transaction datasets with an imbalance lower than 1\% of frauds have been used in our study. Six encoding methods were employed, which belong to either target-agnostic or target-based groups. The experimental procedure has involved the use of several machine-learning techniques, such as ensemble learning, along with both linear and non-linear learning approaches.Our study emphasizes the significance of carefully selecting an appropriate encoding approach for imbalanced datasets and machine learning algorithms. Using target-based encoding techniques can enhance model performance significantly. Among the various encoding methods assessed, the James-Stein and Weight of Evidence (WOE) encoders were the most effective, whereas the CatBoost encoder may not be optimal for imbalanced datasets. Moreover, it is crucial to bear in mind the curse of dimensionality when employing encoding techniques like hashing and One-Hot encoding.
[发布日期]  [发布机构] 
[效力级别]  [学科分类] 自动化工程
[关键词] imbalanced data;classifier;feature encoding;high-cardinality;fraud detection [时效性] 
   浏览次数:1      统一登录查看全文      激活码登录查看全文