Preparing lessons: Improve knowledge distillation with better supervision

[摘要] Knowledge distillation (KD) is widely applied in the training of efficient neural network. A compact model, which is trained to mimic the representation of a cumbersome model for the same task, generally obtains a better performance compared with being trained with the ground truth label. Previous KDbased works mainly focus on two aspects: (1) designing various feature representation for knowledge transfer; (2) introducing different training mechanism such as progressive learning or adversarial learning. In this paper, we revisit the standard KD and observe that training with teacher's logits might suffer from incorrect and uncertain supervision. To tackle these problems, we propose two novel approaches to deal with incorrect logits and uncertain logits respectively, which are called Logits Adjustment (LA) and Dynamic Temperature Distillation (DTD). To be specific, LA rectifies the incorrect logits according to ground truth label and certain rules. While DTD treats the temperature of KD as a dynamic sample wise parameter rather than a static and global hyper-parameter, which actually notes the uncertainty for each sample's logits. With iteratively updating the sample wise temperature, the student model could pay more attention on the samples that confuse the teacher model. Experiments on CIFAR-10/100, CINIC10 and Tiny ImageNet verify that the proposed methods yield encouraging improvement compared with the standard KD. Furthermore, considering the simple implementations, LA and DTD can be easily attached to many KD-based frameworks and bring improvements without extra cost of training time and computing resources. (c) 2021 Published by Elsevier B.V.

[发布日期] 2021-09-24 [发布机构]

[效力级别] [学科分类]

[关键词] Knowledge distillation;Label regularization;Hard example mining [时效性]

浏览次数：1

统一登录查看全文激活码登录查看全文