Towards Language-Guided Visual Recognition via Dynamic Convolutions
[摘要] In this paper, we are committed to establishing a unified and end-to-end multi-modal network via exploring language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-guided Dynamic Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build a fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on seven benchmark datasets of three vision-and-language tasks, i.e., visual question answering, referring expression comprehension and segmentation. The experimental results not only show the competitive or better performance of LaConvNet against existing multi-modal networks, but also witness the merits of LaConvNet as an unified structure, including compact network, low computational cost and high generalization ability. Our source code is released in SimREC project: .
[发布日期] [发布机构]
[效力级别] Early Access [学科分类]
[关键词] [时效性]