Grounding natural language phrases in images and video
[摘要] Grounding language in images has shown it can help improve performance on many image-language tasks. To spur research on this topic, this dissertation introduces a new dataset which provides the ground truth annotations of the location of noun phrase chunks in image captions.I begin by introducing a constituent task termed phrase localization, where the goal is to localize an entity known to exist in an image when provided with a natural language query.To address this task, I introduce a model which learns a set of models, each of which capture a different concept which is useful in our task.These concepts can be predefined, such as attributes gleamed from the adjectives, as well as those which are automatically learned in a single-end-to-end neural network.I also address the more challenging detection style task, where the goal is to localize a phrase and determine if it is associated with an image.Multiple applications of the models presented in this work demonstrate their value beyond the phrase localization task.
[发布日期] [发布机构]
[效力级别] [学科分类]
[关键词] Computer Vision, Natural Language Processing, Phrase Grounding [时效性]