Machine Generated Deep Image Captioning with Style

Muhammad Aftab

Machine Generated Deep Image Captioning with Style

Files

Machine Generated Deep Image Captioning with Style.pdf (1.51 MB)

Date

2020

Authors

Muhammad Aftab

Publisher

UMT, Lahore

Abstract

A powerful tool for perceiving the physical world is sight. The study of computer vision aims to provide sight to artificial agents, enabling them to understand complex visual scenes. As a core topic in artificial intelligence and machine learning it has been the focus of extensive research, but is far from solved, with humans still outperforming artificial vision systems in most tasks. Communication between humans is primarily through language. Designing an agent that can communicate via language is an important goal for human-agent interaction and for building agents that can learn from the vast repositories of human knowledge. With these aims natural language processing is a core topic in artificial intelligence and machine learning. Like computer vision, natural language processing has been the focus of extensive research, but remains an open problem. This thesis seeks to connect two core topics in machine intelligence: vision and language. Although several topics exist at the intersection.In this research focus on automatic image captioning: generating natural language descriptions of image content. Automatic captioning involves both the image understanding problem from computer vision and the natural language generation problem from natural language processing. To improve communication the researcher endeavour to add an extra layer to automatic captioning in the form of linguistic style. Stylistic variations in language have a range of useful applications, such as: reaching a broad audience, reducing misinformation, and engaging viewers. With these applications in mind the research develop and evaluate novel methods capable of generating stylised captions for natural images. Previous research into image caption generation has focused on generating purely descriptive captions; In this research the focus is on generating visually relevant captions with a distinct linguistic style. Captions with style have the potential to ease communication and add a new layer of personalisation. First, the researcher consider naming variations in image captions, and propose a method for predicting context- dependent names that takes into account visual and linguistic information. This method makes use of a large-scale image caption dataset, which the researcher also use to explore naming conventions and report naming conventions for hundreds of 9 people. Next the researcher propose the SentiCap model, which relies on recent advances in artificial neural networks to generate visually relevant image captions with positive or negative sentiment. To balance descriptiveness and sentiment, the SentiCap model dynamically switches between two recurrent neural networks, one tuned for descriptive words and one for sentiment words. As the first published model for generating captions with sentiment, SentiCap has influenced a number of subsequent works. The researcher then investigate the sub-task of modelling styled sentences without images. The specific task chosen is sentence simplification: rewriting news article sentences to make them easier to understand. For this task the researcher design a neural sequence-to-sequence model that can work with limited training data, using novel adaptations for word copying and sharing word embeddings. Finally, the researcher present SemStyle, a system for generating visually relevant image captions in the style of an arbitrary text corpus. A shared term space allows a neural network for vision and content planning to communicate with a network for styled language generation. SemStyle achieves competitive results in human and automatic evaluations of descriptiveness and style. As a whole, this thesis presents two complete systems for styled caption generation that are first of their kind and demonstrate, for the first time, that automatic style transfer for image captions is achievable. Contributions also include novel ideas for object naming and sentence simplification. This thesis opens up inquiries into highly personalised image captions; large scale visually grounded concept naming; and more generally, styled text generation with content control.

URI

https://escholar.umt.edu.pk/handle/123456789/7744

Collections

2020

Full item page