Machine Generated Deep Image Captioning with Style
Loading...
Date
2020
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
UMT, Lahore
Abstract
A powerful tool for perceiving the physical world is sight. The study of computer
vision aims to provide sight to artificial agents, enabling them to understand complex
visual scenes. As a core topic in artificial intelligence and machine learning it has
been the focus of extensive research, but is far from solved, with humans still
outperforming artificial vision systems in most tasks. Communication between
humans is primarily through language. Designing an agent that can communicate via
language is an important goal for human-agent interaction and for building agents that
can learn from the vast repositories of human knowledge. With these aims natural
language processing is a core topic in artificial intelligence and machine learning.
Like computer vision, natural language processing has been the focus of extensive
research, but remains an open problem. This thesis seeks to connect two core topics in
machine intelligence: vision and language. Although several topics exist at the
intersection.In this research focus on automatic image captioning: generating natural
language descriptions of image content. Automatic captioning involves both the
image understanding problem from computer vision and the natural language
generation problem from natural language processing. To improve communication the
researcher endeavour to add an extra layer to automatic captioning in the form of
linguistic style. Stylistic variations in language have a range of useful applications,
such as: reaching a broad audience, reducing misinformation, and engaging viewers.
With these applications in mind the research develop and evaluate novel methods
capable of generating stylised captions for natural images.
Previous research into image caption generation has focused on generating purely
descriptive captions; In this research the focus is on generating visually relevant
captions with a distinct linguistic style. Captions with style have the potential to ease
communication and add a new layer of personalisation. First, the researcher consider
naming variations in image captions, and propose a method for predicting context-
dependent names that takes into account visual and linguistic information. This
method makes use of a large-scale image caption dataset, which the researcher also
use to explore naming conventions and report naming conventions for hundreds of
9
people. Next the researcher propose the SentiCap model, which relies on recent
advances in artificial neural networks to generate visually relevant image captions
with positive or negative sentiment. To balance descriptiveness and sentiment, the
SentiCap model dynamically switches between two recurrent neural networks, one
tuned for descriptive words and one for sentiment words. As the first published model
for generating captions with sentiment, SentiCap has influenced a number of
subsequent works. The researcher then investigate the sub-task of modelling styled
sentences without images. The specific task chosen is sentence simplification:
rewriting news article sentences to make them easier to understand. For this task the
researcher design a neural sequence-to-sequence model that can work with limited
training data, using novel adaptations for word copying and sharing word
embeddings. Finally, the researcher present SemStyle, a system for generating
visually relevant image captions in the style of an arbitrary text corpus. A shared term
space allows a neural network for vision and content planning to communicate with a
network for styled language generation. SemStyle achieves competitive results in
human and automatic evaluations of descriptiveness and style.
As a whole, this thesis presents two complete systems for styled caption generation
that are first of their kind and demonstrate, for the first time, that automatic style
transfer for image captions is achievable. Contributions also include novel ideas for
object naming and sentence simplification. This thesis opens up inquiries into highly
personalised image captions; large scale visually grounded concept naming; and more
generally, styled text generation with content control.