Affective Image Captioning: Extraction and Semantic Arrangement of Image Information with Deep Neural Networks

  • In recent years, the Internet has become a major source of visual information exchange. Popular social platforms have reported an average of 80 million photo uploads a day. These images, are often accompanied with a user provided text one-liner, called an image caption. Deep Learning techniques have made significant advances towards automatic generation of factual image captions. However, captions generated by humans are much more than mere factual image descriptions. This work takes a step towards enhancing a machine's ability to generate image captions with human-like properties. We name this field as Affective Image Captioning, to differentiate it from the other areas of research focused on generating factual descriptions. To deepen our understanding of human generated captions, we first perform a large-scale Crowd-Sourcing study on a subset of Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M). Three thousand random image-caption pairs were evaluated by native English speakers w.r.t different dimensions like focus, intent, emotion, meaning, and visibility. Our findings indicate three important underlying properties of human captions: subjectivity, sentiment, and variability. Based on these results, we develop Deep Learning models to address each of these dimensions. To address the subjectivity dimension, we propose the Focus-Aspect-Value (FAV) model (along with a new task of aspect-detection) to structure the process of capturing subjectivity. We also introduce a novel dataset, aspects-DB, following this way of modeling. To implement the model, we propose a novel architecture called Tensor Fusion. Our experiments show that Tensor Fusion outperforms the state-of-the-art cross residual networks (XResNet) in aspect-detection. Towards the sentiment dimension, we propose two models:Concept & Syntax Transition Network (CAST) and Show & Tell with Emotions (STEM). The CAST model uses a graphical structure to generate sentiment. The STEM model uses a neural network to inject adjectives into a neutral caption. Achieving a high score of 93% with human evaluation, these models were selected as the top-3 at the ACMMM Grand Challenge 2016. To address the last dimension, variability, we take a generative approach called Generative Adversarial Networks (GAN) along with multimodal fusion. Our modified GAN, with two discriminators, is trained using Reinforcement Learning. We also show that it is possible to control the properties of the generated caption-variations with an external signal. Using sentiment as the external signal, we show that we can easily outperform state-of-the-art sentiment caption models.

Volltext Dateien herunterladen

Metadaten exportieren

Metadaten
Verfasser*innenangaben:Tushar Karayil
URN:urn:nbn:de:hbz:386-kluedo-60233
Untertitel (Deutsch):Generating Human-Like Image Descriptions with Deep Learning Methods
Betreuer*in:Andreas Dengel
Dokumentart:Dissertation
Sprache der Veröffentlichung:Englisch
Datum der Veröffentlichung (online):22.07.2020
Jahr der Erstveröffentlichung:2020
Veröffentlichende Institution:Technische Universität Kaiserslautern
Titel verleihende Institution:Technische Universität Kaiserslautern
Datum der Annahme der Abschlussarbeit:10.06.2020
Datum der Publikation (Server):22.07.2020
Freies Schlagwort / Tag:Automatic Image Captioning; Deep Learning; Natural Language Processing; Neural Networks
Seitenzahl:XV, 138
Fachbereiche / Organisatorische Einheiten:Kaiserslautern - Fachbereich Informatik
DDC-Sachgruppen:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Lizenz (Deutsch):Creative Commons 4.0 - Namensnennung, nicht kommerziell, keine Bearbeitung (CC BY-NC-ND 4.0)