Audio Analysis for Multimodal Learning

  • Multimodal learning has received much attention in past years, and many advances in multimedia analysis and search have been accomplished. However, most focus was on the visual content. The audio signal, being arguably the second most important modality, also carries a lot of useful information that allows to improve the general performance of machine learning models. This has been indeed the case for a variety of applications, e.g., action recognition or scene classification. This work is dedicated to the challenging aspects found in environmental audio-visual analysis, and how one can design cross-modal fusion methods that solve tasks like classification or retrieval in a truly multimodal fashion. The thesis presents a comprehensive overview of the deep learning landscape in audio-visual learning, encompassing essential tasks such as representation learning, correspondence learning, audio-visual object localization, sound source separation, and audio-visual generation. Additionally, specific audio-centric learning objectives, including audio-visual multimodal classification, regular and zero-shot environmental sound classification, and multimodal retrieval, are extensively explored to provide a holistic understanding of the current state-of-the-art in deep learning for audio-visual analysis. Building upon this foundation, the thesis delves into specific techniques and methods tailored for solving these tasks, leveraging insights gained from the current research landscape. The research objective is being addressed by investigating effects of using visual-domain models for audio-specific tasks, the effectiveness of audio-specific architecture enhancements in deep learning models, and the shift from transfer learning to audio-visual contrastive learning. The experimental results demonstrate the significant performance and robustness improvements achieved through these approaches. In conclusion, this doctoral thesis contributes to the field of audio analysis in multimodal learning by discovering the untapped potential of audio signals and highlighting future research directions that can further advance the field. By bridging the gap between audio and visual modalities, this thesis opens up new avenues for more effective and comprehensive multimodal learning systems.
Metadaten
Author:Andrey GuzhovORCiD
URN:urn:nbn:de:hbz:386-kluedo-88041
DOI:https://doi.org/10.26204/KLUEDO/8804
Advisor:Andreas DengelORCiD
Document Type:Doctoral Thesis
Cumulative document:No
Language of publication:English
Date of Publication (online):2025/03/06
Year of first Publication:2025
Publishing Institution:Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Granting Institution:Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Acceptance Date of the Thesis:2024/09/20
Date of the Publication (Server):2025/03/10
Page Number:X, 131
Faculties / Organisational entities:Kaiserslautern - Fachbereich Informatik
CCS-Classification (computer science):I. Computing Methodologies / I.2 ARTIFICIAL INTELLIGENCE
DDC-Cassification:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Licence (German):Creative Commons 4.0 - Namensnennung (CC BY 4.0)