Audio Analysis for Multimodal Learning
- Multimodal learning has received much attention in past years, and many advances in multimedia analysis and search have been accomplished. However, most focus was on the visual content. The audio signal, being arguably the second most important modality, also carries a lot of useful information that allows to improve the general performance of machine learning models. This has been indeed the case for a variety of applications, e.g., action recognition or scene classification. This work is dedicated to the challenging aspects found in environmental audio-visual analysis, and how one can design cross-modal fusion methods that solve tasks like classification or retrieval in a truly multimodal fashion. The thesis presents a comprehensive overview of the deep learning landscape in audio-visual learning, encompassing essential tasks such as representation learning, correspondence learning, audio-visual object localization, sound source separation, and audio-visual generation. Additionally, specific audio-centric learning objectives, including audio-visual multimodal classification, regular and zero-shot environmental sound classification, and multimodal retrieval, are extensively explored to provide a holistic understanding of the current state-of-the-art in deep learning for audio-visual analysis. Building upon this foundation, the thesis delves into specific techniques and methods tailored for solving these tasks, leveraging insights gained from the current research landscape. The research objective is being addressed by investigating effects of using visual-domain models for audio-specific tasks, the effectiveness of audio-specific architecture enhancements in deep learning models, and the shift from transfer learning to audio-visual contrastive learning. The experimental results demonstrate the significant performance and robustness improvements achieved through these approaches. In conclusion, this doctoral thesis contributes to the field of audio analysis in multimodal learning by discovering the untapped potential of audio signals and highlighting future research directions that can further advance the field. By bridging the gap between audio and visual modalities, this thesis opens up new avenues for more effective and comprehensive multimodal learning systems.
Author: | Andrey GuzhovORCiD |
---|---|
URN: | urn:nbn:de:hbz:386-kluedo-88041 |
DOI: | https://doi.org/10.26204/KLUEDO/8804 |
Advisor: | Andreas DengelORCiD |
Document Type: | Doctoral Thesis |
Cumulative document: | No |
Language of publication: | English |
Date of Publication (online): | 2025/03/06 |
Year of first Publication: | 2025 |
Publishing Institution: | Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau |
Granting Institution: | Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau |
Acceptance Date of the Thesis: | 2024/09/20 |
Date of the Publication (Server): | 2025/03/10 |
Page Number: | X, 131 |
Faculties / Organisational entities: | Kaiserslautern - Fachbereich Informatik |
CCS-Classification (computer science): | I. Computing Methodologies / I.2 ARTIFICIAL INTELLIGENCE |
DDC-Cassification: | 0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik |
Licence (German): |