Towards Image and Video Understanding in Challenging Scenarios

Azimi, Fatemeh

doi:10.26204/KLUEDO/9034

Learning-based solutions have revolutionized the field of Artificial Intelligence (AI), pushing it to new frontiers in a variety of domains. AI advancements owe much to highly curated datasets that enable the training and testing of complex deep learning models, leading to excellent accuracy in controlled academic environments. However, it is essential to acknowledge that the efficacy of these models in these lab settings does not fully encapsulate the complexities and challenges encountered in real-life applications. To ensure the practical applicability of learning-based solutions, it is crucial to understand their limitations and performance under diverse and challenging measurements. This requires bridging the gap between academic benchmarks and the complexities of the real world. In this thesis, we take a step toward better understanding the limitations of current deep learning models in handling corner cases and challenging scenarios, with a focus on computer vision tasks across image and video domains. Through this exploration, we identify areas and ways to improve the robustness of computer vision solutions. In the image domain, we study image classification and analyze the model behavior when processing images containing background noise and clutter. We extend this study to investigate salient image classification, a scenario in which multiple objects are present in a scene, and the model is expected to classify the most prominent or salient one. In the video domain, we explore the fundamental task of spatiotemporal feature correspondence learning. This task has diverse applications, such as video object segmentation and tracking. Our investigation delves into challenging scenarios, including tracking smaller objects, managing occlusion, coping with crowded scenes, and efficiently processing longer videos. Furthermore, we investigate self-supervised learning methods for spatiotemporal correspondence learning, motivated by the high cost of annotating video data for this task and the challenges that arise when training on small datasets. Finally, we study the problem of out-of-domain generalization on video data, a critical issue that affects the applicability of learning-based solutions. To this end, we evaluate several ways for using self-supervised learning to mitigate the adverse effects of domain shift, enabling the model to perform well in new, unseen domains. We hope this work fosters advancements in the field of AI by providing insights and directions for designing more robust models that deliver enhanced performance in diverse and complex scenarios.

Author:	Fatemeh Azimi
URN:	urn:nbn:de:hbz:386-kluedo-90343
DOI:	https://doi.org/10.26204/KLUEDO/9034
Advisor:	Andreas Dengel, Didier Stricker
Document Type:	Doctoral Thesis
Cumulative document:	No
Language of publication:	English
Date of Publication (online):	2025/05/24
Year of first Publication:	2025
Publishing Institution:	Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Granting Institution:	Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Acceptance Date of the Thesis:	2024/12/09
Date of the Publication (Server):	2025/05/27
Page Number:	IX, 173
Faculties / Organisational entities:	Kaiserslautern - Fachbereich Informatik
DDC-Cassification:	0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Licence (German):	Creative Commons 4.0 - Namensnennung, nicht kommerziell, keine Bearbeitung (CC BY-NC-ND 4.0)

Towards Image and Video Understanding in Challenging Scenarios

Download full text files

Export metadata

Additional Services