Controllable Deep Image Synthesis

  • Deep learning breakthroughs have significantly advanced computer vision in the past decade, particularly in image synthesis, which involves generating and manipulating images. Image synthesis has numerous practical applications, including art generation, editing, virtual reality, video games, and computer-aided design. While learning the unconditional distribution of natural images is interesting, gaining control over the image generation process by learning a conditional distribution is essential for practical applications. This thesis presents new methods of controllable generative models for high-quality deep image synthesis, building upon and extending the progress made in generative deep learning over the past decade. The central research question is how to advance controllable deep image synthesis, which is explored through five main dimensions: understanding the current state-of-the-art, integrating existing approaches, enabling fine-grained user inputs, improving models by focusing on important image areas, and developing an efficient generation algorithm. Natural language, the primary medium through which we communicate thoughts, ideas, and feelings, is arguably the most flexible and intuitive interface for controllable image synthesis. Thus, the first part of this research reviews text-to-image synthesis models and highlights open challenges such as generating complex scenes. The second part develops hybrid models that enhance image quality and alignment by integrating text-to-image synthesis with visual question answering and proposes a framework of robust generative networks. The third part focuses on precise control over image regions, covering attribute-controlled and dense text-to-image synthesis from free-form region descriptions. The fourth part introduces methods that prioritize key image areas, such as dynamic attention-guided diffusion and a curriculum learning approach that progressively blurs object regions to stabilize training and improve quality. Finally, the last part proposes an efficient algorithm leveraging pre-trained models for high-resolution text-based image generation. In summary, this thesis contributes to the field of controllable deep image synthesis, providing new methods and insights for developing advanced generative models.
Metadaten
Author:Stanislav Frolov
URN:urn:nbn:de:hbz:386-kluedo-92927
DOI:https://doi.org/10.26204/KLUEDO/9292
Advisor:Andreas Dengel
Document Type:Doctoral Thesis
Cumulative document:No
Language of publication:English
Date of Publication (online):2025/11/04
Year of first Publication:2025
Publishing Institution:Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Granting Institution:Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Acceptance Date of the Thesis:2025/10/06
Date of the Publication (Server):2025/11/07
Tag:artificial intelligence; computer vision; generative models; image synthesis; machine learning
Page Number:XII, 184
Faculties / Organisational entities:Kaiserslautern - Fachbereich Informatik
CCS-Classification (computer science):I. Computing Methodologies
DDC-Cassification:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Licence (German):Creative Commons 4.0 - Namensnennung (CC BY 4.0)