Learning View Synthesis from Minimal Scene Specifications

  • Computer-generated images, videos, and visual effects are indispensable resources for digital content creation that enable artists to create engaging visual stories. However, creating compelling stories often requires photorealistic 3D models that are designed by skilled digital artists. Novel View Synthesis (NVS) has emerged as a cheaper means to achieve photorealism. NVS uses a collection of photographs to render a scene from novel camera poses without relying on expensive models of geometry, materials, or light. This thesis introduces a number of view synthesis approaches with varying levels of input complexity, ranging from multi-view stereo and sparse multi-view spherical images to 2D semantic maps and simple textual descriptions. The proposed methods improve the state-of-the-art in terms of accuracy, efficiency, and usability. In the multi-view stereo input setting, the proposed FastNVS method significantly outperforms existing techniques in terms of speed and accuracy. FastNVS achieves this by decomposing the NVS problem into two structured prediction tasks, namely, proxy geometry estimation and texture inpainting. In the more challenging setting of spherical input images, prior work relies on the Multisphere Images (MSI) scene representation. MSI-based methods achieve fast rendering speed but are limited to modeling low-dimensional color values per-sphere. To alleviate this, we propose a novel scene representation called Soft Occlusion Multisphere Images (SOMSI) that enables modeling high-dimensional appearance features in MSI. This is achieved by assigning appearance features to a few occlusion levels, instead of a large number of MSI spheres. SOMSI produces novel views with significantly higher quality while retaining the fast rendering times of traditional MSI. Furthermore, the usability of view synthesis methods is enhanced by introducing novel techniques that require minimal user input and grant users control over the 3D scene. Additionally, the proposed methods give users great creative freedom by enabling them to create novel 3D scenes for which no input images exist. The first method in this line of research is the GVSNet technique, which allows users to create novel views from a single input 2D semantic sketch. Users can also manipulate the geometry of existing scenes by editing the input semantic map. However, the process of creating and editing a semantic map can easily become a tedious task. In order to further simplify the creative process, a novel approach is proposed that maps input textual scene descriptions into renderable scene representations. The proposed method, Text2MPI, is a diffusion model trained to generate compact Multiplane Images (MPI) representations from text. Text2MPI generates crisp photorealistic novel views that are 3D-consistent and match the input text description. Furthermore, the proposed model harnesses the vast generalization capability of 2D diffusion models by integrating 2D scene priors into its training procedure.

Download full text files

Export metadata

Metadaten
Author:Tewodros Amberbir Habtegebrial
URN:urn:nbn:de:hbz:386-kluedo-97269
DOI:https://doi.org/10.26204/KLUEDO/9726
Advisor:Didier StrickerORCiD
Document Type:Doctoral Thesis
Cumulative document:No
Language of publication:English
Date of Publication (online):2026/03/16
Year of first Publication:2026
Publishing Institution:Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Granting Institution:Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau
Acceptance Date of the Thesis:2025/04/11
Date of the Publication (Server):2026/03/16
Page Number:148
Faculties / Organisational entities:Kaiserslautern - Fachbereich Informatik
CCS-Classification (computer science):J. Computer Applications
DDC-Cassification:6 Technik, Medizin, angewandte Wissenschaften / 600 Technik
Licence (German):Creative Commons 4.0 - Namensnennung (CC BY 4.0)