Structural Information Extraction from Document Images: Addressing Challenges in Layout Analysis, Table Detection, and Classification
- Paper documents remain a vital part of our daily lives, and the need for automated
systems to analyze and extract valuable information from these documents is in-
creasingly important. Recent advancements in artificial intelligence have raised user
expectations for the extraction of structural information from document images,
going beyond the traditional goal of extracting raw text from documents. Typically,
document understanding systems comprise multiple components, including layout
analysis, table detection, and document classification, each of which presents unique
challenges. These challenges include handling complex and varied layouts, address-
ing the issue of imbalanced datasets, and developing systems that can adapt and
learn over time. Layout analysis is a critical component of document understanding,
as it involves organizing and structuring the various elements of a document, such
as text, tables, and figures. Accurate table recognition is also essential, as it enables
the effective extraction and interpretation of structured data.
This research enhances document analysis by increasing accuracy, robustness, and
efficiency, which addresses current shortcomings in structural information extraction
from documents through novel datasets, model architectures, and learning strategies.
The dissertation presents multiple contributions to the field of document under-
standing. Initially, we developed a CNN-based method for layout analysis, achieving
a 3 percent enhancement over baseline techniques on PubLayNet. Secondly, we
introduced a continual learning strategy employing experience-replay techniques,
which reduced catastrophic forgetting in table detection by 15 percent. Third, we
presented a novel dataset and developed an asymmetric convolution-based neural
network, improving table ruling line recognition. To mitigate class imbalance in
document classification, we integrated visual and textual features with a customized
loss function, resulting in a 13 percent increase in accuracy. The utilization of
Large Language Models (LLMs) for document comprehension was also studied. A
technique for fine-tuning large language models by structuring input as HTML was
created, yielding results on par with state-of-the-art methods while requiring less
computational power. And a three-phase prompt engineering strategy for zero-shot
information extraction was empirically evaluated, yielding promising outcomes.