Generic Text Recognition using Long Short-Term Memory Networks
- The task of printed Optical Character Recognition (OCR), though considered ``solved'' by many, still poses several challenges. The complex grapheme structure of many scripts, such as Devanagari and Urdu Nastaleeq, greatly lowers the performance of state-of-the-art OCR systems.
Moreover, the digitization of historical and multilingual documents still require much probing. Lack of benchmark datasets further complicates the development of reliable OCR systems. This thesis aims to find the answers to some of these challenges using contemporary machine learning technologies. Specifically, the Long Short-Term Memory (LSTM) networks, have been employed to OCR modern as well historical monolingual documents. The excellent OCR results obtained on these have led us to extend their application for multilingual documents.
The first major contribution of this thesis is to demonstrate the usability of LSTM networks for monolingual documents. The LSTM networks yield very good OCR results on various modern and historical scripts, without using sophisticated features and post-processing techniques. The set of modern scripts include modern English, Urdu Nastaleeq and Devanagari. To address the challenge of OCR of historical documents, this thesis focuses on Old German Fraktur script, medieval Latin script of the 15th century, and Polytonic Greek script. LSTM-based systems outperform the contemporary OCR systems on all of these scripts. To cater for the lack of ground-truth data, this thesis proposes a new methodology, combining segmentation-based and segmentation-free OCR approaches, to OCR scripts for which no transcribed training data is available.
Another major contribution of this thesis is the development of a novel multilingual OCR system. A unified framework for dealing with different types of multilingual documents has been proposed. The core motivation behind this generalized framework is the human reading ability to process multilingual documents, where no script identification takes place.
In this design, the LSTM networks recognize multiple scripts simultaneously without the need to identify different scripts. The first step in building this framework is the realization of a language-independent OCR system which recognizes multilingual text in a single step. This language-independent approach is then extended to script-independent OCR that can recognize multiscript documents using a single OCR model. The proposed generalized approach yields low error rate (1.2%) on a test corpus of English-Greek bilingual documents.
In summary, this thesis aims to extend the research in document recognition, from modern Latin scripts to Old Latin, to Greek and to other ``under-privilaged'' scripts such as Devanagari and Urdu Nastaleeq.
It also attempts to add a different perspective in dealing with multilingual documents.