Building Natural Language Generation and Understanding Systems in Data Constrained Settings
- In recent years, deep learning has made substantial improvements in various fields like image understanding, Natural Language Processing (NLP), etc. These huge advancements have led to the release of many commercial applications which aim to help users carry out their daily tasks. Personal digital assistants are one such successful application of NLP, having a diverse userbase from all age groups. NLP tasks like Natural Language Understanding (NLU) and Natural Language Generation (NLG) are core components for building these assistants. However, like any other deep learning model, the growth of NLU & NLG models is directly coupled with tremendous amounts of training examples, which are expensive to collect due to annotator costs. Therefore, this work investigates the methodologies to build NLU and NLG systems in a data-constrained setting.
We evaluate the problem of limited training data in multiple scenarios like limited or no data available when building a new system, availability of a few labeled examples when adding a new feature to an existing system, and changes in the distribution of test data during the lifetime of a deployed system.
Motivated by the standard methods to handle data-constrained settings, we propose novel approaches to generate data and exploit latent representations to overcome performance drops emerging from limited training data.We propose a framework to generate high-quality synthetic data when few training examples are available for a newly added feature for dialogue agents. Our interpretation-to-text model uses existing training data for bootstrapping new features and improves the accuracy of downstream tasks of intent classification and slot labeling. Following, we study a few-shot setting and observe that generation systems face a low semantic coverage problem. Hence, we present an unsupervised NLG algorithm that ensures that all relevant semantic information is present in the generated text.
We also study to see if we really need all training examples for learning a generalized model. We propose a data selection method that selects the most informative training examples to train Visual Question Answering (VQA) models without erosion of accuracy. We leverage the already available inter-annotator agreement and design a diagnostic tool, called (EaSe), that leverages the entropy and semantic similarity of answer patterns.
Finally, we discuss two empirical studies to understand the feature space of VQA models and show how language model pre-training and exploiting multimodal embedding space allows for building data constrained models ensuring minimal or no accuracy losses.