An Approach to Data Integration and Explorative Query Processing in Scientific Data Management Platforms
- Technological advancements in bioscience research are directly influencing the generation of vast amounts of complex and heterogeneous datasets from individual studies. Efficient research data management (RDM) solutions based on FAIR principles can help research groups standardize and package their study-specific results into uniquely identifiable digital objects that are easily traceable. To explore the inter-dependencies among datasets originating from different research disciplines, it is essential to deploy a generic, data-centric RDM solution that addresses inherent challenges and effectively manages complex datasets.
This thesis introduces PLANTdataHUB, an end-to-end scientific Research Data Management (RDM) ecosystem for plant science data that generates FAIR digital objects, known as Annotated Research Contexts (ARCs). We present an incremental approach to developing a set of search and exploration applications for the plant science community, utilizing the PLANTdataHUB solution. The goal is to facilitate interdisciplinary data analysis among participating research groups, enabling knowledge discovery, collaboration, and innovation.
Our research focuses on developing a framework for exploring large-scale, multi-model plant science datasets. A key contribution of our work is the introduction of a novel key-value index store within the polystore architecture. We propose a fast, scalable, space-efficient, and flexible indexing scheme that leverages purpose-built bitmaps for exploratory data analysis, supporting containment, point, and range query types. This index store complements the query processing mechanism, enabling the execution of cross-model queries across multiple data sources. Additionally, we extend the index management and conceptualize it as a two-player game to address the challenges of attribute selection and cost-based refinement, adapting to the query workloads.
Furthermore, we expand our research by implementing search applications that facilitate integrated metadata exploration through the ARC Metadata Registry and enable in-situ, on-demand querying of ARC datasets with ARCXplore.