Query Processing and Cardinality Estimation over Data Summaries
- In this thesis, we propose methods and optimizations for efficient query processing and cardinality estimation over data summaries. For this reason, we put forth algorithms that work on a substantial amount of static data and methods tailored for streaming environments, where the data is constantly being generated and arrives over time. When working in streaming environments, we first address the problem of natural join computation over streams of semi-structured documents. We initially propose an efficient and scalable partitioning algorithm that uses the main principles of association analysis to identify co-occurring attribute-value pairs within the documents. To compute the join pairs, we put forth a novel natural join algorithm that comes in two flavors. The join algorithm utilizes a frequent pattern tree (FP-tree) to store a compact representation of the schema-free documents. Differently, we address the problem of data stream summarization on edge devices to reduce network traffic and speed up further processing. Instead of emitting the whole data stream, our SoftSieving approach computes item-based summaries using core items. Core items of a data stream are the items with the highest values for a given monotone submodular utility function. When working with vast static data, we address approximate query processing and cardinality estimation. We propose a versatile approach to lightweight, approximate query processing by learning compact but tunably precise representations of larger quantities of tuples coined bubbles. Bubbles are tunable regarding the compactness of enclosed tuples as well as the granularity of statistics and the way they are instantiated. Our solution models the content of the bubbles using Bayesian networks and deep autoregressive models. We have also identified an optimization for deep autoregressive models for the task of cardinality estimation. We overcome the limitations of such models when handling queries involving range predicates by circumventing the typically applied, time-consuming estimation algorithm. Consequently, we present our hybrid approach, Grid-AR, that fuses deep autoregressive models with a grid-based structure.