Exploring Data through Ranked Entities
- Ranking lists are an essential methodology to succinctly summarize outstanding items, computed over database tables or crowdsourced in dedicated websites. In this thesis, we propose the usage of automatically generated, entity-centric rankings to discover insights in data. We present PALEO, a framework for data exploration through reverse engineering top-k database queries, that is, given a database and a sample top-k input list, our approach, aims at determining an SQL query that returns results similar to the provided input when executed over the database. The core problem consist of finding selection predicates that return the given items, determining the correct ranking criteria, and evaluating the most promising candidate queries first. PALEO operates on subset of the base data, uses data samples, histograms, descriptive statistics, and further proposes models that assess the suitability of candidate queries which facilitate limitation of false positives. Furthermore, this thesis presents COMPETE, a novel approach that models and computes dominance over user-provided input entities, given a database of top-k rankings. The resulting entities are found superior or inferior with tunable degree of dominance over the input set---a very intuitive, yet insightful way to explore pros and cons of entities of interest. Several notions of dominance are defined which differ in computational complexity and strictness of the dominance concept---yet, interdependent through containment relations. COMPETE is able to pick the most promising approach to satisfy a user request at minimal runtime latency, using a probabilistic model that is estimating the result sizes. The individual flavors of dominance are cast into a stack of algorithms over inverted indices and auxiliary structures, enabling pruning techniques to avoid significant data access over large datasets of rankings.