Exploring Data through Ranked Entities

  • Ranking lists are an essential methodology to succinctly summarize outstanding items, computed over database tables or crowdsourced in dedicated websites. In this thesis, we propose the usage of automatically generated, entity-centric rankings to discover insights in data. We present PALEO, a framework for data exploration through reverse engineering top-k database queries, that is, given a database and a sample top-k input list, our approach, aims at determining an SQL query that returns results similar to the provided input when executed over the database. The core problem consist of finding selection predicates that return the given items, determining the correct ranking criteria, and evaluating the most promising candidate queries first. PALEO operates on subset of the base data, uses data samples, histograms, descriptive statistics, and further proposes models that assess the suitability of candidate queries which facilitate limitation of false positives. Furthermore, this thesis presents COMPETE, a novel approach that models and computes dominance over user-provided input entities, given a database of top-k rankings. The resulting entities are found superior or inferior with tunable degree of dominance over the input set---a very intuitive, yet insightful way to explore pros and cons of entities of interest. Several notions of dominance are defined which differ in computational complexity and strictness of the dominance concept---yet, interdependent through containment relations. COMPETE is able to pick the most promising approach to satisfy a user request at minimal runtime latency, using a probabilistic model that is estimating the result sizes. The individual flavors of dominance are cast into a stack of algorithms over inverted indices and auxiliary structures, enabling pruning techniques to avoid significant data access over large datasets of rankings.

Volltext Dateien herunterladen

Metadaten exportieren

Weitere Dienste

Suche bei Google Scholar
Metadaten
Verfasser*innenangaben:Kiril Panev
URN:urn:nbn:de:hbz:386-kluedo-55944
Betreuer*in:Sebastian Michel
Dokumentart:Dissertation
Sprache der Veröffentlichung:Englisch
Datum der Veröffentlichung (online):15.04.2019
Jahr der Erstveröffentlichung:2019
Veröffentlichende Institution:Technische Universität Kaiserslautern
Titel verleihende Institution:Technische Universität Kaiserslautern
Datum der Annahme der Abschlussarbeit:10.04.2019
Datum der Publikation (Server):16.04.2019
Seitenzahl:VII, 156
Fachbereiche / Organisatorische Einheiten:Kaiserslautern - Fachbereich Informatik
DDC-Sachgruppen:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Lizenz (Deutsch):Creative Commons 4.0 - Namensnennung, nicht kommerziell, keine Bearbeitung (CC BY-NC-ND 4.0)