On-Demand ETL for Real-Time Analytics

  • In recent years, business intelligence applications become more real-time and traditional data warehouse tables become fresher as they are continuously refreshed by streaming ETL jobs within seconds. Besides, a new type of federated system emerged that unifies domain-specific computation engines to address a wide range of complex analytical applications, which needs streaming ETL to migrate data across computation systems. From daily-sales reports to up-to-the-second cross-/up-sell campaign activities, we observed various latency and freshness requirements set in these analytical applications. Hence, streaming ETL jobs with regular batches are not flexible enough to fit in such a mixed workload. Jobs with small batches can cause resource overprovision for queries with low freshness needs while jobs with large batches would starve queries with high freshness needs. Therefore, we argue that ETL jobs should be self-adaptive to varying SLA demands by setting appropriate batches as needed. The major contributions are summarized as follows. • We defined a consistency model for “On-Demand ETL” which addresses correct batches for queries to see consistent states. Furthermore, we proposed an “Incremental ETL Pipeline” which reduces the performance impact of on-demand ETL processing. • A distributed, incremental ETL pipeline (called HBelt) was introduced in distributed warehouse systems. HBelt aims at providing consistent, distributed snapshot maintenance for concurrent table scans across different analytics jobs. • We addressed the elasticity property for incremental ETL pipeline to guarantee that ETL jobs with batches of varying sizes can be finished within strict deadlines. Hence, we proposed Elastic Queue Middleware and HBaqueue which replace memory-based data exchange queues with a scalable distributed store - HBase. • We also implemented lazy maintenance logic in the extraction and the loading phases to make these two phases workload-aware. Besides, we discuss how our “On-Demand ETL” thinking can be exploited in analytic flows running on heterogeneous execution engines.

Volltext Dateien herunterladen

Metadaten exportieren

Metadaten
Verfasser*innenangaben:Weiping Qu
URN:urn:nbn:de:hbz:386-kluedo-62522
DOI:https://doi.org/10.26204/KLUEDO/6252
Betreuer*in:Stefan Dessloch
Dokumentart:Dissertation
Sprache der Veröffentlichung:Englisch
Datum der Veröffentlichung (online):03.02.2021
Jahr der Erstveröffentlichung:2021
Veröffentlichende Institution:Technische Universität Kaiserslautern
Titel verleihende Institution:Technische Universität Kaiserslautern
Datum der Annahme der Abschlussarbeit:03.02.2021
Datum der Publikation (Server):04.02.2021
Seitenzahl:XIII, 161
Fachbereiche / Organisatorische Einheiten:Kaiserslautern - Fachbereich Informatik
CCS-Klassifikation (Informatik):H. Information Systems / H.2 DATABASE MANAGEMENT (E.5) / H.2.5 Heterogeneous Databases
DDC-Sachgruppen:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Lizenz (Deutsch):Creative Commons 4.0 - Namensnennung (CC BY 4.0)