On-Demand ETL for Real-Time Analytics

Qu, Weiping

doi:10.26204/KLUEDO/6252

In recent years, business intelligence applications become more real-time and traditional data warehouse tables become fresher as they are continuously refreshed by streaming ETL jobs within seconds. Besides, a new type of federated system emerged that unifies domain-specific computation engines to address a wide range of complex analytical applications, which needs streaming ETL to migrate data across computation systems. From daily-sales reports to up-to-the-second cross-/up-sell campaign activities, we observed various latency and freshness requirements set in these analytical applications. Hence, streaming ETL jobs with regular batches are not flexible enough to fit in such a mixed workload. Jobs with small batches can cause resource overprovision for queries with low freshness needs while jobs with large batches would starve queries with high freshness needs. Therefore, we argue that ETL jobs should be self-adaptive to varying SLA demands by setting appropriate batches as needed. The major contributions are summarized as follows. • We defined a consistency model for “On-Demand ETL” which addresses correct batches for queries to see consistent states. Furthermore, we proposed an “Incremental ETL Pipeline” which reduces the performance impact of on-demand ETL processing. • A distributed, incremental ETL pipeline (called HBelt) was introduced in distributed warehouse systems. HBelt aims at providing consistent, distributed snapshot maintenance for concurrent table scans across different analytics jobs. • We addressed the elasticity property for incremental ETL pipeline to guarantee that ETL jobs with batches of varying sizes can be finished within strict deadlines. Hence, we proposed Elastic Queue Middleware and HBaqueue which replace memory-based data exchange queues with a scalable distributed store - HBase. • We also implemented lazy maintenance logic in the extraction and the loading phases to make these two phases workload-aware. Besides, we discuss how our “On-Demand ETL” thinking can be exploited in analytic flows running on heterogeneous execution engines.

Author:	Weiping Qu
URN:	urn:nbn:de:hbz:386-kluedo-62522
DOI:	https://doi.org/10.26204/KLUEDO/6252
Advisor:	Stefan Dessloch
Document Type:	Doctoral Thesis
Language of publication:	English
Date of Publication (online):	2021/02/03
Year of first Publication:	2021
Publishing Institution:	Technische Universität Kaiserslautern
Granting Institution:	Technische Universität Kaiserslautern
Acceptance Date of the Thesis:	2021/02/03
Date of the Publication (Server):	2021/02/04
Page Number:	XIII, 161
Faculties / Organisational entities:	Kaiserslautern - Fachbereich Informatik
CCS-Classification (computer science):	H. Information Systems / H.2 DATABASE MANAGEMENT (E.5) / H.2.5 Heterogeneous Databases
DDC-Cassification:	0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Licence (German):	Creative Commons 4.0 - Namensnennung (CC BY 4.0)

On-Demand ETL for Real-Time Analytics

Download full text files

Export metadata

Additional Services