Incremental Recomputations in Materialized Data Integration

  • Data integration aims at providing uniform access to heterogeneous data, managed by distributed source systems. Data sources can range from legacy systems, databases, and enterprise applications to web-scale data management systems. The materialized approach to data integration, extracts data from the sources, transforms and consolidates the data, and loads it into an integration system, where it is persistently stored and can be queried and analyzed. To support materialized data integration, so called Extract-Transform-Load (ETL) systems have been built and are widely used to populate data warehouses today. While ETL is considered state-of-the-art in enterprise data warehousing, a new paradigm known as MapReduce has recently gained popularity for web-scale data transformations, such as web indexing or page rank computation. The input data of both, ETL and MapReduce programs keeps changing over time, while business transactions are processed or the web is crawled, for instance. Hence, the results of ETL and MapReduce programs get stale and need to be recomputed from time to time. Recurrent computations over changing input data can be performed in two ways. The result may either be recomputed from scratch or recomputed in an incremental fashion. The idea behind the latter approach is to update the existing result in response to incremental changes in the input data. This is typically more efficient than the full recomputation approach, because reprocessing unchanged portions of the input data can often be avoided. Incremental recomputation techniques have been studied by the database research community mainly in the context of the maintenance of materialized views and have been adopted by all major commercial database systems today. However, neither today's ETL tools nor MapReduce support incremental recomputation techniques. The situation of ETL and MapReduce programmers nowadays is thus much comparable to the situation of database programmers in the early 1990s. This thesis makes an effort to transfer incremental recomputation techniques into the ETL and MapReduce environments. This poses interesting research challenges, because these environments differ fundamentally from the relational world with regard to query and programming models, change data capture, transactional guarantees and consistency models. However, as this thesis will show, incremental recomputations are feasible in ETL and MapReduce and may lead to considerable efficiency improvements.

Volltext Dateien herunterladen

Metadaten exportieren

Metadaten
Verfasser*innenangaben:Thomas Jörg
URN:urn:nbn:de:hbz:386-kluedo-34973
Betreuer*in:Stefan Deßloch
Dokumentart:Dissertation
Sprache der Veröffentlichung:Englisch
Datum der Veröffentlichung (online):29.04.2013
Jahr der Erstveröffentlichung:2013
Veröffentlichende Institution:Technische Universität Kaiserslautern
Titel verleihende Institution:Technische Universität Kaiserslautern
Datum der Annahme der Abschlussarbeit:18.01.2013
Datum der Publikation (Server):30.04.2013
Seitenzahl:IX, 188
Fachbereiche / Organisatorische Einheiten:Kaiserslautern - Fachbereich Informatik
CCS-Klassifikation (Informatik):H. Information Systems / H.2 DATABASE MANAGEMENT (E.5) / H.2.5 Heterogeneous Databases
DDC-Sachgruppen:0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
Lizenz (Deutsch):Standard gemäß KLUEDO-Leitlinien vom 10.09.2012