This thesis is about data warehousing technologies for large-scale and right-time data.
Today, due to the exponential growth of data, it has become a common practice
for many enterprises to process hundreds of gigabytes of data per day. Traditionally,
data warehousing populates data from heterogeneous sources into a central data
warehouse (DW) by Extract-Transform-Load (ETL) at regular time intervals, e.g.,
monthly, weekly, or daily. But now, it becomes challenging for large-scale data, and
hard to meet the near real-time/right-time business decisions. This thesis considers
some of these challenges and makes the following contributions:
First, this thesis presents a new and efficient way to store triples from an OWL
Lite ontology known from the Semantic Web field. In contrast to classic triple-stores
where the data with the triple format of (subject; predicate; object) is stored in few,
but big, tables with few columns, the presented triple-store spreads the data over more
tables that may have many columns. The triple-store is optimized by an extensive
use of bulk techniques, which makes it very efficient to insert and extract data. The
DBMS-based solution makes it very flexible to integrate with other non-triple data.
Second, this thesis presents a middle-ware system for live DW data. Processing
live DW data is one of the most tricky problems in data warehousing. An innovative
method is proposed for processing live DW data, which accumulates the data in an
intermediate data store, and does data modifications on-the-fly when the data is materialized
or queried. The data is made available in the DW exactly when needed and
users can get bulk-load speeds, but INSERT-like data availability.
Third, this thesis presents the first dimensional ETL programming framework
using MapReduce. Parallel ETL is needed for large-scale data, but it is not easy to
implement. This presented framework makes this very easy by offering high-level
ETL-specific constructs, including those for star schema, snowflake schema, slowly
changing dimensions (SCDs) and very large dimensions. The framework can achieve
high programming efficiency, i.e., only a few statements needed for implementing a
parallel ETL program, and good scalability for processing different DW schemas.
Finally, this thesis presents scalable dimensional ETL for cloud warehouse. Today,
organizations gain growing interest in moving data warehousing systems towards the cloud, however, the current data warehousing systems are not yet particularly
suited for the cloud. The presented framework exploits Hadoop to parallelize ETL
execution and Hive as the warehouse system. It has a shared-nothing architecture,
and supports scalable ETL operations on clustered commodity machines. To implement
dimensional ETL, this framework can achieve higher programmer productivity
than Hive, and its performance is also better.
In summary, this thesis discusses several aspects of the current challenges and
problems of data warehousing, including integrating Web data, near real-time/righttime
data warehousing, handling the exponential growth of data and cloud data warehousing.
This thesis proposes a variety of technologies to deal with these specific
|Number of pages||196|
|Publication status||Published - 2012|