Graph-based Data Integration for System Integrity and Scalable Analytics

PhD Thesis

Arshad, B. 2022. Graph-based Data Integration for System Integrity and Scalable Analytics. PhD Thesis
AuthorsArshad, B.
TypePhD Thesis

Data from heterogeneous sources need to be brought together for numerous purposes ranging from data consolidation, reporting, and analytics to long-term preservation. The bringing about of these data sources is commonly known as data integration. Data integration is a well-researched problem in nearly every domain. As the need for integrating heterogeneous data sources increase, the solutions around these issues too. The resulting impact of data integration offers numerous insights into the data. Many data integration solutions exist that offer to integrate two data sources, however, most of these solutions disregard the volume and velocity of change that happens as the sources evolve. Or more so, when newer sources are added. Additionally, the challenge to resolve entity deduplication between entities as these sources are integrated becomes a complex and time-consuming process. To overcome these issues, holistic data integration using graphs and entity resolution is needed to integrate multiple data sources while ensuring consistency. Against this background, it is imperative that there is a strong emphasis on performance and scalability. As an example, ensuring structural consistency when integrating multiple sources can be aided by resolving the deduplication of records. Subsequently, data quality issues need to be dealt with when multiple sources are involved. This thesis
addresses these comprehensive issues by designing holistic integration and clustering approach that enables consistency as a result. At first, the primary mechanism to convert data sources to a unified graph is presented. These graphs act as a starting point for the integration process.

Unlike previous systems, the approach is tested by designing a data integration model based on distributed processing of graphs. The process begins with transforming the existing data-sets to a graphical format and generating additional
graphs synthetically. Once these graphs are pushed into an in-memory environment, the proposed holistic clustering approach allows the generation of unified clusters from various data sources representing the matching real-world entities. The quality of results is enhanced by exploiting the structure of the entities. This leads to the process of identifying erroneous existing links and unearthing any supplementary links. An initial assessment of a clinical variation data-set indicates the efficacy of using this approach. Moreover, it is imperative that the system is able to address the dynamicity of data sources. With data integration particularly, there is a requirement for constantly evolving data sources containing numerous entity changes and the addition of new data source(s). Prior attempts at entity resolution, primarily focused on the static clustering of entities from limited data sources. As a result, in order to resolve dynamic data entities and incorporate additional data sources, this thesis intends to assess new ways to address incremental entity clustering in an evolving environment. A comprehensive evaluation of data-sets originating in the real-world and synthetically generated are used to assess the effectiveness, scalability and performance of incremental clustering techniques.

This thesis proposes a distributed holistic approach to integrate disparate sources using entity resolution and clustering techniques. The implementation is based on open-source Apache Spark. In contrast to prior techniques, both static and dynamic settings are enhanced utilising the optimisations suggested. This additionally incorporates the utilisation of blocking techniques to limit the possible search space. The approach in addition to the above recognises links in previous connections and assists with distinguishing new ones. In order to further reduce comparisons, a compact representation of the clusters by fused representatives is utilised. The broad assessment of appropriated distributed clustering approach shows high viability for clinical data-sets together with scalability on a multi-machine Apache Spark cluster.

Overall, the presented holistic data integration approach produces fine results capable of scaling well for a wider range of entities and data sources. The methodology considers interconnecting entities from different sources meanwhile simultaneously giving a compact representation to assemble a unified graph. This graph is then subjected to changes mimicking changes in an evolving data integration environment. The proposed architecture yields better performance compared to traditional graph-based approaches. The resultant integrated data are subjected to comparative
analysis to determine the accuracy of the approach in comparison to data-sets being integrated traditionally. The subsequent arrangement empowers an all-encompassing data integration that aids consistency for systems and analytics dependent on a blend of information from an assortment of unique data sources. Clinical data-sets are used as a running example throughout the thesis. Clinical data-sets provide a starting point for this thesis but the approach can be modified for other domains as detailed in Chapter 7.

KeywordsData Integration, Graphs, Entity Resolution, Consistency
PublisherCollege of Science & Engineering, University of Derby
Digital Object Identifier (DOI)
File Access Level
Output statusSubmitted
Publication process dates
Deposited10 Nov 2022
Permalink -

  • 40
    total views
  • 1
    total downloads
  • 0
    views this month
  • 0
    downloads this month

Export as