Graph-based Data Integration for System Integrity and Scalable Analytics
PhD Thesis
Authors | Arshad, B. |
---|---|
Type | PhD Thesis |
Abstract | Data from heterogeneous sources need to be brought together for numerous purposes ranging from data consolidation, reporting, and analytics to long-term preservation. The bringing about of these data sources is commonly known as data integration. Data integration is a well-researched problem in nearly every domain. As the need for integrating heterogeneous data sources increase, the solutions around these issues too. The resulting impact of data integration offers numerous insights into the data. Many data integration solutions exist that offer to integrate two data sources, however, most of these solutions disregard the volume and velocity of change that happens as the sources evolve. Or more so, when newer sources are added. Additionally, the challenge to resolve entity deduplication between entities as these sources are integrated becomes a complex and time-consuming process. To overcome these issues, holistic data integration using graphs and entity resolution is needed to integrate multiple data sources while ensuring consistency. Against this background, it is imperative that there is a strong emphasis on performance and scalability. As an example, ensuring structural consistency when integrating multiple sources can be aided by resolving the deduplication of records. Subsequently, data quality issues need to be dealt with when multiple sources are involved. This thesis Unlike previous systems, the approach is tested by designing a data integration model based on distributed processing of graphs. The process begins with transforming the existing data-sets to a graphical format and generating additional This thesis proposes a distributed holistic approach to integrate disparate sources using entity resolution and clustering techniques. The implementation is based on open-source Apache Spark. In contrast to prior techniques, both static and dynamic settings are enhanced utilising the optimisations suggested. This additionally incorporates the utilisation of blocking techniques to limit the possible search space. The approach in addition to the above recognises links in previous connections and assists with distinguishing new ones. In order to further reduce comparisons, a compact representation of the clusters by fused representatives is utilised. The broad assessment of appropriated distributed clustering approach shows high viability for clinical data-sets together with scalability on a multi-machine Apache Spark cluster. Overall, the presented holistic data integration approach produces fine results capable of scaling well for a wider range of entities and data sources. The methodology considers interconnecting entities from different sources meanwhile simultaneously giving a compact representation to assemble a unified graph. This graph is then subjected to changes mimicking changes in an evolving data integration environment. The proposed architecture yields better performance compared to traditional graph-based approaches. The resultant integrated data are subjected to comparative |
Keywords | Data Integration, Graphs, Entity Resolution, Consistency |
Year | 2022 |
Publisher | College of Science & Engineering, University of Derby |
Digital Object Identifier (DOI) | https://doi.org/10.48773/9v0wz |
File | License File Access Level Controlled |
Output status | Submitted |
Publication process dates | |
Deposited | 10 Nov 2022 |
https://repository.derby.ac.uk/item/9v0wz/graph-based-data-integration-for-system-integrity-and-scalable-analytics
78
total views1
total downloads1
views this month0
downloads this month