Integrated real-time distributed stream-disk processing architecture for un-structured big data
Loading...
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
UMT, Lahore
Abstract
Real-time ETL(Extract-Transform-Load) is a crucial component of the grow ing demand for quicker business decisions aimed at numerous contemporary applications. The foundation of real-time ETL is un-structured data stream extraction from multi-source and transformation employing distributed disk data because of the volume and velocity of the data. Heterogeneity is another key aspect of complex networks and smart devices for using it as nature of live streams. The heterogeneous stream-disk join is a significant research topic in real-time processing applications because it can directly affect the data ana lytics. Multiple issues, including stream loss, scalability, disk access cost, and data accuracy, should be considered during heterogeneous stream-disk join transformation. As a result, developing an architecture for fundamental ETL building blocks in real time continues to be quite challenging. In this work, two architectures are proposed which can help organizations in improving their current decision making systems. It is particularly focused on speeding up stream-disk joins (transformation), which are the most expensive operation in stream processing because these require frequent disk access. This thesis presents its first architecture, without having to worry about the format of the data sources, for real-time ETL that would convert the unstructured stream of data after combining it with distributed disk data. First proposed architecture is capable to perform analytics on heterogeneous data without loosing the flexibility of data from multiple sources at native speed. To overcome the issue of heterogeneous un-structured stream disk join, second architecture of this thesis is proposed: an integrated distributed het erogeneous stream-disk join architecture DHSDJArch which can prevent stream data loss as well as maintaining balance between heterogeneous dis tributed data sources and accuracy of stream-disk join. This four phased dis tributed architecture is proposed for the multi-objective optimization to trans form heterogeneous incomplete stream. To prevent stream loss, configuration of log retention is proposed based on the characteristics of distributed event streaming platform (DESP). Specifically, two transformations are proposed to xi pre-process heterogeneous streams and to join pre-processed stream with dis tributed disk data by performing real-time disk access while compensating the differences between data sources and streaming application, respectively. Additionally, a cutting-edge data pipeline is described for stream-disk join that makes use of partition-based input and a best-effort in-memory database strategy to lessen the number of times the disk is accessed. The suggested architectures deal with problems including real-time processing in dispersed environments, heterogeneous streams, ignored un-matching streams, disk over head, and stream data loss in streams. On both local and distributed worksta tions, experimental results utilising a stream generator and real-world datasets demonstrate that the proposed architectures greatly enhance throughput, es pecially for high numbers of stream tuples with huge datasets. According to experimental results, throughput scaling is linear with respect to the quan tity of input streams and dataset sizes. Performance criteria considered in this study corroborate the functioning of proposed architectures in terms of accuracy, log retention policy, scaling, stability and cloud data storage. Two contributions are being made in this work by developing and evaluat ing two architectures focusing on unstructured big data generated from homo geneous or heterogeneous multi-sources. Rigorous evaluation shows that there are no existing architectures that dominate overall performance of real-time distributed stream processing under the conditions of un-structured heteroge neous big data streams.