Treffer: Metadata-Driven ETL Framework for Automated Schema Evolution and Impact Analysis

Title:
Metadata-Driven ETL Framework for Automated Schema Evolution and Impact Analysis
Source:
Journal of Computer Science and Technology Studies; Vol. 7 No. 7; 846-852 ; 2709-104X
Publisher Information:
Al-Kindi Center for Research and Development
Publication Year:
2025
Document Type:
Fachzeitschrift article in journal/newspaper
File Description:
application/pdf
Language:
English
DOI:
10.32996/jcsts.2025.7.7.91
Accession Number:
edsbas.F9440CEF
Database:
BASE

Weitere Informationen

Contemporary enterprise data systems encounter extraordinary obstacles when sustaining Extract, Transform, and Load operations throughout diverse data repositories. Schema modifications constitute a continuing challenge within modern data engineering, as changing organizational requirements constantly alter structural data configurations. Traditional schema change management methods depend extensively on manual processes, creating bottlenecks that delay critical business operations. This investigation introduces an innovative metadata-driven ETL framework addressing these obstacles through automated schema evolution detection and intelligent impact evaluation. The framework utilizes schema repositories and version monitoring systems to sustain detailed metadata catalogs, facilitating immediate identification of structural modifications throughout data repositories. The structural framework consists of four essential elements: Schema Registry Service, Change Detection Engine, Impact Analysis Module, and Pipeline Orchestration Layer. The implementation employs microservices design patterns operating on Microsoft Azure Kubernetes Service, incorporating Apache Spark for expandable data processing and Delta Lake for dependable data storage. Extensive testing throughout enterprise settings reveals outstanding results in automated schema change resolution, demonstrating considerable achievement rates for automatic management of schema drift situations without requiring manual oversight. The framework exhibits superior scalability characteristics through distributed architectural principles, enabling horizontal scaling across multiple processing nodes while maintaining sub-second response times for schema change identification and impact evaluation.