Result: RDF-Connect : a declarative framework for streaming and cross-environment data processing pipelines
Further Information
Data processing pipelines are a crucial component of any data-centric system today. Machine learning, data integration, and knowledge graph publishing are examples where data processing pipelines are needed. Furthermore, most production systems require data pipelines that support continuous operation and streaming-based capabilities for low-latency computations over large volumes of data. However, creation and maintenance of data processing pipelines is challenging and a lot of effort is usually spent on ad-hoc scripting, which limits reusability across systems. Existing solutions are not interoperable out-of-the-box and do not allow for easy integration of different execution environments (e.g., Java, Python, JavaScript, Rust, etc), while maintaining a streaming operation. For example, combining Python, JavaScript and Java-based libraries natively in a single pipeline is not straightforward. An interoperable and declarative mechanism could allow for continuous communication and integrated execution of data processing functions across different execution environments. We introduce RDF-Connect, a declarative framework based on semantic standards that enables instantiating pipelines with data processing functions across execution environments communicating through well-known communication protocols. We describe its architecture and demonstrate its use for an RDF knowledge graph creation, validation and publishing use case. The declarative nature of our approach facilitates reusability and maintainability of data processing pipelines. We currently support JavaScript and JVM-based environments but we aim to extend RDF-Connect support to other rich ecosystems such as Python and to lower-level languages such as Rust, to take advantage of system-level performance gains