Treffer: UDF-Centric Dataflow Systems for Supporting User-Defined Functions in Collaborative Data Science, AI, and ML

Title:
UDF-Centric Dataflow Systems for Supporting User-Defined Functions in Collaborative Data Science, AI, and ML
Publisher Information:
eScholarship, University of California 2025-08-16
Document Type:
E-Ressource Electronic Resource
Availability:
Open access content. Open access content
CC-BY
Note:
application/pdf
Other Numbers:
CDLER oai:escholarship.org:ark:/13030/qt9tv0t1tr
qt9tv0t1tr
1533730944
Contributing Source:
UC MASS DIGITIZATION
From OAIster®, provided by the OCLC Cooperative.
Accession Number:
edsoai.on1533730944
Database:
OAIster

Weitere Informationen

Data science tools, spanning from data collection to analysis and visualization, and leveraging advanced techniques such as artificial intelligence (AI), machine learning (ML), and large language models (LLMs), are now indispensable across a wide range of fields. Addressing today’s complex problems demands interdisciplinary collaboration among domain experts, data engineers, computer scientists, and statisticians, as no single field holds all the necessary expertise. There is an increasing demand for systems that let teams bring their own code across languages, collaborate modularly, inspect and interact with running computations at fine granularity, and manage heterogeneous resources in a resource-aware way. For the past few years, we have been building Texera, an open-source system to support collaborative data science using GUI-based workflows. This dissertation extends Texera with first-class support for user-defined functions (UDFs) and builds UDF-centric systems to meet these needs. We first present UDFlow, a framework for supporting UDFs in dataflow systems. It provides a unified API that supported tuple-, batch-, and table-oriented execution, enabling collaborators to express UDF logic at whatever granularity their task required. The API is also expressive enough to handle UDFs with multiple input ports and output ports. It allows collaborators to use Python, R, Scala, and Java UDFs together in a single workflow. We discuss execution support for host-language UDFs as well as foreign-language UDFs (e.g., Python, R) run in sidecar processes. We showcase the UDF UI and supporting services that provide an IDE-like experience to ease the development process of UDFs. We then propose Udon, a novel UDF debugger to support line-by-line debugging on dataflow systems. Udon allows users to set breakpoints, perform code inspections, and make code modifications while executing a UDF even on a single tuple. It includes a novel debug-aware UDF execution model to ensure the