Treffer: Data Lineage Analysis for PySpark and Python ORM Libraries ; Analýza datových toků pro PySpark a ORM knihovny jazyka Python

Title:

Data Lineage Analysis for PySpark and Python ORM Libraries ; Analýza datových toků pro PySpark a ORM knihovny jazyka Python

Authors:

Jurčo, Andrej

Contributors:

Parízek, Pavel, Škoda, Petr

Publisher Information:

Univerzita Karlova, Matematicko-fyzikální fakulta

Publication Year:

2023

Collection:

Charles University: CU Digital repository / Univerzita Karlova: Digitální repozitář UK

Subject Terms:

Document Type:

Dissertation thesis

File Description:

application/pdf; application/zip

Language:

English

Relation:

http://hdl.handle.net/20.500.11956/181592; 247480

Availability:

https://hdl.handle.net/20.500.11956/181592

Accession Number:

edsbas.D455595C

Database:

BASE

Weitere Informationen

In the world of ETL tools and data processing, Python is one of the main languages used in practice. Python scripts that define data manipulations usually use the same Python framework, PySpark, which is the Python API for the Spark framework, alongside database libraries, using their ORM features. These ORM features usually work in a similar way in most of the relevant libraries. Recently, MANTA Flow, a highly automated data lineage analysis tool, was extended with a Python language scanner and now it is in the phase of being extended to support more commonly used frameworks. In this work, we analyzed the PySpark library and the SQLAlchemy ORM technology in order to extend the MANTA's Python scanner with the support for these two frequently used tools. In case of the PySpark library, we designed and implemented a core of the plugin to the Python scanner which supports elementary functionality. The plugin is capable of analyzing various DataFrame input and output options available in PySpark for both file and database data sources, and it is able to propagate data flows during transformations with reasonable level of overapproximation, as demonstrated in the work. In case of the SQLAlchemy ORM, we designed a solution that would allow the scanner to analyze the ORM source code and its core could be used to. ; Vo svete ETL nástrojov a spracovania dát je Python jedným z najčastejšie použí- vaných jazykov. Skripty napísané v jazyku Python, ktoré definujú manipuláciu s dá- tami, zvyčajne používajú rovnakú knižnicu, PySpark, čo je Python API pre framework Spark, spoločne s databázovými knižnicami, využívajúc ich ORM funkcionalitu. Táto funkcionalita zvyčajne funguje podobným spôsobom vo väčšine relevantných knižníc. Nedávno bol MANTA Flow, vysoko automatizovaný nástroj na analýzu data lineage, rozšírený o skener jazyka Python a teraz je vo fáze rozširovania o podporu bežných frameworkov. V tejto práci sme analyzovali knižnicu PySpark a technológiu SQLAlchemy ORM s cieľom rozšíriť Python skener firmy MANTA o podporu ...

Treffer: Data Lineage Analysis for PySpark and Python ORM Libraries ; Analýza datových toků pro PySpark a ORM knihovny jazyka Python

Weitere Informationen

Links

Zusatz-Funktionen