Treffer: Static Analysis of Data Transformations in Jupyter Notebooks

Title:
Static Analysis of Data Transformations in Jupyter Notebooks
Contributors:
Corvallis Srl, Institut Polytechnique de Paris (IP Paris), Département d'informatique - ENS-PSL (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Analyse Statique par Interprétation Abstraite (ANTIQUE), Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Centre Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)
Source:
12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2023). :8-13
Publisher Information:
CCSD; ACM, 2023.
Publication Year:
2023
Collection:
collection:ENS-PARIS
collection:CNRS
collection:INRIA
collection:INRIA-ROCQ
collection:TESTALAIN1
collection:INRIA2
collection:PSL
collection:INRIA-PSL
collection:IP_PARIS
collection:ENS-PSL
collection:DIENS
Subject Geographic:
Original Identifier:
HAL: hal-04249950
Document Type:
Konferenz conferenceObject<br />Conference papers
Language:
English
Relation:
info:eu-repo/semantics/altIdentifier/doi/10.1145/3589250.3596145
DOI:
10.1145/3589250.3596145
Rights:
info:eu-repo/semantics/OpenAccess
URL: http://creativecommons.org/licenses/by/
Accession Number:
edshal.hal.04249950v1
Database:
HAL

Weitere Informationen

Jupyter notebooks used to pre-process and polish raw data for data science and machine learning processes are challenging to analyze. Their data-centric code manipulates dataframes through call to library functions with complex semantics, and the properties to track over it vary widely depending on the verification task. This paper presents a novel abstract domain that simplifies writing analyses for such programs, by extracting a unique CFG from the notebook that contains all transformations applied to the data. Several properties can then be determined by analyzing such CFG, that is simpler than the original Python code. We present a first use case that exploits our analysis to infer the required shape of the dataframes manipulated by the notebook. CCS Concepts: • Theory of computation → Program analysis; Abstraction; • Software and its engineering → Automated static analysis.