Treffer: Extracting and Comparing Concepts Emerging from Software Code, Documentation and Tests

Title:

Extracting and Comparing Concepts Emerging from Software Code, Documentation and Tests

Authors:

Contributors:

Catolino , G., Di Nucci , D., Tamburri , D.A.

Source:

Pauzi, Z & Capiluppi, A 2021, Extracting and Comparing Concepts Emerging from Software Code, Documentation and Tests. in G Catolino , D Di Nucci & D A Tamburri (eds), 20th Belgium-Netherlands Software Evolution Workshop, BENEVOL 2021V : CEUR Workshop Proceedings., Code 176287, CEUR Workshop Proceedings, vol. 3071, CEUR Workshop Proceedings, 20th Belgium-Netherlands Software Evolution Workshop, BENEVOL 2021, Virtual, s-Hertogenbosch, Netherlands, 07/12/2021.

Publisher Information:

CEUR Workshop Proceedings

Publication Year:

2021

Collection:

University of Groningen research database

Subject Terms:

Information retrieval, Natural language processing, Software traceability, Textual analysis

Document Type:

Fachzeitschrift article in journal/newspaper

File Description:

application/pdf

Language:

English

Relation:

info:eu-repo/semantics/altIdentifier/hdl/https://hdl.handle.net/11370/9609ceb7-3b98-476b-9809-1ef0485a4253

Availability:

https://hdl.handle.net/11370/9609ceb7-3b98-476b-9809-1ef0485a4253
https://research.rug.nl/en/publications/9609ceb7-3b98-476b-9809-1ef0485a4253
https://pure.rug.nl/ws/files/232277308/paper1.pdf
https://www.scopus.com/pages/publications/85123481087

Rights:

info:eu-repo/semantics/openAccess ; http://creativecommons.org/licenses/by/4.0/

Accession Number:

edsbas.B5239072

Database:

BASE

Weitere Informationen

Traceability in software engineering is the ability to connect different artifacts that have been built or designed at various points in time. Given the variety of tasks, tools and formats in the software lifecycle, an outstanding challenge for traceability studies is to deal with the heterogeneity of the artifacts, the links between them and the means to extract each. Using a unified approach for extracting keywords from textual information, this paper aims to compare the concepts extracted from three software artifacts: source code, documentation and tests from the same system. The objectives are to detect similarities in the concepts emerged, and to show the degree of alignment and synchronisation the artifacts possess. Using the components of three projects from the Apache Software Foundation, this paper extracts the concepts from 'base' source code, documentation, and tests (separated from the source code). The extraction is done based on the keywords present in each artifact: we then run multiple comparisons (through calculating cosine similarities on features extracted by word embeddings) in order to detect how the sets of concepts are similar or overlap. For similarities between code and tests, we discovered that using pre-trained language models (with increasing dimension and corpus size) correlates to the increase in magnitude, with higher averages and smaller ranges. FastText pre-trained embeddings scored the highest average of 97.33% with the lowest range of 21.8 across all projects. Also, our approach was able to quickly detect outliers, possibly indicating drifts in traceability within modules. For similarities involving documentation, there was a considerable drop in similarity score compared to between code and tests per module - down to below 5%.

Treffer: Extracting and Comparing Concepts Emerging from Software Code, Documentation and Tests

Weitere Informationen

Links

Zusatz-Funktionen