Treffer: Detecting non-natural language artifacts for de-noising bug reports.

Title:

Detecting non-natural language artifacts for de-noising bug reports.

Authors:

Hirsch T; Institute of Software Technology, Graz University of Technology, Inffeldgasse 16b, 8010 Graz, Austria., Hofer B; Institute of Software Technology, Graz University of Technology, Inffeldgasse 16b, 8010 Graz, Austria.

Source:

Automated software engineering [Autom Softw Eng] 2022; Vol. 29 (2), pp. 52. Date of Electronic Publication: 2022 Aug 24.

Publication Type:

Journal Article

Language:

English

Journal Info:

Publisher: Springer Netherlands Country of Publication: Netherlands NLM ID: 9918453686306676 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1573-7535 (Electronic) Linking ISSN: 09288910 NLM ISO Abbreviation: Autom Softw Eng Subsets: PubMed not MEDLINE

Imprint Name(s):

Publication: <2009-> : [Dordrecht] : Springer Netherlands
Original Publication: [Dordrecht] : Kluwer Academic Publishers

References:

Sensors (Basel). 2019 Jul 05;19(13):. (PMID: 31284398)
Biometrics. 1977 Mar;33(1):159-74. (PMID: 843571)

Contributed Indexing:

Keywords: Artifact removal; Bug reports; Data cleaning; De-noising; Issue tickets; NLP

Entry Date(s):

Date Created: 20220906 Latest Revision: 20221101

Update Code:

20250114

PubMed Central ID:

PMC9439617

DOI:

10.1007/s10515-022-00350-0

PMID:

36065351

Database:

MEDLINE

Weitere Informationen

Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets.
(© The Author(s) 2022.)

Conflict of interestThe work described in this paper has been funded by the Austrian Science Fund (FWF): P 32653-N (Automated Debugging in Use). The authors have no other relevant financial or non-financial interests to disclose. The authors have no competing interests.

Treffer: Detecting non-natural language artifacts for de-noising bug reports.

Weitere Informationen

Links

Zusatz-Funktionen