Treffer: Optimizing LLMs for Microservices Logs Analysis through Prompt Engineering

Title:
Optimizing LLMs for Microservices Logs Analysis through Prompt Engineering
Authors:
Publisher Information:
Uppsala universitet, Institutionen för informatik och media
Publication Year:
2025
Collection:
Uppsala University: Publications (DiVA)
Document Type:
Dissertation bachelor thesis
File Description:
application/pdf
Language:
English
Rights:
info:eu-repo/semantics/openAccess
Accession Number:
edsbas.FCE5148D
Database:
BASE

Weitere Informationen

The increasing complexity of cloud-native and microservices-based architectures has made log analysis a critical task for ensuring system reliability and operational efficiency. Traditional rule-based approaches to log analysis are limited in scalability and adaptability, particularly within fast-paced environments of continuous integration and continuous deployment. In industrial settings such as Ericsson, microservices testing generates over 1,300 issue tickets each month, each linked to large volumes of distributed logs. Manually inspecting and diagnosing these logs is time-consuming and resource-intensive, creating a significant operational burden. Recent advances in large language models have introduced promising opportunities for automating log interpretation, summarization, and failure diagnosis. This study investigates the impact of prompt engineering on the performance of large language models in summarizing microservices logs, with a specific focus on Ericsson’s continuous integration and deployment pipeline and logs produced by the Java Common Auto Tester framework. Prompt engineering is defined here as the strategic design of input text to guide a language model toward producing accurate and contextually relevant outputs. Six prompting strategies were evaluated, including minimal instruction, role reframing, few-shot prompting, and chain-of-thought prompting, using real issue tickets and their associated log data. To evaluate performance, the research combined automatic text comparison methods—such as recall-oriented gisting evaluation and semantic similarity scoring—with expert feedback from engineers working in continuous integration and deployment. The results showed that minimal instruction and role reframing prompts produced the clearest, most relevant, and most diagnostically useful summaries, while few-shot and chain-of-thought approaches were less effective due to verbosity and limited generalization. This thesis contributes to ongoing research on the use of large language models in software ...