Treffer: Automated retrieval of enterprise URLs for official statistics: Comparing machine learning and generative AI approaches.
Weitere Informationen
The identification of official enterprise websites is increasingly important for enhancing statistical registers and integrating digital data sources into official statistics. However, manual URL retrieval is labor-intensive and infeasible at large scale. This study compares two types of automated approaches for retrieving business URLs: a pipeline based on machine learning developed primarily in Python, and a pipeline relying on generative AI models (ChatGPT, Gemini, Perplexity, You.com, Meta's Llama 3.1 8B). Using a stratified sample of 500 enterprises from the Italian population of the ICT usage in enterprises survey, we benchmark these approaches against manually verified URLs serving as a gold standard. The results are evaluated primarily in terms of accuracy in recovering the correct URLs, but attention is also given to the computational and operational costs to provide practical insights for their potential implementation within statistical production processes. The machine learning approach achieves the highest accuracy but requires significant computational resources. Among generative AI models, ChatGPT performs best offering a more cost-effective alternative. Integration strategies, including majority voting and method hierarchies, are explored to improve performance. Depending on accuracy and resource constraints, these approaches present viable options for enhancing business registers with web-based information. [ABSTRACT FROM AUTHOR]