Result: A Comparative Analysis of Python Text Matching Libraries: A Multilingual Evaluation of Capabilities, Performance and Resource Utilization
Further Information
Python text-matching libraries have become essential tools in data cleaning and natural language processing; however, researchers have not thoroughly examined their performance, accuracy, and resource efficiency across multilingual scenarios. This study evaluates five major libraries—FuzzyWuzzy, RapidFuzz, Difflib, Levenshtein, and Jellyfish—using a dataset of 50,000 test cases in English, Spanish, French, German, and Italian. We introduce controlled variations in text complexity, error types, and string lengths to measure processing speed, matching accuracy, and resource consumption. The experimental results reveal significant performance differences among the libraries. RapidFuzz processes text 40% faster than others while maintaining efficient memory usage. However, its performance varies depending on language and error type. Levenshtein achieves higher accuracy when handling non-Latin characters, while FuzzyWuzzy consistently performs well across different text lengths. Difflib, despite its built-in availability, runs slower and consumes more resources. Jellyfish specializes in phonetic matching but struggles with long text inputs. Memory usage fluctuates between 20 and 200 Megabytes for identical workloads, revealing substantial efficiency differences. These findings enable developers to select the most suitable library based on their specific needs and computational constraints. Our study introduces a standardized evaluation framework and a multilingual benchmarking dataset, enabling researchers to compare text-matching methods more effectively. By identifying key performance trade-offs, we provide a practical guide for optimizing text-matching efficiency in real-world applications. This research contributes to the broader field of natural language processing by offering data-driven insights and a structured methodology for evaluating text similarity techniques.