Treffer: Improving Document Digitization with Machine Learning-Based OCR

Title:
Improving Document Digitization with Machine Learning-Based OCR
Source:
International Journal on Science and Technology. 16
Publisher Information:
International Research Publication and Journals, 2025.
Publication Year:
2025
Document Type:
Fachzeitschrift Article
ISSN:
2229-7677
DOI:
10.71097/ijsat.v16.i1.1890
Accession Number:
edsair.doi...........46d610eb0ead66ee80a560a51207f49a
Database:
OpenAIRE

Weitere Informationen

In today’s digital era, the extraction of text from unstructured formats such as images, PDFs, and handwritten documents is critical for digitization and automation. Traditional methods often struggle with scalability , complex layouts and multi-language support. This project addresses these challenges by leveraging Machine Learning, Optical Character Recognition (OCR), AWS Textract model and microservices architecture to create a robust, scalable, and efficient text extraction system. The proposed solution integrates advanced technologies such as Java Spring Boot for backend development, PostgreSQL for secure data storage, and containerized microservices for enhanced modularity and scalability. The system performs preprocessing to improve image quality, employs deep learning algorithms for accurate text recognition. Parallel processing and task queuing ensure high throughput and low latency for real-time and bulk operations. By converting unstructured data into structured like JSON or CSV ,this system facilitates seamless integration into existing workflows. This study highlights the design, functionality, and benefits of this innovative approach to text extraction, driving efficiency in document management and automation.