R&D to ROI. Custom OCR dilemma solved by validating business decisions in only 2 months

Adriana Kowalczyk

Faced with the crucial decision of either extending collaboration with their existing service provider or venturing into uncharted territory with a custom OCR solution, our client embarked on a transformative R&D journey. In just two months, the company made swift validations and informed decisions that led to cost-efficient business conclusions.

Business challenge: to custom, or not to custom

The NDA company deals with many paper documents that end-users scan and digitize via the mobile app. Unfortunately, some documents have very poor quality: photos taken with bad cameras, unreadable handwriting, etc. Therefore, extracting information from them was extremely difficult; sometimes, even humans struggled to decipher, not to mention robots.

To organize such a process, extracting and classifying various data types (names, surnames, dates, currencies, products, companies, and most importantly, costs and taxes) from receipts, bills, invoices, and transportation tickets is necessary. 

The company had to decide whether to extend collaboration with the existing service provider, which would entail committing to a year-long contract. Instead, they aimed to develop a custom solution (including infrastructure), independent of user traffic volume, with a fixed cost. 

The main goal of R&D was to verify ASAP whether it was worth proceeding with the custom solution. Will the business be able to spend less, earn more, and scale better? Or maybe it’s better to choose off-the-shelf software?

OCR automates the reading and processing of document data

custom ocr image processing training data
Did this receipt go through the washing machine or slowly crumpled in a pocket?

The challenge was processing multiple types of input with completely different data structures:

  • handwritten receipts,
  • photos of printed receipts generated by the systems,
  • photos of printed receipts generated by the systems, but the photo was taken in difficult conditions, at the wrong angle, etc.,
  • PDFs with structured information.

Manual intervention is needed when errors occur, like misreading numbers (10,000€ instead of 100,00€) or misplacing tax values. Manual checkups and corrections add extra costs, slowing down employees and taking up their time for more crucial tasks. 

custom ocr model
Example of a handwritten receipt for a group taxi order

Currently, the company uses a decent external tool. Still, the pricing model depends on how often their API is used. In the long term, this turned out to be too expensive. 

OCR is a complicated technology, but we took up the challenge of quickly designing something cheap but equally effective.  

Suppose the metrics show that the custom solution is no better. In that case, we stop further R&D.

Research & development hypothesis

TSH R&D team, comprised of software architects, concurrently tackles research projects across diverse industries. This approach enables us to test different solutions to common problems early on, swiftly identifying viable options. Before committing to a project, we routinely conduct smaller proofs of concept to assess its feasibility and avoid potentially wasting time and resources. 

Given the rapid evolution of options, this validation is pivotal in AI, ML, and data management. What was impossible six months ago may become a feasible advantage. 

Failing to invest in R&D and verification risks being tied to outdated technologies, leading to substantial costs for future rebuilding.

NDA didn’t hire us for simple research 

They wanted to know if it was possible to implement their solution under specific custom conditions and solve this problem fast (due to time constraints) without investing in a considerable team. The primary requirement? The new custom solution must be equal. 

The company initially assumed that using the custom solution could be cheaper. 

Even before we started, we already had our suspicions about what was (not) possible. Their current external tool has an entire dedicated company behind it, making it challenging to match or surpass it within tight deadlines and budget limits. 

The success criteria:

  • successful digital photo processing,
  • compatible with all popular payroll systems,
  • maximum cost efficiency via the control algorithm,
  • built-in metric estimations to compare with the current off-the-shelf solution,
  • information processed in a few seconds (from customers’ feedback).

So, to solve this problem, we asked three questions:

  1. How do we lower the costs of the NDA solution? 
  2. What advantages and disadvantages will a new system bring? 
  3. Will the management board turn it into real profit? 

First ideas. Research phase

Firstly, we focused on the technology stack – what programming language and tools we should use. 

Is it worth focusing on Node.js/JavaScript or Python? 

Python is used for many ready-made open-source implementations related to data management, artificial intelligence, machine learning, and computer vision, so we chose it.

Research and code vs. SageMaker Studio Lab

SageMaker is an AWS tool for quick prototyping and experimentation based on Jupyter Notebooks. It provides free runtime (several hours a day) with both CPU and GPU. This is important because graphics-related operations are orders of magnitude faster with the GPU (with no additional cost!).

The idea was that if we proved that the custom solution was viable, the code could be used to build a real API, which then would be implemented in the client’s infrastructure.

We planned to:

  • improve image quality with Pillow and OpenCV,
  • extract text from images using TesseractOCR, KerasOCR, PaddleOCR, or EasyOCR (depending on which tool performs better in our tests).
  • extract information from the text using simple regular expressions (regexes), then Spacy NLP and SpanMarker NER.

During the research phase, we abandoned the following ideas:

  • Extracting information from the text using an LLM that would be hosted on our infrastructure (NOT ChatGPT – the data is too sensitive). It turned out that free and open-source local LLMs (e.g., GPT4All) give poor results, work slowly, and require enormous computing power.
  • AI/ML model LayoutLM – in addition to extracting text from images, also locates elements in the image and performs classification on this basis. It is used in developing other solutions of this type by much larger players. Unfortunately, since it is a more complicated solution, we couldn’t test it properly because of the R&D timeframe. Training the ML network with data from the client would be required to increase its effectiveness. Generally, we feel that this may be a prospective idea.

We save money for our clients even before writing a single line of code

Chosen solution – Optical Character Recognition (OCR)

Methods and algorithms from Computer Vision (e.g., OCR, image quality improvement) or text processing.

Our solution was to divide the whole problem into three phases: 

  1. Improving the quality of images.
  2. Extracting text from images.
  3. Extracting information from text.

This allowed us to do each phase independently and measure its effects. In case of problems, you replace only one part of the process.

For this purpose, we built a pipeline that made it easy to replace the strategies implemented at each stage. It also accelerated our capacity to compare results among different approaches.

At the beginning of the pipeline, we chose the OCR library, a directory to process,  where saved files should go, and other options (incl. cropping and resizing). 

Running OCR was just one milestone in our journey. Each step had its logic, and the proper strategy was chosen based on the configuration. It was the easiest for us to run specific notebooks using Jupyter Notebooks.

1. Improving the quality of images

The better the image quality, the more accurate the text extraction.

To achieve this goal, we employed a multi-step image transformation. Initially, we converted images to grayscale to eliminate unnecessary colors, followed by a thresholding algorithm for creating binary images. Proper thresholding is crucial, as incorrect values result in data loss due to overly bright or dark images.

The standard global thresholding algorithm proved ineffective for images with high noise or uneven lighting, making choosing an appropriate threshold value challenging. A local thresholding technique addressed this issue by calculating threshold values for each pixel based on the mean and standard deviation of surrounding pixels. We tested OpenCV’s Adaptive Thresholding, Otsu’s Binarization, and SciKit’s Niblack and Sauvola Thresholding, all producing superior results.

Cropping and de-skewing were significant steps in image transformation, enhancing further processing by removing unnecessary background elements and correcting angles. This was particularly crucial for improving OCR accuracy in challenging cases.

Additional experiments focused on enhancing contrast, brightness, noise reduction, blurring, and sharpening. While these steps provided minor accuracy improvements based on the source image, they contributed to overall image quality.

Each transformation step generated and saved intermediate images, allowing visual assessment of results and configuration adjustment. 

Structured PDF files required no enhancements, as data could be directly extracted using the PyPDF library. This streamlined the process, ensuring faster execution and higher data quality. Alternatively, rendering PDF files to images for OCR was considered but deemed suboptimal.

Optical Character Recognition (OCR) metrics

Before cropping, our OCR times were meandering around 0.87 seconds. Once we snipped away the unnecessary bits, it went down to 0.60 seconds—a dazzling 31% boost! Don’t underestimate the power of a trim. It might sound like a minor improvement, but in the OCR land, every pixel counts.

Now, let’s talk accuracy. We’re rocking 85% to 90% accuracy with machine-written invoices or top-notch photos. However, when we step into the murky waters of low-quality files, we hit more than 60%

Processing images step by step

ocr processing input image
Source image before processing
ocr capabilities
The same image after applying local thresholding algorithms
image classification of extracted features
Applying the edge detection algorithm to detect receipt contours.
ocr optical character recognition
The next step is to compute a hull mask to extract the final image.
high accuracy details on digital images
Final result – a cropped image.

2. Extracting text from handwriting

We initially experimented with various OCR tools such as TesseractOCR, KerasOCR, PaddleOCR, and EasyOCR to extract text from photos. Most OCRs function automatically with minimal configuration adjustments. After evaluating their performance, we found that TesseractOCR offered the best balance of speed and quality. Consequently, we continued using TesseractOCR for the remainder of our Proof of Concept.

Extracting text from handwriting requires a separate approach and strategy due to its complexity. 

Handwritten receipts posed a distinct challenge, as conventional OCRs needed more complexity. After multiple attempts, we determined that training our model was necessary for improved accuracy. Yet, a new problem emerged: configuring our pipeline to identify handwritten receipts and deciding when to switch to a different model.

To address this, we simultaneously passed images through multiple models and selected the best results. While this approach eliminates the need to manually check if the image contains handwritten text, it demands significantly more processing power. Unfortunately, time constraints prevented us from fully implementing this solution, and it remains a challenge we hope to tackle in the future.

3. Extracting information from text

Due to time constraints, we implemented basic data extraction using regular expressions, resulting in a comprehensive interface linking documents and entities.

For future exploration, options like Spacy NLP and SpanMarker NER are considered. 

A recent find is the Azure OpenAI Service, featuring ChatGPT 4, a top-tier language model. Its advantage lies in Microsoft’s infrastructure, offering secure integration for companies handling sensitive data, like our partner. ChatGPT 4 excels in data understanding and reasoning capabilities, making it a promising choice for extracting data from receipts. However, drawbacks include cost and longer processing times. Given the dynamic nature of the LLMs market, these factors may change at any moment.

Hypothesis busted

Our R&D did not confirm NDA’s hypothesis that a custom solution would be cheaper at this product stage. 

Why? In short, this R&D assumes specific deadlines and current business scale, but the hypothesis could be valid under different conditions.

We combined existing models, algorithms, and concepts from past projects, complemented by specific resources for this research. Integrating these foundational elements creates a functional solution. 

Our proof-of-concept solution performed well in a two-month timeframe. While there is room for improvement, we now have a clear direction and numerous ideas for further development.

The final decision

Our partner transitioned to Azure AI Document Intelligence

Azure AI Document Intelligence
The company already operated on Microsoft cloud services and was ready to go. Moreover, Microsoft extends preferential conditions to startups with significant financial savings.  Despite a similar pricing model, the overall cost is lower than that of competitors. 

Our DevOps established the Azure AI costs at around 92€ monthly for two environments.  

The implementation will be efficiently executed with minimal changes to code and infrastructure.

Only 2 months from start to finish

In the dynamic landscape of AI, investing in Research & Development is a quick verification of your hypotheses and a proactive strategy to stay ahead of the curve. It’s a shame, but most software development companies lack internal R&D departments with decisive and knowledgeable people to provide help in just a few days. 

Sometimes, an initial idea may not be THE idea, but it doesn’t mean that time verifying it was wasted. On the contrary, spending a bit extra on proper data-driven R&D can save you many more resources in the future! 

The value lies in making the correct short and long-term business decisions.

For their R&D, our partner needed an affordable, up-to-date company with tech solutions and business opportunities that work efficiently. Since they had no time to dilly-dally, the deadline was tight, with no wiggle room for delays. 

The collaboration was seamlessly managed, sparing the client from direct involvement. Conducting the R&D was a dedicated software architect with over a decade of collective experience, competencies, knowledge, and agility who fits their project.

This approach proved cost-effective and time-efficient for the client. Bi-weekly meetings with our partner’s team allowed us to present advancements and deliberate on future steps. 

Our R&D team will tell you honestly if your idea is viable

Azure has its perks, but custom builds can scale better and pay off more long-term. Every project is unique. Our architects are here to assess what’s right for you

What would you like to do?

    Your personal data will be processed in order to handle your question, and their administrator will be The Software House sp. z o.o. with its registered office in Gliwice. Other information regarding the processing of personal data, including information on your rights, can be found in our Privacy Policy.

    This site is protected by reCAPTCHA and the Google
    Privacy Policy and Terms of Service apply.

    We regard the TSH team as co-founders in our business. The entire team from The Software House has invested an incredible amount of time to truly understand our business, our users and their needs.

    Eyass Shakrah

    Co-Founder of Pet Media Group

    Thanks

    Thank you for your inquiry!

    We'll be back to you shortly to discuss your needs in more detail.