14 February 2025
How to protect sensitive data when building a digital solution based on LLMs? – PII anonymization case study
Companies dream of using powerful AI data processing to acquire more clients, provide better customer service, and much more. But they are also wary of AI-related data privacy risks and compliance requirements. As a result, many withhold or limit the scope of their AI initiatives. But what if we told you that you can have this cake and eat it, too? Our client protected its data while cutting up to 95% of document processing time with AI.
It seems like all we hear about is AI. Yet, according to Boston Consulting Group, 74% of companies struggle with AI adoption.
Our experience tells us that businesses may limit the scope of AI projects because they desire full data integrity.
Learn:
- how you can secure your sensitive data as you tap into the potential of Large Language Models like OpenAI,
- how we implemented full data anonymization for a client who sped up document processing by up to 95% with an OpenAI-based OCR solution.
It all started with an effort to help admins improve their productivity during the customer onboarding process.
How far can you push productivity without AI?
For 5 years, we’d been working with a UK company that develops pension dashboards. Each employed Brit could use a dashboard to view all of the retirement pension plans they paid into during their professional career.
To onboard each individual, admins had to manually input dozens of documents, losing time on analyzing records and typing.
But once the client acquired a pension document provider with a customer base of their own (e.g., an insurance provider), they needed to onboard thousands of such individuals at once!
The client’s ability to grow was tied to how fast they could process documents. As a result, they searched for different ways to boost their efficiency.
During our cooperation, we helped the client cut the onboarding time of big business clients from 3 months to 3 days by:
- introducing new document templates,
- improving integration with third parties through APIs to obtain some data automatically.
But the drive for efficiency continued.
Cutting onboarding time with AI… and then what?
Soon, we started talking about how Artificial Intelligence could help process document data even faster to limit manual labor even more.
We created a Serverless application powered by an LLM model that uses Optical Character Recognition to extract specific fields from documents. But there was a catch – the LLM model couldn’t have access to users’ personal or sensitive data. A dealbreaker?
The MVP processed a document in 1 minute and 40 seconds when it would take 15 minutes of manual work.
But if we ever wanted the solution to go live, we needed to figure out an efficient and scalable way to protect all the Personally Identifiable Information (PII).
Data anonymization for our client
So-called PII is any type of information that can be used to identify a very specific individual. There are many types of PIIs, but some of the most common include:
- date of birth,
- home address,
- phone number,
- credit card number,
- biometric data (e.g., fingertips or palm prints),
- medical records.
When you anonymize a piece of data, you remove all identifiers that can be used to associate a person with the money value or an insurance provider’s name.
To strengthen your anonymization effort, you might also encrypt specific characters or words by replacing them with others.
After you complete all the steps to anonymize your data, you can send it for processing to an LMM.
The basic idea is not hard, but when your app generates tons of records, data anonymization requires careful planning and testing. It will be different for each application or feature you want to anonymize.

Data anonymization technologies
These were some of our key technology picks for the anonymization work:
Python & Serverless
The basic OCR solution was a Serverless app written in Python leveraging AWS Step Functions & Lambdas.
GPT-4o mini
It’s one of the OpenAI LLMs. We chose it as the processing solution’s engine after we considered the speed and cost of processing.
AWS & REST microservice
The whole data anonymization functionality could be organized as a separate dedicated Python Flask microservice that would expose an endpoint for anonymization hosted on AWS and managed with the App Runner
We also chose the sPaCy library written in Python for Natural Language Processing.
Let’s take a closer look at the exact data anonymization process.
Implementing data anonymization
By looking at how we implemented data anonymization, you’ll see how ensuring data protection fits into the larger process of building an AI feature.
- We identified the PII data that required anonymization
There are many types of documents that need processing. They may share some document fields but also have unique ones. Some of the most common data types we chose included first name, last name, middle name, date of birth, or national insurance number.

- We defined and recognized data patterns
To make sure that the OCR solution knows where the PII data was, we used the following steps:
- text identification to detect and isolate text regions within an image,
- image processing to improve the quality of scanned documents to boost recognition capability,
- character classification to map characters and words to their corresponding alphanumeric or symbolic values.
That’s already the base for an anonymization solution, but we needed to improve it further.

- We built up the anonymization capability for each data type individually
We developed a Named Entity Recognition (NER) model to handle each data type differently, thus improving overall data processing quality. Some tools make this task a lot easier. For example, the aforementioned spaCy library helped us recognize various named entities or data types, such as a person, a country, a nationality, or a book title.
Then, we created a generalized algorithm that distinguishes between data types and an individual anonymization module for each type.
Our data anonymization service was now complete, but there were still a couple of steps to clear before it was ready to serve the client and its users.

- We integrated the anonymization service into your app
To allow the Serverless OCR application to communicate with the anonymization service, we used the REST API.
- We conducted thorough end-to-end testing of the anonymization process
We performed testing iteratively as we moved the data anonymization feature through the MVP phase toward a production-ready solution. To facilitate testing and observability, we set up monitoring.
- Deploy!
The anonymization solution went live.
So, what did we achieve here?
Deliverables – technology & business
From a technological standpoint, the client received:
- An efficient and safe OCR solution
The document processing application was capable of automatically parsing a document in under a minute. The first PoC extracted 15-20 document fields in 40 seconds without ever exposing sensitive PII to the LLM.
- Built-in scalability
Business requirements could evolve and change the structure and sheer quantity of documents in the future. Because we built a generalized process for identifying different data types, we were able to add new data types simply by creating new anonymization modules.
These technological achievements allowed the client to:
- Improve customer onboarding speed
The anonymization feature ensured the client could fast-track document processing for client onboarding without putting sensitive PII data at risk.
- Find a positive attitude about AI
This was the client’s first AI project, and they approached it with a sense of responsibility for their client’s data. In the process of implementing it, they didn’t need to deny themselves the full potential of AI. They gained the right knowledge and attitude to tackle even more amazing AI-based projects in the future.
Truly, the drive for efficiency never ends, but it can also benefit the customers if you take security precautions.

Don’t be the last to see the full potential of AI
You may be afraid of endangering your sensitive information during AI development. It’s a major challenge to AI innovation.
In any organization, you’ll find people who will rightfully point out this danger to you.
There are already companies who have done the homework and realized that they can upgrade their best business use cases with AI and never endanger data. They have the knowledge, facts, and experience to relieve internal doubts and champion AI initiatives.
Our work on the data anonymization tool helped the client validate an AI-driven product idea securely. If your company doesn’t want to be among the last ones to experiment with AI, you may want to acquire developers experienced with anonymized data and data anonymization techniques.
If you have skilled data security and AI experts on your side who can safeguard a hugely successful AI initiative from data integrity issues, you can grow your business faster. Your experts will custom-build a protection mechanism as you play to your strengths with AI.
And if your team wants to consult AI adoption consider trying our workshop
The GenAI Rapid Prototyping Sprint™ is a 2-day AI workshop that will help you quickly discover how to use AI models to generate business value.