How to protect sensitive data when building a digital solution based on LLMs? – PII anonymization case study

Przemysław Gacka

Companies dream of using powerful AI data processing to acquire more clients, provide better customer service, and much more. But they are also wary of AI-related data privacy risks and compliance requirements. As a result, many withhold or limit the scope of their AI initiatives. But what if we told you that you can have this cake and eat it, too? Our client protected its data while cutting up to 95% of document processing time with AI.

It seems like all we hear about is AI. Yet, according to Boston Consulting Group, 74% of companies struggle with AI adoption.

Our experience tells us that businesses may limit the scope of AI projects because they desire full data integrity.

Learn:

  • how you can secure your sensitive data as you tap into the potential of Large Language Models like OpenAI,
  • how we implemented full data anonymization for a client who sped up document processing by up to 95% with an OpenAI-based OCR solution.

It all started with an effort to help admins improve their productivity during the customer onboarding process.

How far can you push productivity without AI?

For 5 years, we’d been working with a UK company that develops pension dashboards. Each employed Brit could use a dashboard to view all of the retirement pension plans they paid into during their professional career.

To onboard each individual, admins had to manually input dozens of documents, losing time on analyzing records and typing.

But once the client acquired a pension document provider with a customer base of their own (e.g., an insurance provider), they needed to onboard thousands of such individuals at once!

The client’s ability to grow was tied to how fast they could process documents. As a result, they searched for different ways to boost their efficiency.

During our cooperation, we helped the client cut the onboarding time of big business clients from 3 months to 3 days by:

  • introducing new document templates,
  • improving integration with third parties through APIs to obtain some data automatically.

But the drive for efficiency continued.

Cutting onboarding time with AI… and then what?

Soon, we started talking about how Artificial Intelligence could help process document data even faster to limit manual labor even more.

We created a Serverless application powered by an LLM model that uses Optical Character Recognition to extract specific fields from documents. But there was a catch – the LLM model couldn’t have access to users’ personal or sensitive data. A dealbreaker?

The MVP processed a document in 1 minute and 40 seconds when it would take 15 minutes of manual work.

But if we ever wanted the solution to go live, we needed to figure out an efficient and scalable way to protect all the Personally Identifiable Information (PII).

Data anonymization for our client

So-called PII is any type of information that can be used to identify a very specific individual. There are many types of PIIs, but some of the most common include:

  • date of birth,
  • home address,
  • phone number,
  • credit card number,
  • biometric data (e.g., fingertips or palm prints),
  • medical records.

When you anonymize a piece of data, you remove all identifiers that can be used to associate a person with the money value or an insurance provider’s name.

To strengthen your anonymization effort, you might also encrypt specific characters or words by replacing them with others. 

After you complete all the steps to anonymize your data, you can send it for processing to an LMM.

The basic idea is not hard, but when your app generates tons of records, data anonymization requires careful planning and testing. It will be different for each application or feature you want to anonymize.

Mark Rearden knows much about PII of the medical kind.

Data anonymization technologies

These were some of our key technology picks for the anonymization work:

Python & Serverless

The basic OCR solution was a Serverless app written in Python leveraging AWS Step Functions & Lambdas.

GPT-4o mini

It’s one of the OpenAI LLMs. We chose it as the processing solution’s engine after we considered the speed and cost of processing.

AWS & REST microservice

The whole data anonymization functionality could be organized as a separate dedicated Python Flask microservice that would expose an endpoint for anonymization hosted on AWS and managed with the App Runner

spaCy

We also chose the sPaCy library written in Python for Natural Language Processing.

Let’s take a closer look at the exact data anonymization process.

Implementing data anonymization

By looking at how we implemented data anonymization, you’ll see how ensuring data protection fits into the larger process of building an AI feature.

  1. We identified the PII data that required anonymization

There are many types of documents that need processing. They may share some document fields but also have unique ones. Some of the most common data types we chose included first name, last name, middle name, date of birth, or national insurance number.

PII to anonymize
  1. We defined and recognized data patterns

To make sure that the OCR solution knows where the PII data was, we used the following steps:

  • text identification to detect and isolate text regions within an image,
  • image processing to improve the quality of scanned documents to boost recognition capability,
  • character classification to map characters and words to their corresponding alphanumeric or symbolic values.

That’s already the base for an anonymization solution, but we needed to improve it further.

PII location
  1. We built up the anonymization capability for each data type individually

We developed a Named Entity Recognition (NER) model to handle each data type differently, thus improving overall data processing quality. Some tools make this task a lot easier. For example, the aforementioned spaCy library helped us recognize various named entities or data types, such as a person, a country, a nationality, or a book title.

Then, we created a generalized algorithm that distinguishes between data types and an individual anonymization module for each type.

Our data anonymization service was now complete, but there were still a couple of steps to clear before it was ready to serve the client and its users.

OCR boosting
  1. We integrated the anonymization service into your app

To allow the Serverless OCR application to communicate with the anonymization service, we used the REST API.

  1. We conducted thorough end-to-end testing of the anonymization process

We performed testing iteratively as we moved the data anonymization feature through the MVP phase toward a production-ready solution. To facilitate testing and observability, we set up monitoring.

  1. Deploy!

The anonymization solution went live.

So, what did we achieve here?

Deliverables – technology & business

From a technological standpoint, the client received:

  • An efficient and safe OCR solution

The document processing application was capable of automatically parsing a document in under a minute. The first PoC extracted 15-20 document fields in 40 seconds without ever exposing sensitive PII to the LLM.

  • Built-in scalability

Business requirements could evolve and change the structure and sheer quantity of documents in the future. Because we built a generalized process for identifying different data types, we were able to add new data types simply by creating new anonymization modules.

These technological achievements allowed the client to:

  • Improve customer onboarding speed

The anonymization feature ensured the client could fast-track document processing for client onboarding without putting sensitive PII data at risk.

  • Find a positive attitude about AI

This was the client’s first AI project, and they approached it with a sense of responsibility for their client’s data. In the process of implementing it, they didn’t need to deny themselves the full potential of AI. They gained the right knowledge and attitude to tackle even more amazing AI-based projects in the future. 

Truly, the drive for efficiency never ends, but it can also benefit the customers if you take security precautions.

machine translation with AI
Find out how this company cut translation costs from $200 to $1.95 per article with machine translation

Don’t be the last to see the full potential of AI

You may be afraid of endangering your sensitive information during AI development. It’s a major challenge to AI innovation.

In any organization, you’ll find people who will rightfully point out this danger to you.

There are already companies who have done the homework and realized that they can upgrade their best business use cases with AI and never endanger data. They have the knowledge, facts, and experience to relieve internal doubts and champion AI initiatives.

Our work on the data anonymization tool helped the client validate an AI-driven product idea securely. If your company doesn’t want to be among the last ones to experiment with AI, you may want to acquire developers experienced with anonymized data and data anonymization techniques.

If you have skilled data security and AI experts on your side who can safeguard a hugely successful AI initiative from data integrity issues, you can grow your business faster. Your experts will custom-build a protection mechanism as you play to your strengths with AI.

And if your team wants to consult AI adoption consider trying our workshop

The GenAI Rapid Prototyping Sprint™ is a 2-day AI workshop that will help you quickly discover how to use AI models to generate business value.

What would you like to do?

    Your personal data will be processed in order to handle your question, and their administrator will be The Software House sp. z o.o. with its registered office in Gliwice. Other information regarding the processing of personal data, including information on your rights, can be found in our Privacy Policy.

    This site is protected by reCAPTCHA and the Google
    Privacy Policy and Terms of Service apply.

    We regard the TSH team as co-founders in our business. The entire team from The Software House has invested an incredible amount of time to truly understand our business, our users and their needs.

    Eyass Shakrah

    Co-Founder of Pet Media Group

    Thanks

    Thank you for your inquiry!

    We'll be back to you shortly to discuss your needs in more detail.