Fogger: Open-source tool for GDPR-friendly data masking

Read time
3 min

As a software developer, you like to focus on software development. Unfortunately, nowadays, you also need to struggle with a bunch of data privacy-related stuff – even in the staging environment. It would be nice to automate some tasks with a free tool for data masking, right? Well, here it is.

Problems with masking sensitive data

When the new application is being developed, we, developers, need some set of data to work with. Usually, it’s done through fixtures – randomly generated data trying to mimic the real world. But then the application is deployed to production and it turns out that the real-life data isn’t so pure and simple.

Real-life users can create things that no developer has ever dreamt of. We need a copy of this data in our development environment.

But that’s not all! We have a new kid on the block, a hot topic (at least here, in Europe): the General Data Protection Regulation or GDPR. Now, you cannot simply get the data, put it in your development machine and play with it. You need to make sure that no sensitive information (like names, emails, credit card numbers) are compromised.

There are plethora of tools which can help you with that – usually tailored to big corporate projects – but the most common solution among startups and SMEs is a custom-made export script. Such a script masks the data in the database, replacing sensitive information with safe, randomly-generated substitutions. But developing it is problematic – it requires time and effort (therefore, money), it’s cumbersome, it’s boring, it’s prone to errors. And when the schema changes, the script needs to be updated.

Improve data masking with Fogger

We’ve struggled with writing export scripts at The Software House for quite a while. But, finally, we’ve said: “no more, let’s prepare a generic solution”. A tool that would be able to mask any schema with just a little configuration. And that’s how Fogger was born.

How does it work? What makes it so neat?

Fogger starts with analysing your database schema and prepares for you a configuration file that looks like this:

This is basically a list of all the tables and columns with masking strategy definitions. As you can see, the latter are blank for now – you need to fill in desired masking strategies next to columns containing sensitive data. For example, this line would replace all the emails with random ones (using example.com or similar domains):

What’s more, Fogger will read metadata from column comments. So, for example, if you put fogger::faker{method: “safeEmail”} in column’s comment during the development, the boilerplate will already have the strategy filled in. This way, you can define how to mask your data in the future, when the time comes, from the beginning of the development process.

The available masking strategies are starify, hashify and faker. The last one is especially great, as it uses the powerful fzaninotto/Faker library with all its methods.

Masking the data with Fogger is done in a consistent manner. For example, when a random value is being saved in place of a real email address, it’s kept in cache for future references. Therefore, if during the process of masking Fogger finds somewhere (be it the same table or not) the same email again, it’ll be replaced with the same substitution. And when a column being masked is a part of a foreign key constraint, all the other columns that are part of the constraint will be masked too.

Subsetting and excluding tables

In addition to data masking, you can define subsetting strategies for tables. If, for example, a table has millions of records and you’re interested only in a few thousand rows, you can achieve this with subsetting the table with one of the available strategies: head, tail and range. Head and tail will give you records from the beginning or the end of the table respectively. Range will let you filter the table by any column values (e.g. date columns to get only rows from October to December).

Last but not least, you can exclude whole tables. If your database contains tables with data that you don’t need – for example, log tables – you can exclude them. Fogger will copy the table’s schema, but not the data.

Usually, when subsetting and excluding, you can easily corrupt your database by removing entries that are referred to in other tables through foreign key constraints. But don’t worry – Fogger will refine the database at the end and put the constraints back in place, so the resulting database will be clean and consistent.

Fogger is not a standalone, do-everything tool. It needs to be run with access to Redis, as Redis provides it with the cache that was already mentioned and with queueing (on Redis lists). The queueing is necessary to process chunks of a database using workers. It enables horizontal scaling, as you can run multiple workers parallelly.

Fogger is designed to run as a Docker container. You can easily integrate it with your infrastructure or run it separately, providing it with access to a database (or a dump of it).

Now, it’s your turn!

We know that masking sensitive data in databases is a real problem, experienced by many of you out there. Because of that, we’ve decided to share Fogger with the world as an open-source project. This way, everyone can benefit from it – including you. And we hope that you can help us make it better by contributing on GitHub.

Start data masking with Fogger


Tags:

Contact us

We will get back to you in less than 24 hours

Tomasz Mróz

Tomasz Mróz

New Business Manager