Back to other posts

Building a Data Lake to safely derive insight from sensitive data

4
min read
January 9, 2025
November 10, 2022

Working with consumers in debt means we handle highly sensitive data. This includes information such as names, addresses and contact details but more crucially financial and health-related data.

All companies have a responsibility to ensure they are protecting their customers’ data from hacks and leaks. However, given that a large proportion of our customers are in vulnerable circumstances, ensuring our data is as secure as possible is a top priority.

Our data is stored in a database in the cloud on AWS (Amazon Web Services) with the following security measures:

  • Encryption at rest - This protects against a physical attack. If someone broke into one of Amazon’s data centres and stole the hard drive storing our data they wouldn’t be able to read it.
  • Everything is in a VPC (Virtual Private Cloud) - No one can get access to our data via the internet unless they are part of our organisation.
  • Application Encryption - If someone does manage to log in through our organisation and tries to read the database, they won’t be able to decipher any information without an encryption key.

Unfortunately, Security comes at a price...

...And that price is felt by the Data team. As a data-driven organisation, we leverage data for all of our decision-making. This includes deploying automated machine learning processes, providing live dashboards to our clients as well as deep dive analysis. Therefore, we need to be able to access up-to-date data without compromising its security.

A bad example of how to do this is to give a read-only API key to the database to members of the data team. Data scientists can then can download the data they need and build models or perform analysis. Unfortunately, this brings with it a huge wave of vulnerabilities:

  • Employees are responsible for handling the keys to the data - anyone with access to these can download the data
  • The data will likely be saved elsewhere unencrypted - meaning all sensitive information is in plain text
  • To make life easier the database might have to be accessible from outside the VPC

Although better security measures can be enforced for a more structured ETL (extract, transform, load) process - the same issues apply:

  • The platform’s security has to be weakened to allow for the ETL process to pull data
  • It is the responsibility of the data extractor to not pull sensitive information

Push, don’t pull

These issues can be avoided by pushing data out of the platform instead of pulling it. This allows for rigid control over what data is released from the database. Furthermore, since we are pushing data out of the platform there is no need for API keys and the database can stay closed off from the outside world.

This process can be achieved with AWS Kinesis and Kinesis Firehose. I like to think of Kinesis as a tunnel where data is thrown into one end and magically appears at the other end. From the platform, we push data into Kinesis and it gets dumped into AWS S3 as a json object. To ensure the data team have up-to-date data, whenever a record is created or updated on the database it is also pushed to Kinesis.

By pushing the data we have rigid control of the specific fields which are released and can also derive insight from sensitive fields. For example, we can create and push a field like has_email (to indicate if a customer has an associated email address or not) instead of releasing the email address itself.

Apache Spark and ELT

To process the dump of json files generated from Kinesis into an optimised and useful format in our data lake we use Apache Spark. Spark is a parallel computation engine and the de facto standard for big data processing. Using Spark we collate and convert the json files into delta format allowing for significantly faster queries, schema enforcement and rollbacks. It is common to adopt ELT over ETL for this stage of a data pipeline.

ETL stands for ‘Extract, Transform, Load’ and is the process of taking data from a location, transforming it in some way and saving it in another location. Since we have already decided on the fields which are sent through Kinesis and we are confident there is no sensitive information we can instead use ELT.

ELT is the cool younger brother of ETL and stands for ‘Extract, Load, Transform’. This fairly new paradigm is a process by which you take data from a source and immediately save it into an optimised format in your desired location with no immediate transformations. Then you can transform it further down the pipeline.

Although the differences between ETL and ELT may seem small - the benefits of ELT can be quite profound. With ELT you save all of your data and then worry about what to do with it later. This allows for easier auditing and greater flexibility to create new downstream processes. The data pipeline is also more reliable since you are isolating the two processes:

  • Getting data from source into your desired location (data warehouse/lake/lakehouse)
  • Performing some business logic transformation

Putting it together

Through this process, we now have up-to-date data in a useful format in our data lakehouse. We can also be confident we have no PII or sensitive information - however, we are still able to draw meaning and insight about the sensitive information. Using this data we can build and deploy machine learning workloads, serve live analytics to our colleagues and clients, and perform deep dive analysis into our product.

And we have achieved this without compromising the security of the platform database.