Data lakes are a popular solution for data storage, and for good reason.
Data lakes are flexible and cost effective, as they allow multiple query engines and many object formats without the need to manage resources like disks, CPUs, and memory.
In a data lake, data is simply stored in an object store, and you can use a managed query engine for a complete pay-per-usage solution.
With terabytes of compressed data added every day by automated processes, and with hundreds of users adding more data, regular cleanup of unused data is a must-have process.
Before deleting unused data, you first need to monitor the data lake's usage.
There are multiple query engines, and it is also possible to access data directly through the object store.
Monitoring this vast amount of data is tricky and can be expensive, so we set out to develop a method to collect usage data in an efficient way.
We wanted to monitor the usage of the data lake tables.
We decided to store the log history from the different engines, and use this data to monitor our data lake tables usage.
The usage data was used to find unused tables, which helped us to get to a cleanup process.
Using this process saves us in storage costs, reduces the data lake attack surface and gives us a cleaner work environment.
In this blog post, you'll learn about data lake table usage monitoring and cleanup.
Collecting usage data requires effort for each individual engine, and the collected data differs from one engine to another, but it's important to monitor all access options.
Once the usage data is stored and accessible, it is possible to monitor the data lake's objects usage to detect access patterns, perform cost analysis, monitor user access, and more.
After detecting unused tables, you should collect information to decide if you can remove any of the data.
For tables we create based on ETL processes, we join the data with our infrastructure data to get the average daily size of data entering the data lake.
Another example of data we can collect about unused tables is the size of the tables, number of objects, and when the data was last updated.
The additional information we collect on unused tables, like the resource consumption and last modification time, helps us to focus on the heavier tables and to decide if we can remove data.
Using the monitoring data we collected, we detected unused tables and removed hundreds of them.
If you want to do it on your own data lake, you have to understand how your data is being accessed, audit your data access, and finally implement usage monitoring.
This Cyber News was published on www.imperva.com. Publication date: Thu, 01 Feb 2024 17:13:04 +0000