Optimizing Data Lake Usage with Effective Object Management

Data lakes are a popular solution for data storage, and for good reason.
Data lakes are flexible and cost effective, as they allow multiple query engines and many object formats without the need to manage resources like disks, CPUs, and memory.
In a data lake, data is simply stored in an object store, and you can use a managed query engine for a complete pay-per-usage solution.
With terabytes of compressed data added every day by automated processes, and with hundreds of users adding more data, regular cleanup of unused data is a must-have process.
Before deleting unused data, you first need to monitor the data lake's usage.
There are multiple query engines, and it is also possible to access data directly through the object store.
Monitoring this vast amount of data is tricky and can be expensive, so we set out to develop a method to collect usage data in an efficient way.
We wanted to monitor the usage of the data lake tables.
We decided to store the log history from the different engines, and use this data to monitor our data lake tables usage.
The usage data was used to find unused tables, which helped us to get to a cleanup process.
Using this process saves us in storage costs, reduces the data lake attack surface and gives us a cleaner work environment.
In this blog post, you'll learn about data lake table usage monitoring and cleanup.
Collecting usage data requires effort for each individual engine, and the collected data differs from one engine to another, but it's important to monitor all access options.
Once the usage data is stored and accessible, it is possible to monitor the data lake's objects usage to detect access patterns, perform cost analysis, monitor user access, and more.
After detecting unused tables, you should collect information to decide if you can remove any of the data.
For tables we create based on ETL processes, we join the data with our infrastructure data to get the average daily size of data entering the data lake.
Another example of data we can collect about unused tables is the size of the tables, number of objects, and when the data was last updated.
The additional information we collect on unused tables, like the resource consumption and last modification time, helps us to focus on the heavier tables and to decide if we can remove data.
Using the monitoring data we collected, we detected unused tables and removed hundreds of them.
If you want to do it on your own data lake, you have to understand how your data is being accessed, audit your data access, and finally implement usage monitoring.

This Cyber News was published on www.imperva.com. Publication date: Thu, 01 Feb 2024 17:13:04 +0000

Cyber News related to Optimizing Data Lake Usage with Effective Object Management

Optimizing Data Lake Usage with Effective Object Management - Data lakes are a popular solution for data storage, and for good reason. Data lakes are flexible and cost effective, as they allow multiple query engines and many object formats without the need to manage resources like disks, CPUs, and memory. In a ...
1 year ago Imperva.com

How to perform a proof of concept for automated discovery using Amazon Macie | AWS Security Blog - After reviewing the managed data identifiers provided by Macie and creating the custom data identifiers needed for your POC, it’s time to stage data sets that will help demonstrate the capabilities of these identifiers and better understand how ...
8 months ago Aws.amazon.com

CVE-2024-0762 - Potential buffer overflow ...
1 year ago

CVE-2024-29980 - Improper Check for Unusual or Exceptional Conditions vulnerability in Phoenix SecureCore™ for Intel Kaby Lake, Phoenix SecureCore™ for Intel Coffee Lake, Phoenix SecureCore™ for Intel Comet Lake, Phoenix SecureCore™ for Intel Ice Lake allows ...
5 months ago Tenable.com

CVE-2024-29979 - Improper Check for Unusual or Exceptional Conditions vulnerability in Phoenix SecureCore™ for Intel Kaby Lake, Phoenix SecureCore™ for Intel Coffee Lake, Phoenix SecureCore™ for Intel Comet Lake, Phoenix SecureCore™ for Intel Ice Lake allows ...
5 months ago Tenable.com

CVE-2022-29277 - Incorrect pointer checks within the the FwBlockServiceSmm driver can allow arbitrary RAM modifications During review of the FwBlockServiceSmm driver, certain instances of SpiAccessLib could be tricked into writing 0xff to arbitrary system and SMRAM ...
2 years ago

How Much Data Does Streaming Use? - As we enjoy the instant gratification, it's important to know how much data streaming uses to avoid data caps. Read on to understand streaming data usage and learn some tips to manage that usage. Data usage refers to the amount of data consumed ...
1 year ago Pandasecurity.com

20 Best Endpoint Management Tools - 2025 - What is Good?What Could Be Better?Comprehensive endpoint security against many threats.The user interface may overwhelm some users.Machine learning for real-time threat detection.Integration with existing systems may be complex.A central management ...
2 months ago Cybersecuritynews.com

Unified Endpoint Management: What is it and What's New? - What began as Mobile Device Management has now transitioned through Mobile Application Management and Enterprise Mobility Management to culminate in UEM. This progression underscores the industry's response to the ever-growing challenges of modern IT ...
1 year ago Securityboulevard.com

From the SIEM to the Lake: Bridging the Gap for Splunk Customers Post-Acquisition - The smoke has cleared on Cisco's largest acquisition ever: that of Splunk for $28 billion in September. This acquisition has added a new layer of uncertainty for users, many of which were already wondering what the future holds for threat detection ...
1 year ago Cyberdefensemagazine.com

Panther Labs introduces Security Data Lake Search and Splunk Integration capabilities - These offerings mark a critical leap forward in managing security risks in today's cloud-first landscape. As organizations race to implement machine learning capabilities, they're increasingly reliant on decentralized, cloud-based data stores and ...
1 year ago Helpnetsecurity.com

15 Best Patch Management Tools - 2025 - What is Good?What Could Be Better?Comprehensive patch management for various operating systems, applications, and third-party software.It is complex for new users and requires time and training to utilize its functionalities fully.Advanced analytics ...
3 months ago Cybersecuritynews.com

Latest Intel CPUs impacted by new Indirector side-channel attack - Modern Intel processors, including chips from the Raptor Lake and the Alder Lake generations are susceptible to a new type of a high-precision Branch Target Injection attack dubbed 'Indirector,' which could be used to steal sensitive information from ...
11 months ago Bleepingcomputer.com

New Microsoft Purview features use AI to help secure and govern all your data - More than 90% of organizations use multiple cloud infrastructures, platforms, and services to run their business, adding complexity to securing all data.1Microsoft Purview can help you secure and govern your entire data estate in this complex and ...
1 year ago Microsoft.com

Top 10 NinjaOne Alternatives to Consider in 2024 - Atera: Best for IT teams needing a unified platform for network and device management, including patch management and automation. Kaseya VSA: Best for IT operations looking for comprehensive IT management including remote control, patch management, ...
11 months ago Heimdalsecurity.com

Building a Sustainable Data Ecosystem - Finally, I outline future research and policy refinement directions, advocating for a collaborative and responsible approach to building a sustainable data ecosystem in generative AI. In recent years, generative AI has emerged as a transformative ...
1 year ago Feeds.dzone.com

5 common data security pitfalls - Many organizations are caught in the crosshairs of cybersecurity challenges, often due to common oversights and misconceptions about data security. From the pitfalls of decentralized data security strategies to the challenges of neglecting known ...
1 year ago Securityintelligence.com

Decoding the data dilemma: Strategies for effective data deletion in the age of AI - Businesses today have a tremendous opportunity to use data in new ways, but they must also look at what data they keep and how they use it to avoid potential legal issues. Forrester predicts a doubling of unstructured data in 2024, driven in part by ...
1 year ago Venturebeat.com

Navigating API Governance: Best Practices for Product Managers - As the complexity of API ecosystems grows, the need for robust governance becomes paramount. In this article, we will explore in-depth the best practices for product managers in navigating API governance, ensuring secure, scalable, and compliant ...
1 year ago Feeds.dzone.com

10 Best IT Asset Management Tools - 2025 - What is Good?What Could Be Better?Atera can seamlessly service and monitor Linux, Mac, and Windows systems.Sometimes, when deploying an update, patch management will fail.Using an administrator terminal, keep an eye on IT asset activity remotely.The ...
2 months ago Cybersecuritynews.com

How to Enrich Data for Fraud Reduction, Risk Management and Mitigation in BFSI - To stay ahead of these challenges, organizations are increasingly relying on data products to enrich their data and enhance their fraud reduction and risk management strategies. The Data Revolution in BFSI. Data is the lifeblood of the BFSI sector. ...
1 year ago Securityboulevard.com

Announcing the Build for Better Code Challenge Winners - The results are in for the Build for Better Code Challenge, highlighting two standout solutions that effectively combine automation or AI with sustainability. In this challenge, we asked you, the Cisco DevNet community, to use automation and AI to ...
1 year ago Feedpress.me

CVE-2023-52489 - In the Linux kernel, the following vulnerability has been resolved: mm/sparsemem: fix race in accessing memory_section->usage The below race is observed on a PFN which falls into the device memory region with the system memory configuration where ...
1 year ago Tenable.com

Essential Features of Cybersecurity Management Software for MSPs - Protect your clients' businesses from cyber threats with Cybersecurity Management Software. A vital tool that aids MSPs in enhancing their cybersecurity practices is Cybersecurity Management Software. In this article, we will delve into the features ...
1 year ago Hackread.com