Optimizing Data Lake Usage with Effective Object Management

Data lakes are a popular solution for data storage, and for good reason.
Data lakes are flexible and cost effective, as they allow multiple query engines and many object formats without the need to manage resources like disks, CPUs, and memory.
In a data lake, data is simply stored in an object store, and you can use a managed query engine for a complete pay-per-usage solution.
With terabytes of compressed data added every day by automated processes, and with hundreds of users adding more data, regular cleanup of unused data is a must-have process.
Before deleting unused data, you first need to monitor the data lake's usage.
There are multiple query engines, and it is also possible to access data directly through the object store.
Monitoring this vast amount of data is tricky and can be expensive, so we set out to develop a method to collect usage data in an efficient way.
We wanted to monitor the usage of the data lake tables.
We decided to store the log history from the different engines, and use this data to monitor our data lake tables usage.
The usage data was used to find unused tables, which helped us to get to a cleanup process.
Using this process saves us in storage costs, reduces the data lake attack surface and gives us a cleaner work environment.
In this blog post, you'll learn about data lake table usage monitoring and cleanup.
Collecting usage data requires effort for each individual engine, and the collected data differs from one engine to another, but it's important to monitor all access options.
Once the usage data is stored and accessible, it is possible to monitor the data lake's objects usage to detect access patterns, perform cost analysis, monitor user access, and more.
After detecting unused tables, you should collect information to decide if you can remove any of the data.
For tables we create based on ETL processes, we join the data with our infrastructure data to get the average daily size of data entering the data lake.
Another example of data we can collect about unused tables is the size of the tables, number of objects, and when the data was last updated.
The additional information we collect on unused tables, like the resource consumption and last modification time, helps us to focus on the heavier tables and to decide if we can remove data.
Using the monitoring data we collected, we detected unused tables and removed hundreds of them.
If you want to do it on your own data lake, you have to understand how your data is being accessed, audit your data access, and finally implement usage monitoring.


This Cyber News was published on www.imperva.com. Publication date: Thu, 01 Feb 2024 17:13:04 +0000


Cyber News related to Optimizing Data Lake Usage with Effective Object Management

Optimizing Data Lake Usage with Effective Object Management - Data lakes are a popular solution for data storage, and for good reason. Data lakes are flexible and cost effective, as they allow multiple query engines and many object formats without the need to manage resources like disks, CPUs, and memory. In a ...
5 months ago Imperva.com
CVE-2024-0762 - Potential buffer overflow ...
1 month ago
How Much Data Does Streaming Use? - As we enjoy the instant gratification, it's important to know how much data streaming uses to avoid data caps. Read on to understand streaming data usage and learn some tips to manage that usage. Data usage refers to the amount of data consumed ...
3 months ago Pandasecurity.com
CVE-2022-29277 - Incorrect pointer checks within the the FwBlockServiceSmm driver can allow arbitrary RAM modifications During review of the FwBlockServiceSmm driver, certain instances of SpiAccessLib could be tricked into writing 0xff to arbitrary system and SMRAM ...
1 year ago
Unified Endpoint Management: What is it and What's New? - What began as Mobile Device Management has now transitioned through Mobile Application Management and Enterprise Mobility Management to culminate in UEM. This progression underscores the industry's response to the ever-growing challenges of modern IT ...
6 months ago Securityboulevard.com
From the SIEM to the Lake: Bridging the Gap for Splunk Customers Post-Acquisition - The smoke has cleared on Cisco's largest acquisition ever: that of Splunk for $28 billion in September. This acquisition has added a new layer of uncertainty for users, many of which were already wondering what the future holds for threat detection ...
4 months ago Cyberdefensemagazine.com
Building a Sustainable Data Ecosystem - Finally, I outline future research and policy refinement directions, advocating for a collaborative and responsible approach to building a sustainable data ecosystem in generative AI. In recent years, generative AI has emerged as a transformative ...
3 months ago Feeds.dzone.com
New Microsoft Purview features use AI to help secure and govern all your data - More than 90% of organizations use multiple cloud infrastructures, platforms, and services to run their business, adding complexity to securing all data.1Microsoft Purview can help you secure and govern your entire data estate in this complex and ...
6 months ago Microsoft.com
Top 10 NinjaOne Alternatives to Consider in 2024 - Atera: Best for IT teams needing a unified platform for network and device management, including patch management and automation. Kaseya VSA: Best for IT operations looking for comprehensive IT management including remote control, patch management, ...
1 week ago Heimdalsecurity.com
Panther Labs introduces Security Data Lake Search and Splunk Integration capabilities - These offerings mark a critical leap forward in managing security risks in today's cloud-first landscape. As organizations race to implement machine learning capabilities, they're increasingly reliant on decentralized, cloud-based data stores and ...
7 months ago Helpnetsecurity.com
Decoding the data dilemma: Strategies for effective data deletion in the age of AI - Businesses today have a tremendous opportunity to use data in new ways, but they must also look at what data they keep and how they use it to avoid potential legal issues. Forrester predicts a doubling of unstructured data in 2024, driven in part by ...
3 months ago Venturebeat.com
5 common data security pitfalls - Many organizations are caught in the crosshairs of cybersecurity challenges, often due to common oversights and misconceptions about data security. From the pitfalls of decentralized data security strategies to the challenges of neglecting known ...
6 months ago Securityintelligence.com
Latest Intel CPUs impacted by new Indirector side-channel attack - Modern Intel processors, including chips from the Raptor Lake and the Alder Lake generations are susceptible to a new type of a high-precision Branch Target Injection attack dubbed 'Indirector,' which could be used to steal sensitive information from ...
6 days ago Bleepingcomputer.com
Latest Intel CPUs impacted by new Indirector side-channel attack - Modern Intel processors, including chips from the Raptor Lake and the Alder Lake generations are susceptible to a new type of a high-precision Branch Target Injection attack dubbed 'Indirector,' which could be used to steal sensitive information from ...
5 days ago Bleepingcomputer.com
When a Data Mesh Doesn't Make Sense - The data mesh is a thoughtful decentralized approach that facilitates the creation of domain-driven, self-service data products. Data mesh-including data mesh governance-requires the right mix of process, tooling, and internal resources to be ...
3 months ago Feeds.dzone.com
How to Enrich Data for Fraud Reduction, Risk Management and Mitigation in BFSI - To stay ahead of these challenges, organizations are increasingly relying on data products to enrich their data and enhance their fraud reduction and risk management strategies. The Data Revolution in BFSI. Data is the lifeblood of the BFSI sector. ...
4 months ago Securityboulevard.com
Navigating API Governance: Best Practices for Product Managers - As the complexity of API ecosystems grows, the need for robust governance becomes paramount. In this article, we will explore in-depth the best practices for product managers in navigating API governance, ensuring secure, scalable, and compliant ...
7 months ago Feeds.dzone.com
Essential Features of Cybersecurity Management Software for MSPs - Protect your clients' businesses from cyber threats with Cybersecurity Management Software. A vital tool that aids MSPs in enhancing their cybersecurity practices is Cybersecurity Management Software. In this article, we will delve into the features ...
1 month ago Hackread.com
Data Loss Prevention for Business: Strategies and Tools - Data Loss Prevention has become crucial in today's data-driven business landscape to protect sensitive information. This discussion aims to provide valuable insights into DLP strategies and tools for business, helping mitigate data loss risks ...
5 months ago Securityzap.com
Data Protection in Educational Institutions - This article delves into the significance of data protection in educational institutions, emphasizing three key areas: the types of educational data, data privacy regulations, and data protection measures. Lastly, robust data protection measures are ...
6 months ago Securityzap.com
CVE-2023-52489 - In the Linux kernel, the following vulnerability has been resolved: mm/sparsemem: fix race in accessing memory_section->usage The below race is observed on a PFN which falls into the device memory region with the system memory configuration where ...
3 months ago Tenable.com
Aim for a modern data security approach - Risk, compliance, governance, and security professionals are finally realizing the importance of subjecting sensitive workloads to robust data governance and protection the moment the data begins traversing the data pipeline. Why current data ...
6 months ago Helpnetsecurity.com
Data Classification Software Features to Look Out For - For organizations looking to improve their data protection and data compliance strategies, technology is essential. Implementation of the right software can help you gain visibility into your company's data, improving your ability to protect customer ...
6 months ago Securityboulevard.com
Data Classification: Your 5 Minute Guide - Data classification has become a vital component of data security governance. With the rise of virtual data networks, organizations must take necessary measures to protect and secure confidential information. Data classification is the process of ...
1 year ago Tripwire.com
InfoWorld's 2023 Technology of the Year Award winners - The arrival of ChatGPT in late 2022 and the ensuing cascade of large language models ensured that 2023 will forever be known as the year of generative AI. With amazing speed, generative AI has rippled across the entire information technology ...
6 months ago Infoworld.com

Latest Cyber News


Cyber Trends (last 7 days)


Trending Cyber News (last 7 days)