In this post, we will share our experience hunting for new threats by processing Kaspersky Security Network (KSN) global threat data with ML tools to identify subtle new Indicators of Compromise (IoCs). The model can process and learn from millions of data points in real time, pointing out subtle indicators that may signal the presence of a new or advanced threat. To maintain model maturity, incremental learning is often needed, which means an ongoing process of updating and refining the machine learning model by incorporating new data over time. By allowing models to learn from decentralized data without sharing the actual data, federated learning could facilitate the creation of more robust and generalizable threat detection models. The Kaspersky Security Network (KSN) infrastructure is designed to receive and process complex global cyberthreat data, transforming it into actionable threat intelligence that powers our products. Random Forest is especially effective in handling non-linear data, reducing the risk of overfitting, and providing insights into the importance of various features in the dataset. Preprocessing is a crucial step in a machine learning pipeline where raw data is transformed into a format suitable for training an ML model. These outcomes provide valuable feedback that helps data scientists and ML engineers make informed decisions to improve the model’s performance, guide adjustments, and ensure the final model is robust, generalizes well, and meets the desired criteria. When dealing with text data, a common approach is to first transform the raw text into numerical features using techniques like TF–IDF and then apply an ML algorithm such as Random Forest to classify or analyze the data. Additionally, federated learning presents a significant opportunity for collaborative threat detection across organizations while preserving data privacy. This approach is particularly important in dynamic fields where data distributions and patterns can shift, leading to the need for models that can keep up with these changes, which is exactly the case with the cybersecurity threat landscape. Machine learning enables systems to learn from data and improve their performance over time without being explicitly programmed. A machine learning dataset is a collection of data used to train, validate, and test ML models. In this post, we have evaluated the utilization of ML models on our KSN global threat data, which has led us to reveal thousands of new advanced threats. The ability of ML-powered technology to analyze vast amounts of data in real time ensures that potential threats are identified and addressed more quickly, minimizing the window of vulnerability. Through the analysis of vast and complex logs, ML models can identify subtle patterns and IoCs, providing organizations with a powerful tool to enhance their security posture. Random Forest models can become demanding when dealing with high-dimensional data and large datasets. We already use deep learning in some of our products, and applying it to threat hunting could potentially further improve detection accuracy and uncover even more sophisticated threats. TF–IDF transforms raw text data into a set of machine-readable numerical features, which can be then fed to an ML model. The choice of dataset, its quality, and how it is prepared and split into training, validation, and test sets can significantly impact the model’s ability to learn and generalize new data. The combination of TF–IDF with Random Forest allows handling high-dimensional data, while also providing robustness and scalability, very much needed to handle data with millions of entries daily. The journey of refining ML models through meticulous dataset preparation, preprocessing, and model implementation has highlighted the importance of leveraging these technologies to build robust, adaptable, and scalable solutions. By analyzing this data, organizations can detect anomalies, pinpoint malicious activity, and mitigate potential cyberattacks before they cause significant damage. A machine learning model reaches maturity when it performs consistently well on the kind of tasks it was designed for, meeting the performance criteria set during its development. The outcomes of machine learning during model training play a crucial role in guiding the development, refinement, and optimization of the model. As we started this study, we kept in mind that the usage of ML in log analysis enables the discovery of previously unknown cyberthreats by analyzing vast amounts of data and uncovering patterns. One promising area is the integration of deep learning techniques, which can automatically extract and learn complex patterns from raw data. ML, a subset of artificial intelligence (AI), with its ability to process and analyze large datasets, offers a powerful solution to enhance threat detection capabilities. As we continue to explore and enhance these capabilities, the potential for machine learning to reshape cybersecurity and protect against increasingly sophisticated threats becomes ever more apparent. TF–IDF is known to be efficient and versatile, while Random Forest is known for accuracy, reduced overfitting, and an ability to capture complex, non-linear relationships between features. The model is then trained and tested, before being deployed to examine larger amounts of data. Continuous learning allows ML models to detect subtle and novel cyberthreats, providing more robust defense. When a model is ready, it can be integrated into a production environment where it can start making predictions on new data. We utilize a variety of ML models and methods that are key to automating threat detection, anomaly recognition, and enhancing the accuracy of malware identification. ML then acts to “reconstruct the cyber-reality” by transforming raw telemetry data into actionable insights that reflect the true state of a network or system’s security. This process involves cleaning the data, handling missing values, transforming variables into a scaled and normalized numerical representation, and ensuring that the data is in a consistent and standardized format. In the ever-evolving landscape of cybersecurity, logs, that is information collected from various sources like network devices, endpoints, and applications, plays a crucial role in identifying and responding to threats. These technologies will not only improve detection accuracy but also enable more proactive and collaborative defense strategies, allowing organizations to stay ahead of the ever-evolving cyberthreat landscape. Ultimately, the ability of ML to partially reconstruct the cyber-reality from logs helps organizations stay ahead of cyberthreats by offering a clearer, more precise view of their security posture, enabling faster and better informed decision-making. Another area of exploration is reinforcement learning, where models can continuously adapt and improve by interacting with dynamic cybersecurity environments. Random Forest is highly effective at identifying patterns, but this strength can lead to challenges in interpretability, particularly with larger models. We will also discuss challenges in implementing machine learning and interpreting threat hunting results. A dataset consist of various examples, each containing features (input variables) and, in supervised learning tasks, corresponding labels (output variables or targets).
This Cyber News was published on securelist.com. Publication date: Wed, 02 Oct 2024 10:43:07 +0000