Armorblox was acquired by Cisco to further their AI-first Security Cloud by bringing generative AI experiences to Cisco's security solutions.
Quickly a new mission came my way: Build generative AI Assistants that will allow cybersecurity administrators to find the answers they seek quickly, and therefore make their lives easier.
The AI Assistant can help with troubleshooting, such as locating policies, giving summarization of existing configurations, providing documentation, and more.
The first and most obvious challenge has been evaluation of the model.
There are several ways a model's responses can be evaluated.
An innovative method that was proposed early on by the community was using LLMs to evaluate LLMs. This works wonders for generalized use cases, but can fall short when assessing models tailored for niche tasks.
In order for niche use cases to perform well, they require access to unique or proprietary data that is inaccessible to standard models like GPT-4.
As we expand the size of real user data that can be used for validation, the importance of automated metrics will grow.
With real user questions, we can more appropriately benchmark against real use cases and automated metrics become a stronger signal for good models.
The first set of use cases for our AI assistant are aimed at allowing a user to become more efficient by either compiling and presenting data coherently or making information more accessible.
Once the AI assistant gives summarizes their rule configuration, they want to know how to alter it.
The AI assistant will give them guided steps to configured the policy as desired.
This has already given me insight into some hallucinations and poor assumptions that the AI assistant is making.
Engaging domain experts as a proxy for real customers at pre-launch to test the AI assistant has proven invaluable.
Instituting a regular team ritual to review and act on this feedback ensures continued alignment with expectations for the model responses.
Prioritizing the feedback we get is extremely important, focusing on the impact of the user experience and the loss of trust in the AI assistant are the core criteria for prioritization along with the frequency of the issue.
The pathways for addressing evaluation gaps are varied - be it through prompt engineering, different models, or trying various augmented model strategies like knowledge graphs.
As the solution evolves into a tangible, demoable product, latency, the amount of time it takes for a response to be returned to a user, becomes increasingly important.
It's been an exciting start to the journey of building products with LLMs and I can't wait to learn more as we continue building and shipping awesome AI products.
Recently, Open AI released their Assistants API, which will enable developers to more easily access the potential of LLMs to operate as agents with multiple tools and larger contexts.
This Cyber News was published on feedpress.me. Publication date: Tue, 12 Dec 2023 13:13:05 +0000