Revolutionizing Cybersecurity Search with Sysdig Sage™

Published by:

Flavio Mutti

Revolutionizing Cybersecurity Search with Sysdig Sage™

Published:

June 24, 2025

Table of contents

We are excited to introduce Sysdig Sage for Search, our AI-based graph search assistant, which enriches Sysdig’s AI capabilities after launching Sysdig Sage for Cloud Detection and Response. Designed to assist cybersecurity professionals, Sysdig Sage for Search redefines how we interact with and extract insights from complex security data.

Traditional cybersecurity tools often fall short when it comes to handling the growing complexity of modern environments. The challenge lies in making sense of vast amounts of data while providing actionable insights in real time. This is where Sysdig’s AI search engine stands out.

Introducing Sysdig Sage for Search

At the heart of Sysdig Sage for Search is a powerful search engine that combines cutting-edge AI technology with deep domain expertise. Designed specifically for cybersecurity, this engine simplifies how professionals interact with complex infrastructure and security data.

Sysdig Sage for Search enables users to express security questions in natural language, which are automatically interpreted and translated into formal SysQL queries against a graph-based datastore. This allows security teams to seamlessly explore relationships, entities, and events without needing to write or understand query syntax.

The system empowers analysts with an intuitive interface that bridges the gap between high-level investigation goals and low-level data, accelerating workflows such as incident response, policy validation, and behavioral analysis.

Key innovations

SysQL: A proprietary query language

Sysdig introduces SysQL, a novel query language tailored specifically for the cybersecurity domain. Unlike generic query languages, SysQL is user-friendly and intuitive, enabling professionals to ask complex questions without needing advanced technical expertise.

SysQL is a query language designed specifically for exploring Kubernetes and cloud resources, as well as risks and findings relevant to cloud security posture management (CSPM). It provides capabilities to query entities related to cloud infrastructure, security conditions, vulnerabilities, and compliance controls.

Advantages of SysQL:

1. Specificity to cloud and Kubernetes:

SysQL is tailored for querying resources and vulnerabilities within cloud environments and Kubernetes setups, making it more efficient for CSPM-related operations than using generic query languages.

2. Entity-relationship structure:

SysQL allows querying based on entities and their relationships. This is particularly useful for understanding how different components in the infrastructure interact and affect each other.

3. Built-in security features:

SysQL supports queries targeting security findings, vulnerabilities, and configurations, making it a valuable tool for security analysis and posture management.

4. Usability:

The language of SysQL is designed to be user-friendly, allowing operators to easily specify the data they want to target or filter, simplifying the task of complex cloud security operations.

5. Comprehensive query operations:

Similar to other query languages, SysQL offers a range of operators like MATCH, WHERE, RETURN, ORDER BY, and more, providing robust support for filtering, sorting, and limiting query results.

Overall, SysQL consolidates cloud-native and Kubernetes resource queries with security and compliance analysis, enabling more relevant and context-driven exploration compared to generic database query languages.

For further details about SysQL please refer to our documentation here.

Fine-tuned LLM

Sysdig’s AI engine is powered by a large language model (LLM) trained on hundreds of thousands of carefully crafted domain-specific questions. This model translates user queries into SysQL, ensuring precise and actionable results.

To enable natural language access to our security data, we fine-tuned a custom LLM specifically designed to understand and generate structured queries over our security graph. This model doesn’t just parse sentences — it interprets user intent in the context of cloud and container security, generating precise, semantically rich queries that power our search engine. The LLM has been trained on a dataset containing about 34k SysQL queries and 135k questions in natural language.

By training the model on real-world examples and refining it through continuous evaluation, we ensure it stays aligned with how practitioners investigate risk and exposure in modern cloud environments.

The model creation and training process includes:

AI-driven dataset generation: The dataset is generated using an AI-driven builder that combines human-curated queries, SysQL query templates, and the SysQL knowledge graph to produce realistic text-to-SysQL query pairs. This process creates a large, diverse dataset split into training, validation, and test sets, enabling the model to learn how real users interact with security data.
Fine-tuning: A domain-specific large language model is fine-tuned using the generated training and validation datasets. With around 135k natural language questions and 34k SysQL queries, the model learns to translate user intent into precise, executable queries that align with the structure and semantics of the SysQL knowledge graph.
Model evaluation: The fine-tuned model is evaluated on both AI-generated and human-curated test datasets to ensure quality, accuracy, and alignment with real-world use cases. A data scientist reviews the model’s output to continuously refine performance and ensure that it meets the practical needs of cloud and container security professionals.

Key advantages:

Domain specialization: By grounding the model in a security-specific query language, you ensure far better precision and relevance than general-purpose models.
High-quality training data: Combining human expertise with template-based generation ensures accuracy while scaling the dataset.
Test-time control: Evaluation on a curated “golden dataset” allows for consistent tracking of performance across iterations.
Explainability: SysQL’s structured format makes it easier to inspect, debug, and validate the generated queries — a critical feature in security applications.
Scalability: The templated and pattern-based query generation enables rapid adaptation to new schemas or data models.

Why this matters:

Faster investigation: Analysts get to insights quickly, without needing to learn a query language.
Security-aware results: The model is purpose-built to understand cloud-native threats and relationships.
Trust through structure: Outputs are explainable and inspectable — critical in high-stakes security workflows.
Built to evolve: As cloud threats change, the model adapts, making our AI engine more resilient and future-proof.

Innovative inference pipeline

Sysdig’s inference pipeline combines the power of the custom LLM with a cybersecurity knowledge graph. This unique approach allows the engine to handle complex user requests, providing insights that go beyond surface-level analysis.

Once trained, our custom LLM becomes the heart of a real-time inference pipeline that transforms natural language into executable security graph queries. When a user asks a question — like “Show me the resources affected by critical vulnerabilities for clusters named “cluster-name” and prioritize by number of packages in use” — the system first moderates the input, then passes it to a query generator that interprets intent and assembles a valid SysQL query.

We introduced a robust, multi-stage inference pipeline for translating natural language questions into formal SysQL queries. The system is designed to handle ambiguity, syntax errors, and unsupported query patterns using a combination of LLM generation, programmatic feedback loops, and semantic post-processing.

Pipeline overview

Moderation and filtering. The generated query undergoes moderation to filter out:
1. Out-of-context questions that do not pertain to the current application scope.
2. In-context but unsupported queries due to system or schema limitations.
Initial query generation. A natural language question is passed to an LLM to generate a candidate SysQL query.
Syntactic validation and correction loop. The query is submitted to a SysQL interpreter. If it fails syntactic validation:
1. The interpreter’s error message and relevant schema-level information (entities, relationships) are injected back into the LLM’s prompt.
2. The LLM attempts to regenerate a corrected version of the query using this enriched context.
Semantic post-processing. Once a syntactically valid query is produced, semantic enumeration issues (e.g., column name ambiguity) are corrected using AI-driven techniques.
Iterative attempts. Steps 2–6 are repeated up to K times to generate multiple candidate queries.
1. The system generates correlated suggestions (e.g., rephrasings or schema hints).
2. These are used to re-prompt a fine-tuned LLM for another round of query generation.
Query refinements. A final refinement step selects the top-1 query from the K candidates and optimizes it for best alignment with the input question and schema context.

Key advantages of the inference pipeline

Resilient to imperfect input: Recovers from incomplete or imprecise questions through layered correction mechanisms.
Accurate query generation: Validates and refines generated queries to ensure they are executable and relevant.
Domain-aware interpretation: Leverages cloud security context to resolve entities, relationships, and fields correctly.
Iterative refinement: Applies structured fallbacks to maximize success without user intervention.
Seamless integration: Connects directly with the backend search engine, delivering results with minimal latency.
Explainability: Produces queries that are transparent and inspectable, enabling trust in automated results.

Why this matters

Robustness: The system can gracefully handle ambiguity or partially formed inputs.
Precision at scale: Multiple validation and correction steps ensure queries are accurate and meaningful.
Adaptability: The pipeline improves over time as new security concepts and query patterns emerge.
Low friction: Users get high-quality results without needing to reformulate their questions.

Seamless integration

The AI-powered search engine is deeply integrated into Sysdig’s platform, particularly within the assistant chat experiences. This integration turns complex cloud and Kubernetes data into accessible, conversational insights, reducing friction and helping users to take action faster.

Embedded where it matters

Whether you’re navigating through Sysdig’s Search UI or interacting with the assistant chat, the AI engine is always on hand to:

Understand natural language questions
Translate them into precise SysQL queries
Return curated, structured answers directly from your environment

This tight integration means security teams can explore their infrastructure and risk landscape without needing to learn a query language or dig through multiple dashboards.

Why this matters

This integration bridges the gap between raw cloud data and the people who need to act on it. By enabling searchable, explainable, and actionable insights, it transforms how teams approach:

Cloud and Kubernetes resource inspection
Vulnerability triage and remediation
Security posture monitoring
Inventory and asset visibility

What the AI assistant can do

The assistant is not a static chatbot, it’s a domain-aware security search interface capable of:

Analyzing cloud and Kubernetes resources across AWS, GCP, and Azure
Surfacing security insights, like failed controls, risky configurations, and exposed assets
Executing and explaining SysQL queries
Summarizing infrastructure inventory by resource type, account, or region
Guiding investigations through follow-up suggestions and contextual understanding

In short, it turns security and operations questions into actions with context, precision, and zero manual digging.

Real-world use case

Scenario

Meet Alex, a cloud security analyst at a mid-sized enterprise. It’s Monday morning, and she’s reviewing their weekly vulnerability status.

Alex starts with the familiar search bar interface and types:

“Show in-use vulns with fix available for more than 30 days.”

Immediately, she sees a list of known vulnerabilities that haven’t been remediated in time: a red flag. One entry catches her eye: CVE-2025-22871.

Instead of switching tools or hunting for documentation, Alex clicks into the Sysdig Sage chat:

“Which workloads are affected by CVE-2025-22871?”

The assistant quickly responds with a breakdown of affected workloads. Alex clicks on the provided link to view them in detail; no need to write complex queries or dig through dashboards.

Then, she asks:

“Tell me more about the coredns workload.”

The assistant pulls up contextual information: the version in use, recent changes, risk exposure, and even deployment timelines.

Next, she digs deeper:

“Can you explain this query?”

Rather than just throwing a SysQL query at her, the assistant offers a step-by-step explanation of how the results were derived, improving Alex’s confidence in the data.

Finally, Alex asks:

“Should I fix this CVE?”

The assistant assesses risk based on exploitability, exposure time, and availability of a fix. It offers a clear recommendation: “Yes. This CVE is actively exploitable, and a patch is available. Delaying may increase your risk posture.”

Outcome and value

By the end of this short interaction, Alex has:

Identified a critical vulnerability
Understood its impact across workloads
Interpreted the technical details of the query
Received a prioritized recommendation for action

All of this has been accomplished without needing to know the underlying query language.

Outperforming the competition

We evaluated Sysdig’s AI-powered graph search against a leading competing solution in the domain of cloud and Kubernetes security (CSPM/KSPM/VM/Inventory). Both systems aim to translate user intent expressed in natural language into structured, executable queries over a security graph.

The competing system clearly demonstrates strong engineering and domain modeling. It produces syntactically valid queries, often capturing a significant portion of the user intent. However, our analysis shows that Sysdig Sage delivers a fundamentally more accurate and operationally effective solution, due to key differences in semantic understanding, entity modeling, and query construction.

Entity modeling aligned with operational reality

Search with Sysdig Sage consistently identifies and models the correct primary entity based on the user’s question. For instance, when the question is about “workloads affected by critical vulnerabilities,” Sysdig Sage returns the workloads as the main result, with related vulnerability data attached. The competing system, while often able to recognize relevant concepts (e.g. vulnerabilities, findings), tends to orient the query around the wrong object, such as findings or container images, leading to mismatches between the intent and output. We suspect that this may depend on our opinionated concept modelling.

Accurate graph traversal and contextual filtering

Search with Sysdig Sage constructs precise graph traversal paths that align with actual runtime relationships: workloads to containers to images to vulnerabilities. It also supports contextual filtering on fields like exposure, region, fix availability, and usage status.

In contrast, the competing system often introduces semantically ambiguous or unnecessary intermediate steps. While these queries are syntactically sound, they do not always reflect real-world deployment semantics (e.g., distinguishing between an image in a registry and a container running in production).

Aggregation, sorting, and limiting built-in

Security investigations often require more than just listing resources: Users frequently want aggregated summaries, ordered risk prioritization, or scoped subsets (e.g., “top 5,” “grouped by region,” etc.). Sysdig Sage fully supports these constructs directly in the initial query.

By comparison, the competing system often omits GROUP BY, ORDER BY, or LIMIT logic, requiring manual editing or post-processing to make the results usable.

Better alignment between input and output

One of the core strengths of search with Sysdig Sage is preserving semantic symmetry: The structure and content of the output precisely match the intent of the question. Whether a user asks about specific CVEs, workloads, clusters, or vulnerable images, Sysdig Sage ensures that the query and its results remain centered on that concept.

In multiple test cases, the competing system returned results that diverged from the expected shape — for example, returning only vulnerability IDs when the user asked for affected resources, or missing key attributes such as cluster or namespace names.

Consistency and executability

Every Sysdig Sage-generated query tested was immediately executable and returned valid, meaningful results. This highlights a key differentiator: Sysdig Sage not only generates correct syntax, but also operationally accurate semantics, ensuring reliable answers to real-world security questions.

Summary of findings

Aspect	Sysdig Sage	Competing System
Primary entity focus	Correctly modeled (e.g., workload, resource)	Sometime misaligned
Graph traversal semantics	Accurate and minimal	Feels redundant or imprecise
Aggregation and grouping	Fully supported	Often absent or requires edits
Sorting and top-N queries	Supported and accurate	Frequently missing
Result shape	Matches user intent	Partial or misaligned
Manual refinement needed	Rarely	Frequently

Final thoughts

We acknowledge the strong capabilities of the competing system, which has pushed the field forward and set a solid baseline for natural language querying in the security domain. Their work has contributed meaningfully to reducing the gap between natural language and structured security insights.

That said, we believe Sysdig Sage offers a fundamentally better solution: one that more accurately understands user intent, translates that intent into executable graph queries, and delivers immediately useful, scoped, and actionable results, all without the need for manual correction or refinement.

Outpace cloud threats with Sysdig Sage

Sysdig’s AI-based search engine is more than just a tool. It’s a game-changer for the cybersecurity industry. By combining cutting-edge AI with deep domain expertise, Sysdig helps professionals to stay ahead of threats and make smarter decisions. Experience the future of cybersecurity search with Sysdig Sage. Request a personalized demo!

About the author

Cloud Security

featured resources

Test drive the right way to defend the cloud with a security expert

GET A DEMO

Revolutionizing Cybersecurity Search with Sysdig Sage™

Falco Feeds extends the power of Falco by giving open source-focused companies access to expert-written rules that are continuously updated as new threats are discovered.

Introducing Sysdig Sage for Search

Key innovations

SysQL: A proprietary query language

Advantages of SysQL:

Fine-tuned LLM

Key advantages:

Why this matters:

Innovative inference pipeline

Pipeline overview

Key advantages of the inference pipeline

Why this matters

Seamless integration

Embedded where it matters

Why this matters

What the AI assistant can do

Real-world use case

Scenario

Outcome and value

Outperforming the competition

Entity modeling aligned with operational reality

Accurate graph traversal and contextual filtering

Aggregation, sorting, and limiting built-in

Better alignment between input and output

Consistency and executability

Summary of findings

Final thoughts

Outpace cloud threats with Sysdig Sage

About the author

Test drive the right way to defend the cloud with a security expert

Test drive the right way to defend the cloud with a security expert