data_critic: add subagent feature that performs data analysis #300

JimFuller-RedHat · 2025-10-27T07:41:25Z

What

Adds a data subagent that performs data analysis.

added an optional data_agent which is used by data_critic tool (which uses cve.CVEDataCritic feature)
added AEGIS_DATA_ENABLED,AEGIS_DATA_LLM_HOST, AEGIS_DATA_LLM_MODEL, AEGIS_DATA_OPENAPI_KEY env var (recc. using gemini-2.5-flash-lite)
added AEGIS_AGENT_MAX_RETRIES env var for generic setting on all agents
minor changes

Note: set DATA llm agent env vars to use something like gemini-2.5-flash-lite as data analysis should be very quick.

Why

Understanding data quality improves assessment of confidence in analysis. It can also directly help analysis by choosing more generic analysis where data quality is low.

How to Test

Related Tickets

https://issues.redhat.com/browse/AEGIS-74

Summary by Sourcery

Introduce a new data critic subagent and toolset for quantitative data quality analysis, integrate it into the CVE feature pipeline, adjust configuration and schema for numeric data_quality, and update documentation and logging support.

New Features:

Add CVEDataCritic feature to perform quantitative data quality assessments
Introduce a dedicated data_critic toolset and data subagent for standalone data analysis

Enhancements:

Integrate the data critic tool into CVE analysis workflows to factor data quality into confidence scoring
Change the data_quality metadata field to a 0.0–1.0 numeric scale and update feature schema accordingly
Enhance LoggingToolset to support argument suppression and apply it to the data toolset

Documentation:

Add environment variables for enabling data critic and configuring the data LLM
Update CHANGELOG to document the new data_quality assessment feature

Summary by Sourcery

Introduce a new data critic subagent with its own toolset and agent to perform quantitative data quality analysis, integrate it into the CVE feature pipeline, migrate the data_quality field to a numeric scale, add environment-driven configuration, and refine logging and documentation.

New Features:

Add CVEDataCritic feature for quantitative CVE data quality assessment
Introduce a dedicated data critic toolset and separate agent for standalone data analysis

Enhancements:

Integrate the data critic tool into CVE analysis workflows to factor data quality into confidence scoring
Convert the data_quality metadata field to a numeric 0.0–1.0 scale and update the feature schema accordingly
Enhance LoggingToolset to support argument suppression for cleaner tool call logs

Documentation:

Add environment variables for enabling the data critic subagent and configuring the data LLM host, model, and API key
Update CHANGELOG to document the new data_quality assessment feature

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Changing data_quality from a Literal enum to a float will break any downstream consumers expecting the old values; please add a migration path or compatibility layer.
The data_toolset is always wrapped and injected even when AEGIS_DATA_ENABLED is false, resulting in an empty toolset in agents; consider conditionally registering and wrapping it only when enabled.
You’ve duplicated the ‘use data_critic tool’ instructions in both SuggestImpact and SuggestRemediation—extract that into a shared directive or helper to avoid drift.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Changing data_quality from a Literal enum to a float will break any downstream consumers expecting the old values; please add a migration path or compatibility layer.
- The data_toolset is always wrapped and injected even when AEGIS_DATA_ENABLED is false, resulting in an empty toolset in agents; consider conditionally registering and wrapping it only when enabled.
- You’ve duplicated the ‘use data_critic tool’ instructions in both SuggestImpact and SuggestRemediation—extract that into a shared directive or helper to avoid drift.

## Individual Comments

### Comment 1
<location> `src/aegis_ai/toolsets/tools/data_critic/__init__.py:22-28` </location>
<code_context>
+        return await cwe_data_critic.exec(
+            cve_data.cve_id, static_context=cve_data.model_dump()
+        )
+    except ValueError:
+        return CVEDataCriticToolInput(status="failure", **cve_data)
+
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Catching only ValueError may miss other exceptions during analysis.

Consider catching additional exception types or logging unexpected errors to improve robustness.

```suggestion
    try:
        cwe_data_critic = cve.CVEDataCritic(agent=data_critic_agent)
        return await cwe_data_critic.exec(
            cve_data.cve_id, static_context=cve_data.model_dump()
        )
    except ValueError:
        return CVEDataCriticToolInput(status="failure", **cve_data)
    except Exception as exc:
        logger.exception("Unexpected error during CVE data analysis")
        return CVEDataCriticToolInput(status="failure", **cve_data)
```
</issue_to_address>

### Comment 2
<location> `src/aegis_ai/toolsets/tools/data_critic/__init__.py:28` </location>
<code_context>
+            cve_data.cve_id, static_context=cve_data.model_dump()
+        )
+    except ValueError:
+        return CVEDataCriticToolInput(status="failure", **cve_data)
+
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Returning CVEDataCriticToolInput on failure may not match expected output schema.

Returning CVEDataCriticToolInput instead of CVEDataCriticOutput on error may cause type mismatches. Please return a default CVEDataCriticOutput with an error status.
</issue_to_address>

### Comment 3
<location> `src/aegis_ai/agents/data.py:16-25` </location>
<code_context>
+        "seed": 42,
+        "response_format": {"type": "json_object"},
+    },
+    system_prompt="""
+    You are a Red Hat product security assistant assessing data quality.
+    Goals: Perform a fast and concise analysis of data quality.
+    Rules:
+    - Prefer facts over speculation; cite only provided context/tools.
+    - Never use any tools.
+    - Keep answers short and directly useful.
+    - Output must match the requested JSON schema when provided.
+    Safety: refuse harmful or unethical requests.
+    """,
+)
</code_context>

<issue_to_address>
**question:** System prompt prohibits tool usage, but agent is used for tool-based analysis.

Please update the prompt or agent configuration to avoid conflicting instructions and ensure consistent behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/aegis_ai/toolsets/tools/data_critic/__init__.py

src/aegis_ai/agents/data.py

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `src/aegis_ai/toolsets/tools/data_critic/__init__.py:27-36` </location>
<code_context>
+    pass
+
+
+async def analyse_cve_data(cve_id: CVEID, cve_data: Any) -> CVEDataCriticOutput:
+    from aegis_ai.agents.data import default as data_critic_agent
+
+    try:
+        cwe_data_critic = cve.CVEDataCritic(agent=data_critic_agent)
+        return await cwe_data_critic.exec(
+            cve_data.cve_id, static_context=cve_data.model_dump()
+        )
+    except Exception:
+        logger.warning("Unexpected error during CVE data analysis.")
+        return CVEDataCriticOutput(
+            data_quality=0,
+            confidence=0,
</code_context>

<issue_to_address>
**suggestion:** Exception handling in analyse_cve_data may mask underlying errors.

Log the exception details or stack trace to improve error traceability and simplify debugging.

```suggestion
    except Exception as exc:
        logger.exception("Unexpected error during CVE data analysis.")
        return CVEDataCriticOutput(
            data_quality=0,
            confidence=0,
            tools_used=[],
            cve_id=cve_id,
            disclaimer=CVEDataCriticOutput.disclaimer,
            explanation=f"Unexpected error during CVE data analysis: {exc}",
        )
```
</issue_to_address>

### Comment 2
<location> `src/aegis_ai/features/cve/__init__.py:24` </location>
<code_context>
+class CVEDataCritic(Feature):
+    """Based on current CVE information and context assess quality of CVE information."""
+
+    async def exec(self, cve_id: CVEID, static_context: Any = None, usage=None):
+        prompt = AegisPrompt(
+            user_instruction="Your task is to examine the provided CVE data and generate a fast and brief data assessment of its quality, confidence, completeness and internal consistency.",
</code_context>

<issue_to_address>
**question:** The exec method signature for CVEDataCritic differs from other Feature classes.

If the usage parameter is required for agent compatibility, document this difference and notify all consumers. Otherwise, align the method signature with other Feature classes for consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/aegis_ai/toolsets/tools/data_critic/__init__.py

src/aegis_ai/features/cve/__init__.py

JimFuller-RedHat · 2025-10-27T10:28:15Z

/retest

JimFuller-RedHat · 2025-10-27T11:51:11Z

when this is enabled:

AEGIS_DATA_ENABLED="true"
AEGIS_DATA_LLM_HOST="https://generativelanguage.googleapis.com"
AEGIS_DATA_LLM_MODEL="gemini-2.5-flash-lite"
AEGIS_DATA_OPENAPI_KEY=INSERTAPIKEY

and eval tests are run - they timeout due to call to data subagent (which is set to use gemini-2.5-flash-lite) - for reasons unknown - otherwise works with normal cli invoke.

kdudka · 2025-10-27T15:24:07Z

and eval tests are run - they timeout due to call to data subagent (which is set to use gemini-2.5-flash-lite) - for reasons unknown - otherwise works with normal cli invoke.

The reason is that the run_if_safe() method in the Feature class is not reentrant. When you reach the function the second time, the llm_sem semaphore is already entered, which may cause indefinite blocking. This can be easily reproduced if you export AEGIS_LLM_MAX_JOBS=1. Then each call to data_critic_cve_data_critic() is terminated by the timeout.

JimFuller-RedHat · 2025-10-27T15:25:17Z

ah thx @kdudka - will fix

src/aegis_ai/toolsets/tools/data_critic/__init__.py

kdudka · 2025-10-27T16:15:29Z

src/aegis_ai/toolsets/__init__.py

+    )
+
 # chain logging wrappers
+data_toolset = LoggingToolset(data_toolset, no_args=True)


The use of no_args=True does not take any effect because WrapperToolset is a data class and you set the flag on a different object than you read it. The following patch makes it work as expected:

--- a/src/aegis_ai/toolsets/__init__.py +++ b/src/aegis_ai/toolsets/__init__.py @@ -7,6 +7,8 @@ import os import logging import time +from dataclasses import dataclass + from pydantic_ai.common_tools.tavily import tavily_search_tool from pydantic_ai.mcp import MCPServerStdio from pydantic_ai.toolsets import FunctionToolset, CombinedToolset @@ -38,7 +40,10 @@ from aegis_ai.toolsets.tools.data_critic import data_critic_toolset logger = logging.getLogger(__name__) +@dataclass class LoggingToolset(WrapperToolset): + no_args: bool + def __init__(self, wrapped, no_args: bool = False): super().__init__(wrapped) self.no_args: bool = no_args

I also suggest to apply the following patch to make the code readable:

--- a/src/aegis_ai/toolsets/__init__.py +++ b/src/aegis_ai/toolsets/__init__.py @@ -42,18 +42,16 @@ logger = logging.getLogger(__name__) @dataclass class LoggingToolset(WrapperToolset): - no_args: bool + print_args: bool - def __init__(self, wrapped, no_args: bool = False): + def __init__(self, wrapped, print_args: bool = True): super().__init__(wrapped) - self.no_args: bool = no_args + self.print_args = print_args async def call_tool(self, name: str, tool_args: dict, ctx: RunContext, tool): # type: ignore[override] # log tool call entry - args = str(tool_args) if tool_args else "" + args = str(tool_args) if self.print_args and tool_args else "" prefix = f"[tool call] {name}({args})" - if self.no_args: - prefix = f"[tool call] {name}()" start = time.time() logger.info(f"{prefix} started") @@ -221,7 +219,7 @@ if use_data_tools in truthy: ) # chain logging wrappers -data_toolset = LoggingToolset(data_toolset, no_args=True) +data_toolset = LoggingToolset(data_toolset, print_args=False) public_toolset = LoggingToolset(public_toolset) redhat_cve_toolset = LoggingToolset(redhat_cve_toolset) public_cve_toolset = LoggingToolset(public_cve_toolset)

xquery · 2025-10-31T10:59:09Z

@sourcery-ai resolve

sourcery-ai · 2025-10-31T10:59:12Z

Hi @xquery! 👋

Only authors and team members can run @sourcery-ai commands on public repos.

JimFuller-RedHat self-assigned this Oct 27, 2025

JimFuller-RedHat marked this pull request as draft October 27, 2025 07:41

sourcery-ai bot reviewed Oct 27, 2025

View reviewed changes

src/aegis_ai/toolsets/tools/data_critic/__init__.py Outdated Show resolved Hide resolved

src/aegis_ai/toolsets/tools/data_critic/__init__.py Outdated Show resolved Hide resolved

src/aegis_ai/agents/data.py Show resolved Hide resolved

JimFuller-RedHat force-pushed the data-critic-third branch 4 times, most recently from e6157ac to e1262b7 Compare October 27, 2025 09:20

JimFuller-RedHat requested a review from kdudka October 27, 2025 09:20

JimFuller-RedHat marked this pull request as ready for review October 27, 2025 09:20

sourcery-ai bot reviewed Oct 27, 2025

View reviewed changes

src/aegis_ai/toolsets/tools/data_critic/__init__.py Show resolved Hide resolved

src/aegis_ai/features/cve/__init__.py Show resolved Hide resolved

data_critic: add data subagent feature that performs data analysis

876b200

JimFuller-RedHat force-pushed the data-critic-third branch from e1262b7 to 876b200 Compare October 27, 2025 09:22

kdudka requested changes Oct 27, 2025

View reviewed changes

data_critic: add subagent feature that performs data analysis #300

Are you sure you want to change the base?

data_critic: add subagent feature that performs data analysis #300

Uh oh!

Conversation

JimFuller-RedHat commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How to Test

Related Tickets

Summary by Sourcery

Summary by Sourcery

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JimFuller-RedHat commented Oct 27, 2025

Uh oh!

JimFuller-RedHat commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdudka commented Oct 27, 2025

Uh oh!

JimFuller-RedHat commented Oct 27, 2025

Uh oh!

Uh oh!

kdudka Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

xquery commented Oct 31, 2025

Uh oh!

sourcery-ai bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JimFuller-RedHat commented Oct 27, 2025 •

edited

Loading

JimFuller-RedHat commented Oct 27, 2025 •

edited

Loading