Skip to content

Conversation

@JimFuller-RedHat
Copy link
Collaborator

@JimFuller-RedHat JimFuller-RedHat commented Oct 27, 2025

What

Adds a data subagent that performs data analysis.

  • added an optional data_agent which is used by data_critic tool (which uses cve.CVEDataCritic feature)
  • added AEGIS_DATA_ENABLED,AEGIS_DATA_LLM_HOST, AEGIS_DATA_LLM_MODEL, AEGIS_DATA_OPENAPI_KEY env var (recc. using gemini-2.5-flash-lite)
  • added AEGIS_AGENT_MAX_RETRIES env var for generic setting on all agents
  • minor changes

Note: set DATA llm agent env vars to use something like gemini-2.5-flash-lite as data analysis should be very quick.

Why

Understanding data quality improves assessment of confidence in analysis. It can also directly help analysis by choosing more generic analysis where data quality is low.

How to Test

Related Tickets

https://issues.redhat.com/browse/AEGIS-74

Summary by Sourcery

Introduce a new data critic subagent and toolset for quantitative data quality analysis, integrate it into the CVE feature pipeline, adjust configuration and schema for numeric data_quality, and update documentation and logging support.

New Features:

  • Add CVEDataCritic feature to perform quantitative data quality assessments
  • Introduce a dedicated data_critic toolset and data subagent for standalone data analysis

Enhancements:

  • Integrate the data critic tool into CVE analysis workflows to factor data quality into confidence scoring
  • Change the data_quality metadata field to a 0.0–1.0 numeric scale and update feature schema accordingly
  • Enhance LoggingToolset to support argument suppression and apply it to the data toolset

Documentation:

  • Add environment variables for enabling data critic and configuring the data LLM
  • Update CHANGELOG to document the new data_quality assessment feature

Summary by Sourcery

Introduce a new data critic subagent with its own toolset and agent to perform quantitative data quality analysis, integrate it into the CVE feature pipeline, migrate the data_quality field to a numeric scale, add environment-driven configuration, and refine logging and documentation.

New Features:

  • Add CVEDataCritic feature for quantitative CVE data quality assessment
  • Introduce a dedicated data critic toolset and separate agent for standalone data analysis

Enhancements:

  • Integrate the data critic tool into CVE analysis workflows to factor data quality into confidence scoring
  • Convert the data_quality metadata field to a numeric 0.0–1.0 scale and update the feature schema accordingly
  • Enhance LoggingToolset to support argument suppression for cleaner tool call logs

Documentation:

  • Add environment variables for enabling the data critic subagent and configuring the data LLM host, model, and API key
  • Update CHANGELOG to document the new data_quality assessment feature

@JimFuller-RedHat JimFuller-RedHat self-assigned this Oct 27, 2025
@JimFuller-RedHat JimFuller-RedHat marked this pull request as draft October 27, 2025 07:41
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Changing data_quality from a Literal enum to a float will break any downstream consumers expecting the old values; please add a migration path or compatibility layer.
  • The data_toolset is always wrapped and injected even when AEGIS_DATA_ENABLED is false, resulting in an empty toolset in agents; consider conditionally registering and wrapping it only when enabled.
  • You’ve duplicated the ‘use data_critic tool’ instructions in both SuggestImpact and SuggestRemediation—extract that into a shared directive or helper to avoid drift.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Changing data_quality from a Literal enum to a float will break any downstream consumers expecting the old values; please add a migration path or compatibility layer.
- The data_toolset is always wrapped and injected even when AEGIS_DATA_ENABLED is false, resulting in an empty toolset in agents; consider conditionally registering and wrapping it only when enabled.
- You’ve duplicated the ‘use data_critic tool’ instructions in both SuggestImpact and SuggestRemediation—extract that into a shared directive or helper to avoid drift.

## Individual Comments

### Comment 1
<location> `src/aegis_ai/toolsets/tools/data_critic/__init__.py:22-28` </location>
<code_context>
+        return await cwe_data_critic.exec(
+            cve_data.cve_id, static_context=cve_data.model_dump()
+        )
+    except ValueError:
+        return CVEDataCriticToolInput(status="failure", **cve_data)
+
+
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Catching only ValueError may miss other exceptions during analysis.

Consider catching additional exception types or logging unexpected errors to improve robustness.

```suggestion
    try:
        cwe_data_critic = cve.CVEDataCritic(agent=data_critic_agent)
        return await cwe_data_critic.exec(
            cve_data.cve_id, static_context=cve_data.model_dump()
        )
    except ValueError:
        return CVEDataCriticToolInput(status="failure", **cve_data)
    except Exception as exc:
        logger.exception("Unexpected error during CVE data analysis")
        return CVEDataCriticToolInput(status="failure", **cve_data)
```
</issue_to_address>

### Comment 2
<location> `src/aegis_ai/toolsets/tools/data_critic/__init__.py:28` </location>
<code_context>
+            cve_data.cve_id, static_context=cve_data.model_dump()
+        )
+    except ValueError:
+        return CVEDataCriticToolInput(status="failure", **cve_data)
+
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Returning CVEDataCriticToolInput on failure may not match expected output schema.

Returning CVEDataCriticToolInput instead of CVEDataCriticOutput on error may cause type mismatches. Please return a default CVEDataCriticOutput with an error status.
</issue_to_address>

### Comment 3
<location> `src/aegis_ai/agents/data.py:16-25` </location>
<code_context>
+        "seed": 42,
+        "response_format": {"type": "json_object"},
+    },
+    system_prompt="""
+    You are a Red Hat product security assistant assessing data quality.
+    Goals: Perform a fast and concise analysis of data quality.
+    Rules:
+    - Prefer facts over speculation; cite only provided context/tools.
+    - Never use any tools.
+    - Keep answers short and directly useful.
+    - Output must match the requested JSON schema when provided.
+    Safety: refuse harmful or unethical requests.
+    """,
+)
</code_context>

<issue_to_address>
**question:** System prompt prohibits tool usage, but agent is used for tool-based analysis.

Please update the prompt or agent configuration to avoid conflicting instructions and ensure consistent behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@JimFuller-RedHat JimFuller-RedHat force-pushed the data-critic-third branch 4 times, most recently from e6157ac to e1262b7 Compare October 27, 2025 09:20
@JimFuller-RedHat JimFuller-RedHat marked this pull request as ready for review October 27, 2025 09:20
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `src/aegis_ai/toolsets/tools/data_critic/__init__.py:27-36` </location>
<code_context>
+    pass
+
+
+async def analyse_cve_data(cve_id: CVEID, cve_data: Any) -> CVEDataCriticOutput:
+    from aegis_ai.agents.data import default as data_critic_agent
+
+    try:
+        cwe_data_critic = cve.CVEDataCritic(agent=data_critic_agent)
+        return await cwe_data_critic.exec(
+            cve_data.cve_id, static_context=cve_data.model_dump()
+        )
+    except Exception:
+        logger.warning("Unexpected error during CVE data analysis.")
+        return CVEDataCriticOutput(
+            data_quality=0,
+            confidence=0,
</code_context>

<issue_to_address>
**suggestion:** Exception handling in analyse_cve_data may mask underlying errors.

Log the exception details or stack trace to improve error traceability and simplify debugging.

```suggestion
    except Exception as exc:
        logger.exception("Unexpected error during CVE data analysis.")
        return CVEDataCriticOutput(
            data_quality=0,
            confidence=0,
            tools_used=[],
            cve_id=cve_id,
            disclaimer=CVEDataCriticOutput.disclaimer,
            explanation=f"Unexpected error during CVE data analysis: {exc}",
        )
```
</issue_to_address>

### Comment 2
<location> `src/aegis_ai/features/cve/__init__.py:24` </location>
<code_context>
+class CVEDataCritic(Feature):
+    """Based on current CVE information and context assess quality of CVE information."""
+
+    async def exec(self, cve_id: CVEID, static_context: Any = None, usage=None):
+        prompt = AegisPrompt(
+            user_instruction="Your task is to examine the provided CVE data and generate a fast and brief data assessment of its quality, confidence, completeness and internal consistency.",
</code_context>

<issue_to_address>
**question:** The exec method signature for CVEDataCritic differs from other Feature classes.

If the usage parameter is required for agent compatibility, document this difference and notify all consumers. Otherwise, align the method signature with other Feature classes for consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@JimFuller-RedHat
Copy link
Collaborator Author

/retest

@JimFuller-RedHat
Copy link
Collaborator Author

JimFuller-RedHat commented Oct 27, 2025

when this is enabled:

AEGIS_DATA_ENABLED="true"
AEGIS_DATA_LLM_HOST="https://generativelanguage.googleapis.com"
AEGIS_DATA_LLM_MODEL="gemini-2.5-flash-lite"
AEGIS_DATA_OPENAPI_KEY=INSERTAPIKEY

and eval tests are run - they timeout due to call to data subagent (which is set to use gemini-2.5-flash-lite) - for reasons unknown - otherwise works with normal cli invoke.

@kdudka
Copy link
Collaborator

kdudka commented Oct 27, 2025

and eval tests are run - they timeout due to call to data subagent (which is set to use gemini-2.5-flash-lite) - for reasons unknown - otherwise works with normal cli invoke.

The reason is that the run_if_safe() method in the Feature class is not reentrant. When you reach the function the second time, the llm_sem semaphore is already entered, which may cause indefinite blocking. This can be easily reproduced if you export AEGIS_LLM_MAX_JOBS=1. Then each call to data_critic_cve_data_critic() is terminated by the timeout.

@JimFuller-RedHat
Copy link
Collaborator Author

ah thx @kdudka - will fix

)

# chain logging wrappers
data_toolset = LoggingToolset(data_toolset, no_args=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of no_args=True does not take any effect because WrapperToolset is a data class and you set the flag on a different object than you read it. The following patch makes it work as expected:

--- a/src/aegis_ai/toolsets/__init__.py
+++ b/src/aegis_ai/toolsets/__init__.py
@@ -7,6 +7,8 @@ import os
 import logging
 import time

+from dataclasses import dataclass
+
 from pydantic_ai.common_tools.tavily import tavily_search_tool
 from pydantic_ai.mcp import MCPServerStdio
 from pydantic_ai.toolsets import FunctionToolset, CombinedToolset
@@ -38,7 +40,10 @@ from aegis_ai.toolsets.tools.data_critic import data_critic_toolset
 logger = logging.getLogger(__name__)


+@dataclass
 class LoggingToolset(WrapperToolset):
+    no_args: bool
+
     def __init__(self, wrapped, no_args: bool = False):
         super().__init__(wrapped)
         self.no_args: bool = no_args

I also suggest to apply the following patch to make the code readable:

--- a/src/aegis_ai/toolsets/__init__.py
+++ b/src/aegis_ai/toolsets/__init__.py
@@ -42,18 +42,16 @@ logger = logging.getLogger(__name__)

 @dataclass
 class LoggingToolset(WrapperToolset):
-    no_args: bool
+    print_args: bool

-    def __init__(self, wrapped, no_args: bool = False):
+    def __init__(self, wrapped, print_args: bool = True):
         super().__init__(wrapped)
-        self.no_args: bool = no_args
+        self.print_args = print_args

     async def call_tool(self, name: str, tool_args: dict, ctx: RunContext, tool):  # type: ignore[override]
         # log tool call entry
-        args = str(tool_args) if tool_args else ""
+        args = str(tool_args) if self.print_args and tool_args else ""
         prefix = f"[tool call] {name}({args})"
-        if self.no_args:
-            prefix = f"[tool call] {name}()"
         start = time.time()
         logger.info(f"{prefix} started")

@@ -221,7 +219,7 @@ if use_data_tools in truthy:
     )

 # chain logging wrappers
-data_toolset = LoggingToolset(data_toolset, no_args=True)
+data_toolset = LoggingToolset(data_toolset, print_args=False)
 public_toolset = LoggingToolset(public_toolset)
 redhat_cve_toolset = LoggingToolset(redhat_cve_toolset)
 public_cve_toolset = LoggingToolset(public_cve_toolset)

@xquery
Copy link

xquery commented Oct 31, 2025

@sourcery-ai resolve

@sourcery-ai
Copy link

sourcery-ai bot commented Oct 31, 2025

Hi @xquery! 👋

Only authors and team members can run @sourcery-ai commands on public repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants