CSAF interface for ingestion #1137

olav-gb · 2025-08-14T13:50:37Z

What

Adds a general CSAF interface to increase usability of ingested data.
Adds a small doc to give some general insights about what a callee may need to consider when dealing with CSAFs.

Why

We ingest CSAFs from various vendors at vt-dev, possibly helpful for other teams as well.
Does not intend to "interpret" how specific vendors may or may not use it, this remains in the domain (and responsibility) of the callee code. Especially w.r.t. non-compliancy of sub-structures.
Does not intend to be complete, current implementation represents the degree of completeness required to deal with the vendors we are already using and actively investigating beyond that.

References

https://docs.oasis-open.org/csaf/csaf/v2.0/os/csaf-v2.0-os.html

Checklist

Tests: Written for models and the main interfaces

github-actions · 2025-08-14T13:51:09Z

Conventional Commits Report

😢 No conventional commits found.

👉 Learn more about the conventional commits usage at Greenbone.

codecov · 2025-08-14T14:00:45Z

Codecov Report

❌ Patch coverage is 81.12583% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.95%. Comparing base (e8850f2) to head (7d2a542).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
pontos/csaf/_csaf.py	74.82%	25 Missing and 10 partials ⚠️
pontos/csaf/models/relationship.py	81.53%	6 Missing and 6 partials ⚠️
pontos/csaf/models/revision.py	69.23%	4 Missing ⚠️
pontos/csaf/_utils.py	50.00%	1 Missing and 2 partials ⚠️
pontos/csaf/models/vulnerability.py	93.18%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1137      +/-   ##
==========================================
- Coverage   90.32%   89.95%   -0.37%     
==========================================
  Files         111      117       +6     
  Lines        7315     7617     +302     
  Branches      840      898      +58     
==========================================
+ Hits         6607     6852     +245     
- Misses        500      538      +38     
- Partials      208      227      +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pontos/csaf/_utils.py

bjoernricks · 2025-08-14T15:40:58Z

pontos/csaf/models/relationship.py

+from pontos.csaf import RelationshipCategory
+
+
+class Relationship(dict):


I am really not sure if extending dict is a good idea. Instead I would derive from one of the collection ABCs https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes. Most likely Mapping

The reasons I went with dict, as well as not further restring what is and what is not used are the following:

The resp. input data structures (Relationship, Revision, CSAF, and many others) are explicitly a dictionary when ingesting the CSAF JSON. I.e., this choice more explicitly ingests the data in its original form.

The callee may need access to more rarely used structures that may or may not exist at various location of the CSAF (depending on the vendor's implementation choices/degree of compliancy etc.).

This code does not intend to cover all the potential locations, resp. sub-structures and fields via individual classes and methods. Doing so would require more interpretation about how the vendor uses the CSAF, which is (i) complex in terms of implementation that works for all CSAF inputs as expected, (ii) would result in producing significantly higher parsing overhead due to the amount of checks to be done to ensure correctness across all vendors, (iii) break for certain non-compliant vendors.

Due to (i) the variety in how the data interpreter may need to use it, providing the dict's own methods seems to me a greater benefit than explicitly denying them.

Similarly, aspects like equality should be exactly equal when the original ingested data, i.e., the JSON dict, is equal. This is explicitly covered by deriving from dict.

Feel free to look at the previously linked specification, to consider the implementation effort to cover everything of potential need, even if all vendors were compliant.

Note that:

The callee already has complete access to the raw data, given that the callee provides the raw CSAF json (published openly accessible by the resp. vendor) that this code then aims to make more easily usable. Thus, he already has complete control and responsibility about the data. I.e., the data provider and user, in our use case at least, are the same.

Some of these points are also explicitly documented in the CSAF class.

Just don't derive from dict. If you have unknown user specific data, store it in your class and restrict the access. Quick example:

class Something: def __init__(self, *, known_prop: str, other_data: dict[str, Any]) -> None: self._data = other_data self.know_prop = known_prop def __getitem__(self, key: string) -> Any: return self._data[key]

If you derive from something you earn all the interfaces from this API. Always restrict your interfaces.

Hopefully done. In the end, the user already has explicit access to the relevant sub-dicts (product_tree, vulnerabilities) via the resp. properties. I minimized the interfaces wherever possible. Not sure yet, if it'll require an extension in the future, but at least the Csaf class itself doesn't inherit all interfaces.

pontos/csaf/models/revision.py

bjoernricks

Hi, some additional notes on my comments.

Please provide Model classes similar to something like https://github.com/greenbone/pontos/blob/main/pontos/nvd/models/cpe.py (or alternatively you could use pydantic) for defining your models. When using dicts as your model it's very very difficult to know which properties are actually available, required and which data type they use. Therefore always define all necessary properties on the model. If you are really really sure you want a dict only, please take a look at TypedDict. Models are an intermediate layer between the original data and your API. They allow to validate and transform the original data. API users should not be required to know any details of the internals of a Model. Even if they are documented.

Please always also define the shape of a dict when using it as an argument. For example dict[str, str]. For API users it's hard to guess what's actually expected. if it's json structure then we should have some type somewhere in pontos for that too.

olav-gb · 2025-08-15T07:44:04Z

Models are an intermediate layer between the original data and your API. They allow to validate and transform the original data.

Validation in this case is (i) highly complex due to variability in the original specification, and (ii) will result in inability to ingest certain vendor's CSAFs that are non-compliant with the standard in at least certain sub-structures. Transformation of the security advisory data itself is not really the goal we intend for the API, instead, this is handled by those callees that provide and in turn interpret the data on a per-vendor basis. For some vendors, getting the needed data out is almost trivial and would not require this model, for others, it becomes too complex and case-specific leading to high parsing overhead for any other vendor without the ability to provide guarantees of success.

CSAFs are security advisories; in the end, the callee still has to decide what exact information he needs. Especially with regards to the product_tree, i.e., descriptions of the vulnerable/ fixed products, the formatting of the relevant data is highly vendor specific. E.g., in case of several Linux distros, going purely by the CSAF specification, will not allow the user to actually retrieve the vulnerable packages because they are stated as product_id, which simply is a unique string identifier in this document, that may be "0", may be an exact package name with version and architecture, or anything in between.

API users should not be required to know any details of the internals of a Model.

As a result, the callee will inevitably have to consider what exact data the vendor provides at which places. This API simply makes accessing and using some of the data that remains important across vendors easier. Artificially restricting the remaining data would require the callee to keep both the original raw data and the resp. model in the same context, significantly reducing readability and usability.

olav-gb · 2025-08-15T07:47:48Z

There already is a PyPI for pure validation of the data structure:
https://codes.dilettant.life/docs/csaf/index.html

That is already > 2k lines in code purely for verification, and it's not even a complete implementation.

olav-gb · 2025-08-15T08:35:08Z

For context, we would use this in our notus-generator repo and likely will go the route of separating the complex data interpretation cases from our current repo in a separate one; for the "trivial" interpretation cases, the code of this PR is sufficient by adding minimal per-vendor interpretation for in each of those generators. Depending on the vendor, different fields of the CSAFs are accessible or not and needed or not, since the vast majority are technically non-mandatory and thus up to the vendor to decide how to use. Historically, the minimal parsing and interpretation was sufficient, but for some vendors we would like to cover, this no longer holds, requiring more explicit handling of this complex data structure.

The generators ingests the raw security advisories (raw website/CSAF/CVRF/...) from the resp. vendor's public API/..., and do the needed interpretation for each vendor to in turn to then produce the .notus files (advisories and vulnerable product specifications) ingested by notus-scanner.

olav-gb · 2025-08-18T12:54:54Z

Removed inheritance to the best degree possible, clarified typing to the best degree possible/that seems reasonable without violating the specification.

I've taken a second close look at the Model class. Feel free to correct me if I'm wrong, but given that each of the CSAF classes is really just a dictionary handed over from the CSAF, it doesn't seem to make sense to provide and parse all (potential) fields for a dedicated and potentially long __init__(). From the current use case perspective, we only ingest CSAF jsons, i.e., we have the complete dict at hand at once, instead of for producing them, for which there might be caching/ varying constructors by the callee for each of the sub-contents in each dict instead. If we were to produce CSAFs, then more explicit __init__() functions with resp. Model subtyping would seem reasonable to me.

amy-gb · 2025-10-10T07:40:59Z

We've decided to close this since it will be more efficient to implement it internally, and there also doesn't seem to be much interest inn using it across teams. Thank you @bjoernricks and @y0urself for your time!

olav-gb requested a review from a team as a code owner August 14, 2025 13:50

greenbonebot enabled auto-merge (rebase) August 14, 2025 13:50

bjoernricks reviewed Aug 14, 2025

View reviewed changes

pontos/csaf/_utils.py Outdated Show resolved Hide resolved

bjoernricks reviewed Aug 14, 2025

View reviewed changes

pontos/csaf/models/revision.py Show resolved Hide resolved

bjoernricks reviewed Aug 15, 2025

View reviewed changes

olav-gb requested a review from bjoernricks September 8, 2025 09:49

olav-gb added 4 commits October 2, 2025 07:08

Initial CSAF commit

565b04b

Backports type-hinting, I001

8210796

Fixes: I001

7ebf340

Fixes: CSAF Typing, interface-minimization

7d2a542

y0urself force-pushed the vt-dev/csaf branch from 6f5e395 to 7d2a542 Compare October 2, 2025 05:08

y0urself requested a review from a team as a code owner October 2, 2025 05:08

amy-gb closed this Oct 10, 2025

auto-merge was automatically disabled October 10, 2025 07:40
Pull request was closed

		from pontos.csaf import RelationshipCategory


		class Relationship(dict):

CSAF interface for ingestion #1137

CSAF interface for ingestion #1137

Uh oh!

Conversation

olav-gb commented Aug 14, 2025

What

Why

References

Checklist

Uh oh!

github-actions bot commented Aug 14, 2025

Conventional Commits Report

Uh oh!

codecov bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

bjoernricks Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

olav-gb Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

bjoernricks Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

olav-gb Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bjoernricks left a comment

Choose a reason for hiding this comment

Uh oh!

olav-gb commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

olav-gb commented Aug 15, 2025

Uh oh!

olav-gb commented Aug 15, 2025

Uh oh!

olav-gb commented Aug 18, 2025

Uh oh!

amy-gb commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Aug 14, 2025 •

edited

Loading

olav-gb commented Aug 15, 2025 •

edited

Loading