-
Notifications
You must be signed in to change notification settings - Fork 13
CSAF interface for ingestion #1137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Conventional Commits Report😢 No conventional commits found. 👉 Learn more about the conventional commits usage at Greenbone. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1137 +/- ##
==========================================
- Coverage 90.32% 89.95% -0.37%
==========================================
Files 111 117 +6
Lines 7315 7617 +302
Branches 840 898 +58
==========================================
+ Hits 6607 6852 +245
- Misses 500 538 +38
- Partials 208 227 +19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
pontos/csaf/models/relationship.py
Outdated
from pontos.csaf import RelationshipCategory | ||
|
||
|
||
class Relationship(dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really not sure if extending dict is a good idea. Instead I would derive from one of the collection ABCs https://docs.python.org/3/library/collections.abc.html#collections-abstract-base-classes. Most likely Mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasons I went with dict, as well as not further restring what is and what is not used are the following:
- The resp. input data structures (Relationship, Revision, CSAF, and many others) are explicitly a dictionary when ingesting the CSAF JSON. I.e., this choice more explicitly ingests the data in its original form.
- The callee may need access to more rarely used structures that may or may not exist at various location of the CSAF (depending on the vendor's implementation choices/degree of compliancy etc.).
- This code does not intend to cover all the potential locations, resp. sub-structures and fields via individual classes and methods. Doing so would require more interpretation about how the vendor uses the CSAF, which is (i) complex in terms of implementation that works for all CSAF inputs as expected, (ii) would result in producing significantly higher parsing overhead due to the amount of checks to be done to ensure correctness across all vendors, (iii) break for certain non-compliant vendors.
- Due to (i) the variety in how the data interpreter may need to use it, providing the dict's own methods seems to me a greater benefit than explicitly denying them.
- Similarly, aspects like equality should be exactly equal when the original ingested data, i.e., the JSON dict, is equal. This is explicitly covered by deriving from dict.
- Feel free to look at the previously linked specification, to consider the implementation effort to cover everything of potential need, even if all vendors were compliant.
Note that:
- The callee already has complete access to the raw data, given that the callee provides the raw CSAF json (published openly accessible by the resp. vendor) that this code then aims to make more easily usable. Thus, he already has complete control and responsibility about the data. I.e., the data provider and user, in our use case at least, are the same.
Some of these points are also explicitly documented in the CSAF
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just don't derive from dict. If you have unknown user specific data, store it in your class and restrict the access. Quick example:
class Something:
def __init__(self, *, known_prop: str, other_data: dict[str, Any]) -> None:
self._data = other_data
self.know_prop = known_prop
def __getitem__(self, key: string) -> Any:
return self._data[key]
If you derive from something you earn all the interfaces from this API. Always restrict your interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully done. In the end, the user already has explicit access to the relevant sub-dicts (product_tree, vulnerabilities) via the resp. properties. I minimized the interfaces wherever possible. Not sure yet, if it'll require an extension in the future, but at least the Csaf class itself doesn't inherit all interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, some additional notes on my comments.
Please provide Model classes similar to something like https://github.com/greenbone/pontos/blob/main/pontos/nvd/models/cpe.py (or alternatively you could use pydantic) for defining your models. When using dicts as your model it's very very difficult to know which properties are actually available, required and which data type they use. Therefore always define all necessary properties on the model. If you are really really sure you want a dict only, please take a look at TypedDict. Models are an intermediate layer between the original data and your API. They allow to validate and transform the original data. API users should not be required to know any details of the internals of a Model. Even if they are documented.
Please always also define the shape of a dict when using it as an argument. For example dict[str, str]
. For API users it's hard to guess what's actually expected. if it's json structure then we should have some type somewhere in pontos for that too.
Validation in this case is (i) highly complex due to variability in the original specification, and (ii) will result in inability to ingest certain vendor's CSAFs that are non-compliant with the standard in at least certain sub-structures. Transformation of the security advisory data itself is not really the goal we intend for the API, instead, this is handled by those callees that provide and in turn interpret the data on a per-vendor basis. For some vendors, getting the needed data out is almost trivial and would not require this model, for others, it becomes too complex and case-specific leading to high parsing overhead for any other vendor without the ability to provide guarantees of success. CSAFs are security advisories; in the end, the callee still has to decide what exact information he needs. Especially with regards to the
As a result, the callee will inevitably have to consider what exact data the vendor provides at which places. This API simply makes accessing and using some of the data that remains important across vendors easier. Artificially restricting the remaining data would require the callee to keep both the original raw data and the resp. model in the same context, significantly reducing readability and usability. |
There already is a PyPI for pure validation of the data structure: That is already > 2k lines in code purely for verification, and it's not even a complete implementation. |
For context, we would use this in our The generators ingests the raw security advisories (raw website/CSAF/CVRF/...) from the resp. vendor's public API/..., and do the needed interpretation for each vendor to in turn to then produce the |
Removed inheritance to the best degree possible, clarified typing to the best degree possible/that seems reasonable without violating the specification. I've taken a second close look at the |
We've decided to close this since it will be more efficient to implement it internally, and there also doesn't seem to be much interest inn using it across teams. Thank you @bjoernricks and @y0urself for your time! |
Pull request was closed
What
Why
References
Checklist