RFC: Ontology Support in Cartography #1579

jychp · 2025-05-21T13:02:42Z

jychp
May 21, 2025
Collaborator

version 1.1 (2025/06/01)

Abstract

This RFC proposes introducing a lightweight ontology system into Cartography to enable cross-module analysis, improve semantic interoperability, and preserve the tool’s existing speed and flexibility. It builds on two concepts that partially exist today but are neither formalized nor widely adopted: semantic labels (e.g., :Database, :ComputeInstance) and abstract identity nodes (e.g., :Human, :IP). These additions aim to provide a shared semantic layer without compromising the simplicity and modularity that make Cartography effective.

Motivation & Goals

Cartography’s current design prioritizes fidelity to source data models. This has enabled fast module development and easy onboarding for contributors. However, the lack of a shared schema or unified identity layer creates friction when trying to:

Analyze access relationships across multiple cloud providers
Correlate equivalent entities represented differently by various data sources (e.g., users, IP addresses)
Scale the platform with consistent semantics as new modules are introduced

A lightweight ontology would introduce just enough structure to support higher-level analysis—without compromising the current ease of use or development velocity. This approach maintains simplicity for both contributors and users while enabling more powerful and semantically rich queries when needed.

Goals

Introduce minimal, non-breaking changes to Cartography’s schema and ingestion logic

Enable high-level semantic queries, such as:

(:Human {email: "xx"})-[:USES]->(:User)-[:HAS_ACCESS]->(:ComputeInstance)
(:ComputeInstance)-[:STORES_DATA_IN]->(:DataStore)
(:DNS)-[:RESOLVES_TO]->(:IP)

Make it easier to reason across modules and navigate the graph at a conceptual level
Preserve explicit, contributor-friendly configuration and logic to keep onboarding simple and transparent

Non-Goals

Replace existing per-module schemas or enforce a strict global schema
Integrate a full reasoning engine or introduce heavyweight schema validation

Related subjects

Proposed Design

To avoid the pitfalls of inconsistent naming and to maximize interoperability with other tools, naming will follow the MITRE D3FEND taxonomy (whenever possible): https://d3fend.mitre.org/dao/artifact/d3f:User/

Throughout this RFC, the Human node is used as an example for clarity, since it already exists. However, this node is not part of the D3FEND ontology, which instead uses the concept of a User (with an AWSUser considered an Account). Using User instead of Human aligns with the D3FEND taxonomy and also allows the ontology to be implemented without impacting the current behavior of the tool—preserving full backward compatibility.

1. Semantic Labels

Cartography’s custom ORM already supports additional labels via the extra_labels parameter. This proposal mainly focuses on defining and documenting these semantic labels, and gradually updating existing modules to include them. Since these labels are additive, the change is fully backward-compatible.

Nodes will be allowed to carry extra semantic labels that represent broader categories or abstract types.

Example:

labels:
  - AzureCosmosDBSQLDatabase
  - SQLDatabase
  - Datastore

This behaves similarly to type inheritance in Python, enabling semantic queries like:

MATCH (n:SQLDatabase) RETURN n

across providers and modules.

2. Abstract Nodes

Why Abstract Nodes (vs. Just Labels)?

In many cases, a single real-world entity—or closely related entities—may be represented across multiple modules. In such cases, semantic labels alone are not sufficient.

To borrow a Python analogy:

A semantic label is like a class name.
An abstract node is more like a shared instance that links multiple representations of the same entity.

In other words, semantic labels define a type, while abstract nodes provide a common anchor to connect multiple module-specific instances of that type.

Examples of Abstract Nodes

:Human (used to link identities across systems)
:IP
:CVE
:Group
:UserDevice
:Secret

Illustrative Example:

A GitHubUser and a GSuiteUser node may both carry the :User label, allowing a query like:
```
MATCH (u:User) RETURN u
```
But this does not tell us if these two nodes refer to the same person.

Introducing a shared :Human node allows this kind of query:

MATCH (:Human {email: "j.doe@example.com"})-[:USES]->(u:User) RETURN u

While one might consider enforcing a shared field (e.g. email) across all User nodes to enable this kind of join, doing so would impose global schema requirements, which goes against Cartography’s flexible, module-driven design.

Instead, abstract nodes act as virtual anchors. Each module can map its entities to them explicitly, without schema enforcement.

Ontology Node Creation

Any module that interacts with ontology nodes should be technically capable of creating them. However, the actual creation of ontology nodes should be explicitly controlled by the user via configuration when several modules could handle that.

This allows users to define the source of truth for each type of abstract node.

Examples:

For :Human, it makes sense to allow only identity provider modules (e.g., Okta, GSuite) to create these nodes, while other modules simply link to them.
For :Secret, creation should be restricted to trusted sources (e.g., OpenAI, HashiCorp Vault), not consumers (e.g., GitHub Secrets module).

Ontology Node Structure

Each ontology node should include an additional Ontology label to support query filtering (e.g., to query only ontology nodes or, conversely, only 'real' nodes), and should follow a minimal and consistent structure.

id: Unique identifier
firstseen: Timestamp
lastupdated: Update Tag
created_by: Module name responsible for creation

Fields that contain identifying values (e.g., Human.email) should be minimal and defined explicitly. Mappings to those fields should be handled automatically based on the trusted source module.

Note: The extra label enables ontology-specific cleanup processes without impacting regular nodes.

Proposed Implementation

Since the extra_labels mechanism is already supported by Cartography’s ORM, and its use in an ontological model is more about documentation and naming conventions than technical implementation, this section focuses on implementing abstract nodes.

The goals of this implementation are:

Enable easy linkage between module-specific nodes and abstract ontology nodes.
Make ontology relationships explicit and self-contained within each model.
Enforce ontology node creation rights to trusted, user-configured sources.

Automatic Linking to Ontology Nodes

To keep modules simple, ontology management should be centralized and externalized from individual modules. This approach offers several benefits:

Keeps logic simpler at the module level
Users relying on a single module aren’t forced to deal with higher-level design decisions
Enables a single, centralized ontology definition

All components of the ORM can be reused to support this approach. Ontology nodes can be defined under models/ontology, including their mappings. For example, for a Human node in the ontology, it’s possible to:

Define a corresponding CartographyNodeProperties
Define a pair of CartographyRelProperties for each link to a real-world object (e.g., AnthropicUser, AWSUser, etc.)
Define a CartographyNodeSchema with the extra label Ontology, and declare all relationships in the other_relationships property

Relationships between ontology nodes can follow the MITRE D3FEND taxonomy

The current cleanup mechanism should be reviewed and tested to ensure that links between ontology nodes and real nodes don’t prevent proper deletion of real nodes

The NodeSchema dataclass already provides everything needed to define ontology nodes. A derived class could be introduced for typing clarity, but no additional logic is required.

Ontology creation and cleanup should be handled in a dedicated module (similar to how the analysis module works). This ensures clear separation from other modules and allows the ontology to remain optional while it's still experimental.

This module should run just before the analysis phase, ensuring that all ontology nodes are available and can be leveraged during analysis.

CLI Support for Source of Truth

A new CLI parameter will allow users to explicitly define which modules are trusted to create ontology nodes:

--source-of-truth=okta,azuread,gsuite

Since ontology node ingestion happens in a dedicated module rather than within each individual module, the full set of available ORM utilities can be reused.
The process can follow a simple pattern:

For each module listed in the source of truth, a database query is executed to retrieve existing nodes (this can be done using a straightforward mapping)
The retrieved node list is optionally transformed, then used to create the corresponding abstract nodes

Here’s a naive example for the Human ontology node:

MAPPING = {
    'okta': {
        'humans' : {
            'node_label': 'OktaUser',
            'fields': {
                 'email': 'email'
            },
        },
   'otherModule'
        'humans' : {
            'node_label': 'OtherUser',
            'fields': {
                 'email': 'login'
            },
        },
    }
}

@timeit
def sync(
    neo4j_session: neo4j.Session,
    update_tag: int,
    common_job_parameters: Dict[str, Any],
) -> None:
    existing_users = get(neo4j.Session)
    load_humans(neo4j_session, formated_users, tenant_id, update_tag)

@timeit
def get(neo4j_session: neo4j.Session) -> list[dict]:
    result: list[dict] = []
    for intel in config.source_of_truth:
        module_mapping = MAPPING[intel].get('humans')
        if module_mapping is None:
            continue
        node_label = module_mapping['node_label']
        email_field = module_mapping['fields']['email']
        # We can easily extends this part to propagate more field like first_name, last_name etc ...
        # This is a naive approach, an helper to build this query and handling missing field would probably be required
        for node in neo4j.run(f"MATCH (n:{node_label}) return n.{email_field} AS email"):
            result.append({
                "email": node.email
            })
    return result

@timeit
def load_humans(neo4j_session, nodes, update_tag):
    load(
        neo4j_session,
        HumanOntologySchema(),
        nodes,
        lastupdated=update_tag,
    )

It’s likely worth defining the mapping in a YAML file that can be overridden by the user. This provides greater flexibility and allows for customization without modifying the core codebase. Especially when it comes to handling propagated fields, this approach allows users to customize which properties are carried over to the ontology nodes without hardcoding that logic.

This allows:

Centralized, zero-effort control of ontology creation policies.
No changes to existing ingestion scripts.
Clean and self-contained implementation logic.

Conclusion

Benefits

Feature	Current	Proposed
Fast module development	✅	✅
Schema fidelity	✅	✅
Cross-module queries	❌	✅
Contributor simplicity	✅	✅
Semantic reasoning	❌	✅

By enabling abstract ontological linkage, we unlock the ability to:

Write cross-module queries (e.g. retrieving all users regardless of source),
Build richer visualizations and semantic aggregations,
Simplify downstream analytics and AI/ML applications that rely on shared concepts like Human, Asset, or Organization.

Drawbacks

Introduces slight complexity into the ingestion model, particularly around understanding when and how the ontology_node field is used.
Relies on contributors to use ontology_node and shared labels (:Human, :Asset, etc.) consistently.
- Misuse could lead to data fragmentation or incorrect assumptions in shared queries.

Alternatives Considered

Approach	Reason for Rejection
Full Ontology (OWL, RDF, SHACL, etc.)	Too heavyweight, adds external complexity, steep learning curve, poor fit for Neo4j ingestion
Status Quo (No Ontology)	Limits data reusability across modules, prevents shared concepts, leads to siloed graphs
Centralized matchers / helpers	Adds hidden logic in ingestion code, moves ontology semantics out of models, harder to maintain

Future Work

To realize this vision, the following steps are proposed:

Use :User as a pilot ontology node and migrate all modules producing human-like data (actual Human nodes will be kept for backward compatibility until a date we need to specify)
Document the use of semantic labels (:Account, :Organization, etc.) and encourage their usage
Create and maintain a lightweight ontology documentation (e.g., as a markdown spec or graph visualization)
Expand the ontology support to other key concepts: Asset, Group, Device, etc.

Final words

This proposal introduces an elegant, low-overhead abstraction for semantic linkage across Cartography modules. By embedding ontology metadata directly in schemas, we:

Preserve Cartography’s declarative, modular design,
Enable cross-source data unification without redundant code,
Lay the groundwork for smarter queries, visualizations, and AI integrations.

The adoption effort is incremental and backwards-compatible. It starts with shared concepts like Human, but sets the stage for broader ontological cohesion across the Cartography graph.

d-aggarwal · 2025-05-24T07:41:37Z

d-aggarwal
May 24, 2025
Collaborator

I really appreciate the use of semantic labels and abstract node concepts—it makes cross-module reasoning much more intuitive and scalable.

One small question: Since email is used as the unique identifier for the Human node, how are potential conflicts handled when different sources provide differing data for the same email address?
For example, a person might have slightly different information across platforms, even though the email remains the same. Does _load_abstracted_nodes perform any sort of validation or merging in such cases, or is the latest write assumed to overwrite any previous data?

Just curious to understand the design decision here—thanks in advance!

0 replies

jychp · 2025-05-24T08:06:40Z

jychp
May 24, 2025
Collaborator Author

@d-aggarwal this is exactly the kind of scenario the RFC addresses by introducing the concept of a "source of truth". Data is only synced from that single authoritative source.

At this stage, the plan is to create abstract nodes, like the Human, with minimal information, in this case just the email. That said, it's fairly straightforward to add more standard fields to the abstract node later on. I just haven’t had a strong use case for it yet, which is why I’ve left it out for now.

0 replies

d-aggarwal · 2025-05-24T09:07:39Z

d-aggarwal
May 24, 2025
Collaborator

Thanks for the explanation! That makes sense—having a clear source of truth helps avoid conflicts.

0 replies

chandanchowdhury · 2025-05-24T21:21:48Z

chandanchowdhury
May 24, 2025
Collaborator

@jychp Fantastic proposal :)

Do you think we can adopt MITRE D3FEND™ DAO or OCSF, something else? Have any preference over another?
Can help us avoid the hardest thing in computer science, i.e. naming things, while following established industry standard.

Example: https://schema.ocsf.io/1.5.0/objects/user or https://d3fend.mitre.org/dao/artifact/d3f:User/

0 replies

jychp · 2025-05-25T07:37:12Z

jychp
May 25, 2025
Collaborator Author

We should definitely rely on an existing framework for naming.

@chandanchowdhury yeah, I'm so exciting about that :)

Back in a previous project I led, I used D3FEND, it was released during that time, and OCSF wasn’t around yet.

I believe D3FEND is the stronger choice:

It’s more comprehensive (OCSF lacks references to cloud-native objects)
OCSF refer to D3FEND, so it will be easy to map to OCSF then if needed
It’s already graph-based by design
It opens the door to seamless integration with other MITRE frameworks (trust me, mapping attacks with TTPs in a D3FEND-compatible graph unlocks incredible possibilities)

0 replies

jychp · 2025-05-30T17:46:56Z

jychp
May 30, 2025
Collaborator Author

Comment from @achantavy: for the abstract node, we can also move the mapping outside the module model definition to keep the module as simple as possible.

For example: we could define in the ontology that an AWSUser can be linked to a human using the email property, instead of embedding that logic directly in the module.

I actually really like this idea because:

it keeps things simpler at the module level
people who only want to use a single module aren’t burdened with higher-level design choices
we get a centralized ontology definition in one place

0 replies

jychp · 2025-06-01T09:15:53Z

jychp
Jun 1, 2025
Collaborator Author

I've updated the RFC with your feedback, thanks!
This new version is simpler while delivering the same results, requires fewer code changes, and better aligns with the philosophy of the project.

0 replies

jychp · 2025-06-15T12:47:34Z

jychp
Jun 15, 2025
Collaborator Author

Just pushed an early POC for that RFC: #1633

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Ontology Support in Cartography #1579

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 8 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFC: Ontology Support in Cartography #1579

Uh oh!

Uh oh!

jychp May 21, 2025 Collaborator

Abstract

Motivation & Goals

Goals

Non-Goals

Related subjects

Proposed Design

1. Semantic Labels

2. Abstract Nodes

Why Abstract Nodes (vs. Just Labels)?

Examples of Abstract Nodes

Ontology Node Creation

Ontology Node Structure

Proposed Implementation

Automatic Linking to Ontology Nodes

CLI Support for Source of Truth

Conclusion

Benefits

Drawbacks

Alternatives Considered

Future Work

Final words

Replies: 8 comments

Uh oh!

d-aggarwal May 24, 2025 Collaborator

Uh oh!

jychp May 24, 2025 Collaborator Author

Uh oh!

d-aggarwal May 24, 2025 Collaborator

Uh oh!

chandanchowdhury May 24, 2025 Collaborator

Uh oh!

jychp May 25, 2025 Collaborator Author

Uh oh!

jychp May 30, 2025 Collaborator Author

Uh oh!

jychp Jun 1, 2025 Collaborator Author

Uh oh!

jychp Jun 15, 2025 Collaborator Author

jychp
May 21, 2025
Collaborator

d-aggarwal
May 24, 2025
Collaborator

jychp
May 24, 2025
Collaborator Author

d-aggarwal
May 24, 2025
Collaborator

chandanchowdhury
May 24, 2025
Collaborator

jychp
May 25, 2025
Collaborator Author

jychp
May 30, 2025
Collaborator Author

jychp
Jun 1, 2025
Collaborator Author

jychp
Jun 15, 2025
Collaborator Author