You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to propose that Modulus incorporates a core taxonomy and ontology for PDE data and model declarations.
Context
I’ve been working with various parts of NVIDIA Modulus and noticed that data compatibility between different operator-learning methods (FNO, AFNO, PINNs, etc.) often requires manual reformatting—resampling unstructured meshes onto grids, extracting point clouds, handling boundary data, etc. These steps are crucial but can be inconsistent or ad hoc. Moreover, for advanced AutoML or workflow pipelines, we lack a consistent way to match a given dataset to the right neural surrogate or automate conversions when needed.
Idea
Rather than building a large "modulus-transform" library in-house, I propose we define a Core Taxonomy & Ontology for describing PDE data—and expand that idea to let each model also declare its data requirements. This would be a minimal but standardized way for data sets to “announce”:
… plus other optional fields like boundary, is_transient, etc. (See the Appendix below for a detailed field listing.)
Key Motivations
AutoML Workflows: Make it easier to select the right model for a given dataset or PDE problem—potentially enabling a “descriptor → model” matching engine in the future.
Automated Transformations: We can enable automated transformations (e.g., unstructured mesh → uniform grid) by exposing an ontology interface that external libraries can plug into. This fosters more sophisticated workflows that build on Modulus.
Interoperability: Encourage clearer data definitions so diverse neural operator implementations (FNO, WNO, DiffusionNet, etc.) and research repositories can collaborate seamlessly.
Why This Matters
Clarity: Each dataset can include a small YAML/JSON descriptor summarizing how it’s structured (e.g., dimension, geometry type, boundary info).
Automatic Checking: If a user tries to feed an unstructured mesh into AFNO, Modulus can detect a mismatch and suggest external tools for mesh→grid conversions.
No Need for a Full Library: We simply define the interface—transformation tasks can be performed by external open-source projects (PyVista, VTK, Open3D) if desired.
Easier Model Selection: Each Modulus operator can declare a snippet of which data types it supports. If your dataset descriptor doesn’t match, you receive a clear alert (e.g., “Needs a 2D uniform grid”) or a pointer to an alternative model.
Friction Points So Far
Mismatch in Data Requirements: Some models need uniform grids (FNO, WNO), others want unstructured meshes (DiffusionNet), and others use collocation points (PINNs). Without explicit descriptors, errors or haphazard re-sampling scripts abound.
Manual, Ad-Hoc Conversions: We frequently code one-off transformations (mesh → grid, point → grid, etc.), guess boundary handling, and end up cluttering our workflow with repetitive tasks instead of focusing on PDE modeling.
Lack of Interoperability: Switching from one surrogate model to another is unclear if we don’t know which data conversions are feasible or necessary. You might have a robust unstructured mesh but discover the model needs a uniform 2D grid.
Impact on Experimentation & Collaboration: Without a formal descriptor stating dimension, geometry, boundary flags, HPC teams or new collaborators struggle to replicate or extend existing projects. Even trying “the same dataset on two PDE surrogates” can become a hassle.
Envisioned Changes
Data Descriptor: A standard .yaml or .json file accompanies each dataset.
For example:
Model Declarations: Each built-in operator (FNO, PINNs, etc.) has a compact snippet describing its accepted formats (e.g., “grid, uniform=true, dimension=2 or 3”).
Integrating External Tools: If mismatch occurs, you can invoke a minimal “adapter” or CLI hook (e.g., “Mesh to Uniform Grid” with PyVista). Modulus itself only interfaces with these transformations, so we avoid building or maintaining a massive transformation library in-house.
Benefits
Streamlined User Experience: No guesswork on array shapes or boundary labeling—datasets explicitly state their structure.
Multi-Model Experimentation: If you want to try AFNO and DiffusionNet on the same domain, you know what conversions (if any) are needed.
Reproducibility: PDE data sets shared with a descriptor remove ambiguity—everyone knows exactly how the geometry is stored.
Request & Next Steps
Feedback: Is this feasible, or do you see potential snags? Any must-have fields missing in the descriptor?
Suggestions: Are there library routines (PyVista/VTK/Open3D) worth referencing as “recommended” transforms?
Use Cases: Please share if you’ve faced painful integration or confusion about data formats in Modulus—that’s exactly what this proposal aims to fix.
Below, I’ll outline in the Appendix how a Core Taxonomy & Ontology can define the minimal fields for PDE data, along with short examples of model declarations and a “master table” of popular PDE surrogates. Let me know your thoughts on the approach!
Thanks!
Georg Maerz
Appendix
Table-of-content:
Motivation, Use cases, and Benefits
Core Idea and Proposed Solution Outline
A. Taxonomy fields
B. The Concept: “Descriptors” for Datasets and for models
C. Model Declarations: “Accepted Formats”
D. Benefits of This Approach
E. Implementation Sketch
Deep-Dive on Taxonomy & Ontology
3.1. Taxonomy Fields
3.2. Data Representation
3.3 External Tools: “Adapters”
Overview of Model classification (Master Table of Models vs. Accepted Data + Four Example Models)
4.1. Master Table
4.2. Four Quick Examples
1. Motivation, Use Cases, and Benefits
A well-defined taxonomy for physics-based AI data isn’t just an academic exercise—it enables practical solutions in real-world workflows, from industrial design optimization to academic PDE research. By clearly identifying how each dataset (and model) is structured, we can reduce guesswork, streamline transformations, and facilitate AutoML scenarios. Below are key ways this ontology helps:
1.1 Automated Model Selection
Scenario: A user has a 3D non-uniform surface mesh and wants to see which neural PDE surrogates (FNO, WNO, DiffusionNet, etc.) can handle it directly. How the Taxonomy Helps:
By tagging the dataset with fields like dimension: 3, geometry_type: "mesh", uniform: false, etc., a tool (or “ontology engine”) can instantly tell which models list “3D unstructured mesh” in their accepted data.
Benefit: Quick, automatic identification of surrogates that match the dataset—and if none match, the system suggests a transformation step or an alternative approach. This fosters an AutoML-style pipeline for PDE data.
1.2 Data Transformations and Interchange
Scenario: Converting from a volumetric mesh (e.g., tetrahedral elements) to a uniform grid (for a spectral-based model), or from a surface mesh to a point cloud (for a point-based model). How the Taxonomy Helps:
The ontology describes both the source (e.g., “3D_non_uniform_volumetric_mesh”) and the intended target (“3D_grid”).
Benefit: A user can rely on consistent “source → target” descriptors to invoke standard transformation utilities (e.g., VTK, PyVista) without writing ad-hoc scripts each time—reducing friction in multi-step PDE workflows.
1.3 Multi-Model Experimentation
Scenario: You want to compare how FNO, WNO, and DiffusionNet perform on the same dataset. How the Taxonomy Helps:
A single descriptor can specify the domain geometry, boundary info, and channels. An “ontology engine” can see if FNO/WNO require uniform grids or if DiffusionNet uses an unstructured mesh.
Benefit: Allows fair, robust experimentation—users can systematically check which conversions are needed (if any) to apply multiple models to the same PDE domain.
1.4 Reproducibility and Collaboration
Scenario: A research team shares PDE data and training scripts on GitHub. Another team wants to replicate or extend the results using a different PDE solver or neural architecture. How the Taxonomy Helps:
If the dataset is labeled with a standard descriptor (e.g., “3D_point_cloud with boundary labeling, is_transient: true”), new collaborators know exactly what data shape they’re dealing with.
Benefit: Fewer misunderstandings about array layouts, boundary conditions, or spacing. Studies become simpler to reproduce—everyone is on the same page about data definitions.
1.5 Extensibility for New Models
Scenario: A novel PDE surrogate emerges, requiring specialized data (multi-block structured grids, spherical geodesic tiling, etc.). How the Taxonomy Helps:
That method can be added to the “Model vs. Data Table,” specifying exactly which geometry types it accepts. If new geometry types are needed, they can be incorporated with minimal disruption to existing definitions.
Benefit: The taxonomy evolves naturally, preserving consistency as the community adds or modifies PDE surrogates.
1.6 Support for Industrial & HPC Workflows
Scenario: In aerospace or automotive industries, massive HPC simulations produce million-element meshes or large spatiotemporal datasets. Engineers want to apply neural surrogates for design optimization or digital twins. How the Taxonomy Helps:
Because each dataset is explicitly classified, HPC engineers can build pipelines that automatically convert solver outputs into ML-ready formats—or feed them into an AutoML system for PDE surrogates.
Benefit: Scalability—industrial workflows no longer rely on one-off data manipulations. The taxonomy ensures a consistent approach that scales to big data scenarios.
1. Motivation, Use cases, and Benefits
Automl and workflow explanation
A well-defined taxonomy for physics-based AI data isn’t just an academic exercise—it enables practical solutions in real-world workflows, from industrial design optimization to academic PDE research. Below are the primary use cases where this ontology proves most valuable, along with the benefits derived from a standardized approach.
Automated Model Selection
Scenario: A user has a dataset in some format (e.g., a 3D non-uniform surface mesh) and wants to know which models (FNO, WNO, DiffusionNet, etc.) can accept it without extensive pre-processing.
How the Taxonomy Helps: By describing the dataset with our ontology (e.g., dimension: 3, geometry_type: "mesh", uniform: false, cell_type: "triangle", etc.), a selection tool can immediately check which models list “3D unstructured mesh” in their accepted data formats.
Benefit: Rapid identification of compatible surrogates, or an automatic suggestion to transform the data if the user wants to apply a different model.
Data Transformations and Interchange
Scenario: Converting from a volumetric mesh (e.g., a tetrahedral finite-element mesh) to a uniform grid (for a spectral-based neural operator), or sampling from a surface mesh to a point cloud (for a point-based model).
How the Taxonomy Helps: The ontology states both the source (e.g., “3D_non_uniform_volumetric_mesh”) and target (“3D_grid”) in clear terms. This formalization enables a library of transform functions to check if a direct or approximate conversion is possible (and how “lossy” it might be).
Benefit: Consistency—instead of custom ad-hoc scripts each time, a user can rely on a universal set of transformation utilities keyed to these standardized descriptors.
Multi-Model Experimentation
Scenario: A data scientist wants to compare how different neural PDE surrogates (e.g., FNO vs. WNO vs. DiffusionNet) perform on the same dataset.
How the Taxonomy Helps: By specifying a single dataset descriptor, the pipeline can attempt (or automate) conversions to each model’s accepted format. If all are feasible, the user can easily benchmark results.
Benefit: Encourages fair comparisons across diverse ML models and fosters more robust experimentation—no longer blocked by “I can’t feed a mesh to that code” or “this model only wants a grid.”
Reproducibility and Collaboration
Scenario: Researchers share PDE data and code on GitHub or a similar platform. Another team wants to replicate or extend those results using a different neural solver.
How the Taxonomy Helps: If the dataset is labeled with a standard descriptor (e.g., “3D_point_cloud with boundary labeling, is_transient: true”), collaborators immediately know the format and how to handle it.
Benefit: Fewer misunderstandings about data shapes or boundary definitions. Studies become easier to reproduce because the data format is unambiguously documented.
Extensibility for New Models
Scenario: A new operator-learning method emerges, requiring specialized data (say, multi-block structured grids or spherical geodesic tiling).
How the Taxonomy Helps: The new method can be added to the “Model vs. Data Table,” indicating precisely which geometry types and representations it accepts. If new geometry categories are needed, they can be added to the taxonomy with minimal disruption to existing entries.
Benefit: The taxonomy evolves naturally, ensuring that the entire community maintains a consistent approach to describing PDE data.
Support for Industrial & HPC Workflows
Scenario: Large-scale simulations in automotive or aerospace industries produce massive HPC data (e.g., million-element volumetric meshes or time-series snapshots). Analysts want to apply neural surrogates for design optimization or digital twins.
How the Taxonomy Helps: Because data structures are explicitly classified, HPC engineers can script robust pipelines that batch-convert solver outputs into ML-ready formats, or feed them into an AutoML system for PDE surrogates.
Benefit: Scalability and consistency—industrial workflows become less reliant on custom, one-off data manipulations and more on standard, documented transformations.
2. Core Idea and Proposed Solution Outline
2.A. The Concept: “Descriptors” for Dataset
Rather than continuing with one-off data scripts, we introduce a lightweight “descriptor” file for each PDE dataset or for each entry in the model zoo of a model. This descriptor (in JSON, YAML, or similar) captures minimal metadata:
dimension (e.g., 1, 2, or 3 for the spatial domain)
geometry_type ("grid", "mesh", or "point")
uniform (true/false for spacing/connectivity)
representation (e.g., [N, H, W, C] for grids, (vertices, faces) for meshes)
boundary (true if boundary conditions/labels are explicitly stored)
channels (number of PDE variables or feature channels)
coordinate_mapping:String or null: how we map discrete indices to physical coordinates (e.g., "implicit uniform", or name of a coordinate array).
cell_type: String or null: describes element shape in a mesh (e.g., "triangle", "tetra", "quad"). null if not applicable (e.g., grid).
plus optional fields like decimation_level, is_transient, etc.
The fields above are the so-called Taxonomy. (These fields are detailed in Section 3, Taxonomy Fields.)
Why? So each dataset “announces” how it’s structured or each model in the model zoo "announces" which input it can take (i.e., which transformation a workflow should do).
If your data is a 2D uniform grid, the descriptor might look like this:
On the flip side, each neural PDE model or operator in Modulus can declare what data format(s) it supports. For example, a Fourier-based operator like FNO might say:
When you load FNO, Modulus checks: “Does your dataset’s descriptor match geometry_type: "grid", dimension: 2, uniform: true?” If yes—great. If not, it says “FNO expects a 2D uniform grid, but your data is a 3D unstructured mesh—please re-sample or choose a different model.”
This avoids building a massive transformation library in-house. We only define the interface—the “language” each dataset speaks and each model requires. Actual mesh→grid, grid→point-cloud transformations can be done with open-source libraries like PyVista, VTK, or Open3D, if needed.
Some models can accept multiple formats if it’s flexible (e.g., “FNO_2D” plus “FNO_3D” or partial uniform).
2.C. Benefits of This Approach
Immediate Clarity: If you see dimension: 3 and geometry_type: "mesh", you know you’re dealing with an unstructured domain in 3D. Models that only do uniform grids or collocation points are off the table (or need re-sampling).
AutoML-Style Pipelines: In principle, one could build an automated “data→model” matching system. If Modulus sees dimension=2/uniform grid, it might suggest “FNO or AFNO.” If geometry_type=“mesh,” it might suggest “DiffusionNet” or “MeshGraphNet.”
Lightweight: We’re not rewriting code to handle each possible transform. We’re just documenting data structures and letting the user (or external scripts) handle conversions if needed.
Scalability: As new PDE surrogates come online (DeepONet, PDE-Transformer, etc.), they add a snippet describing accepted data. As new PDE data sets appear, they provide a descriptor. Everything remains consistent without a huge refactor.
2.D. Implementation Sketch
Data Loaders in Modulus:
Could parse a .yaml or .json descriptor for each dataset.
Compare it against the “accepted_formats” of the chosen model.
Either proceed or prompt a “format mismatch” warning.
Choose a PDE model in Modulus. If it matches, train. If not, convert externally or pick another model.
No Full Library:
If your data is a surface mesh but you want a 2D uniform grid for WNO, Modulus might just say: “Mismatch. Try PyVista to voxelize the mesh.”
We avoid huge in-house transformations, focusing on interfaces and easy checks.
Next: I’ll share a short “master table” summarizing models vs. data structures, plus four quick examples (AFNO, PINNs, WNO, DiffusionNet) showing how each might declare accepted formats. Then we can discuss feedback, potential pitfalls, and next steps!
3. Deep-Dive on Taxonomy & Ontology
3.1. Taxonomy Fields
We introduce a minimal set of fields—listed in the table below—to consistently describe the shape, connectivity, and additional metadata of PDE data. Each field is a key–value pair capturing an aspect of the dataset’s domain geometry or PDE variables.
Field
Meaning & Possible Values
dimension
Integer in ({1,2,3,\dots}). The spatial dimensionality of the domain. E.g., 2 for a planar field, 3 for volumetric.
geometry_type
Categorical: one of ({\text{"point"}, \text{"grid"}, \text{"mesh"}}). Describes whether data is raw points, a structured lattice, or an unstructured mesh.
uniform
Boolean: true if spacing or topology is regular, false if non-uniform/unstructured.
representation
Nested object clarifying how coordinates, adjacency, or array layouts are stored (e.g., [N, H, W, C], adjacency lists, etc.).
is_transient
Boolean: true if the data includes multiple time steps within one descriptor, false otherwise.
boundary
Boolean: indicates if the data explicitly labels boundary nodes, faces, or conditions. Useful for PDE setups requiring BC enforcement.
cell_type
String or null: describes element shape in a mesh (e.g., "triangle", "tetra", "quad"). null if not applicable (e.g., grid).
decimation
Boolean: indicates if the data has been downsampled (coarsened). Often relevant for HPC or multi-resolution pipelines.
decimation_level
Integer or Float: optional field that quantifies the ratio or factor of decimation.
channels
Integer or String: indicates how many PDE variables or feature channels each point/element holds (e.g., velocity components, scalar fields).
coordinate_mapping
String or null: how we map discrete indices to physical coordinates (e.g., "implicit uniform", or name of a coordinate array).
Why These Fields?
Real PDE data can vary wildly in geometry (points vs. grids vs. meshes), uniformity (structured vs. unstructured), and required PDE metadata (boundary conditions, material properties, etc.).
These fields strike a balance between minimalism (so it’s easy to fill out) and completeness (to meaningfully distinguish different data formats).
3.2. Data Representation
While the taxonomy fields (dimension, geometry type, uniformity, etc.) describe the conceptual layout of a dataset, the actual storage of PDE data can vary widely. In practice, these variations determine how easily data can be loaded, transformed, or fed into a physics-based AI model. Below, we outline typical representations for grids, meshes, and point sets, along with transient data handling.
Uniform Grids (Structured)
Shape: For a 2D field, data might be stored as a 3D array ((N, H, W)) (or ((N, H, W, C)) if there are multiple channels).
Index-to-Coordinate Mapping: Often implicit—for example, each grid cell or point is at ((x_0 + i ,\Delta x, y_0 + j ,\Delta y)).
Implementation Detail: If stored in NumPy or PyTorch, the tensor shape might be [..., H, W], where the exact order of dimensions depends on user convention (e.g., channels_last vs. channels_first).
Taxonomy Example:
geometry_type: "grid",
uniform: true,
representation: array_layout: "[N, H, W, C]",
coordinate_mapping: "implicit uniform".
Non-Uniform Grids (Structured but Variable Spacing)
Shape: Similar array structure ((N, H, W, \dots)), but each row/column can have unequal spacing (\Delta x_i, \Delta y_j).
Index-to-Coordinate Mapping: Typically stored in separate arrays (e.g., a 1D array for x coordinates and another for y, or a 2D coordinate mesh).
Taxonomy Example:
geometry_type: "grid",
uniform: false,
representation: coordinate_mapping: "[x(i), y(j)]", etc.
Unstructured Meshes
Vertices: An array storing the spatial coordinates of each node, e.g. (N, 2) for 2D or (N, 3) for 3D.
Faces/Cells: A separate array listing which vertex indices make up each element. For surfaces, these might be triangles (M, 3) or quads (M, 4). For volumetric meshes, tetrahedra (M, 4) or hexahedra (M, 8).
Adjacency: Optional but often used for graph neural networks or to speed up neighbor queries. Could be stored as a list of edges, a node→neighbors dictionary, or a sparse matrix.
Shape: Typically an array (N, d) where d is the embedding dimension (2D or 3D).
No Connectivity: The points have no explicit adjacency, so PDE-based operations (if any) might rely on nearest-neighbor searches or custom approaches.
Taxonomy Example:
geometry_type: "point",
uniform: false (usually random or sensor-based),
representation: array_layout: "[N, d]".
Transient Data (Multiple Time Steps)
Single vs. Multi-File: Some pipelines store each time step in a separate file; others stack them in a 4D or 5D array (e.g., ((N_\text{time}, H, W, C))).
Taxonomy Attribute: is_transient: true indicates that the dataset descriptor includes multiple time frames in one structure.
Implementation Detail: An index in the first dimension might correspond to time, e.g., (t, x, y, channels).
Boundary or Auxiliary Annotations
For PDE boundary conditions, a user may add arrays labeling boundary nodes or prescribing Dirichlet/Neumann values. This might appear as:
A boolean mask array boundary_mask for each node or grid cell.
A dictionary specifying which edges/faces are walls, inlets, or symmetry boundaries.
Taxonomy: boundary: true, plus a note in representation describing how boundary info is stored.
Decimation / Multi-Scale
If the data has been downsampled (for instance, from a high-resolution CFD simulation to a coarser grid), an additional field like decimation_level can track how aggressive the reduction was.
Multi-scale or multi-resolution workflows may store hierarchical meshes or multiple grid sizes, but for simplicity, each descriptor focuses on a single resolution.
3.2.2. How This Connects to the Taxonomy
Each of these representation strategies ties back to the fields in data descriptor. For instance:
dimension + geometry_type clarify if we’re dealing with (N, H, W) arrays on a uniform grid or (vertices, faces) in a mesh.
uniform + representation indicate if we have consistent spacing or must store coordinates.
boundary signals if PDE boundary annotations are included.
is_transient determines whether time steps are embedded in the same data structure.
By consistently encoding these details, we can quickly see whether a dataset (say, an unstructured surface mesh with boundary info) is compatible with a given model (e.g., a graph-based PDE surrogate) or if we need a data transformation (e.g., re-sampling that mesh onto a uniform grid for a Fourier-based operator).
Overall, a clear data representation—in line with the taxonomy fields—makes physics-based ML pipelines more automatable and transparent, ensuring that each step from raw solver output (or sensor measurement) to neural network training is well-defined.
Below is External Tools: “Adapters” of the comprehensive document. This section outlines practical use cases where our Core Taxonomy & Ontology facilitates key workflows (model selection, data transformations, multi-model experimentation), and highlights how a standardized data description benefits the broader physics-based AI community.
3.3 External Tools: “Adapters”
VTK / PyVista
Open-source libraries that handle mesh → grid or mesh → point cloud transformations.
Modulus could provide adapters: Python classes or scripts that read the descriptor, call PyVista/VTK operations (e.g., “surface to volumetric,” “voxelization,” etc.), then generate an updated descriptor for the output.
Open3D / PCL
Tools for point clouds: downsampling, normal estimation, boundary detection.
If Modulus sees “geometry_type: point,” “uniform: false,” and a user wants a structured grid, an adapter calls Open3D’s nearest-neighbor interpolation or something similar.
In large-scale HPC contexts, specialized C++ or parallel meshing codes might do the conversions.
Modulus can define a minimal CLI or Python interface that passes the descriptor fields so the external tool can run the appropriate pipeline.
4. Overview of Model classification (Master Table of Models vs. Accepted Data + Four Example Models)
To illustrate how each PDE surrogate can declare its data requirements—and how users know if their dataset fits—here’s a master table showing typical “Accepted Data” for eight well-known models in physics-based AI. After that, we’ll do four quick sub-sections to give more detailed examples.
4.1. Master Table
Model
Accepted Data
Notes
AFNO
2D/3D uniform grid (structured)
Adaptive Fourier Neural Operator. - Extends FNO with adaptive frequency weighting. - Ideal for global PDE phenomena on a regular lattice.
FNO
2D/3D (often uniform grid, sometimes partial non-uniform)
Fourier Neural Operator. - Uses global Fourier transforms; typically no explicit boundary labeling. - Good for parametric PDE families on grids.
WNO
2D/3D uniform grid (structured, wavelet-based)
Wavelet Neural Operator. - Replaces FFT with wavelet transforms for local/multi-scale features. - Still data-driven, typically no direct PDE boundary labeling.
DiffusionNet
2D/3D unstructured mesh (often surface, can be volumetric)
Graph-like diffusion approach. - Requires explicit vertices, faces, adjacency. - Good for manifold PDEs or shape analysis on complex geometries.
Large-scale GNN for weather/climate. - Time-evolving data, geodesic tiling. - Designed for planet-scale PDE forecasting.
PointNet
2D/3D point cloud (no adjacency)
Classification/regression on raw points. - Not inherently PDE-oriented, but can adapt if PDE fields are stored as scattered points.
PINNs
Collocation points in 1D/2D/3D + boundary/initial condition info
Physics-Informed Neural Networks. - PDE constraints in the loss. - Works well even with minimal labeled data, focusing on PDE residuals at domain points.
4.2. Four Quick Examples
Here’s a brief demonstration of how four of these models might specify their “accepted_formats,” plus a minimal example descriptor that satisfies each:
Result: PINNs rely on PDE residual enforcement at collocation points + boundary. If your data is basically a set of scattered points in 3D, with boundary labels, it’s fully compatible.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Modulus Community!
I want to propose that Modulus incorporates a core taxonomy and ontology for PDE data and model declarations.
Context
I’ve been working with various parts of NVIDIA Modulus and noticed that data compatibility between different operator-learning methods (FNO, AFNO, PINNs, etc.) often requires manual reformatting—resampling unstructured meshes onto grids, extracting point clouds, handling boundary data, etc. These steps are crucial but can be inconsistent or ad hoc. Moreover, for advanced AutoML or workflow pipelines, we lack a consistent way to match a given dataset to the right neural surrogate or automate conversions when needed.
Idea
Rather than building a large "
modulus-transform
" library in-house, I propose we define a Core Taxonomy & Ontology for describing PDE data—and expand that idea to let each model also declare its data requirements. This would be a minimal but standardized way for data sets to “announce”:(See the Appendix below for a detailed field listing.)
Key Motivations
Why This Matters
Friction Points So Far
Envisioned Changes
Data Descriptor: A standard
.yaml
or.json
file accompanies each dataset.For example:
Model Declarations: Each built-in operator (FNO, PINNs, etc.) has a compact snippet describing its accepted formats (e.g., “grid, uniform=true, dimension=2 or 3”).
Integrating External Tools: If mismatch occurs, you can invoke a minimal “adapter” or CLI hook (e.g., “Mesh to Uniform Grid” with PyVista). Modulus itself only interfaces with these transformations, so we avoid building or maintaining a massive transformation library in-house.
Benefits
Request & Next Steps
Below, I’ll outline in the Appendix how a Core Taxonomy & Ontology can define the minimal fields for PDE data, along with short examples of model declarations and a “master table” of popular PDE surrogates. Let me know your thoughts on the approach!
Thanks!
Georg Maerz
Appendix
Table-of-content:
1. Motivation, Use Cases, and Benefits
A well-defined taxonomy for physics-based AI data isn’t just an academic exercise—it enables practical solutions in real-world workflows, from industrial design optimization to academic PDE research. By clearly identifying how each dataset (and model) is structured, we can reduce guesswork, streamline transformations, and facilitate AutoML scenarios. Below are key ways this ontology helps:
1.1 Automated Model Selection
Scenario: A user has a 3D non-uniform surface mesh and wants to see which neural PDE surrogates (FNO, WNO, DiffusionNet, etc.) can handle it directly.
How the Taxonomy Helps:
dimension: 3
,geometry_type: "mesh"
,uniform: false
, etc., a tool (or “ontology engine”) can instantly tell which models list “3D unstructured mesh” in their accepted data.1.2 Data Transformations and Interchange
Scenario: Converting from a volumetric mesh (e.g., tetrahedral elements) to a uniform grid (for a spectral-based model), or from a surface mesh to a point cloud (for a point-based model).
How the Taxonomy Helps:
1.3 Multi-Model Experimentation
Scenario: You want to compare how FNO, WNO, and DiffusionNet perform on the same dataset.
How the Taxonomy Helps:
1.4 Reproducibility and Collaboration
Scenario: A research team shares PDE data and training scripts on GitHub. Another team wants to replicate or extend the results using a different PDE solver or neural architecture.
How the Taxonomy Helps:
1.5 Extensibility for New Models
Scenario: A novel PDE surrogate emerges, requiring specialized data (multi-block structured grids, spherical geodesic tiling, etc.).
How the Taxonomy Helps:
1.6 Support for Industrial & HPC Workflows
Scenario: In aerospace or automotive industries, massive HPC simulations produce million-element meshes or large spatiotemporal datasets. Engineers want to apply neural surrogates for design optimization or digital twins.
How the Taxonomy Helps:
1. Motivation, Use cases, and Benefits
Automl and workflow explanation
A well-defined taxonomy for physics-based AI data isn’t just an academic exercise—it enables practical solutions in real-world workflows, from industrial design optimization to academic PDE research. Below are the primary use cases where this ontology proves most valuable, along with the benefits derived from a standardized approach.
Automated Model Selection
dimension: 3
,geometry_type: "mesh"
,uniform: false
,cell_type: "triangle"
, etc.), a selection tool can immediately check which models list “3D unstructured mesh” in their accepted data formats.Data Transformations and Interchange
Multi-Model Experimentation
Reproducibility and Collaboration
Extensibility for New Models
Support for Industrial & HPC Workflows
2. Core Idea and Proposed Solution Outline
2.A. The Concept: “Descriptors” for Dataset
Rather than continuing with one-off data scripts, we introduce a lightweight “descriptor” file for each PDE dataset or for each entry in the model zoo of a model. This descriptor (in JSON, YAML, or similar) captures minimal metadata:
1
,2
, or3
for the spatial domain)"grid"
,"mesh"
, or"point"
)true
/false
for spacing/connectivity)[N, H, W, C]
for grids,(vertices, faces)
for meshes)true
if boundary conditions/labels are explicitly stored)String
ornull
: how we map discrete indices to physical coordinates (e.g.,"implicit uniform"
, or name of a coordinate array).String
ornull
: describes element shape in a mesh (e.g.,"triangle"
,"tetra"
,"quad"
).null
if not applicable (e.g., grid).decimation_level
,is_transient
, etc.The fields above are the so-called Taxonomy. (These fields are detailed in Section 3, Taxonomy Fields.)
Why? So each dataset “announces” how it’s structured or each model in the model zoo "announces" which input it can take (i.e., which transformation a workflow should do).
If your data is a 2D uniform grid, the descriptor might look like this:
2.B. Model Declarations: “Accepted Formats”
On the flip side, each neural PDE model or operator in Modulus can declare what data format(s) it supports. For example, a Fourier-based operator like FNO might say:
When you load FNO, Modulus checks: “Does your dataset’s descriptor match
geometry_type: "grid"
,dimension: 2
,uniform: true
?” If yes—great. If not, it says “FNO expects a 2D uniform grid, but your data is a 3D unstructured mesh—please re-sample or choose a different model.”This avoids building a massive transformation library in-house. We only define the interface—the “language” each dataset speaks and each model requires. Actual mesh→grid, grid→point-cloud transformations can be done with open-source libraries like PyVista, VTK, or Open3D, if needed.
Some models can accept multiple formats if it’s flexible (e.g., “FNO_2D” plus “FNO_3D” or partial uniform).
2.C. Benefits of This Approach
dimension: 3
andgeometry_type: "mesh"
, you know you’re dealing with an unstructured domain in 3D. Models that only do uniform grids or collocation points are off the table (or need re-sampling).2.D. Implementation Sketch
.yaml
or.json
descriptor for each dataset.dimension
,geometry_type
,uniform
, etc.).Next: I’ll share a short “master table” summarizing models vs. data structures, plus four quick examples (AFNO, PINNs, WNO, DiffusionNet) showing how each might declare accepted formats. Then we can discuss feedback, potential pitfalls, and next steps!
3. Deep-Dive on Taxonomy & Ontology
3.1. Taxonomy Fields
We introduce a minimal set of fields—listed in the table below—to consistently describe the shape, connectivity, and additional metadata of PDE data. Each field is a key–value pair capturing an aspect of the dataset’s domain geometry or PDE variables.
true
if spacing or topology is regular,false
if non-uniform/unstructured.[N, H, W, C]
, adjacency lists, etc.).true
if the data includes multiple time steps within one descriptor,false
otherwise."triangle"
,"tetra"
,"quad"
).null
if not applicable (e.g., grid)."implicit uniform"
, or name of a coordinate array).Why These Fields?
3.2. Data Representation
While the taxonomy fields (dimension, geometry type, uniformity, etc.) describe the conceptual layout of a dataset, the actual storage of PDE data can vary widely. In practice, these variations determine how easily data can be loaded, transformed, or fed into a physics-based AI model. Below, we outline typical representations for grids, meshes, and point sets, along with transient data handling.
Uniform Grids (Structured)
[..., H, W]
, where the exact order of dimensions depends on user convention (e.g.,channels_last
vs.channels_first
).geometry_type: "grid"
,uniform: true
,representation: array_layout: "[N, H, W, C]"
,coordinate_mapping: "implicit uniform"
.Non-Uniform Grids (Structured but Variable Spacing)
x
coordinates and another fory
, or a 2D coordinate mesh).geometry_type: "grid"
,uniform: false
,representation: coordinate_mapping: "[x(i), y(j)]"
, etc.Unstructured Meshes
(N, 2)
for 2D or(N, 3)
for 3D.(M, 3)
or quads(M, 4)
. For volumetric meshes, tetrahedra(M, 4)
or hexahedra(M, 8)
.geometry_type: "mesh"
,uniform: false
,representation: vertices: (N, 3), faces: (M, 3), adjacency: "list"
.Point Clouds
(N, d)
whered
is the embedding dimension (2D or 3D).geometry_type: "point"
,uniform: false
(usually random or sensor-based),representation: array_layout: "[N, d]"
.Transient Data (Multiple Time Steps)
is_transient: true
indicates that the dataset descriptor includes multiple time frames in one structure.(t, x, y, channels)
.Boundary or Auxiliary Annotations
boundary_mask
for each node or grid cell.boundary: true
, plus a note inrepresentation
describing how boundary info is stored.Decimation / Multi-Scale
decimation_level
can track how aggressive the reduction was.3.2.2. How This Connects to the Taxonomy
Each of these representation strategies ties back to the fields in data descriptor. For instance:
(N, H, W)
arrays on a uniform grid or(vertices, faces)
in a mesh.By consistently encoding these details, we can quickly see whether a dataset (say, an unstructured surface mesh with boundary info) is compatible with a given model (e.g., a graph-based PDE surrogate) or if we need a data transformation (e.g., re-sampling that mesh onto a uniform grid for a Fourier-based operator).
Overall, a clear data representation—in line with the taxonomy fields—makes physics-based ML pipelines more automatable and transparent, ensuring that each step from raw solver output (or sensor measurement) to neural network training is well-defined.
Below is External Tools: “Adapters” of the comprehensive document. This section outlines practical use cases where our Core Taxonomy & Ontology facilitates key workflows (model selection, data transformations, multi-model experimentation), and highlights how a standardized data description benefits the broader physics-based AI community.
3.3 External Tools: “Adapters”
VTK / PyVista
Open3D / PCL
High-Performance Tools (e.g., HPC meshing libraries)
4. Overview of Model classification (Master Table of Models vs. Accepted Data + Four Example Models)
To illustrate how each PDE surrogate can declare its data requirements—and how users know if their dataset fits—here’s a master table showing typical “Accepted Data” for eight well-known models in physics-based AI. After that, we’ll do four quick sub-sections to give more detailed examples.
4.1. Master Table
- Extends FNO with adaptive frequency weighting.
- Ideal for global PDE phenomena on a regular lattice.
- Uses global Fourier transforms; typically no explicit boundary labeling.
- Good for parametric PDE families on grids.
- Replaces FFT with wavelet transforms for local/multi-scale features.
- Still data-driven, typically no direct PDE boundary labeling.
- Requires explicit
vertices
,faces
, adjacency.- Good for manifold PDEs or shape analysis on complex geometries.
- Node-edge message passing.
- Handles complex domain connectivity (fluid-structure interaction, etc.).
- Time-evolving data, geodesic tiling.
- Designed for planet-scale PDE forecasting.
- Not inherently PDE-oriented, but can adapt if PDE fields are stored as scattered points.
- PDE constraints in the loss.
- Works well even with minimal labeled data, focusing on PDE residuals at domain points.
4.2. Four Quick Examples
Here’s a brief demonstration of how four of these models might specify their “accepted_formats,” plus a minimal example descriptor that satisfies each:
A. AFNO (Adaptive Fourier Neural Operator)
B. PINNs
C. WNO (Wavelet Neural Operator)
D. DiffusionNet
Beta Was this translation helpful? Give feedback.
All reactions