-
Notifications
You must be signed in to change notification settings - Fork 375
Description
The change in #1593 made the marquez-api jar incompatible with code that had depended on the LineageEvent class and its related classes. Any code that depended on those models must now be rewritten to rely on the OpenLineage.* models, which have a very different construction model, thus require a major effort to rewrite.
Moreover, the current OpenLineage API has introduced new fields in the InputDataset and OutputDataset models, which were never present in the Marquez implementation of the OpenLineage models. The LineageEvent model is annotated with @JsonIgnoreProperties so any new fields in the JSON are simply dropped during deserialization. Therefore, simply reverting the LineageEvent models would make the Marquez backend incompatible with the new OpenLineage models as new facets would be dropped from the model before storing.
I think we should revert #1593 and alter the models to support unknown fields. Some options for this are
- Add a
Map<String, Object>field annotated with@JsonAnySetterso that any unknown fields are added to the map, rather than dropped.- This is little work up front and offers backward and forward compatibility, as any unknown fields are automatically supported. There is some maintainability concern, as we need to update the Marquez model alongside the OL one.
- Extend or wrap (using
@JsonUnwrapped) JacksonObjectNodeso that objects are automatically deserialized into JsonNodes and setters/getters are written to work with expected properties in a compatible API- This is the most up-front work, but offers the most compatibility and least maintenance. Each model is backward and future compatible with any event POSTed and will always be serialized back into an exact replica of the original event. Accessor methods must be hand-written to replace the lombok-generated ones in order to maintain API compatibility.
- Wrap new
OpenLineagemodel classes with existing Marquez models- This provides the binary compatibility we need, while avoiding the maintenance issue of synchronizing the Marquez models with the OpenLineage ones. The payload would always be deserialized into
OpenLineagemodels (so we can receive and store the data even if the Marquez model is never updated). However, we still need to maintain the compatibility layer (the accessor methods) and we are still limited to the fields defined in the version of the OL library deployed with Marquez. Moreover, the OL API for constructing events is a bit cumbersome to use in a case like this. Each model class must be instantiated by an instance of theOpenLineageclass, which is instantiated with the appropriateproducerfield. Thus, we can't simply instantiate a newJoborJobFacetand expect the accompanyingOpenLineage.JoborOpenLineage.JobFacetsclass to be instantiated, as there needs to be a sharedOpenLineageinstance to actually create the instances. This is easy enough to accomplish for model instances that are created purely from Marquez (e.g., a static utility instance), but makes it very difficult to build a processing workflow, such as one that clones a model and adds a new facet (and maintains the original models'producerfields) before handing off to another processor.
- This provides the binary compatibility we need, while avoiding the maintenance issue of synchronizing the Marquez models with the OpenLineage ones. The payload would always be deserialized into
- Write custom deserializer to automatically add raw JSON string to LineageEvent object
- This is the least work and solves the most immediate problem- that data serialized and stored in the
lineage_eventstable is incomplete. However, it makes processing objects that have unknown fields impossible- e.g., a workflow that copies aLineageEventand adds another facet to theRunbefore passing on to storage or another processor would immediately lose information. It also does not offer any additional maintainability support, as the Marquez models must always be updated to synchronize with the OL models.
- This is the least work and solves the most immediate problem- that data serialized and stored in the
Of the four options, the first offers the most compatibility with the most flexibility while maintaining forward/backward compatibility and relatively low maintainability concern.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status