[server][common][controller][vpj] Materialized view projection and filter support #1647

xunyin8 · 2025-03-24T17:32:10Z

Problem Statement

Consumers of a materialized view might not be interested in every data field and update event. Having the ability to perform projection and filter could reduce footprint and improve ingestion performance for MV consumers.

Solution

Add projection and filtering support for materialized view (MV) to be more efficient about unwanted data for view consumers. Projection can be enabled by setting projection fields in the materialized view parameters. Similarly filtering can be enabled by setting filter by fields. These two features can be enabled separately or together. If enabled together the filter by fields will be included in the projecting fields automatically. Here is an example MV configuration to illustrate the ideas:

Record containing fields: {a, b, c, d, e}
Projecting fields: {b, c}
Filtering fields: {a}

The only filtering option for now is to skip if none of the filter by fields changed. The filtering is also only applied during hybrid ingestion since it doesn't make sense to have a change filter on batch push. With the above setup we will project and write all batch data to the MV ({a, b, c}). RT updates (full PUT or UPDATE) will project and write the resulting record to the MV ({a, b, c}) only if the value of field (a) is different from the old value. All DELETE events will be written to the MV (no filtering).

Code changes

In order to achieve the above behavior there are several changes:

Previously we've used pub sub message headers to perform forwarding to handle chunks during NR pass-through in remote regions. This strategy will not work with projection because in order for us to perform projection on batch data in remote regions, we will need the remote partition leaders to assemble the chunks during NR pass-through. We are replacing the forwarding strategy with InMemoryChunkAssembler. To ensure leaders don't resume in-between chunks we will also buffer and delay writing the chunks to drainer until we have a fully assembled record and produced it to view topic(s). The view partition header code is left untouched in VPJ to remove deployment or rollback order requirements. i.e. VPJ can get ahead of the server. If server gets ahead of VPJ that's fine too because server's new chunking support can function on its own. We can clean up everything once everything is deployed and stable (no more rollbacks).
Added enforcement in controller to ensure view configs are immutable. Projection schema is generated when adding a new materialized view and stored with the view config. Since there can only be one schema version per view, the znode size should be manageable with compression. If this becomes a concern we can also store it separately or generate it on the fly. We also verify the filtering by fields and projection fields to ensure they exist in latest superset or value schema and have default values.
Projection is performed in ComplexVeniceWriter as part of complexPut so both VPJ and leaders can use the same code for projection. Filtering is performed in MaterializedViewWriter since current offering of change filter is applicable only to hyrbid writes.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

Integration tests and will add unit tests once have some consensus on the changes.

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

…lter support Add projection and filtering support for materialized view (MV) to be more efficient about unwanted data for view consumers. Projection can be enabled by setting projection fields in the materialized view parameters. Similarly filtering can be enabled by setting filter by fields. These two features can be enabled separately or together. If enabled together the filter by fields will be included in the projecting fields automatically. Here is an example MV configuration to illustrate the ideas: Record containing fields: {a, b, c, d, e} Projecting fields: {b, c} Filtering fields: {a} The only filtering option for now is to skip if none of the filter by fields changed. The filtering is also only applied during hybrid ingestion since it doesn't make sense to have a change filter on batch push. With the above setup we will project and write all batch data to the MV ({a, b, c}). RT updates (full PUT or UPDATE) will project and write the resulting record to the MV ({a, b, c}) only if the value of field (a) is different from the old value. All DELETE events will be written to the MV (no filtering). In order to achieve the above behavior there are several changes: 1. Previously we've used pub sub message headers to perform forwarding to handle chunks during NR pass-through in remote regions. This strategy will not work with projection because in order for us to perform projection on batch data in remote regions, we will need the remote partition leaders to assemble the chunks during NR pass-through. We are replacing the forwarding strategy with InMemoryChunkAssembler. To ensure leaders don't resume in-between chunks we will also buffer and delay writing the chunks to drainer until we have a fully assembled record and produced it to view topic(s). 2. Added enforcement in controller to ensure view configs are immutable. Projection schema is generated when adding a new materialized view and stored with the view config. Since there can only be one schema version per view, the znode size should be manageable with compression. If this becomes a concern we can also store it separately or generate it on the fly. We also verify the filtering by fields and projection fields to ensure they exist in latest superset or value schema and have default values. 3. Projection is performed in ComplexVeniceWriter as part of complexPut so both VPJ and leaders can use the same code for projection. Filtering is performed in MaterializedViewWriter since current offering of change filter is applicable only to hyrbid writes.

…gration tests

gaojieliu · 2025-03-25T22:38:47Z