Skip to content

feat(search): lineage search performance #13545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 4, 2025
Merged

Conversation

david-leifker
Copy link
Collaborator

@david-leifker david-leifker commented May 18, 2025

SearchAcrossLineage Performance Optimization:
For complex lineage graphs with high fanout, this optimization delivers performance improvements of >30x, reducing response times from >30s to <1s for standard lineage visualization queries.

Key improvements:

  • Replaced aggregation-based queries with standard search queries for entity-limited exploration. The previous aggregation approach (group_by_source and group_by_destination) didn't scale well with large graph indices.
  • Parallelized upstream/downstream searches - Incoming and outgoing relationship queries now execute concurrently using CompletableFutures, effectively halving the time for bidirectional lineage traversal and more importantly avoiding this unnecessary computation for one directional traversal (typical).
  • Implemented search_after pagination to handle large fanout. Previous implementation was constrained to 1k relationships per hop due to Elasticsearch aggregation bucket limits.
  • Added adaptive page sizing that dynamically adjusts based on the number of entities being explored and configured limits, preventing oversized queries while maintaining efficiency.
  • Introduced per-entity exploration limits to ensure fair distribution of the hop limit across multiple input entities, preventing a single high-fanout entity from consuming the entire budget.

Technical details:

  • Removed the complex aggregation query structure (FilterAggregationBuilder, TermsAggregationBuilder, TopHitsAggregationBuilder) in favor of direct search with pagination
  • Added intelligent result batching that respects both global page size limits and per-entity exploration constraints
  • Implemented efficient tracking of discovered entities per input URN to enforce hop limits fairly across multiple starting points
  • Query optimization is now applied (when enabled) to reduce nested boolean query complexity before execution

The optimization particularly benefits scenarios with:

High-fanout entities (100s to 1000s of relationships)
Multi-entity lineage exploration where limits need to be distributed fairly
Deep lineage traversal requiring pagination beyond the 1k limit

Screenshot 2025-05-20 at 7 19 07 PM

Query Optimization:
Our shared query builder can produce deeply nested boolean queries when composed across different components. To address this, we've added optimization logic that simplifies query structure by flattening redundant nesting and converting single-clause boolean operations. Enable this feature with the ELASTICSEARCH_SEARCH_GRAPH_QUERY_OPTIMIZATION environment variable. While performance gains are unverified, the resulting queries are notably more readable.

An example query

Before:

{
    "bool": {
        "filter": [
            {
                "bool": {
                    "should": [
                        {
                            "bool": {
                                "should": [
                                    {
                                        "bool": {
                                            "filter": [
                                                {
                                                    "terms": {
                                                        "source.urn": [
                                                            "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
                                                        ]
                                                    }
                                                },
                                                {
                                                    "term": {
                                                        "relationshipType": {
                                                            "value": "DownstreamOf"
                                                        }
                                                    }
                                                },
                                                {
                                                    "terms": {
                                                        "destination.entityType": [
                                                            "dataset"
                                                        ]
                                                    }
                                                }
                                            ]
                                        }
                                    },
                                    {
                                        "bool": {
                                            "filter": [
                                                {
                                                    "terms": {
                                                        "destination.urn": [
                                                            "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
                                                        ]
                                                    }
                                                },
                                                {
                                                    "term": {
                                                        "relationshipType": {
                                                            "value": "DataProcessInstanceProduces"
                                                        }
                                                    }
                                                },
                                                {
                                                    "terms": {
                                                        "source.entityType": [
                                                            "dataProcessInstance"
                                                        ]
                                                    }
                                                }
                                            ]
                                        }
                                    },
                                    {
                                        "bool": {
                                            "filter": [
                                                {
                                                    "terms": {
                                                        "destination.urn": [
                                                            "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
                                                        ]
                                                    }
                                                },
                                                {
                                                    "term": {
                                                        "relationshipType": {
                                                            "value": "Produces"
                                                        }
                                                    }
                                                },
                                                {
                                                    "terms": {
                                                        "source.entityType": [
                                                            "dataJob"
                                                        ]
                                                    }
                                                }
                                            ]
                                        }
                                    }
                                ],
                                "minimum_should_match": "1"
                            }
                        }
                    ],
                    "minimum_should_match": "1"
                }
            }
        ]
    }
}

After:

{
    "bool": {
        "should": [
            {
                "bool": {
                    "filter": [
                        {
                            "terms": {
                                "source.urn": [
                                    "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
                                ]
                            }
                        },
                        {
                            "term": {
                                "relationshipType": {
                                    "value": "DownstreamOf"
                                }
                            }
                        },
                        {
                            "terms": {
                                "destination.entityType": [
                                    "dataset"
                                ]
                            }
                        }
                    ]
                }
            },
            {
                "bool": {
                    "filter": [
                        {
                            "terms": {
                                "destination.urn": [
                                    "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
                                ]
                            }
                        },
                        {
                            "term": {
                                "relationshipType": {
                                    "value": "DataProcessInstanceProduces"
                                }
                            }
                        },
                        {
                            "terms": {
                                "source.entityType": [
                                    "dataProcessInstance"
                                ]
                            }
                        }
                    ]
                }
            },
            {
                "bool": {
                    "filter": [
                        {
                            "terms": {
                                "destination.urn": [
                                    "urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
                                ]
                            }
                        },
                        {
                            "term": {
                                "relationshipType": {
                                    "value": "Produces"
                                }
                            }
                        },
                        {
                            "terms": {
                                "source.entityType": [
                                    "dataJob"
                                ]
                            }
                        }
                    ]
                }
            }
        ],
        "minimum_should_match": "1"
    }
}

@github-actions github-actions bot added docs Issues and Improvements to docs product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels May 18, 2025
Copy link

alwaysmeticulous bot commented May 18, 2025

🔴 Meticulous spotted visual differences in 91 of 1358 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

Last updated for commit 3e5c628. This comment will update as new commits are pushed.

Copy link

codecov bot commented May 18, 2025

Codecov Report

Attention: Patch coverage is 83.68201% with 78 lines in your changes missing coverage. Please review.

❌ Unsupported file format

Upload processing failed due to unsupported file format. Please review the parser error message:

Error parsing JUnit XML in /home/runner/work/datahub/datahub/metadata-io/build/test-results/test/TEST-com.linkedin.metadata.graph.search.elasticsearch.SearchGraphServiceElasticSearchTest.xml at 117:1058

Caused by:
    RuntimeError: Error converting computed name to ValidatedString
    
    Caused by:
        string is too long

For more help, visit our troubleshooting guide.

Files with missing lines Patch % Lines
...nkedin/metadata/graph/elastic/ESGraphQueryDAO.java 87.82% 20 Missing and 18 partials ⚠️
...va/com/linkedin/metadata/search/utils/ESUtils.java 85.18% 4 Missing and 12 partials ⚠️
...linkedin/metadata/search/LineageSearchService.java 36.36% 3 Missing and 4 partials ⚠️
...emmetadata/ElasticSearchSystemMetadataService.java 0.00% 3 Missing ⚠️
metadata-ingestion/src/datahub/cli/delete_cli.py 0.00% 2 Missing ⚠️
...din/gms/factory/entity/RollbackServiceFactory.java 0.00% 2 Missing ⚠️
...edin/metadata/resources/entity/EntityResource.java 0.00% 2 Missing ⚠️
.../com/linkedin/gms/servlet/RestliServletConfig.java 0.00% 2 Missing ⚠️
...-react/src/app/lineageV2/useSearchAcrossLineage.ts 0.00% 1 Missing ⚠️
...ata/search/elasticsearch/ElasticSearchService.java 0.00% 1 Missing ⚠️
... and 4 more

📢 Thoughts on this report? Let us know!

@david-leifker david-leifker force-pushed the search-max-page-size branch 2 times, most recently from 26ef3b8 to efc5965 Compare May 18, 2025 21:26
@david-leifker david-leifker force-pushed the search-max-page-size branch from efc5965 to e67002e Compare May 19, 2025 20:24
@david-leifker david-leifker force-pushed the search-max-page-size branch from e67002e to c7202f8 Compare May 20, 2025 20:41
@david-leifker david-leifker force-pushed the search-max-page-size branch from c7202f8 to a341978 Compare May 21, 2025 00:14
@david-leifker david-leifker changed the title fix(search): search result page size limit feat(search): lineage search performance May 21, 2025
@david-leifker david-leifker force-pushed the search-max-page-size branch from a341978 to 556daa2 Compare May 21, 2025 03:18
@david-leifker david-leifker force-pushed the search-max-page-size branch from 556daa2 to 3a5bf5a Compare May 21, 2025 17:51
@jmacryl
Copy link
Collaborator

jmacryl commented May 30, 2025

* remove aggregation from limited entities query
* prevent slow responses from large page sizes (prevent 10k queries)
* parallelize upstream/downstream 1-hop search
* implement pagination to remove 1k upstream/downstream limit
* add configuration options to application.yaml
* remove aggregation from limited entities query
* prevent slow responses from large page sizes (prevent 10k queries)
* parallelize upstream/downstream 1-hop search
* implement pagination to remove 1k upstream/downstream limit
* add configuration options to application.yaml
@david-leifker david-leifker force-pushed the search-max-page-size branch from e7c8c3e to 3e5c628 Compare June 1, 2025 03:08
@david-leifker david-leifker merged commit 799a9f2 into master Jun 4, 2025
78 of 81 checks passed
@david-leifker david-leifker deleted the search-max-page-size branch June 4, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops PR or Issue related to DataHub backend & deployment docs Issues and Improvements to docs pending-submitter-merge product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants