multi-graph server: support hosting multiple graphs in one process #540

karasikov · 2025-10-03T00:05:55Z

Support hosting multiple graphs
Filter which ones to query with field graphs in the request json
All query requests in the multi-graph mode share the same thread pool for querying individual graphs.

Usage:
./metagraph server_query indexes.txt --mmap --parallel 50

where indexes.txt is a csv file with graph names and paths:

human,human_chunk1.dbg,human_chunk1.row_diff_flat.annodbg
human,human_chunk2.dbg,human_chunk2.row_diff_flat.annodbg
microbe,microbe.dbg,microbe.row_diff_flat.annodbg
...

metagraph/src/cli/config/config.cpp

metagraph/src/cli/server.cpp

…rvers

adamant-pwn · 2025-10-10T20:47:35Z

metagraph/api/python/metagraph/client.py

               abundance_sum: bool = False,
               query_counts: bool = False,
               query_coords: bool = False,
+               graphs: Union[None, List[str]] = None,


This might be a bit confusing compared to self.graphs in this class. Are we sure we want to keep this naming? Maybe rename the parameter graphs in API to labels, or similar?

That aside, how are MultiGraphClients actually used? Do we theoretically expect a situation in which e.g. first graph serves label A while the second graph serves labels B and C, and we want to query just label A or just labels A and B?

Judging by the unit test below, this would fail. Is it the outcome we want, rather than e.g. returning empty output (and possibly some kind of warning in logs) when querying a label that we don't know?

Yeah, true. We can try to come up with a different name. Otherwise, MultiGraphClient is essentially a GraphClient with a ThreadPool, and it's not used all that migh right now. Maybe it's simple enough to remove. Anyone can always add a ThreadPool on top any time

metagraph/integration_tests/test_api.py

metagraph/src/cli/server.cpp

adamant-pwn · 2025-10-10T21:51:28Z

metagraph/src/cli/server.cpp

+                                assert(json.size() == result.size());
+                                for (Json::ArrayIndex i = 0; i < result.size(); ++i) {
+                                    for (auto&& value : json[i]["results"]) {
+                                        result[i]["results"].append(std::move(value));


Should we put some limitations on final result size, and e.g. throw an exception if we detect that it became too large to handle? Also, might be nice to collect some statistics on how much time we spend querying graphs vs merging results, as the latter is effectively single-threaded due to locking.

It just does a push back to a vector. It should be extremely fast. Space-wise also. I think there is no need to worry about that right now.

Pushing to vector should be fast, yes, I'm just concerned about the growth rate of the output. if a query is something that occurs everywhere, and it produces, say, 1 MiB of output on one chunk, it would elevate to ~2 GiB over 2k+ chunks, and then we also send it back over HTTP, right? Do we have something in place to guarantee scenarios like this don't happen? Or are we fine with them?

Speaking of that,

Is it guaranteed that the order of elements (over which Json::ArrayIndex go) is the same in all chunk results, and not possibly shuffled due to multithreading? I remember it being shuffled in AWS tests when output was in tsv format, so I had to additionally sort before merging.

We apply config.num_top_labels to all chunks individually, right? But should we maybe also apply it during merging, repeatedly discarding matches outside of num_top_labels? Given that we try to simulate a single combined graph on all chunks. This would also ensure result size stays on the order of magnitude of one chunk when this option is provided.

AFAIK, when we use num_top_labels, individual chunks return them in sorted order by the number of matches. I assume we might also need to re-sort them at the end after merging here, to preserve this format? Or is it ensured by JSON already?

I also don't know if there are other query parameters that we may need to account for during the merging stage.

That's a great point, I need to add the num_top_labels filter.

It's a bit trickier than I thought because of the many different result types (counts/coords/presence bitmap/etc.).
Let's do this in another PR, so we don't delay merging this.

I'm thinking of two things:

Always return sorted results in the server mode

Filter by top_labels

metagraph/src/cli/server.cpp

metagraph/src/cli/server_utils.cpp

support hosting multiple graphs and filter which to query

f8605d2

adamant-pwn reviewed Oct 3, 2025

View reviewed changes

metagraph/src/cli/config/config.cpp Outdated Show resolved Hide resolved

adamant-pwn reviewed Oct 3, 2025

View reviewed changes

metagraph/src/cli/server.cpp Show resolved Hide resolved

karasikov added 9 commits October 3, 2025 14:42

catch bad requests

0427741

catch wrong cli usage

26b0add

support parallel query with multiple hosted graphs per server

1033512

minor

ff5a2f6

Merge branch 'master' into mk/api

54864a0

Merge branch 'master' into mk/api

ce50e85

better logging, number the requests

a61dd8d

use a joint thread pool for different graphs

d92faf8

implemented /stats and /column_labels GET requests for multi-graph se…

ade6d23

…rvers

karasikov requested a review from adamant-pwn October 10, 2025 20:23

karasikov changed the title ~~support hosting multiple graphs in one process~~ multi-graph server: support hosting multiple graphs in one process Oct 10, 2025

cleanup: --threads-each is not used anymore in server

248466e

adamant-pwn reviewed Oct 10, 2025

View reviewed changes

karasikov added 2 commits October 11, 2025 15:12

Merge branch 'master' into mk/api

61ba537

cleanup

091af41

karasikov requested a review from adamant-pwn October 11, 2025 21:29

karasikov added 4 commits October 11, 2025 23:43

cleanup: moved read_labels() to load_annotation

c7d33eb

fix

c60b6a2

check that merged results belong to sequences with the same headers

1736489

minor

c13a8f1

karasikov merged commit cbb3891 into master Oct 14, 2025
17 checks passed

karasikov deleted the mk/api branch October 14, 2025 07:59

multi-graph server: support hosting multiple graphs in one process #540

multi-graph server: support hosting multiple graphs in one process #540

Conversation

karasikov commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamant-pwn Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

karasikov Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamant-pwn Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karasikov Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

adamant-pwn Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karasikov Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

karasikov Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karasikov commented Oct 3, 2025 •

edited

Loading

adamant-pwn Oct 10, 2025 •

edited

Loading

adamant-pwn Oct 13, 2025 •

edited

Loading