-
Notifications
You must be signed in to change notification settings - Fork 20
multi-graph server: support hosting multiple graphs in one process #540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
abundance_sum: bool = False, | ||
query_counts: bool = False, | ||
query_coords: bool = False, | ||
graphs: Union[None, List[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bit confusing compared to self.graphs
in this class. Are we sure we want to keep this naming? Maybe rename the parameter graphs
in API to labels
, or similar?
That aside, how are MultiGraphClient
s actually used? Do we theoretically expect a situation in which e.g. first graph serves label A
while the second graph serves labels B
and C
, and we want to query just label A
or just labels A
and B
?
Judging by the unit test below, this would fail. Is it the outcome we want, rather than e.g. returning empty output (and possibly some kind of warning in logs) when querying a label that we don't know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, true. We can try to come up with a different name. Otherwise, MultiGraphClient is essentially a GraphClient with a ThreadPool, and it's not used all that migh right now. Maybe it's simple enough to remove. Anyone can always add a ThreadPool on top any time
assert(json.size() == result.size()); | ||
for (Json::ArrayIndex i = 0; i < result.size(); ++i) { | ||
for (auto&& value : json[i]["results"]) { | ||
result[i]["results"].append(std::move(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put some limitations on final result size, and e.g. throw an exception if we detect that it became too large to handle? Also, might be nice to collect some statistics on how much time we spend querying graphs vs merging results, as the latter is effectively single-threaded due to locking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just does a push back to a vector. It should be extremely fast. Space-wise also. I think there is no need to worry about that right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing to vector should be fast, yes, I'm just concerned about the growth rate of the output. if a query is something that occurs everywhere, and it produces, say, 1 MiB of output on one chunk, it would elevate to ~2 GiB over 2k+ chunks, and then we also send it back over HTTP, right? Do we have something in place to guarantee scenarios like this don't happen? Or are we fine with them?
Speaking of that,
- Is it guaranteed that the order of elements (over which
Json::ArrayIndex
go) is the same in all chunk results, and not possibly shuffled due to multithreading? I remember it being shuffled in AWS tests when output was in tsv format, so I had to additionally sort before merging. - We apply
config.num_top_labels
to all chunks individually, right? But should we maybe also apply it during merging, repeatedly discarding matches outside ofnum_top_labels
? Given that we try to simulate a single combined graph on all chunks. This would also ensure result size stays on the order of magnitude of one chunk when this option is provided. - AFAIK, when we use
num_top_labels
, individual chunks return them in sorted order by the number of matches. I assume we might also need to re-sort them at the end after merging here, to preserve this format? Or is it ensured by JSON already?
I also don't know if there are other query parameters that we may need to account for during the merging stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point, I need to add the num_top_labels filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit trickier than I thought because of the many different result types (counts/coords/presence bitmap/etc.).
Let's do this in another PR, so we don't delay merging this.
I'm thinking of two things:
- Always return sorted results in the server mode
- Filter by top_labels
graphs
in the request jsonUsage:
./metagraph server_query indexes.txt --mmap --parallel 50
where indexes.txt is a csv file with graph names and paths: