reusing vars across federated queries #4529

nguyenm100 · 2023-05-01T21:19:00Z

nguyenm100
May 1, 2023

hey folks, trying to join some datasets via federated queries.

when I do something like

select * where {

    service<service1> {

        select ?fk where { ?fk :hasId "123" }
    }

    service <service2> {
        ?subj :hasFKey ?fk.
        ?subj :otherProp ?prop1 .
    }
}

this seems to be fast. However, adding VALUES to subquery/service1 slows it down considerably (even if i don't use the ?val var):

select * where {

    service<service1> {
        VALUES (?val) { ("id1") ("id2") }
        select ?fk where { ?fk :hasId "123" }
    }

    service <service2> {
        ?subj :hasKey ?fk.
        ?subj :otherProp ?pro1 .
    }
}

there's some discussion on SO about the engine pulling back all the values locally for the join. I couldn't see if there was a way to bypass this in rdf4j (e.g. passing the variable from one subquery to another though it seems to work for the first query when only 1 value was used).

https://stackoverflow.com/questions/45356326/sparql-speed-up-federated-query talks about some "reuse.vars.in.subselects" for the sesame protocol.. but perhaps it's only a graphdb thing?

is there a way to "pass along variables" to federated sparql? (I tried making service1 call a subquery for service2 call and it was still slow)

abrokenjester · 2023-05-02T01:24:26Z

abrokenjester
May 2, 2023
Maintainer

However, adding VALUES to subquery/service1 slows it down considerably (even if i don't use the ?val var):

 ...
  service<service1> {
        VALUES (?val) { ("id1") ("id2") }
        select ?fk where { ?fk :hasId "123" }
    }

First of all: why are you adding a VALUES clause if you don't use the var anywhere? What this likely causes is that the service endpoint you are querying is evaluating the query several times (once for every supplied binding in the VALUES clause) - which would explain why it's slower. Also: have you tried sticking the values clause inside the subselect?

there's some discussion on SO about the engine pulling back all the values locally for the join. I couldn't see if there was a way to bypass this in rdf4j (e.g. passing the variable from one subquery to another though it seems to work for the first query when only 1 value was used).

https://stackoverflow.com/questions/45356326/sparql-speed-up-federated-query talks about some "reuse.vars.in.subselects" for the sesame protocol.. but perhaps it's only a graphdb thing?

I'm not sure that's relevant to the issue you are experiencing. But perhaps I don't fully understand the problem.

2 replies

abrokenjester May 2, 2023
Maintainer

@aschwarte10 do you have any thoughts on this as our "all things federated" expert? :)

aschwarte10 May 2, 2023
Collaborator

Some idea that comes to my head (did not yet have time to go through the implementation):

I seem to remember that multiple SERVICE clauses in RDF4J are joined using BIND joins, I think even using VALUES clauses in that case. This means that the service endpoint will actually see two VALUES clauses in the query. This then may lead to cross products - not sure how big these cross products can be in practice. Maybe these cross products are an explanation for the performance drop.

nguyenm100 · 2023-05-02T01:39:56Z

nguyenm100
May 2, 2023
Author

That was an oversight but thought it interesting to note. I did try using values and it's slow only when both services are there (I issued the query from a third machine). If I run the query in isolation (just service1), it is fast with and without value defined. And as noted, removing values, even with both services issued, is fast. One dataset has about 5m triples and the other has 18m. Result set is only 4 records. On Monday, May 1, 2023 at 09:24:38 PM EDT, Jeen Broekstra ***@***.***> wrote: However, adding VALUES to subquery/service1 slows it down considerably (even if i don't use the ?val var): ... service<service1> { VALUES (?val) { ("id1") ("id2") } select ?fk where { ?fk :hasId "123" } } First of all: why are you adding a VALUES clause if you don't use the var anywhere? What this likely causes is that the service endpoint you are querying is evaluating the query several times (once for every supplied binding in the VALUES clause) - which would explain why it's slower. Have you tried sticking the values clause inside the subselect? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

nguyenm100 · 2023-05-02T08:14:51Z

nguyenm100
May 2, 2023
Author

Would be great to understand a few things wrt federation.

Order of operations. I suppose subqueries are always executed first but are service endpts always executed sequentially. Are they ever exec in parallel?
Can variables bindings be passed from one service endpts to another?
Can joins be done remotely or are they done locally?

I feel like federation is pivotal to the core of the promise of semtech so would love to u/d rdf4j implementation better. Tx

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reusing vars across federated queries #4529

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

reusing vars across federated queries #4529

Uh oh!

Uh oh!

nguyenm100 May 1, 2023

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

abrokenjester May 2, 2023 Maintainer

Uh oh!

abrokenjester May 2, 2023 Maintainer

Uh oh!

aschwarte10 May 2, 2023 Collaborator

Uh oh!

nguyenm100 May 2, 2023 Author

Uh oh!

nguyenm100 May 2, 2023 Author

nguyenm100
May 1, 2023

Replies: 3 comments 2 replies

abrokenjester
May 2, 2023
Maintainer

abrokenjester May 2, 2023
Maintainer

aschwarte10 May 2, 2023
Collaborator

nguyenm100
May 2, 2023
Author

nguyenm100
May 2, 2023
Author