Behavior of RDF4J IRI validation for IRIs containing square brackets #3777

aschwarte10 · 2022-04-05T14:43:49Z

aschwarte10
Apr 5, 2022
Collaborator

Hi all,

we have been running into a case where a generated IRI contained square bracket - for the sake of the example let's use http://example.com/m/a_[abc]

We observed that RDF4J behaves differently for different operations and query types. Specifically when going through RDF4J we get validation errors, which led us to looking into the spec. Also commercial databases seem to have diverging behavior for such IRIs.

My reading of the IRI spec (https://www.ietf.org/rfc/rfc3987.txt) is that square brackets are part of "reserved characters" which are not allowed in the path part of an IRI. This would mean that the RDF4J validation error is correct as expected.

I would like to get confirmation on this and also want to point to inconsistent behavior in RDF4J (see below). I can technically understand why this may happen, but from a user perspective this leads to confusions. Is this maybe something that should be fixed in RDF4J?

My test scenario:

(with memory store)

          // fails with parse error
          conn.prepareUpdate("INSERT DATA { <http://example.com/m/a_[abc]> a <http://example.com/Something> }")
                    .execute();

            // fails with parse error (URISyntaxException)
            conn.add(Values.iri("http://example.com/m/a_[abc]"), RDF.TYPE, Values.iri("http://example.com/Something"));


            // avoid IRI validation
            conn.add(SimpleValueFactory.getInstance().createIRI("http://example.com/m/a_[abc]"), RDF.TYPE,
                    Values.iri("http://example.com/Something"));

            // query works
            conn.prepareTupleQuery("SELECT * WHERE { <http://example.com/m/a_[abc]> ?p ?o }").evaluate()
                    .forEach(bs -> System.out.println(bs));

            // query works (result parsing)
            conn.prepareTupleQuery("SELECT * WHERE { ?s ?p ?o }").evaluate().forEach(bs -> System.out.println(bs));

            // query works in memory store, but fails when using SPARQL repository
            conn.prepareGraphQuery("CONSTRUCT { ?s ?p ?o } WHERE { ?s ?p ?o }").evaluate()
                    .forEach(st -> System.out.println(st));

with a SPARQL repository
(assuming I was able to inser the statement with the broken IRI directly to the datbase)

            // query works (result parsing)
            conn.prepareTupleQuery("SELECT * WHERE { ?s ?p ?o }").evaluate().forEach(bs -> System.out.println(bs));

            // query fails with parse exception
            conn.prepareGraphQuery("CONSTRUCT { ?s ?p ?o } WHERE { ?s ?p ?o }").evaluate()
                    .forEach(st -> System.out.println(st));

From my test I see that different result parsers (specifically RDFParsers) are more strict than tuple result parsers.

From a user perspective I would expect that if the CONSTRUCT query is failing, the SELECT query should do the same.

How do you see this? And is my understanding correct that we are talking about invalid IRIs in the first place (which are accepted currently by different databases)?

Thanks,
Andreas

abrokenjester · 2022-04-09T01:58:23Z

abrokenjester
Apr 9, 2022
Maintainer

Thanks for bringing this up @aschwarte10 . You are correct that this IRI is invalid according to the RFC: square brackets are not allowed in the path of an IRI. The only way to carry such characters in an IRI would be to percent-encode them: http://example.com/m/a_%5Babc%5D is a valid IRI. There actually is a way in RDF4J do this kind of conversion of reserved chars:

String iri = ParsedIRI.create("http://example.com/m/a_[abc]").toString();
System.out.println(iri);

result:
http://example.com/m/a_%5Babc%5D

The general philosophy on IRI validation in RDF4J is that by default, we strictly adhere to the specs in all parsers. However, we allow bypassing these checks in several places, for two reasons:

if we validate IRIs at every entry point into the system (including all possible API calls) we introduce a huge performance penalty. So we rely on the source systems (the parsers or whatever client tooling is calling the base APIs) to do these checks
we should to some extent allow for use cases where people want to work with RDF data that contains occasional mistakes like these

Looking at your examples, I quite agree there is some inconsistency in behavior though. I think your first three examples are behaving correctly, and are consistent as well, but I'm not sure that that SPARQL query in your fourth example should be allowed: I would expect the SPARQL parser to protest. I traced it down and it turns out we use a non-validating ValueFactory in the SPARQL parser. I think that may be an oversight.

As for the discrepancy on processing the results of construct-queries: the in-memory store does not need to do any parsing of the result statements, as they come straight from the store. In contrast, the SPARQLRepository retrieves the result data in serialized form, and uses an RDF parser to deserialize it client-side. We should perhaps look into configuring that result parser to be more lenient, for consistency's sake.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Behavior of RDF4J IRI validation for IRIs containing square brackets #3777

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Behavior of RDF4J IRI validation for IRIs containing square brackets #3777

Uh oh!

aschwarte10 Apr 5, 2022 Collaborator

Replies: 1 comment

Uh oh!

abrokenjester Apr 9, 2022 Maintainer

aschwarte10
Apr 5, 2022
Collaborator

abrokenjester
Apr 9, 2022
Maintainer