Skip to content

Conversation

kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Sep 25, 2025

Background

Initial problem

There is currently an unsolved problem in libcurl, which somehow is mislabeled as merged/solved in curl/curl#13754. For AWS S3 that requires credentials, if an object key name contains =, libcurl will fail with an HTTP 403 response. This problem does not occur to public S3 objects. This can be reproduced using the curl program:

#!/usr/bin/env bash

# version: curl 8.15.0-DEV
curl_bin=<my_curl_program_loc>

# ..........S3 private..........
region=$(aws configure get region)
user_password=$(aws configure get aws_access_key_id):$(aws configure get aws_secret_access_key)
# curl can handle this. The object key name does not contain =
url="https://<private-bucket>.s3.<region>.amazonaws.com/witcher/2MiB.bin"
# curl cannot handle this. The object key name contains =
url="https://<private-bucket>.s3.<region>.amazonaws.com/witcher/key=value_2MiB.bin"

$curl_bin -s $url \
--aws-sigv4 "aws:amz:$region:s3" \
--user "$user_password" \
-o /dev/null -w "%{http_code}\n" -v

# ..........S3 public..........
# curl can handle both
url="https://<public-bucket>.s3.<region>.amazonaws.com/witcher/2MiB.bin"
url="https://<public-bucket>.s3.<region>.amazonaws.com/witcher/key=value_2MiB.bin"
$curl_bin -s $url \
-o /dev/null -w "%{http_code}\n" -v

Additional problem

It has been found that beyond = alone, other special characters such as !*'() in a private S3 object will also cause libcurl error. In addition, some characters such as + in a public S3 object will cause the same error.

This PR

This PR addresses this problem by handling special characters listed in the AWS object key naming guidelines, for both private and public S3 object names. The KvikIO-specific object key naming guidelines are added to the remote file documentation.

Specifically, this PR introduces utility classes UrlBuilder (to complement the existing UrlParser), which builds a URL according to the user-provided components, and UrlEncoder which uses a compile-time, percent-encoding lookup table to encode selected characters.

Closes #823

@kingcrimsontianyu kingcrimsontianyu added improvement Improves an existing functionality non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Sep 25, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kingcrimsontianyu kingcrimsontianyu changed the base branch from branch-25.10 to branch-25.12 September 25, 2025 05:26
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kingcrimsontianyu
Copy link
Contributor Author

kingcrimsontianyu commented Oct 1, 2025

Updated documentation

image

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review October 1, 2025 15:04
@kingcrimsontianyu kingcrimsontianyu requested a review from a team as a code owner October 1, 2025 15:04
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kingcrimsontianyu. The docs look good, and I confirmed that this fixes the issue we saw with = earlier.

I think as we discussed earlier, we weren't able to reproduce this issue with our moto-based tests because it handles authentication differently. So I think the tests in test_url.cpp are sufficient.

@kingcrimsontianyu
Copy link
Contributor Author

Hmm. The URL cpp test failed on ARM + conda. Looking into this ...

@kingcrimsontianyu
Copy link
Contributor Author

The test failure was caused by a misuse of string view, that resulted in UB. This has been fixed.
Also an oversight of mine, I forgot to include the control characters. They have now been added to the list of encoded characters.

@kingcrimsontianyu
Copy link
Contributor Author

Redid the test that uses pylibcudf to read an exotically named S3 object "https://<bucket-private-or-public>.s3.<region>.amazonaws.com/witcher/!-_.*'()&$@=;:+%20,%3F/!-_.*'()&$@=;:+%20,%3F1_mil.parquet"

@kingcrimsontianyu
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit a5661b9 into rapidsai:branch-25.12 Oct 2, 2025
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Affects the C++ API of KvikIO improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

403 Error when reading S3 URLs with special characters

3 participants