Skip to content

feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores #1992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

drexler-sky
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

This PR adds support for Approach 2 (JNI-based Hadoop FileSystem access) to enable S3 reads via the native DataFusion Parquet scanner. The original discussion of both approaches can be found in issue #1766.

What changes are included in this PR?

Added JNI-based integration for accessing Hadoop FileSystem in Comet

Introduced a new config flag: spark.comet.use_jni_object_store to toggle this feature

How are these changes tested?

new test

@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2025

Codecov Report

Attention: Patch coverage is 9.09091% with 50 lines in your changes missing coverage. Please review.

Project coverage is 58.01%. Comparing base (f09f8af) to head (97e6aee).
Report is 312 commits behind head on main.

Files with missing lines Patch % Lines
...n/java/org/apache/comet/parquet/JniHDFSBridge.java 0.00% 50 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1992      +/-   ##
============================================
+ Coverage     56.12%   58.01%   +1.88%     
- Complexity      976     1152     +176     
============================================
  Files           119      134      +15     
  Lines         11743    13095    +1352     
  Branches       2251     2432     +181     
============================================
+ Hits           6591     7597    +1006     
- Misses         4012     4264     +252     
- Partials       1140     1234      +94     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @drexler-sky for this PR! I've been working on something similar and there is overlap as well as some differences in our respective methods. In particular I am looking at how to integrate with the native_iceberg_compat mode, as well as reusing the SeekableInputStream that native_iceberg_compat` already has.
I'll put my code in github in a day or so and we can discuss how to get to an implementation that works for both paths.

@@ -292,4 +299,104 @@ public static native void currentColumnBatch(
* @param handle
*/
public static native void closeRecordBatchReader(long handle);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these to a new file, say JniHDFSBridge ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed.

@@ -50,6 +50,7 @@ lazy_static = "1.4.0"
prost = "0.13.5"
jni = "0.21"
snap = "1.1"
chrono = { version = "0.4", default-features = false, features = ["clock"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use the dependency define in the main Cargo.toml (i.e. make this a workspace dependency)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// # Returns
/// Returns `Ok(usize)` with the file size in bytes on success, or an `ObjectStoreError`
/// if the operation fails.
pub fn call_get_length(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just get_length perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to get_length

/// # Returns
/// Returns `Ok(Vec<u8>)` containing the requested bytes on success, or an
/// `ObjectStoreError` if the operation fails.
pub fn call_read(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just read or maybe read_range ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to read

&self,
_prefix: Option<&Path>,
) -> BoxStream<'static, Result<ObjectMeta, ObjectStoreError>> {
futures::stream::empty().boxed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to return an empty stream here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to todo!()

@@ -384,6 +392,17 @@ pub(crate) fn prepare_object_store_with_configs(

let (object_store, object_store_path): (Box<dyn ObjectStore>, Path) = if scheme == "hdfs" {
parse_hdfs_url(&url)
} else if scheme == "s3" && use_jni {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is a tricky bit. If use_jni is true we should be able to use any file system available to hadoop. We really need to scan the config for all spark.hadoop.fs.customfs.impl and register the jni based object store for every customfs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I updated the code to scan for spark.hadoop.fs.<scheme>.impl configs and use JniObjectStore for any matching scheme when use_jni is true.

@@ -0,0 +1,332 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this jni_hdfs.rs since this is a jni based implementation of an hdfs object store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes @drexler-sky. I did a more detailed pass and have some more comments.

Comment on lines +35 to +47
static JVM: OnceCell<JavaVM> = OnceCell::new();

pub fn init_jvm(env: &JNIEnv) {
let _ = JVM.set(env.get_java_vm().expect("Failed to get JavaVM"));
}

fn get_jni_env<'a>() -> jni::AttachGuard<'a> {
JVM.get()
.expect("JVM not initialized")
.attach_current_thread()
.expect("Failed to attach thread")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already executing this code in a JVM and the JNIEnv is available to us in (jni_api.rs) Java_org_apache_comet_Native_executePlan We could probably pass that in all the way here.

Ok(jmap)
}

pub fn jni_error(e: jni::errors::Error) -> ObjectStoreError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jvm_bridge/mod.rs has a jni_map_error macro. It also has the very useful jni_call and jni_static_call. Perhaps we can use those and make the code consistent?

let (object_store, object_store_path): (Box<dyn ObjectStore>, Path) = if scheme == "hdfs" {
parse_hdfs_url(&url)
} else if use_jni && hadoop_schemes.contains(scheme) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check should come first, even before the s3a scheme has been renamed to s3. Note that a user could override the implementation to be used for the hdfs scheme.
Either way, if the use_jni flag is set, we should use jni if the scheme is hdfs or if the config specifies an implementation for the scheme.
Also , we can rename s3a to s3 only if use_jni is false; this way we won't be renaminbg s3a back and forth.


test("Comet uses JNI object store when use_jni is true") {
spark.conf.set("spark.comet.use_jni_object_store", "true")
runParquetScanAndAssert()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can verify the jni implementation did get called?

@parthchandra
Copy link
Contributor

@comphead @Kontinuation you might be interested in looking at this PR.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @drexler-sky for the PR
Please help me to understand how this object store is different from https://github.com/apache/datafusion-comet/blob/main/native/hdfs/src/object_store/hdfs.rs ?

@Kontinuation
Copy link
Member

I took a look at the datafusion-comet-objectstore-hdfs module Comet and found that it largely overlaps with the Hadoop FileSystem bridge we are building here. A better approach is to reuse datafusion-comet-objectstore-hdfs but find a way to pass additional Hadoop configurations to it. Users may configuring credentials for accessing the storage in Spark configuration so passing them correctly when constructing the ObjectStore is necessary.

datafusion-comet-objectstore-hdfs works as follows:

                                               FFI            JNI
datafusion-comet-objectstore-hdfs --> fs-hdfs -----> libhdfs -----> Hadoop File System (JVM)

libhdfs and fs-hdfs should be able to support all Hadoop File System implementations, not just HDFS. The current problem is that fs-hdfs does not provide a way to instantiate an HdfsFs instance using custom Hadoop configurations. libhdfs does provide hdfsBuilderConfSetStr, so we need to open up new APIs in fs-hdfs to make use of it.

BTW, is there any concern enabling hdfs support by default and switching the default fs-hdfs dependency to fs-hdfs3?

[features]
default = ["hdfs", "try_spawn_blocking"]
hdfs = ["fs-hdfs"]
hdfs3 = ["fs-hdfs3"]

@comphead
Copy link
Contributor

comphead commented Jul 9, 2025

I took a look at the datafusion-comet-objectstore-hdfs module Comet and found that it largely overlaps with the Hadoop FileSystem bridge we are building here. A better approach is to reuse datafusion-comet-objectstore-hdfs but find a way to pass additional Hadoop configurations to it. Users may configuring credentials for accessing the storage in Spark configuration so passing them correctly when constructing the ObjectStore is necessary.

datafusion-comet-objectstore-hdfs works as follows:

                                               FFI            JNI
datafusion-comet-objectstore-hdfs --> fs-hdfs -----> libhdfs -----> Hadoop File System (JVM)

libhdfs and fs-hdfs should be able to support all Hadoop File System implementations, not just HDFS. The current problem is that fs-hdfs does not provide a way to instantiate an HdfsFs instance using custom Hadoop configurations. libhdfs does provide hdfsBuilderConfSetStr, so we need to open up new APIs in fs-hdfs to make use of it.

BTW, is there any concern enabling hdfs support by default and switching the default fs-hdfs dependency to fs-hdfs3?

[features]
default = ["hdfs", "try_spawn_blocking"]
hdfs = ["fs-hdfs"]
hdfs3 = ["fs-hdfs3"]

Thanks @Kontinuation I was about to create a PR to enable hdfs support by default @kazuyukitanimura cc
One stopper for me was that HDFS is less popular now and probably has not that many consumers to be enabled by default. But we probably should do it, hopefully the binary would be much bigger in this case.

for fs-hdfs3 let me check that.

Reg to configuring the Hadoop Client from Rust side I used the command line

    // LIBHDFS3_CONF=/Users/ovoievodin/tmp/hadoop-3.2.4/etc/hadoop/ JAVA_HOME="/opt/homebrew/opt/openjdk@11" RUST_BACKTRACE=1 RUSTFLAGS="-L /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk///Contents/Home/lib/server" cargo test --lib tests::test_read_hdfs -- --nocapture

So the Rust was able to get the Hadoop client configurations.
If running from Comet, hadoop configuration can passed through Spark passing hdfs-site.xml to classpath or using spark.hadoop.* params like in https://datafusion.apache.org/comet/user-guide/datasources.html#using-experimental-native-datafusion-reader

But if it could be improved feel free to extend the hdfs crate so the user can instantiate HdfsFs correctly.

@drexler-sky
Copy link
Contributor Author

@comphead @Kontinuation

Thank you for the comments. That all makes sense to me.

Here’s the plan I propose:

  1. Extend fs-hdfs to support passing in a Hadoop configuration programmatically, instead of relying solely on environment variables.

  2. Update datafusion-comet-objectstore-hdfs to accept a config map and forward it to fs-hdfs.

  3. In Comet, extract relevant Hadoop configurations from SparkConf and pass them through.

While it’s currently possible to set configurations via environment variables or spark.hadoop.*, I believe enabling a more explicit and programmatic approach will improve flexibility and user experience.

Does this approach sound reasonable to everyone? If so, I’ll start with step 1 and submit a PR to arrow-rs once it’s ready.

@parthchandra
Copy link
Contributor

datafusion-comet-objectstore-hdfs --> fs-hdfs -----> libhdfs -----> Hadoop File System (JVM)

I did not realize that fs-hdfs was using libhdfs. How is libhdfs packaged into the build? Do we have to worry about the libc version and different platforms or is this taken care of by the fs-hdfs crate?

@parthchandra
Copy link
Contributor

I have a couple of use cases in mind that I'm hoping this will cover -

  1. Custom credentials providers - Say a credentials provider extending org.apache.hadoop.fs.s3a.auth.AbstractSessionCredentialsProvider
  2. Custom FileSystem implementations. i.e. implementations of org.apache.hadoop.fs.FileSystem

I'm assuming that as long as the we are using libhdfs, these cases are covered?

@comphead
Copy link
Contributor

comphead commented Jul 9, 2025

@comphead @Kontinuation

Thank you for the comments. That all makes sense to me.

Here’s the plan I propose:

  1. Extend fs-hdfs to support passing in a Hadoop configuration programmatically, instead of relying solely on environment variables.
  2. Update datafusion-comet-objectstore-hdfs to accept a config map and forward it to fs-hdfs.
  3. In Comet, extract relevant Hadoop configurations from SparkConf and pass them through.

While it’s currently possible to set configurations via environment variables or spark.hadoop.*, I believe enabling a more explicit and programmatic approach will improve flexibility and user experience.

Does this approach sound reasonable to everyone? If so, I’ll start with step 1 and submit a PR to arrow-rs once it’s ready.

Thanks @drexler-sky I think this approach makes a lot of sense. You probably also want to cover the cases if multiple config sources are set, like env var, spark hadoop and programmatical, which one should be overriding others.

@parthchandra I'm not sure if libhdfs packaged

with hdfs feature and without the libcomet.dylib has the same size of 69M, so static linking is unlikely. And checking dynamic libs

with hdfs (libjvm linked)

otool -L  native/target/release/libcomet.dylib 
native/target/release/libcomet.dylib:
        /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk/Contents/Home/lib/server/libjvm.dylib (compatibility version 1.0.0, current version 1.0.0)
        /System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 1.0.0, current version 61439.120.27)
        /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 3502.1.255)
        /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
MacBook-Pro-110:arrow-datafusion-comet ovoievodin$ ls -la native/target/release/

without hdfs

        /System/Library/Frameworks/Security.framework/Versions/A/Security (compatibility version 1.0.0, current version 61439.120.27)
        /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 3502.1.255)
        /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)

@comphead
Copy link
Contributor

comphead commented Jul 9, 2025

I have a couple of use cases in mind that I'm hoping this will cover -

  1. Custom credentials providers - Say a credentials provider extending org.apache.hadoop.fs.s3a.auth.AbstractSessionCredentialsProvider
  2. Custom FileSystem implementations. i.e. implementations of org.apache.hadoop.fs.FileSystem

I'm assuming that as long as the we are using libhdfs, these cases are covered?

Hope so, my understanding is libhdfs is a wrapper, so authentication like kerberos should work

@parthchandra
Copy link
Contributor

@parthchandra I'm not sure if libhdfs packaged

with hdfs feature and without the libcomet.dylib has the same size of 69M, so static linking is unlikely. And checking dynamic libs

@comphead were you able to run this on a system with no hadoop/hdfs installed? I have an old version of hadoop on my mac and libhdfs.a is less than 500K so it is entirely possible to be statically linked in. I don't know if binaries are available for mac, btw, so I don't know how that would work.

@parthchandra
Copy link
Contributor

Update on this:

  1. fs-hdfs includes its own version of libhdfs as part of its build so we do not need to worry about that
  2. fs-hdfs recognizes only file, hdfs, and viewfs as schemes. This is not a showstopper as it is easily addressed.

With the above addressed and with appropriate modifications to Comet to always use the hdfs object store, I was able to access an s3a uri. The read failed but it may be unrelated (I will investigate).

I'll verify with a custom credentials provider as well.

So it appears that using the datafusion-comet-objectstore-hdfs object store is a viable path forward. @drexler-sky I will agree with @comphead and @Kontinuation so if you want to go ahead with your proposed plan, please go ahead.

@Kontinuation
Copy link
Member

Kontinuation commented Jul 12, 2025

There are some problems with the approach of using fs-hdfs (libhdfs).

Problems

1. Linking against libjvm.so

As we discovered before, using fs-hdfs makes libcomet dynamically link against libjvm.so. libjvm.so is only necessary when we want to call the Java invocation API, these APIs are for creating new Java VMs or finding existing VMs, etc. These APIs are not needed by JNI modules, because the JNI native functions accept a JNIEnv parameter, there's no need for Java VM discovery or creation.

Here are the problems with linking against libjvm.so

  1. This is not a common practice of building JNI modules. I have inspected some commonly used JNI libraries including snappy-java, lz4-java, zstd-jni, arrow_cdata_jni, blas, etc. None of them link against libjvm.so.
  2. This introduces additional friction for development. When hdfs feature is enabled, all the test binaries will link against libjvm.so, and there's a high chance that the tests won't run out-of-box. These are the failures I got when running cargo test --features hdfs:

Linux:

     **Running** unittests src/lib.rs (target/debug/deps/comet-b5e7808e7551d6a8)

/home/kontinuation/documents/github/datafusion-comet/native/target/debug/deps/comet-b5e7808e7551d6a8: error while loading shared libraries: libjvm.so: cannot open shared object file: No such file or directory

**error****:** test failed, to rerun pass `-p datafusion-comet --lib`

macOS:

Running unittests src/lib.rs (target/debug/deps/comet-2a4d21387bc9c825)
dyld[61398]: Library not loaded: @rpath/libjvm.dylib
  Referenced from: <D3FE1520-33E0-3D0F-BDDB-AF7ADB755CBE> /Users/bopeng/workspace/github/datafusion-comet/native/target/debug/deps/comet-2a4d21387bc9c825
  Reason: tried: '/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/aws-lc-sys-93a6aa5d6483d3f7/out/libjvm.dylib' (no such file), '/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/blake3-0740d36ff36fef54/out/libjvm.dylib' (no such file), '/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/fs-hdfs3-ef462506593c877b/out/libjvm.dylib' (no such file), '/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/psm-0620469b47746bcc/out/libjvm.dylib' (no such file), '/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/ring-c54438584ee3a8bc/out/libjvm.dylib' (no such file), '/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/zstd-sys-948618cdd9c42d49/out/libjvm.dylib' (no such file), '/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/deps/libjvm.dylib' (no such file), '/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/libjvm.dylib' (no such file), '/Users/bopeng/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/aarch64-apple-darwin/lib/libjvm.dylib' (no such file), '/Users/bopeng/.rustup/toolchains/stable-aarch64-apple-darwin/lib/libjvm.dylib' (no such file), '/Users/bopeng/lib/libjvm.dylib' (no such file), '/usr/local/lib/libjvm.dylib' (no such file), '/usr/lib/libjvm.dylib' (no such file, not in dyld cache)
error: test failed, to rerun pass `-p datafusion-comet --lib`

We have to point LD_LIBRARY_PATH or DYLD_LIBRARY_PATH to the directory containing libjvm to run those tests. This is quite inconvenient.

2. Problem with AttachCurrentThread and DetachCurrentThread

The JNIEnv retrieval and thread attachment management policy of Comet and libhdfs does not play well with each other.

  • Comet uses JavaVM::attach_current_thread to obtain a JNIEnv for calling Java methods in multiple places, it automatically calls AttachCurrentThread if necessary, and returns an AttachGuard to automatically call DetachCurrentThread when the JNIEnv is dropped.
  • libhdfs always call AttachCurrentThread when obtaining JNIEnv, and cache the JNIEnv handle to thread local variables. It will assume that DetachCurrentThread will never be called and the cached JNIEnv will always be valid.

There might be cases where Comet calls DetachCurrentThread at some point then libhdfs tries to reuse the stale cached thread local JNIEnv, This may happen in theory, I'm not sure if this problem has been observed in practice.

Solutions

I am trying to solve these problems by adding a new no_jvm_invocation feature to fs-hdfs (datafusion-contrib/fs-hdfs#26). When this feature is enabled, libjvm.so won't be linked. It adds a new API set_java_vm and requires the user to provide a handle to a JavaVM before calling other functions. This eliminates the need for discovering VMs by calling JNI_GetCreatedJavaVMs, which is an invocation API requires libjvm.

The policy for managing JNIEnv is also revised:

  • It tries to obtain JNIEnv from JavaVM by calling (*vm)->GetEnv. If it succeeded, the thread was already attached by the caller and may be detached by the caller in the future, libhdfs won't cache the JNIEnv in thread local storage in this case.
  • If (*vm)->GetEnv fails, it calls AttachCurrentThread to obtain an JNIEnv, and cache the JNIEnv in thread local storage. The cached JNIEnv will be reused within this thread. When this thread terminates, DetachCurrentThread will be called to free the resources.

This policy ensures that only the caller of AttachCurrentThread would call DetachCurrentThread, this is a more easy to follow principle than the mess of current status.

Other Considerations

  • The complexity of this approach may be higher than what we expected before
  • The code quality of libhdfs is ... questionable. I'm not sure if it is a good idea to build general Hadoop FileSystem support upon it anymore.

@parthchandra
Copy link
Contributor

We have to point LD_LIBRARY_PATH or DYLD_LIBRARY_PATH to the directory containing libjvm to run those tests. This is quite inconvenient.

I did not run into this issue when trying to use fs-hdfs with S3. I did have to add S3 to fs-hdfs (I'll submit a PR for this) and a hit a few minor issues but nothing major. I do recall we had the requirement to have libjvm in the library load path early on, but don't recall what changed to remove the requirement.

@parthchandra
Copy link
Contributor

I'm not sure if it is a good idea to build general Hadoop FileSystem support upon it anymore.

With @drexler-sky and @Kontinuation's changes we might be in a better position to evaluate this approach in a production environment.

@parthchandra
Copy link
Contributor

@parthchandra
Copy link
Contributor

I am trying to solve these problems by adding a new no_jvm_invocation feature to fs-hdfs (datafusion-contrib/fs-hdfs#26). When this feature is enabled, libjvm.so won't be linked. It adds a new API set_java_vm and requires the user to provide a handle to a JavaVM before calling other functions. This eliminates the need for discovering VMs by calling JNI_GetCreatedJavaVMs, which is an invocation API requires libjvm.

+1

@Kontinuation
Copy link
Member

Kontinuation commented Jul 15, 2025

I found another problem when setting up GitHub Workflow for fs-hdfs:
https://github.com/Kontinuation/fs-hdfs/actions/runs/16287559163/job/45989546965#step:9:461

HdfsFile::read and HdfsFile::read_with_pos do not guarantee that the output buffer will be filled when there's more data available than the capacity of the output buffer, a common practice is to read data in a loop until we read enough data or EOF is hit.

The get and read_range functions in Comet's hdfs support are not aware of this, we need to fix this problem as well.

UPDATE: The CI failure is not caused by incomplete reads, but caused by hardcoding the value of O_APPEND on a specific platform instead of defining it in a portable way. But my point still holds.

@parthchandra
Copy link
Contributor

The get and read_range functions in Comet's hdfs support are not aware of this, we need to fix this problem as well.

Yes, I saw this too. I will have a PR ready today.

@mbutrovich
Copy link
Contributor

Should we set this to draft as we discuss the design going forward?

@drexler-sky drexler-sky marked this pull request as draft July 20, 2025 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants