Skip to content

Conversation

codope
Copy link
Member

@codope codope commented Jul 13, 2025

Description

Implements LSM timeline reading capability for the hudi-rs core to support both v6 and v8 timeline formats, following Java implementation patterns with timeline layout abstraction. This is the foundation PR and we will follow up with future work on LSM timeline history reading, support for multiple layers of timeline compaction and manifest parsing for compacted timeline files.

Core Implementation

  • Timeline Loader Architecture: Added TimelineLoader to abstract different timeline storage formats.
  • Timeline Builder: Implemented TimelineBuilder to handle configuration resolution and loader instantiation based on table configuration.
  • Multi-loader Orchestration: Updated Timeline to support querying both active and archived/compacted timeline data.
Timeline {
    active_loader: TimelineLoader,
    archived_loader: Option<TimelineLoader>,
}
  • Layout v1: Uses LayoutOneActive (reads from .hoodie/ directory) and LayoutOneArchived (future)
  • Layout v2: Uses LayoutTwoActive (foundation for .hoodie/timeline/) and LayoutTwoCompacted (future)

Configuration Integration

  • Version Detection: Automatic timeline layout selection based on hoodie.table.version and hoodie.timeline.layout.version
  • Graceful Fallback: Handles invalid table configurations by defaulting to active timeline layout.
  • Backward Compatibility: Maintains full compatibility with existing v6 table APIs.

Related to #377

How are the changes test-covered

  • 250/250 core tests passing
  • Integration tests for table reading, snapshot queries, and file slices
  • Backward compatibility verified with existing test data

@codope codope requested a review from xushiyan as a code owner July 13, 2025 06:36
@codope codope linked an issue Jul 13, 2025 that may be closed by this pull request
Copy link

codecov bot commented Jul 13, 2025

Codecov Report

❌ Patch coverage is 41.25874% with 84 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.59%. Comparing base (72fe192) to head (4f2eded).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/timeline/loader.rs 40.00% 39 Missing ⚠️
crates/core/src/timeline/lsm_tree.rs 0.00% 14 Missing ⚠️
crates/core/src/timeline/mod.rs 45.00% 11 Missing ⚠️
crates/core/src/config/internal.rs 0.00% 9 Missing ⚠️
crates/core/src/timeline/builder.rs 66.66% 9 Missing ⚠️
crates/core/src/config/table.rs 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #395      +/-   ##
==========================================
- Coverage   89.33%   86.59%   -2.74%     
==========================================
  Files          49       52       +3     
  Lines        2747     2835      +88     
==========================================
+ Hits         2454     2455       +1     
- Misses        293      380      +87     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codope codope mentioned this pull request Jul 13, 2025
4 tasks
Copy link
Member Author

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few note to reviewers.

storage: Arc<Storage>,
}

impl ActiveTimeline {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's take the opportunity to re-think some of the model design:

from the callers perspective, they should not care if the commits should come from active timeline history/ or archived/ because these are all timeline internal logic flow. If the caller requests require looking into compacted or archived commits, the timeline instance will need to make a call and fetch the data as needed. So at the interface level, we don't need an explicit ActiveTimeline. We can just keep it Timeline as the front-facing model.

#[derive(Debug, Clone)]
pub enum TimelineLayoutType {
Active(super::active::ActiveTimeline),
Lsm(super::lsm::LsmTimeline),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think of Timeline as a top level interface where callers can interact with, the layout is an internal config that resolved based on table version and timeline layout version (as you resolved in create_layout(). It'd be a bit odd to see a "Type" struct having a Timeline instance basically for choosing a Timeline to use.

How about we design it like:

  • Timeline remains as the top-level interface
  • When a Timeline is created, it resolves HudiConfigs, and create an internal model call TimelineLoader based on the configs
  • TimelineLoader can be an enum, it has these items:
LayoutOneActive
LayoutOneArchived
LayoutTwoActive
LayoutTwoCompacted

Each loader will implement the logic to load its intended commit files.

  • When some APIs are invoked and need to load archived or compacted commits, the Timeline instance will use the needed Loaders to do IO
  • Timeline is responsible for collecting and stitching the Loaders' return data and returning final data to callers
  • We also avoid calling a LSM timeline since LSM tree is just one impl of Timeline which should be hidden away from Timeline's level

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, better suggestion overall! Few things if you can clarify:

We also avoid calling a LSM timeline since LSM tree is just one impl of Timeline which should be hidden away from Timeline's level

Should the LSM tree logic be a separate module that the loaders use, or should the loaders themselves be the LSM tree implementations? So, in the context of TimelineLoader::LayoutTwoActive, would this variant contain an LSMTree instance, or would it implement LSM tree logic directly?

Also, should the current Timeline::load_instants() method remain as-is, or would you prefer a different public API? For example, another API that takes a boolean to decide whether to load archived/compacted timeline. Alternatively, we could just add another API keeping load_instants() as-is.

Copy link
Member

@xushiyan xushiyan Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codope cool! keeping lsm tree specific in lsm_tree.rs is good. may not need to worry about re-using it for now - it can be used by the loaders exclusively, but somewhat positioned for future re-use

Looking closer - so load_instants() is actually a more generic helper function, it's the using of a selector in one of the init functions (new_from_storage()) that defines the behavior of timeline's instants loading. So depends on where load_instants() gets called, we just need to pass in the desired TimelineSelector

@xushiyan xushiyan added this to the release-0.5.0 milestone Jul 14, 2025
@codope codope force-pushed the lsm-timeline-p1 branch from 98b19ae to 7b34918 Compare July 14, 2025 14:42
codope added 5 commits August 4, 2025 11:01
…d and new table versions

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
This commit introduces the foundational architecture for supporting LSM timelines in hudi-rs, enabling compatibility with both v6 (active timeline) and v8 (LSM timeline) Hudi table formats.

The key changes include:

- A `TimelineLoader` enum to abstract different timeline storage formats.
- A `TimelineBuilder` to handle configuration resolution and loader instantiation based on `hoodie.table.version` and `hoodie.timeline.layout.version`.
- Refactored `Timeline` to use the new loader architecture while maintaining backward compatibility with existing APIs.
- Graceful fallback for invalid table configurations to ensure test stability.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Implements the orchestration logic within the `Timeline` struct to support querying both active and archived/compacted timeline data.

Adds a new `load_instants_with_archive` method which queries the active loader and, if requested, the archived loader, then merges the results. Existing `get_completed_*` methods are updated to use this new orchestration layer while maintaining their original behavior.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
- Renamed lsm.rs to lsm_tree.rs to better reflect its purpose
- Expanded LSMTree struct with core logic for manifest reading
- Integrated LSMTree with TimelineLoader for v8 timeline support
- Updated TimelineLoader to handle LayoutTwoActive using the new LSM tree logic

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
codope added 2 commits August 4, 2025 11:57
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
selector: &TimelineSelector,
storage: &Storage,
desc: bool,
include_archived: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the current TimelineSelector has this property too, which was my initial thought. But let's try to go with simple UX for the API: callers do not need to know if they need to load archived or not, which is timeline's internal impl - they just ask for a list of instants for a given time range (everything else that a timeline selector holds):

  • if no specific time given, load active instants
  • if a start or end time given, and falls into archived range, then load relevant archived + active instants

Timeline loader can internally decide what to load.

The only case that we might need users to set include_archived is to set it false when they might load for a time range that could be in archived but for perf reason they choose to skip archived. This is a rare advanced usage we should be good to ignore now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We’ll keep public API simple and make archived loading an internal concern driven by the selector range; removed the public flag for now.

let mut instants = Vec::new();

for file_info in files {
if file_info.name.starts_with("history/") || file_info.name.starts_with('_') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can enhance the storage.list_files() to support a listing for only immediate files under the prefix, then we dont need explicit check and just grab the instant files. What is the case with having _ prefix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also occurs to me that we may need to enhance the listing by ignoring .crc files by default.. we'll track this separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensured sorted results; removed _ filter; ignoring .crc. Will do a shallow-listing enhancement on Storage in a follow-up.

}

pub struct LSMTree {
storage: Arc<Storage>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to take any hudi configs for LSM tree specific settings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll pass configs into the LSM reader when adding archived reads to honor timeline-specific settings.

instants.extend(archived_instants);

// Re-sort after merging
instants.sort_unstable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be sure about loading results' order- archived instants already sorted right? then we can just concat with active ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loaders now return sorted lists; we'll replace the final sort with a merge when archived reading lands.

use serde::{Deserialize, Serialize};
use std::sync::Arc;

pub const LSM_TIMELINE_DIR: &str = ".hoodie/timeline";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to respect "hoodie.timeline.history.path", and "hoodie.timeline.history.path" right? same for v6, we need to handle "hoodie.archivelog.folder". but we can deal with these in PR for archived timeline. the corresponding timeline loader should be responsible to take these from configs and own it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted; i'll wire in the configs but archived loaders will actually resolve paths from configs in the next PR.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
- Made load_archived_instants pub(crate)
- Removed include_archived from public surface; added internal load_instants_internal used by Timeline methods
- LayoutTwoActive now sorts results, ignores .crc, and drops the _ prefix filter
- LSM manifest docs
- Dropped v5 support in builder
- Added TimelineSelector::has_time_filter() so Timeline knows when a time range is requested.
- Updated Timeline::load_instants(...) to automatically include archived results when a time filter is present and an archived loader exists. Active is always queried first, archived appended if available.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@xushiyan
Copy link
Member

resuming review on this over the next 2 days

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have some suggestions for code simplification. once addressed, we should be good to move forward

Comment on lines +546 to +547
(HudiTableConfig::BasePath, "file:///tmp/base".to_string()),
(HudiTableConfig::TableVersion, "6".to_string()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(HudiTableConfig::BasePath, "file:///tmp/base".to_string()),
(HudiTableConfig::TableVersion, "6".to_string()),
(HudiTableConfig::BasePath, "file:///tmp/base"),
(HudiTableConfig::TableVersion, "6"),

same as above. the api supports Into<String>()

)),
// Fallbacks are handled in callers; no defaults here
Self::ArchiveLogFolder => None,
Self::TimelineHistoryPath => None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 do have default values right? why not provide them through this API?

let mut instants = Vec::new();

for file_info in files {
if file_info.name.starts_with("history/") || file_info.name.ends_with(".crc") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if file_info.name.starts_with("history/") || file_info.name.ends_with(".crc") {
// TODO: make `storage.list_files` api support such filtering, like ignore crc and return files only
if file_info.name.starts_with("history/") || file_info.name.ends_with(".crc") {

if file_info.name.starts_with("history/") || file_info.name.ends_with(".crc") {
continue;
}
match selector.try_create_instant(file_info.name.as_str()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this API needs to handle v8 instants as the completed ones have completion ts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codope this is the only blocker for merging. if we need to pass v8 timeline loading test cases, this parsing needs to be there.

Comment on lines +114 to +119
// Config wiring: if archived read not enabled, return behavior based on policy
let storage = match self {
TimelineLoader::LayoutOneArchived(storage)
| TimelineLoader::LayoutTwoArchived(storage) => storage.clone(),
_ => return Ok(Vec::new()), // Active loaders don't have archived parts
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this storage clone does not need happen here?

Comment on lines +122 to +127
let enabled = configs
.get_or_default(TimelineArchivedReadEnabled)
.to::<bool>();
if !enabled {
let behavior = configs
.get_or_default(TimelineArchivedUnavailableBehavior)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we just need 1 config? TimelineArchivedUnavailableBehavior default to continue will just either load archived or do nothing. even for internal config, we need to keep it simple for easy maintenance

let archive_dir = configs
.try_get(ArchiveLogFolder)
.map(|v| v.to::<String>())
.unwrap_or_else(|| ".hoodie/archived".to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here this default value can be provided through default_value() ? the try_get() API handles it

for file_info in files {
if file_info.name.ends_with(".crc") {
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can skip checking .crc in this pr for simplicity. there is an open PR addresses this from storage level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read LSM tree timeline

2 participants