-
Notifications
You must be signed in to change notification settings - Fork 52
feat: add LSM timeline support with timeline layout management for old and new table versions #395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #395 +/- ##
==========================================
- Coverage 89.33% 86.59% -2.74%
==========================================
Files 49 52 +3
Lines 2747 2835 +88
==========================================
+ Hits 2454 2455 +1
- Misses 293 380 +87 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few note to reviewers.
crates/core/src/timeline/active.rs
Outdated
storage: Arc<Storage>, | ||
} | ||
|
||
impl ActiveTimeline { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's take the opportunity to re-think some of the model design:
from the callers perspective, they should not care if the commits should come from active timeline history/
or archived/
because these are all timeline internal logic flow. If the caller requests require looking into compacted or archived commits, the timeline instance will need to make a call and fetch the data as needed. So at the interface level, we don't need an explicit ActiveTimeline
. We can just keep it Timeline
as the front-facing model.
crates/core/src/timeline/layout.rs
Outdated
#[derive(Debug, Clone)] | ||
pub enum TimelineLayoutType { | ||
Active(super::active::ActiveTimeline), | ||
Lsm(super::lsm::LsmTimeline), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think of Timeline
as a top level interface where callers can interact with, the layout is an internal config that resolved based on table version and timeline layout version (as you resolved in create_layout()
. It'd be a bit odd to see a "Type" struct having a Timeline instance basically for choosing a Timeline to use.
How about we design it like:
Timeline
remains as the top-level interface- When a Timeline is created, it resolves
HudiConfigs
, and create an internal model callTimelineLoader
based on the configs TimelineLoader
can be an enum, it has these items:
LayoutOneActive
LayoutOneArchived
LayoutTwoActive
LayoutTwoCompacted
Each loader will implement the logic to load its intended commit files.
- When some APIs are invoked and need to load archived or compacted commits, the Timeline instance will use the needed
Loader
s to do IO Timeline
is responsible for collecting and stitching theLoader
s' return data and returning final data to callers- We also avoid calling a LSM timeline since LSM tree is just one impl of Timeline which should be hidden away from Timeline's level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, better suggestion overall! Few things if you can clarify:
We also avoid calling a LSM timeline since LSM tree is just one impl of Timeline which should be hidden away from Timeline's level
Should the LSM tree logic be a separate module that the loaders use, or should the loaders themselves be the LSM tree implementations? So, in the context of TimelineLoader::LayoutTwoActive
, would this variant contain an LSMTree
instance, or would it implement LSM tree logic directly?
Also, should the current Timeline::load_instants()
method remain as-is, or would you prefer a different public API? For example, another API that takes a boolean to decide whether to load archived/compacted timeline. Alternatively, we could just add another API keeping load_instants()
as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codope cool! keeping lsm tree specific in lsm_tree.rs
is good. may not need to worry about re-using it for now - it can be used by the loaders exclusively, but somewhat positioned for future re-use
Looking closer - so load_instants()
is actually a more generic helper function, it's the using of a selector
in one of the init functions (new_from_storage()
) that defines the behavior of timeline's instants loading. So depends on where load_instants()
gets called, we just need to pass in the desired TimelineSelector
…d and new table versions Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
This commit introduces the foundational architecture for supporting LSM timelines in hudi-rs, enabling compatibility with both v6 (active timeline) and v8 (LSM timeline) Hudi table formats. The key changes include: - A `TimelineLoader` enum to abstract different timeline storage formats. - A `TimelineBuilder` to handle configuration resolution and loader instantiation based on `hoodie.table.version` and `hoodie.timeline.layout.version`. - Refactored `Timeline` to use the new loader architecture while maintaining backward compatibility with existing APIs. - Graceful fallback for invalid table configurations to ensure test stability. Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Implements the orchestration logic within the `Timeline` struct to support querying both active and archived/compacted timeline data. Adds a new `load_instants_with_archive` method which queries the active loader and, if requested, the archived loader, then merges the results. Existing `get_completed_*` methods are updated to use this new orchestration layer while maintaining their original behavior. Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
- Renamed lsm.rs to lsm_tree.rs to better reflect its purpose - Expanded LSMTree struct with core logic for manifest reading - Integrated LSMTree with TimelineLoader for v8 timeline support - Updated TimelineLoader to handle LayoutTwoActive using the new LSM tree logic Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
crates/core/src/timeline/mod.rs
Outdated
selector: &TimelineSelector, | ||
storage: &Storage, | ||
desc: bool, | ||
include_archived: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the current TimelineSelector has this property too, which was my initial thought. But let's try to go with simple UX for the API: callers do not need to know if they need to load archived or not, which is timeline's internal impl - they just ask for a list of instants for a given time range (everything else that a timeline selector holds):
- if no specific time given, load active instants
- if a start or end time given, and falls into archived range, then load relevant archived + active instants
Timeline loader can internally decide what to load.
The only case that we might need users to set include_archived
is to set it false
when they might load for a time range that could be in archived but for perf reason they choose to skip archived. This is a rare advanced usage we should be good to ignore now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We’ll keep public API simple and make archived loading an internal concern driven by the selector range; removed the public flag for now.
crates/core/src/timeline/loader.rs
Outdated
let mut instants = Vec::new(); | ||
|
||
for file_info in files { | ||
if file_info.name.starts_with("history/") || file_info.name.starts_with('_') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can enhance the storage.list_files()
to support a listing for only immediate files under the prefix, then we dont need explicit check and just grab the instant files. What is the case with having _
prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also occurs to me that we may need to enhance the listing by ignoring .crc
files by default.. we'll track this separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensured sorted results; removed _
filter; ignoring .crc
. Will do a shallow-listing enhancement on Storage in a follow-up.
} | ||
|
||
pub struct LSMTree { | ||
storage: Arc<Storage>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to take any hudi configs for LSM tree specific settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll pass configs into the LSM reader when adding archived reads to honor timeline-specific settings.
crates/core/src/timeline/mod.rs
Outdated
instants.extend(archived_instants); | ||
|
||
// Re-sort after merging | ||
instants.sort_unstable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we be sure about loading results' order- archived instants already sorted right? then we can just concat with active ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loaders now return sorted lists; we'll replace the final sort with a merge when archived reading lands.
use serde::{Deserialize, Serialize}; | ||
use std::sync::Arc; | ||
|
||
pub const LSM_TIMELINE_DIR: &str = ".hoodie/timeline"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to respect "hoodie.timeline.history.path"
, and "hoodie.timeline.history.path"
right? same for v6, we need to handle "hoodie.archivelog.folder"
. but we can deal with these in PR for archived timeline. the corresponding timeline loader should be responsible to take these from configs and own it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted; i'll wire in the configs but archived loaders will actually resolve paths from configs in the next PR.
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
- Made load_archived_instants pub(crate) - Removed include_archived from public surface; added internal load_instants_internal used by Timeline methods - LayoutTwoActive now sorts results, ignores .crc, and drops the _ prefix filter - LSM manifest docs - Dropped v5 support in builder - Added TimelineSelector::has_time_filter() so Timeline knows when a time range is requested. - Updated Timeline::load_instants(...) to automatically include archived results when a time filter is present and an archived loader exists. Active is always queried first, archived appended if available. Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
ea11bad
to
4f2eded
Compare
resuming review on this over the next 2 days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have some suggestions for code simplification. once addressed, we should be good to move forward
(HudiTableConfig::BasePath, "file:///tmp/base".to_string()), | ||
(HudiTableConfig::TableVersion, "6".to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(HudiTableConfig::BasePath, "file:///tmp/base".to_string()), | |
(HudiTableConfig::TableVersion, "6".to_string()), | |
(HudiTableConfig::BasePath, "file:///tmp/base"), | |
(HudiTableConfig::TableVersion, "6"), |
same as above. the api supports Into<String>()
)), | ||
// Fallbacks are handled in callers; no defaults here | ||
Self::ArchiveLogFolder => None, | ||
Self::TimelineHistoryPath => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 do have default values right? why not provide them through this API?
let mut instants = Vec::new(); | ||
|
||
for file_info in files { | ||
if file_info.name.starts_with("history/") || file_info.name.ends_with(".crc") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if file_info.name.starts_with("history/") || file_info.name.ends_with(".crc") { | |
// TODO: make `storage.list_files` api support such filtering, like ignore crc and return files only | |
if file_info.name.starts_with("history/") || file_info.name.ends_with(".crc") { |
if file_info.name.starts_with("history/") || file_info.name.ends_with(".crc") { | ||
continue; | ||
} | ||
match selector.try_create_instant(file_info.name.as_str()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this API needs to handle v8 instants as the completed ones have completion ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codope this is the only blocker for merging. if we need to pass v8 timeline loading test cases, this parsing needs to be there.
// Config wiring: if archived read not enabled, return behavior based on policy | ||
let storage = match self { | ||
TimelineLoader::LayoutOneArchived(storage) | ||
| TimelineLoader::LayoutTwoArchived(storage) => storage.clone(), | ||
_ => return Ok(Vec::new()), // Active loaders don't have archived parts | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this storage clone does not need happen here?
let enabled = configs | ||
.get_or_default(TimelineArchivedReadEnabled) | ||
.to::<bool>(); | ||
if !enabled { | ||
let behavior = configs | ||
.get_or_default(TimelineArchivedUnavailableBehavior) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we just need 1 config? TimelineArchivedUnavailableBehavior
default to continue will just either load archived or do nothing. even for internal config, we need to keep it simple for easy maintenance
let archive_dir = configs | ||
.try_get(ArchiveLogFolder) | ||
.map(|v| v.to::<String>()) | ||
.unwrap_or_else(|| ".hoodie/archived".to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here this default value can be provided through default_value()
? the try_get()
API handles it
for file_info in files { | ||
if file_info.name.ends_with(".crc") { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can skip checking .crc
in this pr for simplicity. there is an open PR addresses this from storage level
Description
Implements LSM timeline reading capability for the hudi-rs core to support both v6 and v8 timeline formats, following Java implementation patterns with timeline layout abstraction. This is the foundation PR and we will follow up with future work on LSM timeline history reading, support for multiple layers of timeline compaction and manifest parsing for compacted timeline files.
Core Implementation
TimelineLoader
to abstract different timeline storage formats.TimelineBuilder
to handle configuration resolution and loader instantiation based on table configuration.Timeline
to support querying both active and archived/compacted timeline data.LayoutOneActive
(reads from.hoodie/
directory) andLayoutOneArchived
(future)LayoutTwoActive
(foundation for.hoodie/timeline/
) andLayoutTwoCompacted
(future)Configuration Integration
hoodie.table.version
andhoodie.timeline.layout.version
Related to #377
How are the changes test-covered