Skip to content

Commit 66c52a6

Browse files
authored
RFC: vectored Timeline::get (#6250)
1 parent 8a646cb commit 66c52a6

File tree

1 file changed

+142
-0
lines changed

1 file changed

+142
-0
lines changed
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Vectored Timeline Get
2+
3+
Created on: 2024-01-02
4+
Author: Christian Schwarz
5+
6+
# Summary
7+
8+
A brief RFC / GitHub Epic describing a vectored version of the `Timeline::get` method that is at the heart of Pageserver.
9+
10+
# Motivation
11+
12+
During basebackup, we issue many `Timeline::get` calls for SLRU pages that are *adjacent* in key space.
13+
For an example, see
14+
https://github.com/neondatabase/neon/blob/5c88213eaf1b1e29c610a078d0b380f69ed49a7e/pageserver/src/basebackup.rs#L281-L302.
15+
16+
Each of these `Timeline::get` calls must traverse the layer map to gather reconstruct data (`Timeline::get_reconstruct_data`) for the requested page number (`blknum` in the example).
17+
For each layer visited by layer map traversal, we do a `DiskBtree` point lookup.
18+
If it's negative (no entry), we resume layer map traversal.
19+
If it's positive, we collect the result in our reconstruct data bag.
20+
If the reconstruct data bag contents suffice to reconstruct the page, we're done with `get_reconstruct_data` and move on to walredo.
21+
Otherwise, we resume layer map traversal.
22+
23+
Doing this many `Timeline::get` calls is quite inefficient because:
24+
25+
1. We do the layer map traversal repeatedly, even if, e.g., all the data sits in the same image layer at the bottom of the stack.
26+
2. We may visit many DiskBtree inner pages multiple times for point lookup of different keys.
27+
This is likely particularly bad for L0s which span the whole key space and hence must be visited by layer map traversal, but
28+
may not contain the data we're looking for.
29+
3. Anecdotally, keys adjacent in keyspace and written simultaneously also end up physically adjacent in the layer files [^1].
30+
So, to provide the reconstruct data for N adjacent keys, we would actually only _need_ to issue a single large read to the filesystem, instead of the N reads we currently do.
31+
The filesystem, in turn, ideally stores the layer file physically contiguously, so our large read will turn into one IOP toward the disk.
32+
33+
[^1]: https://www.notion.so/neondatabase/Christian-Investigation-Slow-Basebackups-Early-2023-12-34ea5c7dcdc1485d9ac3731da4d2a6fc?pvs=4#15ee4e143392461fa64590679c8f54c9
34+
35+
# Solution
36+
37+
We should have a vectored aka batched aka scatter-gather style alternative API for `Timeline::get`. Having such an API unlocks:
38+
39+
* more efficient basebackup
40+
* batched IO during compaction (useful for strides of unchanged pages)
41+
* page_service: expose vectored get_page_at_lsn for compute (=> good for seqscan / prefetch)
42+
* if [on-demand SLRU downloads](https://github.com/neondatabase/neon/pull/6151) land before vectored Timeline::get, on-demand SLRU downloads will still benefit from this API
43+
44+
# DoD
45+
46+
There is a new variant of `Timeline::get`, called `Timeline::get_vectored`.
47+
It takes as arguments an `lsn: Lsn` and a `src: &[KeyVec]` where `struct KeyVec { base: Key, count: usize }`.
48+
49+
It is up to the implementor to figure out a suitable and efficient way to return the reconstructed page images.
50+
It is sufficient to simply return a `Vec<Bytes>`, but, likely more efficient solutions can be found after studying all the callers of `Timeline::get`.
51+
52+
Functionally, the behavior of `Timeline::get_vectored` is equivalent to
53+
54+
```rust
55+
let mut keys_iter: impl Iterator<Item=Key>
56+
= src.map(|KeyVec{ base, count }| (base..base+count)).flatten();
57+
let mut out = Vec::new();
58+
for key in keys_iter {
59+
let data = Timeline::get(key, lsn)?;
60+
out.push(data);
61+
}
62+
return out;
63+
```
64+
65+
However, unlike above, an ideal solution will
66+
67+
* Visit each `struct Layer` at most once.
68+
* For each visited layer, call `Layer::get_value_reconstruct_data` at most once.
69+
* This means, read each `DiskBtree` page at most once.
70+
* Facilitate merging of the reads we issue to the OS and eventually NVMe.
71+
72+
Each of these items above represents a signficant amount of work.
73+
74+
## Performance
75+
76+
Ideally, the **base performance** of a vectored get of a single page should be identical to the current `Timeline::get`.
77+
A reasonable constant overhead over current `Timeline::get` is acceptable.
78+
79+
The performance improvement for the vectored use case is demonstrated in some way, e.g., using the `pagebench` basebackup benchmark against a tenant with a lot of SLRU segments.
80+
81+
# Implementation
82+
83+
High-level set of tasks / changes to be made:
84+
85+
- **Get clarity on API**:
86+
- Define naive `Timeline::get_vectored` implementation & adopt it across pageserver.
87+
- The tricky thing here will be the return type (e.g. `Vec<Bytes>` vs `impl Stream`).
88+
- Start with something simple to explore the different usages of the API.
89+
Then iterate with peers until we have something that is good enough.
90+
- **Vectored Layer Map traversal**
91+
- Vectored `LayerMap::search` (take 1 LSN and N `Key`s instead of just 1 LSN and 1 `Key`)
92+
- Refactor `Timeline::get_reconstruct_data` to hold & return state for N `Key`s instead of 1
93+
- The slightly tricky part here is what to do about `cont_lsn` [after we've found some reconstruct data for some keys](https://github.com/neondatabase/neon/blob/d066dad84b076daf3781cdf9a692098889d3974e/pageserver/src/tenant/timeline.rs#L2378-L2385)
94+
but need more.
95+
Likely we'll need to keep track of `cont_lsn` per key and continue next iteration at `max(cont_lsn)` of all keys that still need data.
96+
- **Vectored `Layer::get_value_reconstruct_data` / `DiskBtree`**
97+
- Current code calls it [here](https://github.com/neondatabase/neon/blob/d066dad84b076daf3781cdf9a692098889d3974e/pageserver/src/tenant/timeline.rs#L2378-L2384).
98+
- Delta layers use `DiskBtreeReader::visit()` to collect the `(offset,len)` pairs for delta record blobs to load.
99+
- Image layers use `DiskBtreeReader::get` to get the offset of the image blob to load. Underneath, that's just a `::visit()` call.
100+
- What needs to happen to `DiskBtree::visit()`?
101+
* Minimally
102+
* take a single `KeyVec` instead of a single `Key` as argument, i.e., take a single contiguous key range to visit.
103+
* Change the visit code to to invoke the callback for all values in the `KeyVec`'s key range
104+
* This should be good enough for what we've seen when investigating basebackup slowness, because there, the key ranges are contiguous.
105+
* Ideally:
106+
* Take a `&[KeyVec]`, sort it;
107+
* during Btree traversal, peek at the next `KeyVec` range to determine whether we need to descend or back out.
108+
* NB: this should be a straight-forward extension of the minimal solution above, as we'll already be checking for "is there more key range in the requested `KeyVec`".
109+
- **Facilitate merging of the reads we issue to the OS and eventually NVMe.**
110+
- The `DiskBtree::visit` produces a set of offsets which we then read from a `VirtualFile` [here](https://github.com/neondatabase/neon/blob/292281c9dfb24152b728b1a846cc45105dac7fe0/pageserver/src/tenant/storage_layer/delta_layer.rs#L772-L804)
111+
- [Delta layer reads](https://github.com/neondatabase/neon/blob/292281c9dfb24152b728b1a846cc45105dac7fe0/pageserver/src/tenant/storage_layer/delta_layer.rs#L772-L804)
112+
- We hit (and rely) on `PageCache` and `VirtualFile here (not great under pressure)
113+
- [Image layer reads](https://github.com/neondatabase/neon/blob/292281c9dfb24152b728b1a846cc45105dac7fe0/pageserver/src/tenant/storage_layer/image_layer.rs#L429-L435)
114+
- What needs to happen is the **vectorization of the `blob_io` interface and then the `VirtualFile` API**.
115+
- That is tricky because
116+
- the `VirtualFile` API, which sits underneath `blob_io`, is being touched by ongoing [io_uring work](https://github.com/neondatabase/neon/pull/5824)
117+
- there's the question how IO buffers will be managed; currently this area relies heavily on `PageCache`, but there's controversy around the future of `PageCache`.
118+
- The guiding principle here should be to avoid coupling this work to the `PageCache`.
119+
- I.e., treat `PageCache` as an extra hop in the I/O chain, rather than as an integral part of buffer management.
120+
121+
122+
Let's see how we can improve by doing the first three items in above list first, then revisit.
123+
124+
## Rollout / Feature Flags
125+
126+
No feature flags are required for this epic.
127+
128+
At the end of this epic, `Timeline::get` forwards to `Timeline::get_vectored`, i.e., it's an all-or-nothing type of change.
129+
130+
It is encouraged to deliver this feature incrementally, i.e., do many small PRs over multiple weeks.
131+
That will help isolate performance regressions across weekly releases.
132+
133+
# Interaction With Sharding
134+
135+
[Sharding](https://github.com/neondatabase/neon/pull/5432) splits up the key space, see functions `is_key_local` / `key_to_shard_number`.
136+
137+
Just as with `Timeline::get`, callers of `Timeline::get_vectored` are responsible for ensuring that they only ask for blocks of the given `struct Timeline`'s shard.
138+
139+
Given that this is already the case, there shouldn't be significant interaction/interference with sharding.
140+
141+
However, let's have a safety check for this constraint (error or assertion) because there are currently few affordances at the higher layers of Pageserver for sharding<=>keyspace interaction.
142+
For example, `KeySpace` is not broken up by shard stripe, so if someone naively converted the compaction code to issue a vectored get for a keyspace range it would violate this constraint.

0 commit comments

Comments
 (0)