|
| 1 | +# Vectored Timeline Get |
| 2 | + |
| 3 | +Created on: 2024-01-02 |
| 4 | +Author: Christian Schwarz |
| 5 | + |
| 6 | +# Summary |
| 7 | + |
| 8 | +A brief RFC / GitHub Epic describing a vectored version of the `Timeline::get` method that is at the heart of Pageserver. |
| 9 | + |
| 10 | +# Motivation |
| 11 | + |
| 12 | +During basebackup, we issue many `Timeline::get` calls for SLRU pages that are *adjacent* in key space. |
| 13 | +For an example, see |
| 14 | +https://github.com/neondatabase/neon/blob/5c88213eaf1b1e29c610a078d0b380f69ed49a7e/pageserver/src/basebackup.rs#L281-L302. |
| 15 | + |
| 16 | +Each of these `Timeline::get` calls must traverse the layer map to gather reconstruct data (`Timeline::get_reconstruct_data`) for the requested page number (`blknum` in the example). |
| 17 | +For each layer visited by layer map traversal, we do a `DiskBtree` point lookup. |
| 18 | +If it's negative (no entry), we resume layer map traversal. |
| 19 | +If it's positive, we collect the result in our reconstruct data bag. |
| 20 | +If the reconstruct data bag contents suffice to reconstruct the page, we're done with `get_reconstruct_data` and move on to walredo. |
| 21 | +Otherwise, we resume layer map traversal. |
| 22 | + |
| 23 | +Doing this many `Timeline::get` calls is quite inefficient because: |
| 24 | + |
| 25 | +1. We do the layer map traversal repeatedly, even if, e.g., all the data sits in the same image layer at the bottom of the stack. |
| 26 | +2. We may visit many DiskBtree inner pages multiple times for point lookup of different keys. |
| 27 | + This is likely particularly bad for L0s which span the whole key space and hence must be visited by layer map traversal, but |
| 28 | + may not contain the data we're looking for. |
| 29 | +3. Anecdotally, keys adjacent in keyspace and written simultaneously also end up physically adjacent in the layer files [^1]. |
| 30 | + So, to provide the reconstruct data for N adjacent keys, we would actually only _need_ to issue a single large read to the filesystem, instead of the N reads we currently do. |
| 31 | + The filesystem, in turn, ideally stores the layer file physically contiguously, so our large read will turn into one IOP toward the disk. |
| 32 | + |
| 33 | +[^1]: https://www.notion.so/neondatabase/Christian-Investigation-Slow-Basebackups-Early-2023-12-34ea5c7dcdc1485d9ac3731da4d2a6fc?pvs=4#15ee4e143392461fa64590679c8f54c9 |
| 34 | + |
| 35 | +# Solution |
| 36 | + |
| 37 | +We should have a vectored aka batched aka scatter-gather style alternative API for `Timeline::get`. Having such an API unlocks: |
| 38 | + |
| 39 | +* more efficient basebackup |
| 40 | +* batched IO during compaction (useful for strides of unchanged pages) |
| 41 | +* page_service: expose vectored get_page_at_lsn for compute (=> good for seqscan / prefetch) |
| 42 | + * if [on-demand SLRU downloads](https://github.com/neondatabase/neon/pull/6151) land before vectored Timeline::get, on-demand SLRU downloads will still benefit from this API |
| 43 | + |
| 44 | +# DoD |
| 45 | + |
| 46 | +There is a new variant of `Timeline::get`, called `Timeline::get_vectored`. |
| 47 | +It takes as arguments an `lsn: Lsn` and a `src: &[KeyVec]` where `struct KeyVec { base: Key, count: usize }`. |
| 48 | + |
| 49 | +It is up to the implementor to figure out a suitable and efficient way to return the reconstructed page images. |
| 50 | +It is sufficient to simply return a `Vec<Bytes>`, but, likely more efficient solutions can be found after studying all the callers of `Timeline::get`. |
| 51 | + |
| 52 | +Functionally, the behavior of `Timeline::get_vectored` is equivalent to |
| 53 | + |
| 54 | +```rust |
| 55 | +let mut keys_iter: impl Iterator<Item=Key> |
| 56 | + = src.map(|KeyVec{ base, count }| (base..base+count)).flatten(); |
| 57 | +let mut out = Vec::new(); |
| 58 | +for key in keys_iter { |
| 59 | + let data = Timeline::get(key, lsn)?; |
| 60 | + out.push(data); |
| 61 | +} |
| 62 | +return out; |
| 63 | +``` |
| 64 | + |
| 65 | +However, unlike above, an ideal solution will |
| 66 | + |
| 67 | +* Visit each `struct Layer` at most once. |
| 68 | +* For each visited layer, call `Layer::get_value_reconstruct_data` at most once. |
| 69 | + * This means, read each `DiskBtree` page at most once. |
| 70 | +* Facilitate merging of the reads we issue to the OS and eventually NVMe. |
| 71 | + |
| 72 | +Each of these items above represents a signficant amount of work. |
| 73 | + |
| 74 | +## Performance |
| 75 | + |
| 76 | +Ideally, the **base performance** of a vectored get of a single page should be identical to the current `Timeline::get`. |
| 77 | +A reasonable constant overhead over current `Timeline::get` is acceptable. |
| 78 | + |
| 79 | +The performance improvement for the vectored use case is demonstrated in some way, e.g., using the `pagebench` basebackup benchmark against a tenant with a lot of SLRU segments. |
| 80 | + |
| 81 | +# Implementation |
| 82 | + |
| 83 | +High-level set of tasks / changes to be made: |
| 84 | + |
| 85 | +- **Get clarity on API**: |
| 86 | + - Define naive `Timeline::get_vectored` implementation & adopt it across pageserver. |
| 87 | + - The tricky thing here will be the return type (e.g. `Vec<Bytes>` vs `impl Stream`). |
| 88 | + - Start with something simple to explore the different usages of the API. |
| 89 | + Then iterate with peers until we have something that is good enough. |
| 90 | +- **Vectored Layer Map traversal** |
| 91 | + - Vectored `LayerMap::search` (take 1 LSN and N `Key`s instead of just 1 LSN and 1 `Key`) |
| 92 | + - Refactor `Timeline::get_reconstruct_data` to hold & return state for N `Key`s instead of 1 |
| 93 | + - The slightly tricky part here is what to do about `cont_lsn` [after we've found some reconstruct data for some keys](https://github.com/neondatabase/neon/blob/d066dad84b076daf3781cdf9a692098889d3974e/pageserver/src/tenant/timeline.rs#L2378-L2385) |
| 94 | + but need more. |
| 95 | + Likely we'll need to keep track of `cont_lsn` per key and continue next iteration at `max(cont_lsn)` of all keys that still need data. |
| 96 | +- **Vectored `Layer::get_value_reconstruct_data` / `DiskBtree`** |
| 97 | + - Current code calls it [here](https://github.com/neondatabase/neon/blob/d066dad84b076daf3781cdf9a692098889d3974e/pageserver/src/tenant/timeline.rs#L2378-L2384). |
| 98 | + - Delta layers use `DiskBtreeReader::visit()` to collect the `(offset,len)` pairs for delta record blobs to load. |
| 99 | + - Image layers use `DiskBtreeReader::get` to get the offset of the image blob to load. Underneath, that's just a `::visit()` call. |
| 100 | + - What needs to happen to `DiskBtree::visit()`? |
| 101 | + * Minimally |
| 102 | + * take a single `KeyVec` instead of a single `Key` as argument, i.e., take a single contiguous key range to visit. |
| 103 | + * Change the visit code to to invoke the callback for all values in the `KeyVec`'s key range |
| 104 | + * This should be good enough for what we've seen when investigating basebackup slowness, because there, the key ranges are contiguous. |
| 105 | + * Ideally: |
| 106 | + * Take a `&[KeyVec]`, sort it; |
| 107 | + * during Btree traversal, peek at the next `KeyVec` range to determine whether we need to descend or back out. |
| 108 | + * NB: this should be a straight-forward extension of the minimal solution above, as we'll already be checking for "is there more key range in the requested `KeyVec`". |
| 109 | +- **Facilitate merging of the reads we issue to the OS and eventually NVMe.** |
| 110 | + - The `DiskBtree::visit` produces a set of offsets which we then read from a `VirtualFile` [here](https://github.com/neondatabase/neon/blob/292281c9dfb24152b728b1a846cc45105dac7fe0/pageserver/src/tenant/storage_layer/delta_layer.rs#L772-L804) |
| 111 | + - [Delta layer reads](https://github.com/neondatabase/neon/blob/292281c9dfb24152b728b1a846cc45105dac7fe0/pageserver/src/tenant/storage_layer/delta_layer.rs#L772-L804) |
| 112 | + - We hit (and rely) on `PageCache` and `VirtualFile here (not great under pressure) |
| 113 | + - [Image layer reads](https://github.com/neondatabase/neon/blob/292281c9dfb24152b728b1a846cc45105dac7fe0/pageserver/src/tenant/storage_layer/image_layer.rs#L429-L435) |
| 114 | + - What needs to happen is the **vectorization of the `blob_io` interface and then the `VirtualFile` API**. |
| 115 | + - That is tricky because |
| 116 | + - the `VirtualFile` API, which sits underneath `blob_io`, is being touched by ongoing [io_uring work](https://github.com/neondatabase/neon/pull/5824) |
| 117 | + - there's the question how IO buffers will be managed; currently this area relies heavily on `PageCache`, but there's controversy around the future of `PageCache`. |
| 118 | + - The guiding principle here should be to avoid coupling this work to the `PageCache`. |
| 119 | + - I.e., treat `PageCache` as an extra hop in the I/O chain, rather than as an integral part of buffer management. |
| 120 | + |
| 121 | + |
| 122 | +Let's see how we can improve by doing the first three items in above list first, then revisit. |
| 123 | + |
| 124 | +## Rollout / Feature Flags |
| 125 | + |
| 126 | +No feature flags are required for this epic. |
| 127 | + |
| 128 | +At the end of this epic, `Timeline::get` forwards to `Timeline::get_vectored`, i.e., it's an all-or-nothing type of change. |
| 129 | + |
| 130 | +It is encouraged to deliver this feature incrementally, i.e., do many small PRs over multiple weeks. |
| 131 | +That will help isolate performance regressions across weekly releases. |
| 132 | + |
| 133 | +# Interaction With Sharding |
| 134 | + |
| 135 | +[Sharding](https://github.com/neondatabase/neon/pull/5432) splits up the key space, see functions `is_key_local` / `key_to_shard_number`. |
| 136 | + |
| 137 | +Just as with `Timeline::get`, callers of `Timeline::get_vectored` are responsible for ensuring that they only ask for blocks of the given `struct Timeline`'s shard. |
| 138 | + |
| 139 | +Given that this is already the case, there shouldn't be significant interaction/interference with sharding. |
| 140 | + |
| 141 | +However, let's have a safety check for this constraint (error or assertion) because there are currently few affordances at the higher layers of Pageserver for sharding<=>keyspace interaction. |
| 142 | +For example, `KeySpace` is not broken up by shard stripe, so if someone naively converted the compaction code to issue a vectored get for a keyspace range it would violate this constraint. |
0 commit comments