-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Generational compaction #5583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
jcoglan
wants to merge
10
commits into
apache:main
Choose a base branch
from
neighbourhoodie:feat/generational-compaction
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Generational compaction #5583
jcoglan
wants to merge
10
commits into
apache:main
from
neighbourhoodie:feat/generational-compaction
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
f0666ca
to
29f855a
Compare
To support a generational storage model, the #st struct needs to have multiple file handles open. Whereas we currently back a shard with a single file, `db.suffix.couch`, the generational model will augment this with a set of "generation" files named `db.1.suffix.couch`, `db.2.suffix.couch`, etc. The original `db.suffix.couch` file is henceforth referred to as "gen-0". Each of these file handles needs to be monitored by the incref/decref functions and so we replace the `fd` and `fd_monitor` fields with a pair of `{fd, monitor}` stored in the `fd` field. The new `gen_fds` field stores a list of such pairs, and points at the `db.{1,2,...}.couch` files. The number of generational files opened is determined by a new field in the DB header named `max_generation`. This defaults to 0 so that all existing databases stay on the current storage model, and need to opt in to using generational storage. Here we also add a set of functions that the engine and compactor will need for managing generational files: - `generation_file_path()`: returns the path to the Nth generation file; returns the normal `db.suffix.couch` path for gen-0. - `open_generation_file()`: opens and monitors the Nth generation file. - `open_generation_files()`: opens and monitors all the files for generations from 1 to N. - `maybe_open_generation_files()`: opens and monitors all the generation files except if the `compacting` option is set; the compactor does not need to re-open the generation files as it will share the existing handles with the engine (i.e. we don't open multiple handles to the same file). - `open_additional_generation_file()`: when compacting the highest generation, we will open an extra temporary file for its live data to be moved into; if `max_generation` = M then this causes `gen_fds` to contain M+1 file handles. - `reopen_generation_file()`: once the file `db.N.couch` has been compacted into `db.N+1.couch`, this function will remove and reopen the existing `db.N.couch` file so that it becomes empty. - `delete_generational_files()`: when deleting the database, this removes all the generational files. - `get_fd()`: returns the file handle for the Nth generation, or the original gen-0 `db.suffix.couch` file.
In the generational storage model, all new docs/revs continue to be written to "gen-0", the `db.suffix.couch` file. On compaction, live data is "promoted" to the next generation; data in `db.couch` is moved to `db.1.couch`, data in `db.1.couch` to `db.2.couch`, etc. Therefore, doc body and attachment pointers need to include a representation of which file they reside in. This is accomplished by storing a pair of `{Gen, Ptr}` instead of just `Ptr` when a body/attachment is written to generation 1 or above. When writing to gen-0, we continue to just store the pointer, rather than wrapping it in `{0, Ptr}`. This means that we continue to write backwards-compatible data for databases that have not opted in to generational storage, and it makes sure we can continue to read existing data, as pointers stored in gen-0 look the same as they always have.
This commit implements the generational compaction scheme wherein live data is "promoted" to a higher generation by the compactor. Each compaction run targets a specific generation N, from 0 up to the database's maximum generation M. If a database has gen-0 file `db.couch`, then the compactor works as follows: - The compactor still creates `db.couch.compact.data` and `db.couch.compact.meta` files. If N = M then it also opens the file `db.M.couch.compact.maxgen`, and this file is added to the end of `gen_fds`, creating a temporary generation M+1 file. - The compactor shares the `gen_fds` file handles with the main DB engine, so that only one file handle exists for these files at a time. Since only the compactor writes to generational files, it may be safe for it to open its own handles, but that is not currently implemented. - All the *structure* of the database -- the by-id and by-seq trees, purge history, metadata, etc -- remains in the gen-0 file, that is, the new structure continues to be built in `db.couch.compact.data`. Only *data*, i.e. document bodies and attachments, is ever stored in a higher generation. - If an attachment is currently stored in gen N, then it is copied into gen N+1. If it resides in a different non-zero generation, it remains where it is. If it resides in gen-0, and N > 0, then it is copied to `db.couch.compact.data`, since the original `db.couch` file will be discarded at the end of compaction. - Document bodies follow the same rule, with one addition: if they contain any attachment pointers that have been moved by the previous rule, then a new copy of the document must be stored with updated attachment pointers. If the document is currently in gen N, then it is copied to gen N+1 with updated attachments. Otherwise, a fresh copy is written to its current generation -- either a generational file, or `db.couch.compact.data`. - If N = M = 0, then doc/attachment data is copied from `db.couch` to `db.couch.compact.data`, rather than to `db.1.couch`. This means compaction continues to work as it currently does for existing databases. - When compaction is complete, `db.couch.compact.data` is moved to `db.couch`. If N > 0 then `db.N.couch` is removed and reopened. Any live data it contained should now reside in `db.N+1.couch`. If N = M, then `db.M.couch.compact.maxgen` is moved to `db.M.couch`, and `gen_fds` reverts to its normal size. - When N = M, i.e. we are compacting the max generation, the target generation will be the M+1 entry in `gen_fds`, but this file will eventually be moved to `db.M.couch`. Therefore we need to write pointers to this file's data with generation M, even though it is at position M+1 in `gen_fds` when it is being written to.
This adds a parameter named `gen` to the `PUT /db` and `POST /db/_compact` endpoints. This sets the `max_generation` of the database when it's created, and sets which generation to compact. The parameter defaults to zero in both endpoints.
In order for smoosh to trigger compactions of generations above 0, we need to store per-generation size information, rather than just storing the total for all the shard's files. The key changes are: - `#full_doc_info.sizes` can now store a list of #size_info rather than a single record. - `couch_db_updater:add_sizes()` uses the generation of the leaf pointer to build a list of #size_info, one for each generation. If there is only a single generation, then a single #size_info is returned, so that we continue to store a single #size_info record for non-generational databases and maximise backwards compatibility. - In `couch_bt_engine`: `get_partition_info()` sums the sizes of each generation to return the total size of the partition shard; `split_sizes()` and `join_sizes()` can work on a list of #size_info as well as a single record; and `reduce_sizes()` can merge two lists of #size_info records. - `couch_db_updater:flush_trees()` and `couch_bt_engine_compactor:copy_docs()` fold the attachment sizes into the active and external sizes when the end result is a multi-generation list of sizes. - `couch_db:get_size_info()` returns a list of #size_info records. The first one is calculated for gen-0 as normal, i.e. the active size is got by adding all the tree sizes to the size of the stored data. For higher generations, the active size is just the size of the stored data. - `fabric_db_info:merge_results()` continues to return a single object for the `sizes` for non-generational databases, but returns an array of per-generation size info for generational ones. - `couch_db_updater:estimate_size()` sums the sizes of all generations to estimate the total size.
Now that we store per-generation size information, we can make smoosh trigger compaction when any generation passes a channel's thresholds. We achieve this by adjusting the events that smoosh reacts to, so that it considers a specific generation for compaction: - When the `updated` event occurs, enqueue the affected database at generation 0, since all new data is written to gen-0. - In `couch_bt_engine:finish_compaction_int()`, we return the compaction's target generation in the result. In `couch_db_engine:finish_compaction()` we use this value to emit a `compacted_into_generation` event. This notifies smoosh that the target generation has gained new data and should be considered for compaction into the generation above it. The generation is then fed into `find_channel()` and `get_priority()` so that these functions examine the correct size information when deciding whether to trigger compaction. We also include the source generation in the compaction's "key" to identify which generation of a DB is being compacted, so that it resumes correctly from pausing or crashing.
29f855a
to
09e16b3
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(This is a draft we're opening for discussion. The bulk of required information on design background, analysis and implementation is in the commits, including some design docs added to the repo. We will flesh this PR out as the feature gets closer to being ready.)
Overview
This PR implements a "generational" storage model in
couch_bt_engine
, which @janl and I have been working on. Its aim is to improve the performance of compaction on large databases with seldom-changing documents, where every compaction run currently has to copy a mostly-unchanged set of data into the new file.The generational model splits a shard's data storage into multiple generations, where the usual
db.couch
file is "generation 0". On compaction, live data in this file is promoted into generation 1. The next time generation 0 is compacted, it does not have to copy the same set of data again has much of it will have been moved to another file.Further detail on the design and analysis is in design docs we have committed to the repo; see https://github.com/neighbourhoodie/couchdb/blob/feat/generational-compaction/src/couch/doc/generational-compaction. The commit messages give further details about the implementation.
Open questions
Testing recommendations
Related Issues or Pull Requests
Checklist
rel/overlay/etc/default.ini
src/docs
folder