Skip to content

Generational compaction #5583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 21 additions & 2 deletions src/chttpd/src/chttpd_db.erl
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,8 @@ handle_compact_req(#httpd{method = 'POST'} = Req, Db) ->
chttpd:validate_ctype(Req, "application/json"),
case Req#httpd.path_parts of
[_DbName, <<"_compact">>] ->
ok = fabric:compact(Db),
SrcGen = list_to_integer(chttpd:qs_value(Req, "gen", "0")),
ok = fabric:compact({Db, SrcGen}),
send_json(Req, 202, {[{ok, true}]});
[DbName, <<"_compact">>, DesignName | _] ->
case ddoc_cache:open(DbName, <<"_design/", DesignName/binary>>) of
Expand Down Expand Up @@ -427,9 +428,10 @@ handle_design_info_req(Req, _Db, _DDoc) ->
create_db_req(#httpd{} = Req, DbName) ->
couch_httpd:verify_is_server_admin(Req),
ShardsOpt = parse_shards_opt(Req),
MaxGenOpt = parse_generations_opt(Req),
EngineOpt = parse_engine_opt(Req),
DbProps = parse_partitioned_opt(Req),
Options = lists:append([ShardsOpt, [{props, DbProps}], EngineOpt]),
Options = lists:append([ShardsOpt, MaxGenOpt, [{props, DbProps}], EngineOpt]),
DocUrl = absolute_uri(Req, "/" ++ couch_util:url_encode(DbName)),
case fabric:create_db(DbName, Options) of
ok ->
Expand Down Expand Up @@ -853,6 +855,19 @@ db_req(#httpd{method = 'GET', path_parts = [_, <<"_revs_limit">>]} = Req, Db) ->
send_json(Req, fabric:get_revs_limit(Db));
db_req(#httpd{path_parts = [_, <<"_revs_limit">>]} = Req, _Db) ->
send_method_not_allowed(Req, "PUT,GET");
db_req(#httpd{method = 'PUT', path_parts = [_, <<"_max_generation">>]} = Req, Db) ->
Options = [{user_ctx, Req#httpd.user_ctx}],
case chttpd:json_body(Req) of
MaxGen when is_integer(MaxGen), MaxGen > 0 ->
case fabric:set_max_generation(Db, MaxGen, Options) of
ok ->
send_json(Req, {[{<<"ok">>, true}]});
Error ->
throw(Error)
end;
_ ->
throw({bad_request, "`max_generation` must be positive integer"})
end;
db_req(#httpd{method = 'PUT', path_parts = [_, <<"_purged_infos_limit">>]} = Req, Db) ->
Options = [{user_ctx, Req#httpd.user_ctx}],
case chttpd:json_body(Req) of
Expand Down Expand Up @@ -1983,6 +1998,10 @@ parse_shards_opt(Param, Req, Default) ->
false -> throw({bad_request, Err})
end.

parse_generations_opt(Req) ->
Val = chttpd:qs_value(Req, "gen", "0"),
[{max_generation, list_to_integer(Val)}].

parse_engine_opt(Req) ->
case chttpd:qs_value(Req, "engine") of
undefined ->
Expand Down
241 changes: 241 additions & 0 deletions src/couch/doc/generational-compaction/crash-safety.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
# Crash safety

In 3.4 the process for ending compaction in `finish_compaction` goes like this:

0. state during compaction:

db.couch
db.couch.compact.data
db.couch.compact.meta

1. rename the data file:

db.couch
db.couch.compact
db.couch.compact.meta

2. delete the original DB file:

db.couch.compact
db.couch.compact.meta

3. rename the .compact file:

db.couch
db.couch.compact.meta

4. delete the meta file:

db.couch

Crash safety: if a crash occurs after step 2, then on restart the
`db.couch.compact` file is taken to be the canonical file, renamed to
`db.couch`, and we continue as normal.

Question: why can't we get the same behaviour by directly renaming
`db.couch.compact.data` to `db.couch`, producing the same state as after step 3?
Rename in the same directory is atomic and we should have stopped writing to the
original `db.couch` file in order to perform this switch-over.

Answer: this might not be true on Windows, you cannot remove a file that is in
use, we have to close `db.couch`, remove it, then rename the compacted file into
its place. The rename of `.compact.data` to `.compact` signals a state where
compaction completed and we were in the middle of swapping the files over, so
nothing new has been written to `db.couch` and we can resume by using
`db.couch.compact`. This is not necessarily the case if `db.couch.compact.data`
still exists.

After `open_db_file` returns, the code in `init` that reads/creates the DB
header also deletes all the compaction files (`.compact`, `.compact.data`,
`.compact.meta`) if the header was newly created, rather than read from existing
file data.

For generational compaction, there are several cases to consider.


## A. Compacting gen 0

In this case we're appending to `db.1.couch`; references to new data written
there ends up in `db.couch.compact.data`, along with references to _existing_
data that was already in `db.{1,2,...}.couch`. The following files exist during
compaction:

db.couch
db.1.couch
db.couch.compact.data
db.couch.compact.meta

On completion, none of the generational files will be removed. Therefore all
pointers in `db.couch` and `db.couch.compact.data` remain valid and we are free
to use either file as our canonical DB file. The original cleanup procedure can
be used without modification.


## B. Compacting gen 1, 2, ...

One generation up from 0, we have data being moved from gen 1 to 2, or in
general from G to G+1. Files existing during compaction are:

db.couch
db.1.couch
db.2.couch
db.couch.compact.data
db.couch.compact.meta

When compacting gen 1, if `db.couch` points to gen 1 then
`db.couch.compact.data` will refer to gen 2. Also, data from `db.couch` has to
be moved to `db.couch.compact.data` in order to avoid being lost on completion.
All pointers to gen 2 and above remain unmodified.

On completion, `db.1.couch` will be removed on the grounds that all its data has
been moved to `db.2.couch` and nothing new has been written to `db.1.couch`.
This is because the compactor is the only thing that writes to `db.G.couch` and
only one compaction per shard runs at a time.

However, it is only safe to remove `db.1.couch` when nothing is referring to it
any more, i.e. after `db.couch.compact.data` is renamed to `db.couch.compact`.
Once this has happened, any future DB access will either use `db.couch.compact`,
or the contents of it after moving to `db.couch`, not the original `db.couch`,
and so the old pointers into `db.1.couch` have expired.

At any point before `db.couch.compact.data` is renamed, the data in `db.1.couch`
is still being referenced, and so it cannot be removed.

Therefore we can extend the cleanup process to:

1. Rename `db.couch.compact.data` to `db.couch.compact`
2. Delete `db.1.couch`
3. Delete `db.couch`
4. Rename `db.couch.compact` to `db.couch`
5. Delete `db.couch.compact.meta`

The reason for putting the deletion of `db.1.couch` as early as possible is to
reduce the set of crash scenarios where this file remains in place. If
`db.1.couch` remains after a crash, this is not _unsafe_ (i.e. it does not cause
a consistency problem or data loss) but it does leave a pile of unreferenced
data that needs to be cleaned up. The simplest way to achieve this would be to
re-run compaction of gen 1, which could be triggered by noticing the file has an
active size of 0. Possibly Smoosh could prioritise the generation with the
largest proportion of garbage when deciding what to compact next.


## C. Compacting the last generation

Say the DB has a maximum generation of 2. This means that normally, the existing
files are:

db.couch
db.1.couch
db.2.couch

During compaction of gen 2, a temporary additional generation is created along
with the usual compaction files:

db.couch
db.1.couch
db.2.couch
db.2.couch.compact.maxgen
db.couch.compact.data
db.couch.compact.meta

Live data is copied from `db.2.couch` to `db.2.couch.compact.maxgen`, but the
pointers stored in `db.couch.compact.data` refer to gen 2, with the intention
that `db.2.couch.compact.maxgen` will eventually be renamed to `db.2.couch`
rather than letting the generation number grow indefinitely.

This requires a further change the cleanup process:

1. Rename `db.couch.compact.data` to `db.couch.compact`
2. Delete `db.2.couch`
3. Rename `db.2.couch.compact.maxgen` to `db.2.couch`
4. Delete `db.couch`
5. Rename `db.couch.compact` to `db.couch`
6. Delete `db.couch.compact.meta`

A crash after steps 1, 2, or 3 produces one of these states:

Step 1 Step 2 Step 3
------------------------- ------------------------- -------------------------
db.couch db.couch db.couch
db.2.couch (old) db.2.couch (new)
db.2.couch.compact.maxgen db.2.couch.compact.maxgen
db.couch.compact db.couch.compact db.couch.compact

These states have an ambiguity to them; all of them will cause `db.couch` to be
used when the DB is re-opened, but it's not clear whether the data in
`db.2.couch` is valid for that file or not. It creates a state where you need to
determine which data to preserved based on the presence of all the other files,
which is complicated. In state 1 you can continue using the old `db.2.couch`,
but in the other states you need to decide to open `db.couch.compact` instead
and possibly clean up the `db.2.couch.compact.maxgen` file. This suggests that
removing `db.couch` earlier in the process might be good.

1. Rename `db.couch.compact.data` to `db.couch.compact`
2. Delete `db.couch`
3. Delete `db.2.couch`
4. Rename `db.2.couch.compact.maxgen` to `db.2.couch`
5. Rename `db.couch.compact` to `db.couch`
6. Delete `db.couch.compact.meta`

Having done that, we need to extend the recovery path where we fail to find
`db.couch` and check for `db.couch.compact`. The possible crash states of the
above process are:

Step 1 Step 2 Step 3 Step 4
------------------------- ------------------------- ------------------------- -------------------------
db.couch
db.2.couch (old) db.2.couch (old) db.2.couch (new)
db.2.couch.compact.maxgen db.2.couch.compact.maxgen db.2.couch.compact.maxgen
db.couch.compact db.couch.compact db.couch.compact db.couch.compact

In the first state, we would open `db.couch` on restart, and this refers to data
in the _old_ `db.2.couch`. We can continue to use this and either leave
`db.2.couch.compact.maxgen` in place for some future compaction of gen 2, or we
can delete it as we're not using any data in it.

In all the other states we will use `db.couch.compact` and therefore need to
complete the process of moving `db.2.couch.compact.maxgen` to `db.2.couch`
before using it. If `db.2.couch` exists, we remove it, and then we rename
`db.2.couch.compact.maxgen` to `db.2.couch`. This process is safe if we crash
after removing the old `db.2.couch`.

We could also try to resolve these two questions independently:

1. What to do if both `db.couch` and `db.couch.compact` exist
2. What to do if `db.G.couch` where `G > max_generation` exists

But as we've seen, the cleanup operations for both these questions create states
where the mutual ordering of their operations is important and it would be wise
to minimise the set of possible such states. Therefore we suggest the following
recovery routine:

1. Attempt to open `db.couch`. If this succeeds, remove any generation files
above `max_generation`. Otherwise...

2. If `db.M+1.couch` where `M = max_generation` exists, then remove `db.M.couch`
then rename `db.M+1.couch` to `db.M.couch`

3. Rename `db.couch.compact` to `db.couch`

4. Open `db.couch`

This process works correctly if the generation files are cleaned up _before_ the
rename of `db.couch.compact` to `db.couch` in `finish_compaction`. Otherwise the
`.compact` file that indicates the partial completion of this process may not
exist following a crash and this makes it harder to tell which generation files
are valid on recovery.

This also highlights that letting users modify `max_generation` for a DB is not
safe while compaction is happening, because it may lead to confusion during
recovery that could cause data loss (i.e. mistaken deletion of
`db.2.couch.compact.maxgen`).

If we allow post-creation changes to `max_generation`, then:

- Increasing it is "free"; all existing data remains valid but it simply becomes
_possible_ for future compactions to create higher generations.

- Decreasing it requires a "reversed" compaction to move data from higher
generations to lower ones followed by deleting the emptied generation(s). And
so this change should properly be thought of as requesting a compaction with
special properties.
Loading