Skip to content

Commit 3c3ad93

Browse files
committed
docs: Improve pruning doc based on Adam's review comments
1 parent 6504d97 commit 3c3ad93

File tree

1 file changed

+32
-15
lines changed

1 file changed

+32
-15
lines changed

docs/implementation/pruning.md

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
## Pruning deployments
22

3-
Pruning is an operation that deletes data from a deployment that is only
4-
needed to respond to queries at block heights before a certain block. In
5-
GraphQL, those are only queries with a constraint `block { number: <n> } }`
6-
or a similar constraint by block hash where `n` is before the block to
7-
which the deployment is pruned. Queries that are run at a block height
8-
greater than that are not affected by pruning, and there is no difference
9-
between running these queries against an unpruned and a pruned deployment.
3+
Subgraphs, by default, store a full version history for entities, allowing
4+
consumers to query the subgraph as of any historical block. Pruning is an
5+
operation that deletes entity versions from a deployment older than a
6+
certain block, so it is no longer possible to query the deployment as of
7+
prior blocks. In GraphQL, those are only queries with a constraint `block {
8+
number: <n> } }` or a similar constraint by block hash where `n` is before
9+
the block to which the deployment is pruned. Queries that are run at a
10+
block height greater than that are not affected by pruning, and there is no
11+
difference between running these queries against an unpruned and a pruned
12+
deployment.
1013

1114
Because pruning reduces the amount of data in a deployment, it reduces the
1215
amount of storage needed for that deployment, and is beneficial for both
@@ -54,14 +57,28 @@ existing tables into new tables and then replaces the existing tables with
5457
these much smaller tables. Which strategy to use is determined for each
5558
table individually, and governed by the settings for
5659
`GRAPH_STORE_HISTORY_REBUILD_THRESHOLD` and
57-
`GRAPH_STORE_HISTORY_DELETE_THRESHOLD`: if we estimate that we will remove
58-
more than `REBUILD_THRESHOLD` of the table, the table will be rebuilt. If
59-
we estimate that we will remove a fraction between `REBUILD_THRESHOLD` and
60-
`DELETE_THRESHOLD` of the table, unneeded entity versions will be
61-
deleted. If we estimate to remove less than `DELETE_THRESHOLD`, the table
62-
is not changed at all. With both strategies, operations are broken into
63-
batches that should each take `GRAPH_STORE_BATCH_TARGET_DURATION` seconds
64-
to avoid causing very long-running transactions.
60+
`GRAPH_STORE_HISTORY_DELETE_THRESHOLD`, both numbers between 0 and 1: if we
61+
estimate that we will remove more than `REBUILD_THRESHOLD` of the table,
62+
the table will be rebuilt. If we estimate that we will remove a fraction
63+
between `REBUILD_THRESHOLD` and `DELETE_THRESHOLD` of the table, unneeded
64+
entity versions will be deleted. If we estimate to remove less than
65+
`DELETE_THRESHOLD`, the table is not changed at all. With both strategies,
66+
operations are broken into batches that should each take
67+
`GRAPH_STORE_BATCH_TARGET_DURATION` seconds to avoid causing very
68+
long-running transactions.
69+
70+
Pruning, in most cases, runs in parallel with indexing and does not block
71+
it. When the rebuild strategy is used, pruning does block indexing while it
72+
copies non-final entities from the existing table to the new table.
73+
74+
The initial prune started by `graphman prune` prints a progress report on
75+
the console. For the ongoing prune runs that are periodically performed,
76+
the following information is logged: a message `Start pruning historical
77+
entities` which includes the earliest and latest block, a message `Analyzed
78+
N tables`, and a message `Finished pruning entities` with details about how
79+
much was deleted or copied and how long that took. Pruning analyzes tables,
80+
if that seems necessary, because its estimates of how much of a table is
81+
likely not needed are based on Postgres statistics.
6582

6683
### Caveats
6784

0 commit comments

Comments
 (0)