Skip to content

Commit 705db27

Browse files
committed
docs: Explain how pruning works and how it is configured
1 parent fb0aca5 commit 705db27

File tree

2 files changed

+83
-0
lines changed

2 files changed

+83
-0
lines changed

docs/implementation/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ the code should go into comments.
99
* [Time-travel Queries](./time-travel.md)
1010
* [SQL Query Generation](./sql-query-generation.md)
1111
* [Adding support for a new chain](./add-chain.md)
12+
* [Pruning](./pruning.md)

docs/implementation/pruning.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
## Pruning deployments
2+
3+
Pruning is an operation that deletes data from a deployment that is only
4+
needed to respond to queries at block heights before a certain block. In
5+
GraphQL, those are only queries with a constraint `block { number: <n> } }`
6+
or a similar constraint by block hash where `n` is before the block to
7+
which the deployment is pruned. Queries that are run at a block height
8+
greater than that are not affected by pruning, and there is no difference
9+
between running these queries against an unpruned and a pruned deployment.
10+
11+
Because pruning reduces the amount of data in a deployment, it reduces the
12+
amount of storage needed for that deployment, and is beneficial for both
13+
query performance and indexing speed. Especially compared to the default of
14+
keeping all history for a deployment, it can often reduce the amount of
15+
data for a deployment by a very large amount and speed up queries
16+
considerably. See [caveats](#caveats) below for the downsides.
17+
18+
The block `b` to which a deployment is pruned is controlled by how many
19+
blocks `history_blocks` of history to retain; `b` is calculated internally
20+
using `history_blocks` and the latest block of the deployment when the
21+
prune operation is performed. When pruning finishes, it updates the
22+
`earliest_block` for the deployment. The `earliest_block` can be retrieved
23+
through the `index-node` status API, and `graph-node` will return an error
24+
for any query that tries to time-travel to a point before
25+
`earliest_block`. The value of `history_blocks` must be greater than
26+
`ETHEREUM_REORG_THRESHOLD` to make sure that reverts can never conflict
27+
with pruning.
28+
29+
Pruning is started by running `graphman prune`. That command will perform
30+
an initial prune of the deployment and set the subgraph's `history_blocks`
31+
setting which is used to periodically check whether the deployment has
32+
accumulated more history than that. Whenever the deployment does contain
33+
more history than that, the deployment is automatically repruned. If
34+
ongoing pruning is not desired, pass the `--once` flag to `graphman
35+
prune`. Ongoing pruning can be turned off by setting `history_blocks` to a
36+
very large value with the `--history` flag.
37+
38+
Repruning is performed whenever the deployment has more than
39+
`history_blocks * GRAPH_STORE_HISTORY_SLACK_FACTOR` blocks of history. The
40+
environment variable `GRAPH_STORE_HISTORY_SLACK_FACTOR` therefore controls
41+
how often repruning is performed: with
42+
`GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5` and `history_blocks` set to 10,000,
43+
a reprune will happen every 5,000 blocks. After the initial pruning, a
44+
reprune therefore happens every `history_blocks * (1 -
45+
GRAPH_STORE_HISTORY_SLACK_FACTOR)` blocks. This value should be set high
46+
enough so that repruning occurs relatively infrequently to not cause too
47+
much database work.
48+
49+
Pruning uses two different strategies for how to remove unneeded data:
50+
rebuilding tables and deleting old entity versions. Deleting old entity
51+
versions is straightforward: this strategy deletes rows from the underlying
52+
tables. Rebuilding tables will copy the data that should be kept from the
53+
existing tables into new tables and then replaces the existing tables with
54+
these much smaller tables. Which strategy to use is determined for each
55+
table individually, and governed by the settings for
56+
`GRAPH_STORE_HISTORY_REBUILD_THRESHOLD` and
57+
`GRAPH_STORE_HISTORY_DELETE_THRESHOLD`: if we estimate that we will remove
58+
more than `REBUILD_THRESHOLD` of the table, the table will be rebuilt. If
59+
we estimate that we will remove a fraction between `REBUILD_THRESHOLD` and
60+
`DELETE_THRESHOLD` of the table, unneeded entity versions will be
61+
deleted. If we estimate to remove less than `DELETE_THRESHOLD`, the table
62+
is not changed at all. With both strategies, operations are broken into
63+
batches that should each take `GRAPH_STORE_BATCH_TARGET_DURATION` seconds
64+
to avoid causing very long-running transactions.
65+
66+
### Caveats
67+
68+
Pruning is a user-visible operation and does affect some of the things that
69+
can be done with a deployment:
70+
71+
* because it removes history, it restricts how far back time-travel queries
72+
can be performed. This will only be an issue for entities that keep
73+
lifetime statistics about some object (e.g., a token) and are used to
74+
produce time series: after pruning, it is only possible to produce a time
75+
series that goes back no more than `history_blocks`. It is very
76+
beneficial though for entities that keep daily or similar statistics
77+
about some object as it removes data that is not needed once the time
78+
period is over, and does not affect how far back time series based on
79+
these objects can be retrieved.
80+
* it restricts how far back a graft can be performed. Because it removes
81+
history, it becomes impossible to graft more than `history_blocks` before
82+
the current deployment head.

0 commit comments

Comments
 (0)