|
| 1 | +## Pruning deployments |
| 2 | + |
| 3 | +Pruning is an operation that deletes data from a deployment that is only |
| 4 | +needed to respond to queries at block heights before a certain block. In |
| 5 | +GraphQL, those are only queries with a constraint `block { number: <n> } }` |
| 6 | +or a similar constraint by block hash where `n` is before the block to |
| 7 | +which the deployment is pruned. Queries that are run at a block height |
| 8 | +greater than that are not affected by pruning, and there is no difference |
| 9 | +between running these queries against an unpruned and a pruned deployment. |
| 10 | + |
| 11 | +Because pruning reduces the amount of data in a deployment, it reduces the |
| 12 | +amount of storage needed for that deployment, and is beneficial for both |
| 13 | +query performance and indexing speed. Especially compared to the default of |
| 14 | +keeping all history for a deployment, it can often reduce the amount of |
| 15 | +data for a deployment by a very large amount and speed up queries |
| 16 | +considerably. See [caveats](#caveats) below for the downsides. |
| 17 | + |
| 18 | +The block `b` to which a deployment is pruned is controlled by how many |
| 19 | +blocks `history_blocks` of history to retain; `b` is calculated internally |
| 20 | +using `history_blocks` and the latest block of the deployment when the |
| 21 | +prune operation is performed. When pruning finishes, it updates the |
| 22 | +`earliest_block` for the deployment. The `earliest_block` can be retrieved |
| 23 | +through the `index-node` status API, and `graph-node` will return an error |
| 24 | +for any query that tries to time-travel to a point before |
| 25 | +`earliest_block`. The value of `history_blocks` must be greater than |
| 26 | +`ETHEREUM_REORG_THRESHOLD` to make sure that reverts can never conflict |
| 27 | +with pruning. |
| 28 | + |
| 29 | +Pruning is started by running `graphman prune`. That command will perform |
| 30 | +an initial prune of the deployment and set the subgraph's `history_blocks` |
| 31 | +setting which is used to periodically check whether the deployment has |
| 32 | +accumulated more history than that. Whenever the deployment does contain |
| 33 | +more history than that, the deployment is automatically repruned. If |
| 34 | +ongoing pruning is not desired, pass the `--once` flag to `graphman |
| 35 | +prune`. Ongoing pruning can be turned off by setting `history_blocks` to a |
| 36 | +very large value with the `--history` flag. |
| 37 | + |
| 38 | +Repruning is performed whenever the deployment has more than |
| 39 | +`history_blocks * GRAPH_STORE_HISTORY_SLACK_FACTOR` blocks of history. The |
| 40 | +environment variable `GRAPH_STORE_HISTORY_SLACK_FACTOR` therefore controls |
| 41 | +how often repruning is performed: with |
| 42 | +`GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5` and `history_blocks` set to 10,000, |
| 43 | +a reprune will happen every 5,000 blocks. After the initial pruning, a |
| 44 | +reprune therefore happens every `history_blocks * (1 - |
| 45 | +GRAPH_STORE_HISTORY_SLACK_FACTOR)` blocks. This value should be set high |
| 46 | +enough so that repruning occurs relatively infrequently to not cause too |
| 47 | +much database work. |
| 48 | + |
| 49 | +Pruning uses two different strategies for how to remove unneeded data: |
| 50 | +rebuilding tables and deleting old entity versions. Deleting old entity |
| 51 | +versions is straightforward: this strategy deletes rows from the underlying |
| 52 | +tables. Rebuilding tables will copy the data that should be kept from the |
| 53 | +existing tables into new tables and then replaces the existing tables with |
| 54 | +these much smaller tables. Which strategy to use is determined for each |
| 55 | +table individually, and governed by the settings for |
| 56 | +`GRAPH_STORE_HISTORY_REBUILD_THRESHOLD` and |
| 57 | +`GRAPH_STORE_HISTORY_DELETE_THRESHOLD`: if we estimate that we will remove |
| 58 | +more than `REBUILD_THRESHOLD` of the table, the table will be rebuilt. If |
| 59 | +we estimate that we will remove a fraction between `REBUILD_THRESHOLD` and |
| 60 | +`DELETE_THRESHOLD` of the table, unneeded entity versions will be |
| 61 | +deleted. If we estimate to remove less than `DELETE_THRESHOLD`, the table |
| 62 | +is not changed at all. With both strategies, operations are broken into |
| 63 | +batches that should each take `GRAPH_STORE_BATCH_TARGET_DURATION` seconds |
| 64 | +to avoid causing very long-running transactions. |
| 65 | + |
| 66 | +### Caveats |
| 67 | + |
| 68 | +Pruning is a user-visible operation and does affect some of the things that |
| 69 | +can be done with a deployment: |
| 70 | + |
| 71 | +* because it removes history, it restricts how far back time-travel queries |
| 72 | + can be performed. This will only be an issue for entities that keep |
| 73 | + lifetime statistics about some object (e.g., a token) and are used to |
| 74 | + produce time series: after pruning, it is only possible to produce a time |
| 75 | + series that goes back no more than `history_blocks`. It is very |
| 76 | + beneficial though for entities that keep daily or similar statistics |
| 77 | + about some object as it removes data that is not needed once the time |
| 78 | + period is over, and does not affect how far back time series based on |
| 79 | + these objects can be retrieved. |
| 80 | +* it restricts how far back a graft can be performed. Because it removes |
| 81 | + history, it becomes impossible to graft more than `history_blocks` before |
| 82 | + the current deployment head. |
0 commit comments