-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Fault Tolerant Execution
The current all-or-nothing architecture makes it significantly more difficult to tolerate faults. The likelihood of a failure grows with the time it takes to complete a query. In addition to increased likelihood of a failure the impact of failing a long running query is much higher as it often results in a significant waste of time and resources.
Due to the nature of streaming exchange in Trino all tasks are interconnected. A failure of any task results in a query failure. To support long running queries Trino has to be able to tolerate task failures.
The idea of this approach is to suspend query execution once in a while and store the current state of all tasks and operators to a persistent storage. In case of a fault a saved snapshot can be restored and a query can be resumed.
Usually this approach is taken under an assumption of the state being “not that large” so the snapshots can be taken at high enough frequency.
Pros
- Exchanges remains “cheap”
- Partitions can be assigned dynamically
Cons
- All tasks must be restarted from a checkpoint in case of a failure
- Might be impossible to make progress when the system experiences faults more often that a snapshot can be taken. For example when running on spot instances with a 2 minutes withdrawal notice a snapshot has to be taken more frequently than once in 2 minutes.
- When the workload grows so does the state. At some point taking frequent snapshots might become unmanageable. Doesn’t help with running queries that require more memory than available in a cluster as it doesn’t enable more granular partitioning and an iterative task execution
- Doesn't provide additional capabilities to support speculative execution
- “fault” is often a non-binary thing. A task that experiences a significant slowdown is technically not “failed”, yet it might be better to speculatively start another copy of that task to avoid waiting for an instance that might never finish.
- Doesn’t enable adaptive query optimizations as fundamentally the model of execution remains the same (an entire query is executed all at once)