-
Notifications
You must be signed in to change notification settings - Fork 795
Description
Search before asking
- I had searched in the issues and found no similar issues.
Version
v1.2.776-nightly
What's Wrong?
Since Databend is a distributed system with query execution on multiple nodes, a loss of a node mid-query should be handled gracefully so as not to leave the database in an indeterminate state.
Currently, during query execution when a query node is lost, a "Broken Pipe" error is raised to the client. Unfortunately, it is not known what state the tables are in when that occurs.
While it might be possible to add something on the client side to try to determine the state, this would be quite complex and likely not entirely accurate or able to understand the internal state of Databend objects at the time of failure. Since Databend is a distributed system, node failures will most certainly occur on a regular basis for any sufficiently large system. Therefore, internal loss of nodes should be expected and handled transparently to outside systems/clients. We experience "Broken Pipe" errors on a daily basis.
While much less important, this bug also prevents the use of auto-scaling the query pods. The query pods can be scaled out without error but when trying to scale back in, there is a high likelihood that a Broken Pipe error will occur if the system has any use since query pods are being terminated without regard to active queries being run.
How to Reproduce?
- Create a Databend cluster with at least two query pods running on different nodes
- Execute a long running query
- Terminate one or more of the nodes hosting the query pods
- Observe the query error, most likely a "Broken Pipe"
Are you willing to submit PR?
- Yes I am willing to submit a PR!