Skip to content

Transaction mutations partially applied when TiKV nodes crashing? #497

@jmhrpr

Description

@jmhrpr

Cluster version: 8.5.1
Client version:

name = "tikv-client"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "048968e4e3d04db472346770cc19914c6b5ae206fa44677f6a0874d54cd05940"

I have a workload where am ingesting lots of data into TiKV, one transaction after another in quick succession. In one transaction I write/delete many KVs, and in a later transaction (in this specific case ~24h later) I try to read one of the keys I wrote to but no value is found, and I did not do another operation to the key in between. Other keys I wrote within the transaction which wrote the missing KV are found. Some of the TiKV nodes are crashing around the time due to the heavy workload. The missing keys are not deterministic, but when we repeat the workload we again see cases where some writes or deletes within a transaction do not seem to be applied. When no nodes crash during the workload we don't seem to have the issue (could be coincidence).

  1. .put(key) within transaction (default optimistic)
  2. Begin committing transaction writing the KV
  3. 10-20 or so logs about connecting to specifically one of the nodes which is likely crashed/restarting, these 2 repeating:
tikv_client::pd::client�[0m�[2m:�[0m connect to tikv endpoint: <node 0> 
tikv_client::common::security�[0m�[2m:�[0m connect to rpc server at endpoint: <node 0>
  1. Client returns the commit was successful

Not related to this specific transaction but to give some context on what else was happening: ~30 seconds later, a different transaction submitted later gets a failed to commit secondary keys due to TxnLockNotFound, but it was not the same transaction as the one which wrote the missing KV. ~2 mins after that we see a heart beat error for TxnNotFound. Then a transaction errors with gRPC api error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }.

Here are the logs from the cluster around (30s either side) the time I called .commit() on the transaction which was supposed to write this missing KV (I called commit at 2025-07-12T19:28:37.261):

logs_adjusted.json

The missing key was 03ee44790100f9465cf9b0426aa7020a0685c66bd10a859b78e2da1dea940ba5e113600083b37e051e0b67eb17348f48d3e6e6e8b009d8ea48854d721488667d6000000000029fd4fc727f8594309b02da47145f019993219e0db644bdfd81ea6890e0c717836bdc370000000000000001 and it was written in transaction with start ts (Timestamp { physical: 1752348517228, logical: 14, suffix_bits: 0 }) and commit ts Timestamp { physical: 1752348517328, logical: 35, suffix_bits: 0 }.

How do I go about debugging this, or what information can I provide to help? I'm not sure if this is client or something else.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions