You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ydb/docs/en/core/concepts/cluster/_includes/common_scheme_ydb/tablets.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ User logic is located between the basic tablet and the user and lets you process
12
12
13
13
### How does a tablet store data and what are they like? {#storage}
14
14
15
-
A basic tablet is an LSM tree that holds all of its table data. One level below the basic tablet is BlobStorage that, roughly speaking, is KeyValue storage that stores binary large objects (blobs). *BLOB* is a binary fragment from 1 byte to 10 MB in size, which has a fixed ID (that is usually called *BlobId* and is of the TLogoBlobID type) and contains related data. Storage is immutable, meaning that only one value corresponds to each ID and it cannot change over time. You can write and read a blob and then delete it when you no longer need it.
15
+
A basic tablet is an [LSM tree](../../../glossary.md#lsm-tree) that holds all of its table data. One level below the basic tablet is BlobStorage that, roughly speaking, is KeyValue storage that stores binary large objects (blobs). *BLOB* is a binary fragment from 1 byte to 10 MB in size, which has a fixed ID (that is usually called *BlobId* and is of the TLogoBlobID type) and contains related data. Storage is immutable, meaning that only one value corresponds to each ID and it cannot change over time. You can write and read a blob and then delete it when you no longer need it.
16
16
17
17
To learn more about blobs and distributed storages, see [here](../../distributed_storage.md).
Copy file name to clipboardExpand all lines: ydb/docs/en/core/concepts/glossary.md
+25-3Lines changed: 25 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -217,7 +217,7 @@ An **actor system** is a C++ library with {{ ydb-short-name }}'s [implementation
217
217
218
218
#### Actor service {#actor-service}
219
219
220
-
An **actor service** is an actor that has a well-known name and is usually run in a single instance on a [node](#node).
220
+
An **actor service** is an [actor](#actor) that has a well-known name and is usually run in a single instance on a [node](#node).
221
221
222
222
#### ActorId {#actorid}
223
223
@@ -272,6 +272,20 @@ A **tablet generation** is a number identifying the reincarnation of the tablet
272
272
273
273
A **tablet local database** or **local database** is a set of data structures and related code that manages the tablet's state and the data it stores. Logically, the local database state is represented by a set of tables very similar to relational tables. Modification of the state of the local database is performed by local tablet transactions generated by the tablet's user actor.
274
274
275
+
Each local database table is stored using the [LSM tree](#lsm-tree) data structure.
276
+
277
+
#### Log-structured merge-tree {#lsm-tree}
278
+
279
+
A **[log-structured merge-tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree)** or **LSM tree**, is a data structure designed to optimize write and read performance in storage systems. It is used in {{ ydb-short-name }} for storing [local database](#local-database) tables and [VDisks](#vdisk) data.
280
+
281
+
#### MemTable {#memtable}
282
+
283
+
All data written to a [local database](#local-database) tables is initially stored in an in-memory data structure called a **MemTable**. When the MemTable reaches a predefined size, it is flushed to disk as an immutable [SST](#sst).
284
+
285
+
#### Sorted string table {#sst}
286
+
287
+
A **sorted string table** or **SST** is an immutable data structure that stores table rows sorted by key, facilitating efficient key lookups and range queries. Each SST is composed of a contiguous series of small data pages, typically around 7 KiB in size each, which further optimizes the process of reading data from disk. An SST typically represents a part of [LSM tree](#lsm-tree).
288
+
275
289
#### Tablet pipe {#tablet-pipe}
276
290
277
291
A **Tablet pipe** or **TabletPipe** is a virtual connection that can be established with a tablet. It includes resolving the [tablet leader](#tablet-leader) by [TabletID](#tabletid). It is the recommended way to work with the tablet. The term **open a pipe to a tablet** describes the process of resolving (searching) a tablet in a cluster and establishing a virtual communication channel with it.
@@ -284,6 +298,14 @@ A **TabletID** is a cluster-wide unique [tablet](#tablet) identifier.
284
298
285
299
The **bootstrapper** is the primary mechanism for launching tablets, used for service tablets (for example, for [Hive](#hive), [DS controller](#ds-controller), root [SchemeShard](#scheme-shard)). The [Hive](#hive) tablet initializes the rest of the tablets.
286
300
301
+
### Shared cache {#shared-cache}
302
+
303
+
A **shared cache** is an [actor](#actor) that stores data pages recently accessed and read from [distributed storage](#distributed-storage). Caching these pages reduces disk I/O operations and accelerates data retrieval, enhancing overall system performance.
304
+
305
+
### Memory controller {#memory-controller}
306
+
307
+
A **memory controller** is an [actor](#actor) that manages {{ ydb-short-name }} [memory limits](../deploy/configuration/config.md#memory-controller).
308
+
287
309
### Tablet types {#tablet-types}
288
310
289
311
[Tablets](#tablet) can be considered a framework for building reliable components operating in a distributed system. {{ ydb-short-name }} has multiple components implemented using this framework, listed below.
@@ -353,7 +375,7 @@ Due to its nature, the state storage service operates in a best-effort manner. F
353
375
354
376
#### Compaction {#compaction}
355
377
356
-
**Compaction** is the internal background process of rebuilding [LSM tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree) data. The data in [VDisks](#vdisk) and [local databases](#local-database) are organized in the form of an LSM tree. Therefore, there is a distinction between **VDisk compaction** and **Tablet compaction**. The compaction process is usually quite resource-intensive, so efforts are made to minimize the overhead associated with it, for example, by limiting the number of concurrent compactions.
378
+
**Compaction** is the internal background process of rebuilding [LSM tree](#lsm-tree) data. The data in [VDisks](#vdisk) and [local databases](#local-database) are organized in the form of an LSM tree. Therefore, there is a distinction between **VDisk compaction** and **Tablet compaction**. The compaction process is usually quite resource-intensive, so efforts are made to minimize the overhead associated with it, for example, by limiting the number of concurrent compactions.
357
379
358
380
#### gRPC proxy {#grpc-proxy}
359
381
@@ -405,7 +427,7 @@ PDisk contains a scheduler that provides device bandwidth sharing between severa
405
427
406
428
#### Skeleton {#skeleton}
407
429
408
-
A **Skeleton** is an actor that provides an interface to a [VDisk](#vdisk).
430
+
A **Skeleton** is an [actor](#actor) that provides an interface to a [VDisk](#vdisk).
Copy file name to clipboardExpand all lines: ydb/docs/en/core/concepts/mvcc.md
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ A simple and naive way of adding MVCC to a sorted KV store is to store multiple
20
20
21
21
## Why YDB needs MVCC
22
22
23
-
YDB table shards store data in a sorted KV store, implemented as a write-optimized LSM-tree (Log-Structured Merge-tree), and historically they did not use MVCC. Since the order of transactions is predetermined externally (using Coordinators, somewhat similar to sequencers in the original Calvin paper), YDB heavily relies on reordering transaction execution at each participant, which is correct as long as such reordering cannot be observed externally, and it doesn't change the final outcome. Without MVCC reordering is impeded by read-write conflicts, e.g. when a write cannot start execution until a particularly wide read is complete. With MVCC writes no longer need to wait for conflicting reads to complete, and reads only ever need to wait for preceding conflicting writes to commit. This makes the out-of-order engine's job easier and improves the overall throughput.
23
+
YDB table shards store data in a sorted KV store, implemented as a write-optimized [LSM tree](glossary.md#lsm-tree), and historically they did not use MVCC. Since the order of transactions is predetermined externally (using Coordinators, somewhat similar to sequencers in the original Calvin paper), YDB heavily relies on reordering transaction execution at each participant, which is correct as long as such reordering cannot be observed externally, and it doesn't change the final outcome. Without MVCC reordering is impeded by read-write conflicts, e.g. when a write cannot start execution until a particularly wide read is complete. With MVCC writes no longer need to wait for conflicting reads to complete, and reads only ever need to wait for preceding conflicting writes to commit. This makes the out-of-order engine's job easier and improves the overall throughput.
24
24
25
25
| Timestamp | Statement | Without MVCC | With MVCC | Description |
26
26
| --- | --- | --- | --- | --- |
@@ -37,17 +37,17 @@ After implementing MVCC using global versions (shared with deterministic distrib
37
37
38
38
## How YDB stores MVCC data
39
39
40
-
DataShard tablets currently store a single table partition in a write-optimized LSM-tree, where for each primary key we store row operation with a set of column updates. During searches, we merge updates from multiple levels and get the final row state. Compactions similarly merge updates from multiple levels and write a resulting aggregate row update.
40
+
DataShard tablets currently store a single table partition in a write-optimized LSMtree, where for each primary key we store row operation with a set of column updates. During searches, we merge updates from multiple levels and get the final row state. Compactions similarly merge updates from multiple levels and write a resulting aggregate row update.
41
41
42
-
One of our design goals when adding MVCC was minimal degradation to existing workloads, and that meant queries, especially range queries, with the most recent version needed to be fast. That meant using common approaches like adding a version suffix to keys was out of the question. Instead, when a row in an SST (sorted string table, part of an LSM-tree) has multiple versions we only store the most recent version in the main data page, marking it with a flag signaling "history" data is present. Older row versions are stored in a special "history" companion SST, where for each marked row id we store row versions in descending order. When we read from a snapshot, we detect if the most recent row version is too recent, and perform a binary search in the history SST instead. Once we found a row version corresponding to a snapshot we apply its updates to the final row state. We also use the fact that LSM-tree levels roughly correspond to their write time, allowing us to stop searching once the first matching row is found for a given snapshot. For each level below that we only need to apply the most recent row to the final row state, which limits the number of merges to at most the number of levels, which is usually small.
42
+
One of our design goals when adding MVCC was minimal degradation to existing workloads, and that meant queries, especially range queries, with the most recent version needed to be fast. That meant using common approaches like adding a version suffix to keys was out of the question. Instead, when a row in an [SST](glossary.md#sst) (sorted string table, part of an LSM tree) has multiple versions we only store the most recent version in the main data page, marking it with a flag signaling "history" data is present. Older row versions are stored in a special "history" companion SST, where for each marked row id we store row versions in descending order. When we read from a snapshot, we detect if the most recent row version is too recent, and perform a binary search in the history SST instead. Once we found a row version corresponding to a snapshot we apply its updates to the final row state. We also use the fact that LSM tree levels roughly correspond to their write time, allowing us to stop searching once the first matching row is found for a given snapshot. For each level below that we only need to apply the most recent row to the final row state, which limits the number of merges to at most the number of levels, which is usually small.
Rows in SSTs are effectively deltas, nonetheless, they are stored as pre-merged from the viewpoint of a given SST, which helps with both search and compaction complexity. Let's imagine a hypothetical situation where the user writes 1 million updates to some key K, each time modifying one of a multitude of columns. As a write-optimized storage, we prefer blind writes and don't read the full row before updating and writing a new updated row, instead, we write an update that says "update column C for key K". If we didn't store pre-merged state at each level, soon there would have been 1 million deltas for the key K, each at a different version. Then each read would potentially need to consider applying all 1 million deltas to the row. Instead, we merge updates at the same level into aggregate updates, starting with memtable (where the previous row state is always in memory and we don't need to read from disk). When compacting several levels into a new SST we only need to iterate over each update version and merge it with the most recent version in the SSTs below, this limits either merge complexity at compaction (the number of merges for each version is limited by the number of levels) and at read time, while still allowing us to perform blind writes.
47
47
48
48

49
49
50
-
Eventually, we mark version ranges as deleted and no longer readable, after which compactions allow us to garbage collect unnecessary row versions automatically (unreachable versions are skipped over and not emitted when writing new SSTs). We also store a small per-version histogram for each SST, so we can detect when too much unnecessary data accumulates in the LSM-tree and trigger additional compactions for garbage collection.
50
+
Eventually, we mark version ranges as deleted and no longer readable, after which compactions allow us to garbage collect unnecessary row versions automatically (unreachable versions are skipped over and not emitted when writing new SSTs). We also store a small per-version histogram for each SST, so we can detect when too much unnecessary data accumulates in the LSMtree and trigger additional compactions for garbage collection.
Copy file name to clipboardExpand all lines: ydb/docs/en/core/contributor/localdb-uncommitted-txs.md
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# LocalDB: persistent uncommitted changes
2
2
3
-
Tablets may need to store a potentially large amount of data over a potentially long time, and then either commit or rollback all accumulated changes atomically without keeping them in-memory. To support this [LocalDB](https://github.com/ydb-platform/ydb/blob/main/ydb/core/tablet_flat/flat_database.h) allows table changes to be marked with a unique 64-bit transaction id (TxId), which are stored in the given table alongside committed data, but not visible until the given TxId is committed. The commit or rollback itself is atomic and very cheap, with committed data eventually integrated into the table as if written normally in the first place.
3
+
Tablets may need to store a potentially large amount of data over a potentially long time, and then either commit or rollback all accumulated changes atomically without keeping them in-memory. To support this [LocalDB](../concepts/glossary.md#local-database) allows table changes to be marked with a unique 64-bit transaction id (TxId), which are stored in the given table alongside committed data, but not visible until the given TxId is committed. The commit or rollback itself is atomic and very cheap, with committed data eventually integrated into the table as if written normally in the first place.
4
4
5
5
This feature is used as a building block for various other features:
6
6
@@ -26,7 +26,7 @@ Redo log (see [flat_redo_writer.h](https://github.com/ydb-platform/ydb/blob/main
26
26
27
27
## Storing uncommitted changes in MemTables
28
28
29
-
[MemTable](https://github.com/ydb-platform/ydb/blob/0adff98ae52cb826f7fb9705503e430b9812994f/ydb/core/tablet_flat/flat_mem_warm.h#L180) in LocalDB is a relatively small in-memory sorted tree that maps table keys to values. MemTable value is a chain of MVCC (partial) rows, each tagged with a row version (a pair of Step and TxId which is a global timestamp). Rows are normally pre-merged across the given MemTable. For example, let's suppose there have been the following operations for some key K:
29
+
[MemTable](../concepts/glossary.md#memtable) in LocalDB is a relatively small in-memory sorted tree that maps table keys to values. MemTable value is a chain of MVCC (partial) rows, each tagged with a row version (a pair of Step and TxId which is a global timestamp). Rows are normally pre-merged across the given MemTable. For example, let's suppose there have been the following operations for some key K:
30
30
31
31
| Version | Operation |
32
32
--- | ---
@@ -86,7 +86,7 @@ Notice how the new record has its state pre-merged, including the previously com
86
86
87
87
## Compacting uncommitted changes
88
88
89
-
Compaction takes some parts from the table, merges them in a sorted order, and writes as a new SST, which replaces compacted data. When compacting MemTable it also implies compacting the relevant redo log, and includes `EvRemoveTx`/`EvCommitTx` events, which affect change visibility and must also end up in persistent storage. LocalDB writes TxStatus blobs (see [flat_page_txstatus.h](https://github.com/ydb-platform/ydb/blob/main/ydb/core/tablet_flat/flat_page_txstatus.h)), which store a list of committed and removed transactions, and replace the compacted redo log in regard to `EvRemoveTx`/`EvCommitTx` events. Compaction uses the latest transaction status maps, but it filters them leaving only those transactions that are mentioned in the relevant MemTables or previous TxStatus pages, so that it matches the compacted redo log.
89
+
Compaction takes some parts from the table, merges them in a sorted order, and writes as a new [SST](../concepts/glossary.md#sst), which replaces compacted data. When compacting MemTable it also implies compacting the relevant redo log, and includes `EvRemoveTx`/`EvCommitTx` events, which affect change visibility and must also end up in persistent storage. LocalDB writes TxStatus blobs (see [flat_page_txstatus.h](https://github.com/ydb-platform/ydb/blob/main/ydb/core/tablet_flat/flat_page_txstatus.h)), which store a list of committed and removed transactions, and replace the compacted redo log in regard to `EvRemoveTx`/`EvCommitTx` events. Compaction uses the latest transaction status maps, but it filters them leaving only those transactions that are mentioned in the relevant MemTables or previous TxStatus pages, so that it matches the compacted redo log.
90
90
91
91
Data pages (see [flat_page_data.h](https://github.com/ydb-platform/ydb/blob/main/ydb/core/tablet_flat/flat_page_data.h)) store uncommitted deltas from MemTables (or other SSTs) aggregated by their TxId in the same order just before the primary record. Records may have MVCC flags (HasHistory, IsVersioned, IsErased), which specify whether there is MVCC fields and data present. Delta records have an [IsDelta flag](https://github.com/ydb-platform/ydb/blob/0adff98ae52cb826f7fb9705503e430b9812994f/ydb/core/tablet_flat/flat_page_data.h#L98), which is really a HasHistory flag without other MVCC flags. Since it was never used by previous versions (HasHistory flag was only ever used together with IsVersioned flag, you could not have history rows without a verioned record), it clearly identifies record as an uncommitted delta. Delta records have a [TDelta](https://github.com/ydb-platform/ydb/blob/0adff98ae52cb826f7fb9705503e430b9812994f/ydb/core/tablet_flat/flat_page_data.h#L66) info immediately after the fixed record data, which specifies TxId of the uncommitted delta.
0 commit comments