Skip to content

Commit 0031c64

Browse files
authored
Merge pull request #9143 from lichuang/rename_optimistic
feat: use `analyze table` instead of `optimize table statistic`
2 parents 8aab437 + 0bf47c1 commit 0031c64

File tree

32 files changed

+471
-217
lines changed

32 files changed

+471
-217
lines changed

docs/doc/14-sql-commands/00-ddl/20-table/60-optimize-table.md

Lines changed: 3 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The objective of optimizing a table in Databend is to compact or purge its histo
88
Databend's Time Travel feature relies on historical data. If you purge historical data from a table with the command `OPTIMIZE TABLE <your_table> PURGE` or `OPTIMIZE TABLE <your_table> ALL`, the table will not be eligible for time travel. The command removes all snapshots (except the most recent one) and their associated segments,block files and table statistic file.
99
:::
1010

11-
## What are Snapshot, Segment, Block and Table statistic file?
11+
## What are Snapshot, Segment, Block?
1212

1313
Snapshot, segment, and block are the concepts Databend uses for data storage. Databend uses them to construct a hierarchical structure for storing table data.
1414

@@ -20,8 +20,6 @@ A snapshot is a JSON file that does not save the table's data but indicate the s
2020

2121
A segment is a JSON file that organizes the storage blocks (at least 1, at most 1,000) where the data is stored. If you run [FUSE_SEGMENT](../../../15-sql-functions/111-system-functions/fuse_segment.md) against a snapshot with the snapshot ID, you can find which segments are referenced by the snapshot.
2222

23-
A table statistic file is a JSON file that save table statistic data, such as distinct values of table column.
24-
2523
Databends saves actual table data in parquet files and considers each parquet file as a block. If you run [FUSE_BLOCK](../../../15-sql-functions/111-system-functions/fuse_block.md) against a snapshot with the snapshot ID, you can find which blocks are referenced by the snapshot.
2624

2725
Databend creates a unique ID for each database and table for storing the snapshot, segment, and block files and saves them to your object storage in the path `<bucket_name>/[root]/<db_id>/<table_id>/`. Each snapshot, segment, and block file is named with a UUID (32-character lowercase hexadecimal string).
@@ -31,7 +29,6 @@ Databend creates a unique ID for each database and table for storing the snapsho
3129
| Snapshot | JSON | `<32bitUUID>_<version>.json` | `<bucket_name>/[root]/<db_id>/<table_id>/_ss/` |
3230
| Segment | JSON | `<32bitUUID>_<version>.json` | `<bucket_name>/[root]/<db_id>/<table_id>/_sg/` |
3331
| Block | parquet | `<32bitUUID>_<version>.parquet` | `<bucket_name>/[root]/<db_id>/<table_id>/_b/` |
34-
| Table statistic | JSON | `<32bitUUID>_<version>.json` | `<bucket_name>/[root]/<db_id>/<table_id>/_ts/` |
3532

3633
## Table Optimization Considerations
3734

@@ -67,12 +64,13 @@ Optimizing a table could be time-consuming, especially for large ones. Databend
6764
## Syntax
6865

6966
```sql
70-
OPTIMIZE TABLE [database.]table_name [ PURGE | COMPACT | ALL | STATISTIC ] [SEGMENT] [LIMIT <segment_count>]
67+
OPTIMIZE TABLE [database.]table_name [ PURGE | COMPACT | ALL | [SEGMENT] [LIMIT <segment_count>]
7168
```
7269

7370
- `OPTIMIZE TABLE <table_name> PURGE`
7471

7572
Purges the historical data of table. Only the latest snapshot (including the segments, blocks and table statistic file referenced by this snapshot) will be kept.
73+
(For more explanations of table statistic file, see [ANALYZE TABLE](./80-analyze-table.md).)
7674

7775
- `OPTIMIZE TABLE <table_name> COMPACT [LIMIT <segment_count>]`
7876

@@ -97,13 +95,6 @@ OPTIMIZE TABLE [database.]table_name [ PURGE | COMPACT | ALL | STATISTIC ] [SEGM
9795

9896
Works the same way as `OPTIMIZE TABLE <table_name> PURGE`.
9997

100-
- `OPTIMIZE TABLE <table_name> STATISTIC`
101-
102-
Estimates the number of distinct values of each column in a table.
103-
104-
- It does not display the estimated results after execution. To show the estimated results, use the function [FUSE_STATISTIC](../../../15-sql-functions/111-system-functions/fuse_statistic.md).
105-
- The command does not identify distinct values by comparing them but by counting the number of storage segments and blocks. This might lead to a significant difference between the estimated results and the actual value, for example, multiple blocks holding the same value. In this case, Databend recommends compacting the storage segments and blocks to merge them as much as possible before you run the estimation.
106-
10798
## Examples
10899

109100
This example compacts and purges historical data from a table:
@@ -162,70 +153,4 @@ mysql> select snapshot_id, segment_count, block_count, row_count from fuse_snaps
162153
+----------------------------------+---------------+-------------+-----------+
163154
| 4f33a63031424ed095b8c2f9e8b15ecb | 16 | 16 | 10000005 |
164155
+----------------------------------+---------------+-------------+-----------+
165-
```
166-
167-
This example estimates the number of distinct values for each column in a table and shows the results with the function FUSE_STATISTIC:
168-
169-
```sql
170-
create table t(a uint64);
171-
172-
insert into t values (5);
173-
insert into t values (6);
174-
insert into t values (7);
175-
176-
select * from t order by a;
177-
178-
----
179-
5
180-
6
181-
7
182-
183-
-- FUSE_STATISTIC will not return any results until you run an estimation with OPTIMIZE TABLE.
184-
select * from fuse_statistic('db_09_0020', 't');
185-
186-
optimize table `t` statistic;
187-
188-
select * from fuse_statistic('db_09_0020', 't');
189-
190-
----
191-
(0,3);
192-
193-
194-
insert into t values (5);
195-
insert into t values (6);
196-
insert into t values (7);
197-
198-
select * from t order by a;
199-
200-
----
201-
5
202-
5
203-
6
204-
6
205-
7
206-
7
207-
208-
-- FUSE_STATISTIC returns results of your last estimation. To get the most recent estimated values, run the estimation again.
209-
-- OPTIMIZE TABLE does not identify distinct values by comparing them but by counting the number of storage segments and blocks.
210-
select * from fuse_statistic('db_09_0020', 't');
211-
212-
----
213-
(0,3);
214-
215-
optimize table `t` statistic;
216-
217-
select * from fuse_statistic('db_09_0020', 't');
218-
219-
----
220-
(0,6);
221-
222-
-- Best practice: Compact the table before running the estimation.
223-
optimize table t compact;
224-
225-
optimize table `t` statistic;
226-
227-
select * from fuse_statistic('db_09_0020', 't');
228-
229-
----
230-
(0,3);
231156
```
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: ANALYZE TABLE
3+
---
4+
5+
The objective of analyzing a table in Databend is to calculate table statistics, such as distinct number of columns.
6+
7+
## What is Table statistic file?
8+
9+
A table statistic file is a JSON file that save table statistic data, such as distinct values of table column.
10+
11+
Databend creates a unique ID for each database and table for storing the table statistic file and saves them to your object storage in the path `<bucket_name>/[root]/<db_id>/<table_id>/`. Each table statistic file is named with a UUID (32-character lowercase hexadecimal string).
12+
13+
| File | Format | Filename | Storage Folder |
14+
|----------|---------|---------------------------------|----------------------------------------------------------------------------|
15+
| Table statistic | JSON | `<32bitUUID>_<version>.json` | `<bucket_name>/[root]/<db_id>/<table_id>/_ts/` |
16+
17+
## Syntax
18+
```sql
19+
ANALYZE TABLE [database.]table_name
20+
```
21+
22+
- `ANALYZE TABLE <table_name>`
23+
24+
Estimates the number of distinct values of each column in a table.
25+
26+
- It does not display the estimated results after execution. To show the estimated results, use the function [FUSE_STATISTIC](../../../15-sql-functions/111-system-functions/fuse_statistic.md).
27+
- The command does not identify distinct values by comparing them but by counting the number of storage segments and blocks. This might lead to a significant difference between the estimated results and the actual value, for example, multiple blocks holding the same value. In this case, Databend recommends compacting the storage segments and blocks to merge them as much as possible before you run the estimation.
28+
29+
## Examples
30+
31+
This example estimates the number of distinct values for each column in a table and shows the results with the function FUSE_STATISTIC:
32+
33+
```sql
34+
create table t(a uint64);
35+
36+
insert into t values (5);
37+
insert into t values (6);
38+
insert into t values (7);
39+
40+
select * from t order by a;
41+
42+
----
43+
5
44+
6
45+
7
46+
47+
-- FUSE_STATISTIC will not return any results until you run an estimation with OPTIMIZE TABLE.
48+
select * from fuse_statistic('db_09_0020', 't');
49+
50+
analyze table `t`;
51+
52+
select * from fuse_statistic('db_09_0020', 't');
53+
54+
----
55+
(0,3);
56+
57+
58+
insert into t values (5);
59+
insert into t values (6);
60+
insert into t values (7);
61+
62+
select * from t order by a;
63+
64+
----
65+
5
66+
5
67+
6
68+
6
69+
7
70+
7
71+
72+
-- FUSE_STATISTIC returns results of your last estimation. To get the most recent estimated values, run the estimation again.
73+
-- OPTIMIZE TABLE does not identify distinct values by comparing them but by counting the number of storage segments and blocks.
74+
select * from fuse_statistic('db_09_0020', 't');
75+
76+
----
77+
(0,3);
78+
79+
analyze table `t`;
80+
81+
select * from fuse_statistic('db_09_0020', 't');
82+
83+
----
84+
(0,6);
85+
86+
-- Best practice: Compact the table before running the estimation.
87+
optimize table t compact;
88+
89+
analyze table `t`;
90+
91+
select * from fuse_statistic('db_09_0020', 't');
92+
93+
----
94+
(0,3);
95+
```

docs/doc/15-sql-functions/111-system-functions/fuse_statistic.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,4 @@ FUSE_STATISTIC('<database_name>', '<table_name>')
1616

1717
## Examples
1818

19-
You're most likely to use this function together with `OPTIMIZE TABLE <table_name> STATISTIC` to generate and check the statistical information of a table. For more explanations and examples, see [OPTIMIZE TABLE](../../14-sql-commands/00-ddl/20-table/60-optimize-table.md).
19+
You're most likely to use this function together with `ANALYZE TABLE <table_name>` to generate and check the statistical information of a table. For more explanations and examples, see [OPTIMIZE TABLE](../../14-sql-commands/00-ddl/20-table/60-optimize-table.md).

src/query/ast/src/ast/format/ast_format.rs

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1371,6 +1371,17 @@ impl<'ast> Visitor<'ast> for AstFormatVisitor {
13711371
self.children.push(node);
13721372
}
13731373

1374+
fn visit_analyze_table(&mut self, stmt: &'ast AnalyzeTableStmt<'ast>) {
1375+
let mut children = Vec::new();
1376+
self.visit_table_ref(&stmt.catalog, &stmt.database, &stmt.table);
1377+
children.push(self.children.pop().unwrap());
1378+
1379+
let name = "AnalyzeTable".to_string();
1380+
let format_ctx = AstFormatContext::with_children(name, children.len());
1381+
let node = FormatTreeNode::with_children(format_ctx, children);
1382+
self.children.push(node);
1383+
}
1384+
13741385
fn visit_exists_table(&mut self, stmt: &'ast ExistsTableStmt<'ast>) {
13751386
self.visit_table_ref(&stmt.catalog, &stmt.database, &stmt.table);
13761387
let child = self.children.pop().unwrap();

src/query/ast/src/ast/statements/statement.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,7 @@ pub enum Statement<'a> {
105105
RenameTable(RenameTableStmt<'a>),
106106
TruncateTable(TruncateTableStmt<'a>),
107107
OptimizeTable(OptimizeTableStmt<'a>),
108+
AnalyzeTable(AnalyzeTableStmt<'a>),
108109
ExistsTable(ExistsTableStmt<'a>),
109110

110111
// Views
@@ -295,6 +296,7 @@ impl<'a> Display for Statement<'a> {
295296
Statement::RenameTable(stmt) => write!(f, "{stmt}")?,
296297
Statement::TruncateTable(stmt) => write!(f, "{stmt}")?,
297298
Statement::OptimizeTable(stmt) => write!(f, "{stmt}")?,
299+
Statement::AnalyzeTable(stmt) => write!(f, "{stmt}")?,
298300
Statement::ExistsTable(stmt) => write!(f, "{stmt}")?,
299301
Statement::CreateView(stmt) => write!(f, "{stmt}")?,
300302
Statement::AlterView(stmt) => write!(f, "{stmt}")?,

src/query/ast/src/ast/statements/table.rs

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -411,6 +411,28 @@ impl Display for OptimizeTableStmt<'_> {
411411
}
412412
}
413413

414+
#[derive(Debug, Clone, PartialEq)]
415+
pub struct AnalyzeTableStmt<'a> {
416+
pub catalog: Option<Identifier<'a>>,
417+
pub database: Option<Identifier<'a>>,
418+
pub table: Identifier<'a>,
419+
}
420+
421+
impl Display for AnalyzeTableStmt<'_> {
422+
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
423+
write!(f, "ANALYZE TABLE ")?;
424+
write_period_separated_list(
425+
f,
426+
self.catalog
427+
.iter()
428+
.chain(&self.database)
429+
.chain(Some(&self.table)),
430+
)?;
431+
432+
Ok(())
433+
}
434+
}
435+
414436
#[derive(Debug, Clone, PartialEq, Eq)]
415437
pub struct ExistsTableStmt<'a> {
416438
pub catalog: Option<Identifier<'a>>,
@@ -462,7 +484,6 @@ pub enum CompactTarget {
462484
pub enum OptimizeTableAction<'a> {
463485
All,
464486
Purge,
465-
Statistic,
466487
Compact {
467488
target: CompactTarget,
468489
limit: Option<Expr<'a>>,
@@ -474,7 +495,6 @@ impl<'a> Display for OptimizeTableAction<'a> {
474495
match self {
475496
OptimizeTableAction::All => write!(f, "ALL"),
476497
OptimizeTableAction::Purge => write!(f, "PURGE"),
477-
OptimizeTableAction::Statistic => write!(f, "STATISTIC"),
478498
OptimizeTableAction::Compact { target, limit } => {
479499
match target {
480500
CompactTarget::Block => {

src/query/ast/src/parser/statement.rs

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,18 @@ pub fn statement(i: Input) -> IResult<StatementMsg> {
515515
})
516516
},
517517
);
518+
let analyze_table = map(
519+
rule! {
520+
ANALYZE ~ TABLE ~ #peroid_separated_idents_1_to_3
521+
},
522+
|(_, _, (catalog, database, table))| {
523+
Statement::AnalyzeTable(AnalyzeTableStmt {
524+
catalog,
525+
database,
526+
table,
527+
})
528+
},
529+
);
518530
let exists_table = map(
519531
rule! {
520532
EXISTS ~ TABLE ~ #peroid_separated_idents_1_to_3
@@ -991,6 +1003,7 @@ pub fn statement(i: Input) -> IResult<StatementMsg> {
9911003
| #rename_table : "`RENAME TABLE [<database>.]<table> TO <new_table>`"
9921004
| #truncate_table : "`TRUNCATE TABLE [<database>.]<table> [PURGE]`"
9931005
| #optimize_table : "`OPTIMIZE TABLE [<database>.]<table> (ALL | PURGE | COMPACT [SEGMENT])`"
1006+
| #analyze_table : "`ANALYZE TABLE [<database>.]<table>`"
9941007
| #exists_table : "`EXISTS TABLE [<database>.]<table>`"
9951008
),
9961009
rule!(
@@ -1449,7 +1462,6 @@ pub fn optimize_table_action(i: Input) -> IResult<OptimizeTableAction> {
14491462
alt((
14501463
value(OptimizeTableAction::All, rule! { ALL }),
14511464
value(OptimizeTableAction::Purge, rule! { PURGE }),
1452-
value(OptimizeTableAction::Statistic, rule! { STATISTIC }),
14531465
map(
14541466
rule! { COMPACT ~ (SEGMENT)? ~ ( LIMIT ~ ^#expr )?},
14551467
|(_, opt_segment, opt_limit)| OptimizeTableAction::Compact {

src/query/ast/src/visitors/visitor.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -436,6 +436,8 @@ pub trait Visitor<'ast>: Sized {
436436

437437
fn visit_optimize_table(&mut self, _stmt: &'ast OptimizeTableStmt<'ast>) {}
438438

439+
fn visit_analyze_table(&mut self, _stmt: &'ast AnalyzeTableStmt<'ast>) {}
440+
439441
fn visit_exists_table(&mut self, _stmt: &'ast ExistsTableStmt<'ast>) {}
440442

441443
fn visit_create_view(&mut self, _stmt: &'ast CreateViewStmt<'ast>) {}

src/query/ast/src/visitors/visitor_mut.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -439,6 +439,8 @@ pub trait VisitorMut: Sized {
439439

440440
fn visit_optimize_table(&mut self, _stmt: &mut OptimizeTableStmt<'_>) {}
441441

442+
fn visit_analyze_table(&mut self, _stmt: &mut AnalyzeTableStmt<'_>) {}
443+
442444
fn visit_exists_table(&mut self, _stmt: &mut ExistsTableStmt<'_>) {}
443445

444446
fn visit_create_view(&mut self, _stmt: &mut CreateViewStmt<'_>) {}

src/query/ast/src/visitors/walk.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,7 @@ pub fn walk_statement<'a, V: Visitor<'a>>(visitor: &mut V, statement: &'a Statem
346346
Statement::RenameTable(stmt) => visitor.visit_rename_table(stmt),
347347
Statement::TruncateTable(stmt) => visitor.visit_truncate_table(stmt),
348348
Statement::OptimizeTable(stmt) => visitor.visit_optimize_table(stmt),
349+
Statement::AnalyzeTable(stmt) => visitor.visit_analyze_table(stmt),
349350
Statement::ExistsTable(stmt) => visitor.visit_exists_table(stmt),
350351
Statement::CreateView(stmt) => visitor.visit_create_view(stmt),
351352
Statement::AlterView(stmt) => visitor.visit_alter_view(stmt),

0 commit comments

Comments
 (0)