Skip to content

Commit b3872b9

Browse files
committed
Update create_statistics.sgml
1 parent 0b98792 commit b3872b9

File tree

1 file changed

+206
-12
lines changed

1 file changed

+206
-12
lines changed

postgresql/doc/src/sgml/ref/create_statistics.sgml

Lines changed: 206 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -40,16 +40,24 @@ ____________________________________________________________________________-->
4040
<refsynopsisdiv>
4141
<!--==========================orignal english content==========================
4242
<synopsis>
43+
CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
44+
ON ( <replaceable class="parameter">expression</replaceable> )
45+
FROM <replaceable class="parameter">table_name</replaceable>
46+
4347
CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
4448
[ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
45-
ON <replaceable class="parameter">column_name</replaceable>, <replaceable class="parameter">column_name</replaceable> [, ...]
49+
ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) }, { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [, ...]
4650
FROM <replaceable class="parameter">table_name</replaceable>
4751
</synopsis>
4852
____________________________________________________________________________-->
4953
<synopsis>
54+
CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
55+
ON ( <replaceable class="parameter">expression</replaceable> )
56+
FROM <replaceable class="parameter">table_name</replaceable>
57+
5058
CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
5159
[ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
52-
ON <replaceable class="parameter">column_name</replaceable>, <replaceable class="parameter">column_name</replaceable> [, ...]
60+
ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) }, { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [, ...]
5361
FROM <replaceable class="parameter">table_name</replaceable>
5462
</synopsis>
5563

@@ -75,6 +83,28 @@ ____________________________________________________________________________-->
7583
被发出该命令的用户所有。
7684
</para>
7785

86+
<!--==========================orignal english content==========================
87+
<para>
88+
The <command>CREATE STATISTICS</command> command has two basic forms. The
89+
first form allows univariate statistics for a single expression to be
90+
collected, providing benefits similar to an expression index without the
91+
overhead of index maintenance. This form does not allow the statistics
92+
kind to be specified, since the various statistics kinds refer only to
93+
multivariate statistics. The second form of the command allows
94+
multivariate statistics on multiple columns and/or expressions to be
95+
collected, optionally specifying which statistics kinds to include. This
96+
form will also automatically cause univariate statistics to be collected on
97+
any expressions included in the list.
98+
</para>
99+
____________________________________________________________________________-->
100+
<para>
101+
<command>CREATE STATISTICS</command>命令有两种基本形式。
102+
第一种形式允许对被收集的单个表达式的单变量统计信息,提供了类似于表达式索引的好处,而不需要索引维护的开销。
103+
这种形式不允许指定统计类型,因为不同的统计类型引用只针对多元统计。
104+
此命令的第二种形式允许收集多个列和/或表达式的多元统计信息,可选地指定需要包括的统计信息类型。
105+
这种格式也会自动使得列表中包含的任何表达式上的单变量统计信息被收集。
106+
</para>
107+
78108
<!--==========================orignal english content==========================
79109
<para>
80110
If a schema name is given (for example, <literal>CREATE STATISTICS
@@ -146,24 +176,26 @@ ____________________________________________________________________________-->
146176
<listitem>
147177
<!--==========================orignal english content==========================
148178
<para>
149-
A statistics kind to be computed in this statistics object.
179+
A multivariate statistics kind to be computed in this statistics object.
150180
Currently supported kinds are
151181
<literal>ndistinct</literal>, which enables n-distinct statistics,
152182
<literal>dependencies</literal>, which enables functional
153183
dependency statistics, and <literal>mcv</literal> which enables
154184
most-common values lists.
155185
If this clause is omitted, all supported statistics kinds are
156-
included in the statistics object.
186+
included in the statistics object. Univariate expression statistics are
187+
built automatically if the statistics definition includes any complex
188+
expressions rather than just simple column references.
157189
For more information, see <xref linkend="planner-stats-extended"/>
158190
and <xref linkend="multivariate-statistics-examples"/>.
159191
</para>
160192
____________________________________________________________________________-->
161193
<para>
162-
在此统计对象中计算的统计种类。目前支持的种类是启用n-distinct统计的
163-
<literal>ndistinct</literal>,启用功能依赖性统计的<literal>dependencies</literal>,以及启用最常见的值列表的<literal>mcv</literal>。
194+
在此统计对象中计算的多变量统计种类。
195+
目前支持的种类是启用n-distinct统计的<literal>ndistinct</literal>,启用功能依赖性统计的<literal>dependencies</literal>,以及启用最常见的值列表的<literal>mcv</literal>。
164196
如果省略该子句,则统计对象中将包含所有支持的统计类型。
165-
有关更多信息,请参阅<xref linkend="planner-stats-extended"/>和
166-
<xref linkend="multivariate-statistics-examples"/>。
197+
如果统计信息定义包含任何复杂表达式而不仅仅是简单的列引用,单变量表达式统计会自动构建。
198+
有关更多信息,请参阅<xref linkend="planner-stats-extended"/>和<xref linkend="multivariate-statistics-examples"/>。
167199
</para>
168200
</listitem>
169201
</varlistentry>
@@ -177,16 +209,43 @@ ____________________________________________________________________________-->
177209
<!--==========================orignal english content==========================
178210
<para>
179211
The name of a table column to be covered by the computed statistics.
180-
At least two column names must be given; the order of the column names
181-
is insignificant.
212+
This is only allowed when building multivariate statistics. At least
213+
two column names or expressions must be specified, and their order is
214+
not significant.
182215
</para>
183216
____________________________________________________________________________-->
184217
<para>
185-
被计算的统计信息包含的表格列的名称。至少必须给出两个列名,列名的顺序可以忽略。
218+
被计算的统计信息包含的表格列的名称。
219+
这里只在建立多变量统计信息时才被允许。
220+
至少必须指定两个列名或表达式,它们的顺序是不重要的。
186221
</para>
187222
</listitem>
188223
</varlistentry>
189224

225+
<varlistentry>
226+
<!--==========================orignal english content==========================
227+
<term><replaceable class="parameter">expression</replaceable></term>
228+
____________________________________________________________________________-->
229+
<term><replaceable class="parameter">表达式</replaceable></term>
230+
<listitem>
231+
<!--==========================orignal english content==========================
232+
<para>
233+
An expression to be covered by the computed statistics. This may be
234+
used to build univariate statistics on a single expression, or as part
235+
of a list of multiple column names and/or expressions to build
236+
multivariate statistics. In the latter case, separate univariate
237+
statistics are built automatically for each expression in the list.
238+
</para>
239+
____________________________________________________________________________-->
240+
<para>
241+
由计算统计信息包含的表达式。
242+
这可以用于在单个表达式上构建单变量统计信息,或者作为多个列名和/或表达式的列表的一部分来构建多变量统计信息。
243+
在后一种情况中,将为列表中的每个表达式自动构建单独的单变量统计信息。
244+
</para>
245+
</listitem>
246+
</varlistentry>
247+
248+
190249
<varlistentry>
191250
<!--==========================orignal english content==========================
192251
<term><replaceable class="parameter">table_name</replaceable></term>
@@ -225,6 +284,19 @@ ____________________________________________________________________________-->
225284
你必须是表的所有者才能创建读取它的统计对象。不过,一旦创建,
226285
统计对象的所有权与基础表无关。
227286
</para>
287+
288+
<!--==========================orignal english content==========================
289+
<para>
290+
Expression statistics are per-expression and are similar to creating an
291+
index on the expression, except that they avoid the overhead of index
292+
maintenance. Expression statistics are built automatically for each
293+
expression in the statistics object definition.
294+
</para>
295+
____________________________________________________________________________-->
296+
<para>
297+
表达式统计信息是对每个表达式的,就像在表达式上创建索引,只是它们避免了索引维护的开销。
298+
表达式统计信息是为统计对象定义中的每个表达式自动构建的。
299+
</para>
228300
</refsect1>
229301

230302
<refsect1 id="sql-createstatistics-examples">
@@ -305,7 +377,7 @@ EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
305377
<!--==========================orignal english content==========================
306378
<para>
307379
Create table <structname>t2</structname> with two perfectly correlated columns
308-
(containing identical data), and a MCV list on those columns:
380+
(containing identical data), and an MCV list on those columns:
309381

310382
<programlisting>
311383
CREATE TABLE t2 (
@@ -359,6 +431,128 @@ EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
359431
MCV列表为计划器提供了关于表中普遍出现的特定值的更详细的信息,以及表中未显示的值组合的选择性上限,允许它在这两种情况下产生更好的估计值。
360432
</para>
361433

434+
<!--==========================orignal english content==========================
435+
<para>
436+
Create table <structname>t3</structname> with a single timestamp column,
437+
and run queries using expressions on that column. Without extended
438+
statistics, the planner has no information about the data distribution for
439+
the expressions, and uses default estimates. The planner also does not
440+
realize that the value of the date truncated to the month is fully
441+
determined by the value of the date truncated to the day. Then expression
442+
and ndistinct statistics are built on those two expressions:
443+
444+
<programlisting>
445+
CREATE TABLE t3 (
446+
a timestamp
447+
);
448+
449+
INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
450+
'2020-12-31'::timestamp,
451+
'1 minute'::interval) s(i);
452+
453+
ANALYZE t3;
454+
455+
-&minus; the number of matching rows will be drastically underestimated:
456+
EXPLAIN ANALYZE SELECT * FROM t3
457+
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
458+
459+
EXPLAIN ANALYZE SELECT * FROM t3
460+
WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
461+
AND '2020-06-30'::timestamp;
462+
463+
EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
464+
FROM t3 GROUP BY 1, 2;
465+
466+
-&minus; build ndistinct statistics on the pair of expressions (per-expression
467+
-&minus; statistics are built automatically)
468+
CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;
469+
470+
ANALYZE t3;
471+
472+
-&minus; now the row count estimates are more accurate:
473+
EXPLAIN ANALYZE SELECT * FROM t3
474+
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
475+
476+
EXPLAIN ANALYZE SELECT * FROM t3
477+
WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
478+
AND '2020-06-30'::timestamp;
479+
480+
EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
481+
FROM t3 GROUP BY 1, 2;
482+
</programlisting>
483+
484+
Without expression and ndistinct statistics, the planner has no information
485+
about the number of distinct values for the expressions, and has to rely
486+
on default estimates. The equality and range conditions are assumed to have
487+
0.5% selectivity, and the number of distinct values in the expression is
488+
assumed to be the same as for the column (i.e. unique). This results in a
489+
significant underestimate of the row count in the first two queries. Moreover,
490+
the planner has no information about the relationship between the expressions,
491+
so it assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
492+
conditions are independent, and multiplies their selectivities together to
493+
arrive at a severe overestimate of the group count in the aggregate query.
494+
This is further exacerbated by the lack of accurate statistics for the
495+
expressions, forcing the planner to use a default ndistinct estimate for the
496+
expression derived from ndistinct for the column. With such statistics, the
497+
planner recognizes that the conditions are correlated, and arrives at much
498+
more accurate estimates.
499+
</para>
500+
____________________________________________________________________________-->
501+
<para>
502+
使用单个时间戳列创建表<structname>t3</structname>,并用该列上的表达式运行查询。
503+
没有扩展的统计信息,计划器无法获知表达式数据分布的相关信息,然后使用默认的估计值。
504+
计划器也没有认识到按月截断日期的值完全取决于按天截断日期的值。
505+
然后表达式和模糊统计构建在这两个表达式之上:
506+
507+
508+
<programlisting>
509+
CREATE TABLE t3 (
510+
a timestamp
511+
);
512+
513+
INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
514+
'2020-12-31'::timestamp,
515+
'1 minute'::interval) s(i);
516+
517+
ANALYZE t3;
518+
519+
-- the number of matching rows will be drastically underestimated:
520+
EXPLAIN ANALYZE SELECT * FROM t3
521+
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
522+
523+
EXPLAIN ANALYZE SELECT * FROM t3
524+
WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
525+
AND '2020-06-30'::timestamp;
526+
527+
EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
528+
FROM t3 GROUP BY 1, 2;
529+
530+
-- build ndistinct statistics on the pair of expressions (per-expression
531+
-- statistics are built automatically)
532+
CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;
533+
534+
ANALYZE t3;
535+
536+
-- now the row count estimates are more accurate:
537+
EXPLAIN ANALYZE SELECT * FROM t3
538+
WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
539+
540+
EXPLAIN ANALYZE SELECT * FROM t3
541+
WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
542+
AND '2020-06-30'::timestamp;
543+
544+
EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
545+
FROM t3 GROUP BY 1, 2;
546+
</programlisting>
547+
548+
没有表达式和模糊统计信息,规划器就没有表达式的不同值的数量所相关的信息,并且不得不依赖默认估计值。
549+
相等和范围条件假设有0.5%的选择度,并且表达式中不同值的数量被假设为与列相同(也就是独一无二的)。
550+
这将导致前两个查询中的行数严重低估。
551+
此外,计划器没有关于表达式之间关系的信息,所以它假设两个<literal>WHERE</literal>和<literal>GROUP BY</literal>条件是独立的,并将它们的选择相乘,以得到对聚合查询中的组数的严重高估。
552+
由于缺乏表达式准确的统计信息,这种情况进一步加剧了,强迫计划器使用默认的ndistinct估计,对于从列的ndistinct派生的表达式。
553+
有了这些统计信息,规划器就能认识到这些条件是有相互关系的,并得出更准确的估计。
554+
</para>
555+
362556
</refsect1>
363557

364558
<refsect1>

0 commit comments

Comments
 (0)