Skip to content

Commit f1daf95

Browse files
authored
Update glossary and add lots of detail (#5)
This commit defines several key terms related to query optimization that we found were heavily overloaded last semester, leading to lots of confusion. Since this document will evolve over time, there are still many TODOs that will have to be fixed later. * update glossary and add lots of detail * add directory * clean style * refactor and make section for each term * Fill in all main sections with something * clean up ending but leave TODOs
1 parent de96ad3 commit f1daf95

File tree

1 file changed

+288
-32
lines changed

1 file changed

+288
-32
lines changed

docs/src/architecture/glossary.md

Lines changed: 288 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,332 @@
11
# Glossary
22

3-
Definitions in query optimization can get very overloaded. Below is the language optd developers speak.
3+
We have found internally that definitions in query optimization have become overloaded. This
4+
document defines key names and definitions for concepts that are required in optimization.
5+
6+
Many of the names and definitions will be inspired by the Cascades framework. However, there are a
7+
few important differences that need to be addressed considering our memo table will be persistent.
8+
9+
# Contents
10+
11+
- [Memo Table]
12+
- [Expression]
13+
- [Relational Expression]
14+
- [Logical Expression]
15+
- [Physical Expression]
16+
- [Scalar Expression]
17+
- **[Equivalence of Expressions](#expression-equivalence)**
18+
- [Group]
19+
- [Relational Group]
20+
- [Scalar Group]
21+
- [Query Plan]
22+
- [Logical Plan]
23+
- [Physical Plan]
24+
- [Operator] / [Plan Node]
25+
- [Relational Operator]
26+
- [Logical Operator]
27+
- [Physical Operator]
28+
- [Scalar Operator]
29+
- [Property]
30+
- [Logical Property]
31+
- [Physical Property]
32+
- ? Derived Property ?
33+
- [Rule]
34+
- [Transformation Rule]
35+
- [Implementation Rule]
36+
37+
[EQOP]: https://www.microsoft.com/en-us/research/publication/extensible-query-optimizers-in-practice/
38+
[Memo Table]: #memo-table
39+
[Expression]: #expression
40+
[Relational Expression]: #relational-expression
41+
[Logical Expression]: #logical-expression
42+
[Physical Expression]: #physical-expression
43+
[Scalar Expression]: #scalar-expression
44+
[Group]: #group
45+
[Relational Group]: #relational-group
46+
[Scalar Group]: #scalar-group
47+
[Query Plan]: #query-plan
48+
[Logical Plan]: #logical-plan
49+
[Physical Plan]: #physical-plan
50+
[Plan Node]: #operator
51+
[Operator]: #operator
52+
[Relational Operator]: #relational-operator
53+
[Logical Operator]: #logical-operator
54+
[Physical Operator]: #physical-operator
55+
[Scalar Operator]: #scalar-operator
56+
[Property]: #property
57+
[Logical Property]: #logical-property
58+
[Physical Property]: #physical-property
59+
[Rule]: #rule
60+
[Transformation Rule]: #transformation-rule
61+
[Implementation Rule]: #implementation-rule
62+
[Enforcer Rule]: #enforcer-rule
63+
[Enforcer Operator]: #enforcer-operator
464

5-
### Relational operator
6-
A **relation operator** (`RelNode`) describes an operation that can be evaluated to obtain a bag of tuples. In other literature this is also referred to as a query plan. A relational operator can be either logical or physical.
65+
# Comparison with Cascades
766

8-
### Scalar operator
67+
In the Cascades framework, an expression is a tree of operators. In `optd`, we are instead defining
68+
a logical or physical [Query Plan] to be a tree or DAG of [Operator]s. An expression in `optd`
69+
strictly refers to the representation of an operator in the [Memo Table], not in query plans.
970

10-
A **scalar operator** (`ScalarNode`) describes an operation that can be evaluated to obtain a single value. In other literature this is also referred to as a sql expression or a row expression.
71+
See the [section below](#expression-logical-physical-scalar) on the kinds of expressions for more
72+
information.
1173

12-
## Cascades
74+
Most other terms in `optd` are similar to Cascades or are self-explanatory.
1375

14-
### Expressions
76+
<br>
1577

16-
A **logical expression** is a tree/DAG of logical operators.
78+
# Memo Table Terms
1779

18-
A **physical expression** is a tree/DAG of physical operators.
80+
This section describes names and definitions of concepts related to the memo table.
1981

20-
The term **expression** in the context of Cascades can refer to either a relational or a scalar expression.
82+
## Memo Table
2183

22-
### Properties
84+
The memo table is the data structure used for dynamic programming in a top-down plan enumeration
85+
search algorithm. The memo table consists of a mutually recursive data structure made up of
86+
[Expression]s and [Group]s.
2387

24-
**Properties** are metadata computed (and sometimes stored) for each node in an expression.
25-
Properties of an expression may be **required** by the original SQL query or **derived** from **physical properties of one of its inputs.**
88+
## Expression
2689

90+
An expression is the representation of a non-materialized operator _inside_ of the [Memo Table].
2791

28-
**Logical properties** describe the structure and content of data returned by an expression.
92+
There are 2 types of expressions: [Relational Expression]s and [Scalar Expression]s. A [Relational
93+
Expression] can be either a [Logical Expression] or a [Physical Expression].
2994

30-
- Examples: row count, operator type,statistics, whether relational output columns can contain nulls.
95+
Note that different kinds of expressions can have the same names as [Operator]s or [Plan Node]s, but
96+
expressions solely indicate non-materialized relational or scalar operators in the [Memo Table].
3197

32-
**Physical properties** are characteristics of an expression that
33-
impact its layout, presentation, or location, but not its logical content.
98+
Operators outside of the [Memo Table] should _**not**_ be referred to as expressions, and should
99+
instead be referred to as [Operator]s or [Plan Node]s.
34100

35-
- Examples: order and data distribution.
101+
Notably, when we refer to an expression, _we are specifically talking about the representation of_
102+
_operators inside the memo table_. A logical operator from an incoming logical plan should _not_
103+
be called an [Logical Expression], and similarly a physical execution operator in the final output
104+
physical plan should also _not_ be called an [Physical Expression].
36105

106+
Another way to think about this is that expressions are _not_ materialized, and plan nodes and
107+
operators inside query plans _are_ materialized. Operators inside of query plans (both logical and
108+
physical) should be referred to as either logical or physical [Operator]s or logical or physical
109+
[Plan Node]s.
37110

38-
### Equivalence
111+
Another key difference between expressions and [Plan Node]s is that expressions have 0 or more
112+
**Group Identifiers** as children, and [Plan Node]s have 0 or more other [Plan Node]s as children.
39113

40-
Two logical expressions are equivalent if the logical properties of the two expressions are the same. They should produce the same set of rows and columns.
114+
## Relational Expression
41115

42-
Two physical expressions are equivalent if their logical and physical properties are the same.
116+
A relational expression is either a [Logical Expression] or a [Physical Expression].
43117

44-
Logical expression with a required physical property is equivalent to a physical expression if the physical expression has the same logical property and delivers the physical property.
118+
When we say "relational", we mean representations of operations in the relational algebra of SQL.
45119

120+
Relational expressions differ from [Scalar Expression]s in that the result of algebraically
121+
evaluating a relational expression produces a bag of tuples instead of a single scalar value.
46122

47-
### Group
123+
See the following sections for more information.
48124

49-
A **group** consists of equivalent logical expressions.
125+
## Logical Expression
50126

51-
A **relational group** consists of logically equivalent logical relational operators.
127+
A logical expression is a version of a [Relational Expression].
52128

53-
A **scalar group** consists of logically equivalent logical scalar operators.
129+
TODO(connor) Add more details.
54130

55-
### Rule
131+
Examples of logical expressions include Logical Scan, Logical Join, or Logical Sort expressions
132+
(which can just be shorthanded to Scan, Join, or Sort).
56133

57-
a **rule** in Cascades transforms an expression into equivalent expressions. It has the following interface.
134+
## Physical Expression
135+
136+
A physical expression is a version of a [Relational Expression].
137+
138+
TODO(connor) Add more details.
139+
140+
Examples of physical expressions include Table Scan, Index Scan, Hash Join, or Sort Merge Join.
141+
142+
## Scalar Expression
143+
144+
A scalar expression is a version of an [Expression].
145+
146+
A scalar expression describes an operation that can be evaluated to obtain a single value. This can
147+
also be referred to as a SQL expression, a row expression, or a SQL predicate.
148+
149+
TODO(everyone) Figure out the semantics of what a scalar expression really is.
150+
151+
Examples of scalar expressions include the expressions `t1.a < 42` or `t1.b = t2.c`.
152+
153+
## Expression Equivalence
154+
155+
Two [Logical Expression]s are equivalent if the [Logical Property]s of the two expressions are the
156+
same. In other words, the [Logical Plan]s they represent produce the same set of rows and columns.
157+
158+
Two Physical Expressions are equivalent if their Logical and [Physical Property]s are the same.
159+
In other words, the [Physical Plan]s they represent produce the same set of rows and columns, in the
160+
exact same order and distribution.
161+
162+
TODO This next part is unclear?
163+
164+
A [Logical Expression] with a required [Physical Property] is equivalent to a [Physical Expression]
165+
if the [Physical Expression] has the same [Logical Property] and delivers the [Physical Property].
166+
167+
## Group
168+
169+
A **group** is a set of equivalent [Expression]s.
170+
171+
We follow the definition of groups in the Volcano and Cascades frameworks. From the [EQOP] Microsoft
172+
article (Section 2.2, page 205):
173+
174+
> In the memo, each class of equivalent expressions is called an _equivalence class_ or a _group_,
175+
> and all equivalent expressions within the class are called _group expressions_ or simply
176+
> _expressions_.
177+
178+
## Relational Group
179+
180+
A relational group is a set of 1 or more equivalent [Logical Expression]s and 0 or more equivalent
181+
[Physical Expression]s.
182+
183+
For a given relational group, the first step of optimization is exploration, in which equivalent
184+
[Logical Expression]s are added to the group via [Transformation Rule]s. Once the search space for
185+
the group has been exhausted (all possible transformation rules have been applied to all logical
186+
expressions in the group), the group can be physically optimized. At this point, the search
187+
algorithm will apply [Implementation RUle]s to cost and find the best execution plan.
188+
189+
TODO Add more details.
190+
191+
TODO Add example.
192+
193+
## Scalar Group
194+
195+
A scalar group consists of equivalent [Scalar Expression]s.
196+
197+
TODO Add more details.
198+
199+
TODO Add example.
200+
201+
<br>
202+
203+
# Plan Enumeration and Search Concepts
204+
205+
This section describes names and definitions of concepts related to the general plan enumeration and
206+
search of optimal query plans.
207+
208+
## Query Plan
209+
210+
A query plan is a tree or DAG of relational and scalar operators. We can consider query optimization
211+
to be a function from an unoptimized query plan to an optimized query plan. More specifically, the
212+
input plan is generally a [Logical Plan] and the output plan is always a [Physical Plan].
213+
214+
We generally consider query plans to either be completely logical or completely physical. However,
215+
when dealing with rule matching and rule application to enumerate different but equivalent query
216+
plans, we also deal with partially materialized query plans that can be a mix of both logical and
217+
physical operators (as well as group identifiers and other scalar operators).
218+
219+
TODO Add more details about partially materialized plans.
220+
221+
## Logical Plan
222+
223+
A logical plan is a tree or DAG of [Logical Operator]s that can be evaluated to produce a bag of
224+
tuples. This can also be referred to as a logical query plan. The [Operator]s that make up this
225+
logical plan can be considered logical plan nodes.
226+
227+
## Physical Plan
228+
229+
A physical plan is a tree or DAG of [Physical Operator]s that can be evaluated by an execution
230+
engine to produce a table. This can also be referred to as a physical query plan. The [Operator]s
231+
that make up this physical plan can be considered physical plan nodes.
232+
233+
## Operator
234+
235+
An operator is the materialized version of an [Expression]. Like expressions, there are both
236+
relational operators and scalar operators.
237+
238+
See the following sections for more information.
239+
240+
## Relational Operator
241+
242+
A relational operator is a node in a [Query Plan] (which is a tree or DAG), and is the materialized
243+
version of a [Relational Expression].
244+
245+
## Logical Operator
246+
247+
A logical operator is a node in a [Logical Plan] (which is a tree or DAG), and is the materialized
248+
version of a [Logical Expression].
249+
250+
## Physical Operator
251+
252+
A physical operator is a node in a [Physical Plan] (which is a tree or DAG), and is the materialized
253+
version of a [Physical Expression].
254+
255+
## Scalar Operator
256+
257+
A scalar operator is a node in a [Query Plan] that describes a scalar expression, and can be
258+
considered the materialized version of a [Scalar Expression].
259+
260+
## Property
261+
262+
A property is metadata computed (and sometimes stored) for a given relational expression.
263+
264+
Properties of an expression may be _required_ by the original SQL query or _derived_ from the
265+
[Physical Property] of one of its inputs.
266+
267+
TODO Add more details.
268+
269+
## Logical Property
270+
271+
A logical property describes the structure and content of data returned by a given expression.
272+
273+
Examples: row count, operator type,statistics, whether relational output columns can contain nulls.
274+
275+
TODO Clean up and add more details.
276+
277+
## Physical Property
278+
279+
A physical property is a characteristic of an expression that impacts its layout, presentation, or
280+
location, but not its logical content.
281+
282+
Examples: order and data distribution.
283+
284+
TODO Clean up and add more details.
285+
286+
## Rule
287+
288+
A rule transforms a query plan or sub-plan into an equivalent plan.
289+
290+
Rules should have an interface similar to the following:
58291

59292
```rust
60293
trait Rule {
61294
/// Checks whether the rule is applicable on the input expression.
62295
fn check_pattern(expr: Expr) -> bool;
296+
63297
/// Transforms the expression into one or more equivalent expressions.
64298
fn transform(expr: Expr) -> Vec<Expr>;
65299
}
66300
```
67301

68-
A **transformation rule** transforms a **part** of the logical expression into logical expressions. This is also called a logical to logical transformation in other systems.
302+
TODO Actually figure out the interface for rules since it's probably not going to like that.
303+
304+
TODO Clean up and add more details.
69305

70-
A **implementation rule** transforms a **part** of a logical expression to an equivalent physical expression with physical properties.
306+
## Transformation Rule
71307

72-
In Cascades, you don't need to materialize the entire query tree when applying rules. Instead, you can materialize expressions on demand while leaving unrelated parts of the tree as group identifiers.
308+
A transformation rule transforms a _part_ of the logical expression into logical expressions.
309+
310+
This is also called a logical to logical transformation in other systems.
311+
312+
TODO Clean up and add more details.
313+
314+
## Implementation Rule
315+
316+
A implementation rule transforms a _part_ of a logical expression to an equivalent physical
317+
expression with physical properties.
318+
319+
In Cascades, you don't need to materialize the entire query tree when applying rules. Instead, you
320+
can materialize expressions on demand while leaving unrelated parts of the tree as group identifiers.
73321

74322
In other systems, there are physical to physical expression transformation for execution engine specific optimization, physical property enforcement, or distributed planning. At the moment, we are **not** considering physical-to-physical transformations.
75323

76-
**Enforcer rule:** *TODO!*
324+
TODO Clean up and add more details.
325+
326+
## Enforcer Rule
327+
328+
TODO Write this section.
329+
330+
## Enforcer Operator
331+
332+
TODO Write this section.

0 commit comments

Comments
 (0)