Skip to content

Commit 4549ae9

Browse files
committed
Merge branch 'main' into agg-v2
2 parents 0e5d681 + 4d2aa87 commit 4549ae9

File tree

3 files changed

+259
-21
lines changed

3 files changed

+259
-21
lines changed

docs/doc/60-contributing/02-roadmap.md

Lines changed: 31 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -12,41 +12,51 @@ This is Databend Roadmap 2022 :rocket:, sync from the [#3706](https://github.com
1212

1313
# Main tasks
1414

15-
### 1. Query
15+
Roadmap 2021: https://github.com/datafuselabs/databend/issues/746
16+
17+
# Main tasks
18+
19+
### 1. Query
1620

1721

1822
| Task | Status | Release Target | Comments |
1923
| ----------------------------------------------- | --------- | -------------- | --------------- |
20-
| [Query Cluster Track #747](https://github.com/datafuselabs/databend/issues/747) | PROGRESS | | |
21-
| [RBAC Privileges #2793](https://github.com/datafuselabs/databend/issues/2793) | PROGRESS | | |
22-
| [ New Planner Framework #1217](https://github.com/datafuselabs/databend/issues/1218)| PROGRESS | | [RFC](https://databend.rs/doc/contributing/rfcs/new-sql-planner-framework)|
23-
| [ Database Sharing #3430](https://github.com/datafuselabs/databend/issues/3430)| PROGRESS | | |
24-
| [ STAGE Command #2976](https://github.com/datafuselabs/databend/issues/2976)| PROGRESS | | |
25-
| [ COPY Command #4104](https://github.com/datafuselabs/databend/issues/4104)| PROGRESS | | |
24+
| [Query Cluster Track #747](https://github.com/datafuselabs/databend/issues/747) | DONE | | |
25+
| [RBAC Privileges #2793](https://github.com/datafuselabs/databend/issues/2793) | DONE | | |
26+
| [ New Planner Framework #1217](https://github.com/datafuselabs/databend/issues/1218)| DONE | | [RFC](https://databend.rs/doc/contributing/rfcs/new-sql-planner-framework)|
27+
| [ Database Sharing #3430](https://github.com/datafuselabs/databend/issues/3430)| DONE | | |
28+
| [ STAGE Command #2976](https://github.com/datafuselabs/databend/issues/2976)| DONE | | |
29+
| [ COPY Command #4104](https://github.com/datafuselabs/databend/issues/4104)| DONE | | |
2630
| [Index Design #3711](https://github.com/datafuselabs/databend/issues/3711) | PROGRESS | | |
27-
| [Push-Based + Pull-Based processor](https://github.com/datafuselabs/databend/issues/3379)| PROGRESS | | |
28-
| [Semi-structured Data Types #3916](https://github.com/datafuselabs/databend/issues/3916) | PROGRESS | | |
31+
| [Push-Based + Pull-Based processor](https://github.com/datafuselabs/databend/issues/3379)| DONE | | |
32+
| [Semi-structured Data Types #3916](https://github.com/datafuselabs/databend/issues/3916) | DONE | | |
33+
| [Table Cluster Key #4268](https://github.com/datafuselabs/databend/issues/4268) | DONE | | |
34+
| Transactions | DONE | | |
2935
| [Support Fulltext Index #3915](https://github.com/datafuselabs/databend/issues/3915) | PLANNING | | |
30-
| [Table Cluster Key #4268](https://github.com/datafuselabs/databend/issues/4268) | PLANNING | | |
31-
| Tansactions | PLANNING | | |
32-
| Window Functions | PLANNING | | |
33-
| Lambda Functions | PLANNING | | |
34-
| Array Functions | PLANNING | | |
36+
| [Hive External Data Source #4826](https://github.com/datafuselabs/databend/issues/4826) | DONE | | |
37+
| [Window Functions](https://github.com/datafuselabs/databend/issues/4653) | DONE | | |
38+
| Lambda Functions | DONE | | |
39+
| Array Functions | DONE | | |
3540
| Compile Aggregate Functions(JIT) | PLANNING | | |
36-
| Common Table Expressions | PLANNING | | [MySQL CTE](https://dev.mysql.com/doc/refman/8.0/en/with.html#common-table-expressions) |
37-
| External Cache | PLANNING | | |
41+
| [Common Table Expressions #6246](https://github.com/datafuselabs/databend/issues/6246) | DONE | | [MySQL CTE](https://dev.mysql.com/doc/refman/8.0/en/with.html#common-table-expressions) |
42+
| [External Cache](https://github.com/datafuselabs/databend/issues/6786) #6786 | PROGRESS | | |
3843
| External Table | PLANNING | | [Snowflake ET](https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html)|
39-
| Update&Delete | PLANNING | | |
44+
| Delete | DONE | | |
45+
| Update | PROGRESS | | |
4046
| Streaming Ingestion | PLANNING | | |
41-
| Streaming Analytics | PLANNING | | |
47+
| [Resource Quota](https://github.com/datafuselabs/databend/issues/6935) | PROGRESS | | |
48+
| [LakeHouse](https://github.com/datafuselabs/databend/issues/7592) | PROGRESS | | v0.9|
4249

4350

4451
### 2. Testing
4552

4653
| Task | Status | Release Target | Comments |
4754
| ----------------------------------------------- | --------- | -------------- | --------------- |
48-
| [ Continuous Benchmarking #3084](https://github.com/datafuselabs/databend/issues/3084) | PROGRESS | | |
55+
| [ Continuous Benchmarking #3084](https://github.com/datafuselabs/databend/issues/3084) | DONE | | https://perf.databend.rs |
56+
4957

5058
# Releases
51-
- [x] #2525
52-
- [x] #2257
59+
- [x] [Release proposal: Nightly v0.8 #4591](https://github.com/datafuselabs/databend/issues/4591)
60+
- [x] [Release proposal: Nightly v0.7 #3428](https://github.com/datafuselabs/databend/issues/3428)
61+
- [x] [Release proposal: Nightly v0.6 #2525](https://github.com/datafuselabs/databend/issues/2525)
62+
- [x] [Release proposal: Nightly v0.5 #2257](https://github.com/datafuselabs/databend/issues/2257)
Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
---
2+
title: Designing and Using JSON in Databend
3+
description: JSON
4+
slug: json-datatypes
5+
date: 2022-09-14
6+
tags: [databend, JSON]
7+
authors:
8+
- name: baishen
9+
url: https://github.com/b41sh
10+
image_url: https://github.com/b41sh.png
11+
---
12+
13+
JSON (JavaScript Object Notation) is a commonly used semi-structured data type. With the self-describing schema structure, JSON can hold all data types, including multi-level nested data types, such as Array, Object, etc. JSON takes advantage of high flexibility and easy dynamic expansion compared with the structured data types that must strictly follow the fields in a tabular data structure.
14+
15+
As data volume increases rapidly in recent years, many platforms have started to use and get the most out of semi-structured data types (such as JSON). For example, the JSON data shared by various platforms through open interfaces, and the public datasets and application logs stored in JSON format.
16+
17+
Databend supports structured data types, as well as JSON. This post dives deeply into the JSON data type in Databend.
18+
19+
![](../static/img/blog/json.png)
20+
21+
## Working with JSON in Databend
22+
23+
Databend stores semi-structured data as the VARIANT (also called JSON) data type:
24+
25+
```sql
26+
CREATE TABLE test
27+
(
28+
id INT32,
29+
v1 VARIANT,
30+
v2 JSON
31+
);
32+
```
33+
34+
The JSON data needs to be generated by calling the `parse_json` or `try_parse_json` function. The input string must be in the standard JSON format, including Null, Boolean, Number, String, Array, and Object. In case of parsing failure due to invalid string, the `parse_json` function will return an error while the `try_parse_json` function will return a NULL value.
35+
36+
```sql
37+
INSERT INTO test VALUES
38+
(1, parse_json('{"a":{"b":1,"c":[1,2]}}'), parse_json('[["a","b"],{"k":"a"}]')),
39+
(2, parse_json('{"a":{"b":2,"c":[3,4]}}'), parse_json('[["c","d"],{"k":"b"}]'));
40+
41+
SELECT * FROM test;
42+
+----+-------------------------+-----------------------+
43+
| id | v1 | v2 |
44+
+----+-------------------------+-----------------------+
45+
| 1 | {"a":{"b":1,"c":[1,2]}} | [["a","b"],{"k":"a"}] |
46+
| 2 | {"a":{"b":2,"c":[3,4]}} | [["c","d"],{"k":"b"}] |
47+
+----+-------------------------+-----------------------+
48+
```
49+
50+
JSON usually holds data of Array or Object type. Due to the nested hierarchical structure, the internal elements can be accessed through JSON PATH. The syntax supports the following delimiters:
51+
52+
- `:`: Colon can be used to obtain the elements in an object by the key.
53+
54+
- `.`: Dot can be used to obtain the elements in an object by the key. Do NOT use a dot as the first delimiter in a statement, or Databend would consider the dot as the delimiter to separate the table name from the column name.
55+
56+
- `[]`: Brackets can be used to obtain the elements in an object by the key or the elements in an array by the index.
57+
58+
You can mix the three types of delimiters above.
59+
60+
```sql
61+
SELECT v1:a.c, v1:a['b'], v1['a']:c, v2[0][1], v2[1].k FROM test;
62+
63+
+--------+-----------+-----------+----------+---------+
64+
| v1:a.c | v1:a['b'] | v1['a']:c | v2[0][1] | v2[1].k |
65+
+--------+-----------+-----------+----------+---------+
66+
| [1,2] | 1 | [1,2] | "b" | "a" |
67+
| [3,4] | 2 | [3,4] | "d" | "b" |
68+
+--------+-----------+-----------+----------+---------+
69+
```
70+
71+
The internal elements extracted through JSON PATH are also of JSON type, and they can be converted to basic types through the cast function or using the conversion operator `::`.
72+
73+
```sql
74+
SELECT cast(v1:a.c[0], int64), v1:a.b::int32, v2[0][1]::string FROM test;
75+
76+
+--------------------------+---------------+------------------+
77+
| cast(v1:a.c[0] as int64) | v1:a.b::int32 | v2[0][1]::string |
78+
+--------------------------+---------------+------------------+
79+
| 1 | 1 | b |
80+
| 3 | 2 | d |
81+
+--------------------------+---------------+------------------+
82+
```
83+
84+
## Parsing JSON from GitHub
85+
86+
Many public datasets are stored in JSON format. We can import these data into Databend for parsing. The following introduction uses the GitHub events dataset as an example.
87+
88+
The GitHub events dataset (downloaded from [GH Archive](https://www.gharchive.org/)) uses the following JSON format:
89+
90+
```json
91+
{
92+
"id":"23929425917",
93+
"type":"PushEvent",
94+
"actor":{
95+
"id":109853386,
96+
"login":"teeckyar-bot",
97+
"display_login":"teeckyar-bot",
98+
"gravatar_id":"",
99+
"url":"https://api.github.com/users/teeckyar-bot",
100+
"avatar_url":"https://avatars.githubusercontent.com/u/109853386?"
101+
},
102+
"repo":{
103+
"id":531248561,
104+
"name":"teeckyar/Times",
105+
"url":"https://api.github.com/repos/teeckyar/Times"
106+
},
107+
"payload":{
108+
"push_id":10982315959,
109+
"size":1,
110+
"distinct_size":1,
111+
"ref":"refs/heads/main",
112+
"head":"670e7ca4085e5faa75c8856ece0f362e56f55f09",
113+
"before":"0a2871cb7e61ce47a6790adaf09facb6e1ef56ba",
114+
"commits":[
115+
{
116+
"sha":"670e7ca4085e5faa75c8856ece0f362e56f55f09",
117+
"author":{
118+
"email":"support@teeckyar.ir",
119+
"name":"teeckyar-bot"
120+
},
121+
"message":"1662804002 Timehash!",
122+
"distinct":true,
123+
"url":"https://api.github.com/repos/teeckyar/Times/commits/670e7ca4085e5faa75c8856ece0f362e56f55f09"
124+
}
125+
]
126+
},
127+
"public":true,
128+
"created_at":"2022-09-10T10:00:00Z",
129+
"org":{
130+
"id":106163581,
131+
"login":"teeckyar",
132+
"gravatar_id":"",
133+
"url":"https://api.github.com/orgs/teeckyar",
134+
"avatar_url":"https://avatars.githubusercontent.com/u/106163581?"
135+
}
136+
}
137+
```
138+
139+
From the data above, we can see that the `actor`, `repo`, `payload`, and `org` fields have a nested structure and can be stored as JSON. Others can be stored as basic data types. So we can create a table like this:
140+
141+
```sql
142+
CREATE TABLE `github_data`
143+
(
144+
`id` VARCHAR,
145+
`type` VARCHAR,
146+
`actor` JSON,
147+
`repo` JSON,
148+
`payload` JSON,
149+
`public` BOOLEAN,
150+
`created_at` timestamp(0),
151+
`org` json
152+
);
153+
```
154+
155+
Use the COPY INTO command to load the data:
156+
157+
```sql
158+
COPY INTO github_data
159+
FROM 'https://data.gharchive.org/2022-09-10-10.json.gz'
160+
FILE_FORMAT = (
161+
compression = auto
162+
type = NDJSON
163+
);
164+
```
165+
166+
The following code returns the top 10 projects with the most commits:
167+
168+
```sql
169+
SELECT repo:name,
170+
count(id)
171+
FROM github_data
172+
WHERE type = 'PushEvent'
173+
GROUP BY repo:name
174+
ORDER BY count(id) DESC
175+
LIMIT 10;
176+
177+
+----------------------------------------------------------+-----------+
178+
| repo:name | count(id) |
179+
+----------------------------------------------------------+-----------+
180+
| "Lombiq/Orchard" | 1384 |
181+
| "maique/microdotblog" | 970 |
182+
| "Vladikasik/statistic" | 738 |
183+
| "brokjad/got_config" | 592 |
184+
| "yanonono/booth-update" | 537 |
185+
| "networkoperator/demo-cluster-manifests" | 433 |
186+
| "kn469/web-clipper-bed" | 312 |
187+
| "ufapg/jojo" | 306 |
188+
| "bj5nj7oh/bj5nj7oh" | 291 |
189+
| "appseed-projects2/500f32d3-8019-43ee-8f2a-a273163233fb" | 247 |
190+
+----------------------------------------------------------+-----------+
191+
```
192+
193+
The following code returns the top 10 users with the most forks:
194+
195+
```sql
196+
SELECT actor:login,
197+
count(id)
198+
FROM github_data
199+
WHERE type='ForkEvent'
200+
GROUP BY actor:login
201+
ORDER BY count(id) DESC
202+
LIMIT 10;
203+
204+
+-----------------------------------+-----------+
205+
| actor:login | count(id) |
206+
+-----------------------------------+-----------+
207+
| "actions-marketplace-validations" | 191 |
208+
| "alveraboquet" | 59 |
209+
| "ajunlonglive" | 50 |
210+
| "Shutch420" | 13 |
211+
| "JusticeNX" | 13 |
212+
| "RyK-eR" | 12 |
213+
| "DroneMad" | 10 |
214+
| "UnqulifiedEngineer" | 9 |
215+
| "PeterZs" | 8 |
216+
| "lgq2015" | 8 |
217+
+-----------------------------------+-----------+
218+
```
219+
220+
## Performance Optimization
221+
222+
The JSON data generally is saved in plaintext format and needs to be parsed to generate the enumeration value of serde_json::Value every time the data is read. Compared to other basic data types, handling JSON data takes more parsing time and needs more memory space.
223+
224+
Databend has improved the read performance of JSON data using the following methods:
225+
226+
- To speed up the parsing and reduce memory usage, Databend stores the JSON data as JSONB in binary format and uses the built-in j_entry structure to hold data type and offset position of each element.
227+
228+
- Adding virtual columns to speed up the queries. Databend extracts the frequently queried fields and the fields of the same data type and stores them as separate virtual columns. Data will be directly read from the virtual columns when querying, which makes Databend achieve the same performance as querying other basic data types.

website/static/img/blog/json.png

20.5 KB
Loading

0 commit comments

Comments
 (0)