Skip to content

Commit 98271aa

Browse files
authored
Merge pull request #9208 from ariesdevil/stream-load-doc
docs: Update streaming load docs and tests
2 parents 8072be9 + 9435d48 commit 98271aa

24 files changed

+49
-52
lines changed

docs/doc/11-integrations/00-api/03-streaming-load.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,20 +14,20 @@ The Streaming Load API is used to read data from your local files and load it in
1414
To create a request with the Streaming Load API, follow the format below:
1515

1616
```bash
17-
curl -H "<parameter>:<value>" [-H "<parameter>:<value>"...] -F "upload=@<file_location>" [-F "upload=@<file_location>"] -XPUT http://<user_name>:[password]@<http_handler_host>:<http_handler_port>/v1/streaming_load
17+
curl -H "insert_sql:<value>" -F "upload=@<file_location>" [-F "upload=@<file_location>"] -XPUT http://<user_name>:[password]@<http_handler_host>:<http_handler_port>/v1/streaming_load
1818
```
1919
## Explaining Argument `-H`
2020

21-
The request usually includes many occurrences of the argument `-H` and each is followed by one of the following parameters to tell Databend how to handle the file you're loading data from. Please note that `insert_sql` is required, and other parameters are optional.
21+
The request usually includes many occurrences of the argument `-H` and each is followed by one of the following parameters to tell Databend how to handle the file you're loading data from. Please note that `insert_sql` is required.
2222

23-
| Parameter | Values | Supported Formats | Examples |
24-
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
25-
| insert_sql | [INSERT_statement] + format [file_format] | All | -H "insert_sql: insert into ontime format CSV" |
26-
| format_skip_header | Tells Databend how many lines at the beginning of the file to skip for header.<br /> 0 (default): No lines to skip;<br /> 1: Skip the first line;<br /> N: Skip the first N lines. | CSV / TSV / NDJSON | -H "format_skip_header: 1" |
27-
| format_compression | Tells Databend the compression format of the file.<br /> NONE (default): Do NOT decompress the file;<br /> AUTO: Automatically decompress the file by suffix;<br /> You can also use one of these values to explicitly specify the compression format: GZIP \ | BZ2 \| BROTLI \ | ZSTD \| DEFALTE \| RAW_DEFLATE. | CSV / TSV / NDJSON | -H "format_compression:auto" |
28-
| format_field_delimiter | Tells Databend the characters used in the file to separate fields.<br /> Default for CSV files: `,`.<br /> Default for TSV files: `\t`.<br /> Hive output files using [SOH control character (\x01)]( https://en.wikipedia.org/wiki/C0_and_C1_control_codes#SOH) as the field delimiter. | CSV / TSV | -H "format_field_delimiter:,". |
29-
| format_record_delimiter | Tells Databend the new line characters used in the file to separate records.<br /> Default: `\n`. | CSV / TSV | -H "format_recorder_delimiter:\n" |
30-
| format_quote | Tells Databend the quote characters for strings in CSV file.<br /> Default: ""(Double quotes). | CSV | |
23+
| Parameter | Values | Supported Formats | Examples |
24+
|-------------------------|-------------------------------------|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
25+
| insert_sql | [INSERT_statement] + [FILE_FORMAT] | All | -H "insert_sql: insert into ontime file_format = (type = 'CSV' skip_header = 1 compression = 'bz2')" | | CSV | |
26+
27+
28+
> FILE_FORMAT = ( TYPE = { CSV | TSV | NDJSON | PARQUET | XML} [ formatTypeOptions ] )
29+
>
30+
> The `formatTypeOptions` contains the same options as the one for the [COPY_INTO](../../14-sql-commands/10-dml/dml-copy-into-table.md) command.
3131
3232
## Alternatives to Streaming Load API
3333

docs/doc/12-load-data/02-local.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ CREATE TABLE books
4343
Create and send the API request with the following scripts:
4444

4545
```bash
46-
curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into book_db.books format CSV' -H 'format_skip_header: 0' -H 'format_field_delimiter: ,' -H 'format_record_delimiter: \n' -F 'upload=@"./books.csv"'
46+
curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into book_db.books file_format = (type = "CSV" skip_header = 0 field_delimiter = "," record_delimiter = "\n")' -F 'upload=@"./books.csv"'
4747
```
4848

4949
Response Example:
@@ -101,7 +101,7 @@ CREATE TABLE bookcomments
101101
Create and send the API request with the following scripts:
102102

103103
```bash
104-
curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into book_db.bookcomments(title,author,date)format CSV' -H 'format_skip_header: 0' -H 'format_field_delimiter: ,' -H 'format_record_delimiter: \n' -F 'upload=@"./books.csv"'
104+
curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into book_db.bookcomments(title,author,date) file_format = (type = "CSV" skip_header = 0 field_delimiter = "," record_delimiter = "\n")' -F 'upload=@"./books.csv"'
105105
```
106106

107107
Notice that the `insert_sql` part above specifies the columns (title, author, and date) to match the loaded data.

docs/doc/21-use-cases/01-analyze-ontime-with-databend-on-ec2-and-s3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ unzip t_ontime.csv.zip
4848
```
4949

5050
```shell title='Load TSV files into Databend'
51-
curl -H "insert_sql:insert into ontime format TSV" -H "format_skip_header:0" -F "upload=@t_ontime.csv" -XPUT http://root:@127.0.0.1:8000/v1/streaming_load
51+
curl -H "insert_sql:insert into ontime file_format = (type = 'TSV' skip_header = 0)" -F "upload=@t_ontime.csv" -XPUT http://root:@127.0.0.1:8000/v1/streaming_load
5252
```
5353

5454
:::tip

docs/doc/21-use-cases/05-analyze-hits-dataset-with-databend.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ gzip -d hits_1m.csv.gz
4747
```
4848

4949
```shell title='Load CSV files into Databend'
50-
curl -H "insert_sql:insert into hits format TSV" -F "upload=@./hits_1m.tsv" -XPUT http://user1:abc123@127.0.0.1:8000/v1/streaming_load
50+
curl -H "insert_sql:insert into hits file_format = (type = 'TSV')" -F "upload=@./hits_1m.tsv" -XPUT http://user1:abc123@127.0.0.1:8000/v1/streaming_load
5151
```
5252

5353
## Step 3. Queries

tests/suites/0_stateless/13_tpch/13_0000_prepare.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,5 +111,6 @@ tar -zxf ${CURDIR}/data/tpch.tar.gz -C ${CURDIR}/data
111111
for t in customer lineitem nation orders partsupp part region supplier
112112
do
113113
echo "$t"
114-
curl -s -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" -H 'insert_sql: insert into '$t' format CSV' -H 'format_skip_header: 0' -H 'format_field_delimiter:|' -H 'format_record_delimiter: \n' -F 'upload=@"'${CURDIR}'/data/tests/suites/0_stateless/13_tpch/data/'$t'.tbl"' > /dev/null 2>&1
114+
insert_sql="insert into $t file_format = (type = 'CSV' skip_header = 0 field_delimiter = '|' record_delimiter = '\n')"
115+
curl -s -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" -H "insert_sql: ${insert_sql}" -F 'upload=@"'${CURDIR}'/data/tests/suites/0_stateless/13_tpch/data/'$t'.tbl"' > /dev/null 2>&1
115116
done

tests/suites/1_stateful/01_load_v2/01_0000_streaming_load.result

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,6 @@
1010
199 2020.0 769
1111
--ndjson
1212
198 2020.0 767
13-
--csv using file_format
14-
199 2020.0 769
1513
--parquet less
1614
199 2020.0 769
1715
--parquet mismatch schema

tests/suites/1_stateful/01_load_v2/01_0000_streaming_load.sh

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -38,43 +38,37 @@ fi
3838

3939
# load csv
4040
echo "--csv"
41-
curl -H "insert_sql:insert into ontime_streaming_load format Csv" -H "format_skip_header:1" -F "upload=@/tmp/ontime_200.csv" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
41+
curl -H "insert_sql:insert into ontime_streaming_load file_format = (type = 'CSV' skip_header = 1)" -F "upload=@/tmp/ontime_200.csv" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
4242
echo "select count(1), avg(Year), sum(DayOfWeek) from ontime_streaming_load;" | $MYSQL_CLIENT_CONNECT
4343
echo "truncate table ontime_streaming_load" | $MYSQL_CLIENT_CONNECT
4444

4545
echo "--csv.gz"
4646
# load csv gz
47-
curl -H "insert_sql:insert into ontime_streaming_load format Csv" -H "format_skip_header:1" -H "format_compression:gzip" -F "upload=@/tmp/ontime_200.csv.gz" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
47+
curl -H "insert_sql:insert into ontime_streaming_load file_format = (type = 'CSV' skip_header = 1 compression = 'gzip')" -F "upload=@/tmp/ontime_200.csv.gz" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
4848
echo "select count(1), avg(Year), sum(DayOfWeek) from ontime_streaming_load;" | $MYSQL_CLIENT_CONNECT
4949
echo "truncate table ontime_streaming_load" | $MYSQL_CLIENT_CONNECT
5050

5151
# load csv zstd
5252
echo "--csv.zstd"
53-
curl -H "insert_sql:insert into ontime_streaming_load format Csv" -H "format_skip_header:1" -H "format_compression:zstd" -F "upload=@/tmp/ontime_200.csv.zst" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
53+
curl -H "insert_sql:insert into ontime_streaming_load file_format = (type = 'CSV' skip_header = 1 compression = 'zstd')" -F "upload=@/tmp/ontime_200.csv.zst" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
5454
echo "select count(1), avg(Year), sum(DayOfWeek) from ontime_streaming_load;" | $MYSQL_CLIENT_CONNECT
5555
echo "truncate table ontime_streaming_load" | $MYSQL_CLIENT_CONNECT
5656

5757
# load csv bz2
5858
echo "--csv.bz2"
59-
curl -H "insert_sql:insert into ontime_streaming_load format Csv" -H "format_skip_header:1" -H "format_compression:bz2" -F "upload=@/tmp/ontime_200.csv.bz2" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
59+
curl -H "insert_sql:insert into ontime_streaming_load file_format = (type = 'CSV' skip_header = 1 compression = 'bz2')" -F "upload=@/tmp/ontime_200.csv.bz2" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
6060
echo "select count(1), avg(Year), sum(DayOfWeek) from ontime_streaming_load;" | $MYSQL_CLIENT_CONNECT
6161
echo "truncate table ontime_streaming_load" | $MYSQL_CLIENT_CONNECT
6262

6363
# load parquet
6464
echo "--parquet"
65-
curl -H "insert_sql:insert into ontime_streaming_load format Parquet" -F "upload=@/tmp/ontime_200.parquet" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
65+
curl -H "insert_sql:insert into ontime_streaming_load file_format = (type = 'Parquet')" -F "upload=@/tmp/ontime_200.parquet" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
6666
echo "select count(1), avg(Year), sum(DayOfWeek) from ontime_streaming_load;" | $MYSQL_CLIENT_CONNECT
6767
echo "truncate table ontime_streaming_load" | $MYSQL_CLIENT_CONNECT
6868

6969
# load ndjson
7070
echo "--ndjson"
71-
curl -H "insert_sql:insert into ontime_streaming_load format NdJson" -H "format_skip_header:1" -F "upload=@/tmp/ontime_200.ndjson" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
72-
echo "select count(1), avg(Year), sum(DayOfWeek) from ontime_streaming_load;" | $MYSQL_CLIENT_CONNECT
73-
echo "truncate table ontime_streaming_load" | $MYSQL_CLIENT_CONNECT
74-
75-
# load csv using file_format syntax
76-
echo "--csv using file_format"
77-
curl -H "insert_sql:insert into ontime_streaming_load file_format = (type = 'CSV' skip_header = 1)" -F "upload=@/tmp/ontime_200.csv" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
71+
curl -H "insert_sql:insert into ontime_streaming_load file_format = (type = 'NdJson' skip_header = 1)" -F "upload=@/tmp/ontime_200.ndjson" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
7872
echo "select count(1), avg(Year), sum(DayOfWeek) from ontime_streaming_load;" | $MYSQL_CLIENT_CONNECT
7973
echo "truncate table ontime_streaming_load" | $MYSQL_CLIENT_CONNECT
8074

@@ -90,13 +84,13 @@ echo 'CREATE TABLE ontime_less
9084

9185

9286
echo "--parquet less"
93-
curl -s -H "insert_sql:insert into ontime_less format Parquet" -F "upload=@/tmp/ontime_200.parquet" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
87+
curl -s -H "insert_sql:insert into ontime_less file_format = (type = 'Parquet')" -F "upload=@/tmp/ontime_200.parquet" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
9488
echo "select count(1), avg(Year), sum(DayOfWeek) from ontime_less;" | $MYSQL_CLIENT_CONNECT
9589

9690
# load parquet with mismatch schema
9791
echo "--parquet mismatch schema"
9892
cat $CURDIR/../ddl/ontime.sql | sed 's/ontime/ontime_test_mismatch/g' | sed 's/DATE/VARCHAR/g' | $MYSQL_CLIENT_CONNECT
99-
curl -s -H "insert_sql:insert into ontime_test_mismatch format Parquet" -F "upload=@/tmp/ontime_200.parquet" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" | grep -c 'parquet schema mismatch'
93+
curl -s -H "insert_sql:insert into ontime_test_mismatch file_format = (type = 'Parquet')" -F "upload=@/tmp/ontime_200.parquet" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" | grep -c 'parquet schema mismatch'
10094

10195

10296
echo "drop table ontime_streaming_load;" | $MYSQL_CLIENT_CONNECT

tests/suites/1_stateful/01_load_v2/01_0000_streaming_load_books.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ echo "CREATE TABLE books
1414
);" | $MYSQL_CLIENT_CONNECT
1515

1616
# load csv
17-
curl -H "insert_sql:insert into books format CSV" -F "upload=@${CURDIR}/books.csv" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
17+
curl -H "insert_sql:insert into books file_format = (type = 'CSV')" -F "upload=@${CURDIR}/books.csv" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
1818
echo "select count(), count_if(title is null), count_if(author is null), count_if(date is null), count_if(publish_time is null) from books " | $MYSQL_CLIENT_CONNECT
1919

2020
# load tsv
21-
curl -H "insert_sql:insert into books format TSV" -F "upload=@${CURDIR}/books.tsv" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
21+
curl -H "insert_sql:insert into books file_format = (type = 'TSV')" -F "upload=@${CURDIR}/books.tsv" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
2222
echo "select count(), count_if(title is null), count_if(author is null), count_if(date is null), count_if(publish_time is null) from books " | $MYSQL_CLIENT_CONNECT
2323

2424

tests/suites/1_stateful/01_load_v2/01_0004_streaming_parquet_int96.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ if [ $? -ne 0 ]; then
1919
fi
2020

2121
# load parquet
22-
curl -H "insert_sql:insert into mytime format Parquet" -F "upload=@/tmp/mytime.parquet" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
22+
curl -H "insert_sql:insert into mytime file_format = (type = 'Parquet')" -F "upload=@/tmp/mytime.parquet" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
2323
echo "select * from mytime" | $MYSQL_CLIENT_CONNECT
2424
echo "drop table mytime;" | $MYSQL_CLIENT_CONNECT
2525

tests/suites/1_stateful/01_load_v2/01_0004_streaming_variant_load.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,13 @@ if [ $? -ne 0 ]; then
3232
fi
3333

3434
# load csv
35+
# todo(ariesdevil): change to new syntax when format_quote landing
3536
curl -H "insert_sql:insert into variant_test format Csv" -H "format_skip_header:0" -H 'format_field_delimiter: ,' -H 'format_record_delimiter: \n' -H "format_quote: \'" -F "upload=@/tmp/json_sample1.csv" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
3637
curl -H "insert_sql:insert into variant_test format Csv" -H "format_skip_header:0" -H 'format_field_delimiter: |' -H 'format_record_delimiter: \n' -H "format_quote: \'" -F "upload=@/tmp/json_sample2.csv" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
3738
echo "select * from variant_test order by Id asc;" | $MYSQL_CLIENT_CONNECT
3839

3940
# load ndjson
40-
curl -H "insert_sql:insert into variant_test2 format NdJson" -H "format_skip_header:0" -F "upload=@/tmp/json_sample.ndjson" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
41+
curl -H "insert_sql:insert into variant_test2 file_format = (type = 'NdJson' skip_header = 0)" -F "upload=@/tmp/json_sample.ndjson" -u root: -XPUT "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/streaming_load" > /dev/null 2>&1
4142
echo "select * from variant_test2 order by b asc;" | $MYSQL_CLIENT_CONNECT
4243

4344
echo "drop table variant_test;" | $MYSQL_CLIENT_CONNECT

0 commit comments

Comments
 (0)