Query running too slow? Rewrite it with Quorion!

Part1: Rewritten Queries Generation

Environment Requirements

Java JDK 1.8
Scala 2.12.10
Maven 3.8.6
Python version >= 3.9
Python package requirements: docopt, requests, flask, openpyxl, pandas, matplotlib, numpy

Preprocessing[option].

Statistics: For generating new statistics (cost.csv), we offer the DuckDB version scripts query/preprocess.sh and query/gen_cost.sh. Modify the configurations in them, and execute the following command. For web-ui, please move the generated statistics files to folder graph/q1/, tpch/q2/, lsqb/q1/, job/1a/, and custom/q1/ respectively; for command-line operations, please move them to the specific corresponding query folders.
Plan: Here, we also provide the conversion of DuckDB plans. Please modify the DuckDB and Python paths in gen_plan.sh. Then execute the following command. After running the command, the original DuckDB plan will be generated as db_plan.json, and the newly generated plan will be plan.json, which is suitable for our parser. Here ${DB_FILE_PATH} represents a persistent database in DuckDB. Please change the parameter to timeout=0 in requests.post at main.py:223 if you want to use the self-defined plan.

$ ./gen_plan.sh ${DB_FILE_PATH} ${QUERY_DIRECTORY}
e.g.
./gen_plan.sh ~/test_db job

We provide two execution modes. The default mode is web-ui execution. If you need to switch, please modify the corresponding value EXEC_MODE at Line 767 in main.py.

Web-UI

Execute main.py to launch the Python backend rewriter component.

$ python main.py

Execute the Java backend parser component through command java -jar sparksql-plus-web-jar-with-dependencies.jar build from SparkSQLPlus, which is included as a submodule. [Option] You can also build jar file by yourself.
Please use the following command to init and update it.

$ git submodule init
$ git submodule update [--remote]
    or
$ git submodule update --init --recursive

Open the webpage at http://localhost:8848.
Begin submitting queries for execution on the webpage.

Command Line [Default]

Modify python path (PYTHON_ENV) in auto_rewrite.sh.
Execute the following command to get the rewrite querys. The rewrite time is shown in rewrite_time.txt
OPTIONS

Mode: Set generate code mode D(DuckDB)/M(MySql) [default: D]
Yannakakis/Yannakakis-Plus : Set Y for Yannakakis; N for Yannakakis-Plus [default: N]

$ bash start_parser.sh
$ Parser started.
$ ./auto_rewrite.sh ${DDL_NAME} ${QUERY_DIR} [OPTIONS]
e.g ./auto_rewrite.sh lsqb lsqb M N

If you want to run a single query, please change the code commented # NOTE: single query keeps here in function init_global_vars (Line 587 - Line 589 in main.py), and comment the code block labeled # NOTE: auto-rewrite keeps here (the code between the two blank lines, Line 610 - Line 629 in main.py).

Part2: Reproducibility of the Experiments

Step1: DBMS Requirement Preparation

DuckDB 1.0:

Change directory to any directory that you want to install your DuckDB
Download *.zip or *.tar.gz file from https://github.com/duckdb/duckdb/releases/tag/v1.0.0
Extract the content and generate duckdb executable file

PostgreSQL 16.2

Change directory to any directory that you want to install your PostgreSQL
Install PostgreSQL 16.2 according to the instructions on https://www.postgresql.org/download/
Create a database test. You may use another name for the database.
Make sure you can access the database by psql -d {db_name} -U {your_name} -p {your_port} (without a password)

Spark 3.5.1

Change directory to any directory that you want to install your Spark
Download Spark 3.5.1 from https://archive.apache.org/dist/spark/spark-3.5.1/
Extract the downloaded package
Set environment variables. Please ensure to modify them according to your file path.

export SPARK_HOME="/path/to/spark-3.5.1xxx"
export PATH="${SPARK_HOME}/bin":"${PATH}"

Step2: Dataset Download

Graph data

Run bash download_graph.sh to download a graph from SNAP. It is also possible to use other input data as long as the columns are separated by commas.

LSQB data

Choice 1: generate by yourself from official site

Clone lsqb dataset generate tool from https://github.com/ldbc/lsqb
Follow the instruction and generate the scale factor = 30 data result

Choice 2: download directly from the cloud storage (~13G)

Please download from lsqb_30.

TPC-H data

Choice 1: generate by yourself from official site

Clone TPC-H dataset generation tool from https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp
Follow the instruction and generate the scale factor = 100 data result

Choice 2: download directly from the cloud storage (~108G)

Please download from tpch_100

JOB data

Choice 1: download from script (~3.7G, scale=1)

Run bash download_job.sh to download job data from DuckDB Support

Choice 2: download directly from the cloud storage (take some time ~242G, scale=100)

Please download from job_100

Step3: Database Initialization

DuckDB

Make sure you have already created the experiemente data in the previous steps
Locate to the duckdb installed location and execute duckdb to get into duckdb environment.
Load data.

Graph data: execute .open graph_db. Change PATH_TO_GRAPH_DATA to your download graph data and execute queries in query/load_graph.sql
LSQB data: execute .open lsqb_db. Change PATH_TO_LSQB_DATA to your download lsqb data and execute queries in query/load_lsqb.sql
TPC-H data: execute .open tpch_db. Change PATH_TO_TPCH_DATA to your download tpch data and execute queries in query/load_tpch.sql
JOB data: execute .open job_db. Change PATH_TO_JOB_DATA to your download job data and execute queries in query/load_job.sql

PostgreSQL

Make sure you have already created the experiemente data in the previous steps
Locate to the PostgreSQL installed location and execute postgresql-16.2/bin/psql -p ${port} -d test
Load data: execute queries in query/load_graph.sql, query/load_lsqb.sql, query/load_tpch.sql, query/load_job.sql

Step4: Run experiments

DuckDB & PostgreSQL

Change the specifications in query/config.properties about DuckDB, PostgreSQL settings, and other common settings.
Execute query/auto_run_duckdb.sh to run duckdb experiements, query/auto_run_pg.sh to run postgresql experiements.

syntax: 
./auto_run_duckdb.sh ${DATABSE} ${INPUT_DIRECTORY} [parallism]
./auto_run_pg.sh ${INPUT_DIRECTORY} [parallism]

eg:
# Execute queries under graph in database graph with default parallism with DuckDB
./auto_run_duckdb.sh graph graph

# Execute queries under graph in database graph with parallism=72 with DuckDB
./auto_run_duckdb.sh lsqb lsqb 72

# Execute queries under tpch with default parallism with PostgreSQL
./auto_run_pg.sh tpch

# Execute queries under job with parallism=72 with PostgreSQL
./auto_run_pg.sh job 72

The queries for parallism, scale & selectivity is under query directory.

For parallism testing, the queries is under query/parallelism_[lsqb|sgpb], please set parallism through

./auto_run_duckdb.sh parallelism_[lsqb|sgpb] [1|2|4|8|16|32|48]

Note: DuckDB and PostgreSQL have slightly different syntax.

(1.1) DuckDB requires (XXX) IN (SELECT (XXX))—the columns in the subquery must be wrapped as a tuple with parentheses.

(1.2) PostgreSQL requires (XXX) IN (SELECT XXX)—no parentheses around the subquery’s SELECT list. The current scripts are written for DuckDB.

(1.3) If you plan to run them on PostgreSQL, please make the following changes: Remove the extra parentheses inside the subquery (i.e., change (SELECT (XXX)) to (SELECT XXX)).

(2.1) PostgreSQL does not allow column aliases in CREATE OR REPLACE VIEW. Therefore, before every rewritten query, add
```
DROP VIEW IF EXISTS xxx CASCADE;
```
to discard any previously created view with the same name. DuckDB has no such limitation.

SparkSQL

For details, please refer to the SparkSQLRunner README.

Step5: plot

Execute ./auto_summary.sh ${INPUT_DIR} or ./auto_summary_job.sh ${INPUT_DIR} to gather results for queries in INPUT_DIR. The generated statistis is summary_*_statistics[_default].csv.

# Gather results for query under directory graph & lsqb & tpch & job
./auto_summary.sh graph
./auto_summary.sh lsqb
./auto_summary.sh tpch
./auto_summary_job.sh job

Execute scripts under draw/* to do the plotting and generated picture is under draw/*.pdf.

# Generate pictures(graph.pdf, lsqb.pdf, tpch.pdf) about running times for SGPB, LSQB and TPCH. Corresponding to Figure 9. 
python3 draw_graph.py

# Generate pictures(job_duckdb.pdf, job_postgresql.pdf) about running times for JOB. Corresponding to Figure 10. 
python3 draw_job.py

# Generate picture(selectivity_scale.pdf) about selectivity & scale. Corresponding to Figure 11. 
python3 draw_selectivity.py

# Generate pictures(thread1.pdf, thread2.pdf) about parallelism. Corresponding to Figure 12.
python3 draw_thread.py

Part3: Extra Information [Option]

Structure Overview

Web-based Interface
Java Parser Backend
Python Optimizer & Rewriter Backend

Files

./query/[graph|lsqb|tpch|job]: plans for different DBMSs
./query/*.sh: auto-run scripts
./query/*.sql: load data scripts
./query/[src|Schema]: files for auto-run SparkSQL
./*.py: code for rewriter and optimizer
./sparksql-plus-web-jar-with-dependencies.jar: parser jar file

Demonstration

Step 1

Step 2

Step 3

Step 4

NOTE

For queries like SELECT DISTINCT ..., please remove DISTINCT keyword before parsing.
Use jps command to get the parser pid which name is jar, and then kill it.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
SparkSQLPlus @ 43d1007		SparkSQLPlus @ 43d1007
SparkSQLRunner		SparkSQLRunner
draw		draw
query		query
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
1.png		1.png
2.png		2.png
3.png		3.png
4.png		4.png
README.md		README.md
action.py		action.py
aggregation.py		aggregation.py
auto_rewrite.sh		auto_rewrite.sh
codegen.py		codegen.py
codegenTopK.py		codegenTopK.py
codegenYan.py		codegenYan.py
columnPrune.py		columnPrune.py
comparison.py		comparison.py
download_graph.sh		download_graph.sh
download_job.sh		download_job.sh
enumerate.py		enumerate.py
enumsType.py		enumsType.py
estimator.py		estimator.py
generateAggIR.py		generateAggIR.py
generateIR.py		generateIR.py
generateTopKIR.py		generateTopKIR.py
generateYaIR.py		generateYaIR.py
globalVar.py		globalVar.py
jointree.py		jointree.py
levelK.py		levelK.py
main.py		main.py
productK.py		productK.py
reduce.py		reduce.py
sparksql-plus-web-jar-with-dependencies.jar		sparksql-plus-web-jar-with-dependencies.jar
start_parser.sh		start_parser.sh
technical_report.pdf		technical_report.pdf
topk.py		topk.py
treenode.py		treenode.py

hkustDB/Quorion

Folders and files

Latest commit

History

Repository files navigation

Query running too slow? Rewrite it with Quorion!

Part1: Rewritten Queries Generation

Environment Requirements

Web-UI

Command Line [Default]

Part2: Reproducibility of the Experiments

Step1: DBMS Requirement Preparation

DuckDB 1.0:

PostgreSQL 16.2

Spark 3.5.1

Step2: Dataset Download

Graph data

LSQB data

Choice 1: generate by yourself from official site

Choice 2: download directly from the cloud storage (~13G)

TPC-H data

Choice 1: generate by yourself from official site

Choice 2: download directly from the cloud storage (~108G)

JOB data

Choice 1: download from script (~3.7G, scale=1)

Choice 2: download directly from the cloud storage (take some time ~242G, scale=100)

Step3: Database Initialization

DuckDB

PostgreSQL

Step4: Run experiments

DuckDB & PostgreSQL

SparkSQL

Step5: plot

Part3: Extra Information [Option]

Structure Overview

Files

Demonstration

Step 1

Step 2

Step 3

Step 4

NOTE

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages