- Java JDK 1.8
- Scala 2.12.10
- Maven 3.8.6
- Python version >= 3.9
- Python package requirements: docopt, requests, flask, openpyxl, pandas, matplotlib, numpy
- Preprocessing[option].
- Statistics: For generating new statistics (
cost.csv
), we offer the DuckDB version scriptsquery/preprocess.sh
andquery/gen_cost.sh
. Modify the configurations in them, and execute the following command. For web-ui, please move the generated statistics files to foldergraph/q1/
,tpch/q2/
,lsqb/q1/
,job/1a/
, andcustom/q1/
respectively; for command-line operations, please move them to the specific corresponding query folders. - Plan: Here, we also provide the conversion of DuckDB plans. Please modify the DuckDB and Python paths in gen_plan.sh. Then execute the following command. After running the command, the original DuckDB plan will be generated as
db_plan.json
, and the newly generated plan will beplan.json
, which is suitable for our parser. Here${DB_FILE_PATH}
represents a persistent database in DuckDB. Please change the parameter totimeout=0
inrequests.post
atmain.py:223
if you want to use the self-defined plan.
$ ./gen_plan.sh ${DB_FILE_PATH} ${QUERY_DIRECTORY}
e.g.
./gen_plan.sh ~/test_db job
- We provide two execution modes. The default mode is web-ui execution. If you need to switch, please modify the corresponding value
EXEC_MODE
at Line767
inmain.py
.
- Execute main.py to launch the Python backend rewriter component.
$ python main.py
- Execute the Java backend parser component through command
java -jar sparksql-plus-web-jar-with-dependencies.jar
build fromSparkSQLPlus
, which is included as a submodule. [Option] You can also buildjar
file by yourself. - Please use the following command to init and update it.
$ git submodule init
$ git submodule update [--remote]
or
$ git submodule update --init --recursive
- Open the webpage at
http://localhost:8848
. - Begin submitting queries for execution on the webpage.
- Modify python path (
PYTHON_ENV
) inauto_rewrite.sh
. - Execute the following command to get the rewrite querys. The rewrite time is shown in
rewrite_time.txt
- OPTIONS
- Mode: Set generate code mode D(DuckDB)/M(MySql) [default: D]
- Yannakakis/Yannakakis-Plus : Set Y for Yannakakis; N for Yannakakis-Plus [default: N]
$ bash start_parser.sh
$ Parser started.
$ ./auto_rewrite.sh ${DDL_NAME} ${QUERY_DIR} [OPTIONS]
e.g ./auto_rewrite.sh lsqb lsqb M N
- If you want to run a single query, please change the code commented
# NOTE: single query keeps here
in functioninit_global_vars
(Line587
- Line589
inmain.py
), and comment the code block labeled# NOTE: auto-rewrite keeps here
(the code between the two blank lines, Line610
- Line629
inmain.py
).
- Change directory to any directory that you want to install your DuckDB
- Download *.zip or *.tar.gz file from https://github.com/duckdb/duckdb/releases/tag/v1.0.0
- Extract the content and generate duckdb executable file
- Change directory to any directory that you want to install your PostgreSQL
- Install PostgreSQL 16.2 according to the instructions on https://www.postgresql.org/download/
- Create a database
test
. You may use another name for the database. - Make sure you can access the database by
psql -d {db_name} -U {your_name} -p {your_port}
(without a password)
- Change directory to any directory that you want to install your Spark
- Download Spark 3.5.1 from https://archive.apache.org/dist/spark/spark-3.5.1/
- Extract the downloaded package
- Set environment variables. Please ensure to modify them according to your file path.
export SPARK_HOME="/path/to/spark-3.5.1xxx"
export PATH="${SPARK_HOME}/bin":"${PATH}"
Run bash download_graph.sh
to download a graph from SNAP. It is also possible to use other input data as long as the columns are separated by commas.
- Clone lsqb dataset generate tool from https://github.com/ldbc/lsqb
- Follow the instruction and generate the scale factor = 30 data result
- Please download from lsqb_30.
- Clone TPC-H dataset generation tool from https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp
- Follow the instruction and generate the scale factor = 100 data result
- Please download from tpch_100
- Run
bash download_job.sh
to download job data from DuckDB Support
- Please download from job_100
- Make sure you have already created the experiemente data in the previous steps
- Locate to the duckdb installed location and execute
duckdb
to get into duckdb environment. - Load data.
- Graph data: execute
.open graph_db
. ChangePATH_TO_GRAPH_DATA
to your download graph data and execute queries inquery/load_graph.sql
- LSQB data: execute
.open lsqb_db
. ChangePATH_TO_LSQB_DATA
to your download lsqb data and execute queries inquery/load_lsqb.sql
- TPC-H data: execute
.open tpch_db
. ChangePATH_TO_TPCH_DATA
to your download tpch data and execute queries inquery/load_tpch.sql
- JOB data: execute
.open job_db
. ChangePATH_TO_JOB_DATA
to your download job data and execute queries inquery/load_job.sql
- Make sure you have already created the experiemente data in the previous steps
- Locate to the PostgreSQL installed location and execute
postgresql-16.2/bin/psql -p ${port} -d test
- Load data: execute queries in
query/load_graph.sql
,query/load_lsqb.sql
,query/load_tpch.sql
,query/load_job.sql
- Change the specifications in
query/config.properties
about DuckDB, PostgreSQL settings, and other common settings. - Execute
query/auto_run_duckdb.sh
to run duckdb experiements,query/auto_run_pg.sh
to run postgresql experiements.
syntax:
./auto_run_duckdb.sh ${DATABSE} ${INPUT_DIRECTORY} [parallism]
./auto_run_pg.sh ${INPUT_DIRECTORY} [parallism]
eg:
# Execute queries under graph in database graph with default parallism with DuckDB
./auto_run_duckdb.sh graph graph
# Execute queries under graph in database graph with parallism=72 with DuckDB
./auto_run_duckdb.sh lsqb lsqb 72
# Execute queries under tpch with default parallism with PostgreSQL
./auto_run_pg.sh tpch
# Execute queries under job with parallism=72 with PostgreSQL
./auto_run_pg.sh job 72
- The queries for parallism, scale & selectivity is under query directory.
- For parallism testing, the queries is under query/parallelism_[lsqb|sgpb], please set parallism through
./auto_run_duckdb.sh parallelism_[lsqb|sgpb] [1|2|4|8|16|32|48]
-
Note: DuckDB and PostgreSQL have slightly different syntax.
(1.1) DuckDB requires (XXX) IN (SELECT (XXX))—the columns in the subquery must be wrapped as a tuple with parentheses.
(1.2) PostgreSQL requires (XXX) IN (SELECT XXX)—no parentheses around the subquery’s SELECT list. The current scripts are written for DuckDB.
(1.3) If you plan to run them on PostgreSQL, please make the following changes: Remove the extra parentheses inside the subquery (i.e., change (SELECT (XXX)) to (SELECT XXX)).
(2.1) PostgreSQL does not allow column aliases in CREATE OR REPLACE VIEW. Therefore, before every rewritten query, add
DROP VIEW IF EXISTS xxx CASCADE;
to discard any previously created view with the same name. DuckDB has no such limitation.
For details, please refer to the SparkSQLRunner README.
- Execute
./auto_summary.sh ${INPUT_DIR}
or./auto_summary_job.sh ${INPUT_DIR}
to gather results for queries inINPUT_DIR
. The generated statistis issummary_*_statistics[_default].csv
.
# Gather results for query under directory graph & lsqb & tpch & job
./auto_summary.sh graph
./auto_summary.sh lsqb
./auto_summary.sh tpch
./auto_summary_job.sh job
- Execute scripts under
draw/*
to do the plotting and generated picture is underdraw/*.pdf
.
# Generate pictures(graph.pdf, lsqb.pdf, tpch.pdf) about running times for SGPB, LSQB and TPCH. Corresponding to Figure 9.
python3 draw_graph.py
# Generate pictures(job_duckdb.pdf, job_postgresql.pdf) about running times for JOB. Corresponding to Figure 10.
python3 draw_job.py
# Generate picture(selectivity_scale.pdf) about selectivity & scale. Corresponding to Figure 11.
python3 draw_selectivity.py
# Generate pictures(thread1.pdf, thread2.pdf) about parallelism. Corresponding to Figure 12.
python3 draw_thread.py
- Web-based Interface
- Java Parser Backend
- Python Optimizer & Rewriter Backend
./query/[graph|lsqb|tpch|job]
: plans for different DBMSs./query/*.sh
: auto-run scripts./query/*.sql
: load data scripts./query/[src|Schema]
: files for auto-run SparkSQL./*.py
: code for rewriter and optimizer./sparksql-plus-web-jar-with-dependencies.jar
: parser jar file
- For queries like
SELECT DISTINCT ...
, please removeDISTINCT
keyword before parsing. - Use
jps
command to get the parser pid which name isjar
, and then kill it.