Skip to content

Commit 154b208

Browse files
authored
chore: Add benchmarking scripts (#2025)
1 parent b2ed0ed commit 154b208

File tree

19 files changed

+781
-154
lines changed

19 files changed

+781
-154
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ The following chart shows the time it takes to run the 22 TPC-H queries against
4848
using a single executor with 8 cores. See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html)
4949
for details of the environment used for these benchmarks.
5050

51-
When using Comet, the overall run time is reduced from 616 seconds to 275 seconds, a 2.2x speedup.
51+
When using Comet, the overall run time is reduced from 652 seconds to 268 seconds, a 2.4x speedup.
5252

5353
![](docs/source/_static/images/benchmark-results/0.9.0/tpch_allqueries.png)
5454

dev/benchmarks/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*.json
2+
*.png

dev/benchmarks/README.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Comet Benchmarking Scripts
21+
22+
This directory contains scripts used for generating benchmark results that are published in this repository and in
23+
the Comet documentation.
24+
25+
## Example usage
26+
27+
Set Spark environment variables:
28+
29+
```shell
30+
export SPARK_HOME=/opt/spark-3.5.3-bin-hadoop3/
31+
export SPARK_MASTER=spark://yourhostname:7077
32+
```
33+
34+
Set path to queries and data:
35+
36+
```shell
37+
export TPCH_QUERIES=/mnt/bigdata/tpch/queries/
38+
export TPCH_DATA=/mnt/bigdata/tpch/sf100/
39+
```
40+
41+
Run Spark benchmark:
42+
43+
```shell
44+
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
45+
sudo ./drop-caches.sh
46+
./spark-tpch.sh
47+
```
48+
49+
Run Comet benchmark:
50+
51+
```shell
52+
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
53+
export COMET_JAR=/opt/comet/comet-spark-spark3.5_2.12-0.9.0.jar
54+
sudo ./drop-caches.sh
55+
./comet-tpch.sh
56+
```
57+
58+
Run Gluten benchmark:
59+
60+
```shell
61+
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
62+
export GLUTEN_JAR=/opt/gluten/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.4.0.jar
63+
sudo ./drop-caches.sh
64+
./gluten-tpch.sh
65+
```
66+
67+
Generating charts:
68+
69+
```shell
70+
python3 generate-comparison.py --benchmark tpch --labels "Spark 3.5.3" "Comet 0.9.0" "Gluten 1.4.0" --title "TPC-H @ 100 GB (single executor, 8 cores, local Parquet files)" spark-tpch-1752338506381.json comet-tpch-1752337818039.json gluten-tpch-1752337474344.json
71+
```

dev/benchmarks/comet-tpch.sh

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
#!/bin/bash
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
#
20+
21+
$SPARK_HOME/sbin/stop-master.sh
22+
$SPARK_HOME/sbin/stop-worker.sh
23+
24+
$SPARK_HOME/sbin/start-master.sh
25+
$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER
26+
27+
$SPARK_HOME/bin/spark-submit \
28+
--master $SPARK_MASTER \
29+
--jars $COMET_JAR \
30+
--driver-class-path $COMET_JAR \
31+
--conf spark.driver.memory=8G \
32+
--conf spark.executor.instances=1 \
33+
--conf spark.executor.cores=8 \
34+
--conf spark.cores.max=8 \
35+
--conf spark.executor.memory=16g \
36+
--conf spark.memory.offHeap.enabled=true \
37+
--conf spark.memory.offHeap.size=16g \
38+
--conf spark.eventLog.enabled=true \
39+
--conf spark.driver.extraClassPath=$COMET_JAR \
40+
--conf spark.executor.extraClassPath=$COMET_JAR \
41+
--conf spark.plugins=org.apache.spark.CometPlugin \
42+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
43+
--conf spark.comet.exec.replaceSortMergeJoin=true \
44+
--conf spark.comet.cast.allowIncompatible=true \
45+
tpcbench.py \
46+
--name comet \
47+
--benchmark tpch \
48+
--data $TPCH_DATA \
49+
--queries $TPCH_QUERIES \
50+
--output . \
51+
--iterations 1

dev/benchmarks/drop-caches.sh

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#!/bin/bash
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
#
20+
21+
echo 1 > /proc/sys/vm/drop_caches

dev/benchmarks/generate-comparison.py

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
import argparse
19+
import json
20+
import matplotlib.pyplot as plt
21+
import numpy as np
22+
23+
def geomean(data):
24+
return np.prod(data) ** (1 / len(data))
25+
26+
def generate_query_rel_speedup_chart(baseline, comparison, label1: str, label2: str, benchmark: str, title: str):
27+
results = []
28+
for query in range(1, query_count(benchmark)+1):
29+
if query == 999:
30+
continue
31+
a = np.median(np.array(baseline[str(query)]))
32+
b = np.median(np.array(comparison[str(query)]))
33+
if a > b:
34+
speedup = a/b-1
35+
else:
36+
speedup = -(1/(a/b)-1)
37+
results.append(("q" + str(query), round(speedup*100, 0)))
38+
39+
results = sorted(results, key=lambda x: -x[1])
40+
41+
queries, speedups = zip(*results)
42+
43+
# Create figure and axis
44+
if benchmark == "tpch":
45+
fig, ax = plt.subplots(figsize=(10, 6))
46+
else:
47+
fig, ax = plt.subplots(figsize=(35, 10))
48+
49+
# Create bar chart
50+
bars = ax.bar(queries, speedups, color='skyblue')
51+
52+
# Add text annotations
53+
for bar, speedup in zip(bars, speedups):
54+
yval = bar.get_height()
55+
if yval >= 0:
56+
ax.text(bar.get_x() + bar.get_width() / 2.0, min(800, yval+5), f'{yval:.0f}%', va='bottom', ha='center', fontsize=8,
57+
color='blue', rotation=90)
58+
else:
59+
ax.text(bar.get_x() + bar.get_width() / 2.0, yval, f'{yval:.0f}%', va='top', ha='center', fontsize=8,
60+
color='blue', rotation=90)
61+
62+
# Add title and labels
63+
ax.set_title(label2 + " speedup over " + label1 + " (" + title + ")")
64+
ax.set_ylabel('Speedup Percentage (100% speedup = 2x faster)')
65+
ax.set_xlabel('Query')
66+
67+
# Customize the y-axis to handle both positive and negative values better
68+
ax.axhline(0, color='black', linewidth=0.8)
69+
min_value = (min(speedups) // 100) * 100
70+
max_value = ((max(speedups) // 100) + 1) * 100 + 50
71+
if benchmark == "tpch":
72+
ax.set_ylim(min_value, max_value)
73+
else:
74+
# TODO improve this
75+
ax.set_ylim(-250, 300)
76+
77+
# Show grid for better readability
78+
ax.yaxis.grid(True)
79+
80+
# Save the plot as an image file
81+
plt.savefig(f'{benchmark}_queries_speedup_rel.png', format='png')
82+
83+
def generate_query_abs_speedup_chart(baseline, comparison, label1: str, label2: str, benchmark: str, title: str):
84+
results = []
85+
for query in range(1, query_count(benchmark)+1):
86+
if query == 999:
87+
continue
88+
a = np.median(np.array(baseline[str(query)]))
89+
b = np.median(np.array(comparison[str(query)]))
90+
speedup = a-b
91+
results.append(("q" + str(query), round(speedup, 1)))
92+
93+
results = sorted(results, key=lambda x: -x[1])
94+
95+
queries, speedups = zip(*results)
96+
97+
# Create figure and axis
98+
if benchmark == "tpch":
99+
fig, ax = plt.subplots(figsize=(10, 6))
100+
else:
101+
fig, ax = plt.subplots(figsize=(35, 10))
102+
103+
# Create bar chart
104+
bars = ax.bar(queries, speedups, color='skyblue')
105+
106+
# Add text annotations
107+
for bar, speedup in zip(bars, speedups):
108+
yval = bar.get_height()
109+
if yval >= 0:
110+
ax.text(bar.get_x() + bar.get_width() / 2.0, min(800, yval+5), f'{yval:.1f}', va='bottom', ha='center', fontsize=8,
111+
color='blue', rotation=90)
112+
else:
113+
ax.text(bar.get_x() + bar.get_width() / 2.0, yval, f'{yval:.1f}', va='top', ha='center', fontsize=8,
114+
color='blue', rotation=90)
115+
116+
# Add title and labels
117+
ax.set_title(label2 + " speedup over " + label1 + " (" + title + ")")
118+
ax.set_ylabel('Speedup (in seconds)')
119+
ax.set_xlabel('Query')
120+
121+
# Customize the y-axis to handle both positive and negative values better
122+
ax.axhline(0, color='black', linewidth=0.8)
123+
min_value = min(speedups) * 2 - 20
124+
max_value = max(speedups) * 1.5
125+
ax.set_ylim(min_value, max_value)
126+
127+
# Show grid for better readability
128+
ax.yaxis.grid(True)
129+
130+
# Save the plot as an image file
131+
plt.savefig(f'{benchmark}_queries_speedup_abs.png', format='png')
132+
133+
def generate_query_comparison_chart(results, labels, benchmark: str, title: str):
134+
queries = []
135+
benches = []
136+
for _ in results:
137+
benches.append([])
138+
for query in range(1, query_count(benchmark)+1):
139+
if query == 999:
140+
continue
141+
queries.append("q" + str(query))
142+
for i in range(0, len(results)):
143+
benches[i].append(np.median(np.array(results[i][str(query)])))
144+
145+
# Define the width of the bars
146+
bar_width = 0.3
147+
148+
# Define the positions of the bars on the x-axis
149+
index = np.arange(len(queries)) * 1.5
150+
151+
# Create a bar chart
152+
if benchmark == "tpch":
153+
fig, ax = plt.subplots(figsize=(15, 6))
154+
else:
155+
fig, ax = plt.subplots(figsize=(35, 6))
156+
157+
for i in range(0, len(results)):
158+
bar = ax.bar(index + i * bar_width, benches[i], bar_width, label=labels[i])
159+
160+
# Add labels, title, and legend
161+
ax.set_title(title)
162+
ax.set_xlabel('Queries')
163+
ax.set_ylabel('Query Time (seconds)')
164+
ax.set_xticks(index + bar_width / 2)
165+
ax.set_xticklabels(queries)
166+
ax.legend()
167+
168+
# Save the plot as an image file
169+
plt.savefig(f'{benchmark}_queries_compare.png', format='png')
170+
171+
def generate_summary(results, labels, benchmark: str, title: str):
172+
timings = []
173+
for _ in results:
174+
timings.append(0)
175+
176+
num_queries = query_count(benchmark)
177+
for query in range(1, num_queries + 1):
178+
if query == 999:
179+
continue
180+
for i in range(0, len(results)):
181+
timings[i] += np.median(np.array(results[i][str(query)]))
182+
183+
# Create figure and axis
184+
fig, ax = plt.subplots()
185+
fig.set_size_inches(10, 6)
186+
187+
# Add title and labels
188+
ax.set_title(title)
189+
ax.set_ylabel(f'Time in seconds to run all {num_queries} {benchmark} queries (lower is better)')
190+
191+
times = [round(x,0) for x in timings]
192+
193+
# Create bar chart
194+
bars = ax.bar(labels, times, color='skyblue', width=0.8)
195+
196+
# Add text annotations
197+
for bar in bars:
198+
yval = bar.get_height()
199+
ax.text(bar.get_x() + bar.get_width() / 2.0, yval, f'{yval}', va='bottom') # va: vertical alignment
200+
201+
plt.savefig(f'{benchmark}_allqueries.png', format='png')
202+
203+
def query_count(benchmark: str):
204+
if benchmark == "tpch":
205+
return 22
206+
elif benchmark == "tpcds":
207+
return 99
208+
else:
209+
raise "invalid benchmark name"
210+
211+
def main(files, labels, benchmark: str, title: str):
212+
results = []
213+
for filename in files:
214+
with open(filename) as f:
215+
results.append(json.load(f))
216+
generate_summary(results, labels, benchmark, title)
217+
generate_query_comparison_chart(results, labels, benchmark, title)
218+
if len(files) == 2:
219+
generate_query_abs_speedup_chart(results[0], results[1], labels[0], labels[1], benchmark, title)
220+
generate_query_rel_speedup_chart(results[0], results[1], labels[0], labels[1], benchmark, title)
221+
222+
if __name__ == '__main__':
223+
argparse = argparse.ArgumentParser(description='Generate comparison')
224+
argparse.add_argument('filenames', nargs='+', type=str, help='JSON result files')
225+
argparse.add_argument('--labels', nargs='+', type=str, help='Labels')
226+
argparse.add_argument('--benchmark', type=str, help='Benchmark name (tpch or tpcds)')
227+
argparse.add_argument('--title', type=str, help='Chart title')
228+
args = argparse.parse_args()
229+
main(args.filenames, args.labels, args.benchmark, args.title)

0 commit comments

Comments
 (0)