Benchmarking FDS on multi-cpu machines #14701

sarmento · 2025-05-29T16:38:20Z

sarmento
May 29, 2025

Dear FDS community:

I am new to FDS and my interest is mostly on the computational side of the model.
I am looking for FDS input files, preferably publicly available, to benchmark the speedups that one can obtain by running FDS on various cloud machines of several generations, AMD vs Intel, with up to 360 vCPUs. I have been unable to find input examples that can benefit from throwing more CPUs at the problem, since parallelisation depends on mesh division, and it seems that there are not that many examples with high mesh partitioning publicly available.

Would anyone please point me to one of such examples?

The results of the benchmark are supposed to be public, so I am especially interested in having access to publicly available input files so that other people can also replicate the results!

I apollogize if this sounds like a very stupid request but, again, my interest is mostly computational / infrastructural

Thank you so much in advance!
Best,

L

drjfloyd · 2025-05-29T16:51:44Z

drjfloyd
May 29, 2025
Collaborator

See the User's Guide for a discussion of the strong and weak scaling tests. Input files here:

https://github.com/firemodels/fds/tree/master/Validation/MPI_Scaling_Tests/FDS_Input_Files

3 replies

sarmento May 30, 2025
Author

@drjfloyd: This is great! I am already running some of these benchmarks.
Thank you!

However, I have no idea how long these will actually take to run.
Are some reference results reported somewhere?
Do we know what times would any of these simulations actually take in some specific hardware configuration?
Many thanks!

rmcdermo May 30, 2025
Maintainer

The output from our runs is stored in our "out" repo. Here is the link for the MPI tests.

https://github.com/firemodels/out/tree/master/MPI_Scaling_Tests

The hardware on our linux cluster is

lscpu_spark.txt

sarmento May 30, 2025
Author

Thank you!

mcgratta · 2025-05-30T13:19:03Z

mcgratta
May 30, 2025
Maintainer

The weak scaling tests take about a minute of CPU. The strong scaling tests take between 10 s and 30 min.

1 reply

sarmento May 30, 2025
Author

Thank you so much!
I will try to replicate these benchmarks!

sarmento · 2025-06-01T10:36:24Z

sarmento
Jun 1, 2025
Author

Dear all,

I just finished running several simulations for the strong scaling test, for several VMs of the c4 available family on Google Cloud. I want to share this first set of tests with you because I am not sure exactly how to interpret the results.

Experimental Setting

Hardware: VM of the C4 series (https://cloud.google.com/compute/docs/general-purpose-machines#c4_series)
nVCPUs from 2, 4, 8, 16, 32, 48, 96, 192
4 Gb per vCPUs (which means that for 2 vCPU the VM only has 8Gb, but the 192 vCPU has a huge amount of RAM)

For each machine, run

fds strong_scaling_test_F.fds

for F in (001, 008, 032, 064, 096). This means the MPI processes is always 1, but OpenMP processes will be 2, 4, 8, etc.. with the number of vCPUs of the machine. I am assuming that the problem can only be partitioned up to "F" partitions, so we should see no speed up after we get to number of OpenMP threads larger than F. (I am not sure if that is the case, though). Also, not sure if the amount of work goes up "linearly" with the test case (i.e. strong_scaling_test_008.fds has 8 x more work than strong_scaling_test_001.fds)

Overall, 8 x 5 = 40 runs. The code for running all the tests is show below, as well as all the results.

Initial Results

Here I am just looking separately at:

same input, several machine types
same machine, several inputs

File strong_scaling_test_032.

n_vcpus	time (s)
2	1334
4	1069
8	884
16	747
32	812
48	826
96	1123
192	3005

The speed ups seem reasonable, as you throw more vCPU to the same problem up to the number of "viable partitions" (my interpretation). But I don't really understand why the times go up if FDS is not partitioning further than 32 parts if partitioning does not go beyond 32 parts (i.e. machines with more capacity would just be not fully utilised but there should be no extra overhead)

VM c4-standard-32

Case	Time (s)
strong_scaling_test_001.fds	433
strong_scaling_test_008.fds	492
strong_scaling_test_032.fds	812
strong_scaling_test_064.fds	1019
strong_scaling_test_096.fds	1156

I am not sure if I understand these result, unless the amount of work goes up "linearly" with the test case (i.e. strong_scaling_test_008.fds has 8 x more work thant strong_scaling_test_001.fds). If it does, then the numbers show that FDS is effectively using threads really effectively.

t_096 < 96 x t_001 / 32

Initial comments

Overall, I am not sure exactly how to interpret these results.
I'd be happy to get any feedback, or suggestions about additional tests.
Also, I noticed that running mpirun with np = 1 makes everything a lot slower.
Am I doing anything wrong? Should there be a different command line?

Many thanks!!

Appdedix A: Code

For loop over machine types x input files using our own Inductiva api.

"""FDS benchmark."""
import inductiva

# running just for c4's
vCPUs = [2, 4, 8, 16, 32, 48, 96, 192]
files = ["001", "008", "032", "064", "096", "192"]


for f in files:

    fds_filename = f"strong_scaling_test_{f}.fds"
    for v in vCPUs: 

        # Allocate Google cloud machine
        cloud_machine = inductiva.resources.MachineGroup( \
            provider="GCP",
            machine_type=f"c4-standard-{v}")

        custom_simulator = inductiva.simulators.CustomImage(
            container_image="docker://inductiva/kutu:fds_v6.9.1")

        task = custom_simulator.run(
            input_dir="benchmarks",
            commands=[f"fds {fds_filename}"],
            project="fds_c4_standard",
            on=cloud_machine)

        task.set_metadata({
            "vcpus": str(v),
            "fds_filename": fds_filename 
        })

Appendix B: Dump of all results

Case	Status	Machine Type	vCPUs	Time (s)
strong_scaling_test_001.fds	success	GCP-c4-standard-2	2	1151.264
strong_scaling_test_001.fds	success	GCP-c4-standard-4	4	794.976
strong_scaling_test_001.fds	success	GCP-c4-standard-8	8	596.788
strong_scaling_test_001.fds	success	GCP-c4-standard-16	16	501.768
strong_scaling_test_001.fds	success	GCP-c4-standard-32	32	433.638
strong_scaling_test_001.fds	success	GCP-c4-standard-48	48	410.628
strong_scaling_test_001.fds	success	GCP-c4-standard-96	96	425.631
strong_scaling_test_001.fds	success	GCP-c4-standard-192	192	749.999
------------------------------	---------	-------------------------	-------	-----------
strong_scaling_test_008.fds	success	GCP-c4-standard-2	2	1231.112
strong_scaling_test_008.fds	success	GCP-c4-standard-4	4	899.025
strong_scaling_test_008.fds	success	GCP-c4-standard-8	8	704.86
strong_scaling_test_008.fds	success	GCP-c4-standard-16	16	572.772
strong_scaling_test_008.fds	success	GCP-c4-standard-32	32	492.71
strong_scaling_test_008.fds	success	GCP-c4-standard-48	48	583.801
strong_scaling_test_008.fds	success	GCP-c4-standard-96	96	694.895
strong_scaling_test_008.fds	success	GCP-c4-standard-192	192	1537.837
------------------------------	---------	-------------------------	-------	-----------
strong_scaling_test_032.fds	success	GCP-c4-standard-2	2	1334.405
strong_scaling_test_032.fds	success	GCP-c4-standard-4	4	1069.191
strong_scaling_test_032.fds	success	GCP-c4-standard-8	8	884.119
strong_scaling_test_032.fds	success	GCP-c4-standard-16	16	747.961
strong_scaling_test_032.fds	success	GCP-c4-standard-32	32	812.984
strong_scaling_test_032.fds	success	GCP-c4-standard-48	48	826.004
strong_scaling_test_032.fds	success	GCP-c4-standard-96	96	1123.305
strong_scaling_test_032.fds	success	GCP-c4-standard-192	192	3005.363
------------------------------	---------	-------------------------	-------	-----------
strong_scaling_test_064.fds	failed	GCP-c4-standard-2	2	21.865
strong_scaling_test_064.fds	success	GCP-c4-standard-4	4	1219.351
strong_scaling_test_064.fds	success	GCP-c4-standard-8	8	989.131
strong_scaling_test_064.fds	success	GCP-c4-standard-16	16	981.173
strong_scaling_test_064.fds	success	GCP-c4-standard-32	32	1019.206
strong_scaling_test_064.fds	success	GCP-c4-standard-48	48	1162.338
strong_scaling_test_064.fds	success	GCP-c4-standard-96	96	1745.959
strong_scaling_test_064.fds	success	GCP-c4-standard-192	192	3924.191
------------------------------	---------	-------------------------	-------	-----------
strong_scaling_test_096.fds	failed	GCP-c4-standard-2	2	38.104
strong_scaling_test_096.fds	success	GCP-c4-standard-4	4	1236.381
strong_scaling_test_096.fds	success	GCP-c4-standard-8	8	1063.269
strong_scaling_test_096.fds	success	GCP-c4-standard-16	16	1027.206
strong_scaling_test_096.fds	success	GCP-c4-standard-32	32	1156.357
strong_scaling_test_096.fds	success	GCP-c4-standard-48	48	994.179
strong_scaling_test_096.fds	success	GCP-c4-standard-96	96	1808.017
strong_scaling_test_096.fds	success	GCP-c4-standard-192	192	4527.848
------------------------------	---------	-------------------------	-------	-----------

Note: Failures are related with lack of RAM (these machines have 4GB per vCPU).

Appendix C: Example Log FIle

# COMMAND: ['fds', 'strong_scaling_test_001.fds']
# Working directory: /workdir/output/artifacts


 Starting FDS ...

 MPI Process      0 started on api-oo620srri1bdeqaaz5q7ritky-kzpf

 Reading FDS input file ...


 Fire Dynamics Simulator

 Current Date     : June  1, 2025  08:55:29
 Revision         : FDS-6.9.1-0-g889da6a-HEAD
 Revision Date    : Sun Apr 7 17:05:06 2024 -0400
 Compiler         : GCC version 9.5.0
 Compilation Date : Feb 12, 2025  16:55:14

 Number of MPI Processes:  1
 Number of OpenMP Threads: 32

 MPI version: 3.1
 MPI library version: Open MPI v4.1.6, package: Open MPI root@buildkitsandbox Distribution, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023

 Job TITLE        : General purpose input file to test FDS timings
 Job ID string    : strong_scaling_test_001

 Time Step:       1, Simulation Time:     0.002 s
 Time Step:       2, Simulation Time:     0.004 s
 Time Step:       3, Simulation Time:     0.006 s
 Time Step:       4, Simulation Time:     0.008 s
 Time Step:       5, Simulation Time:      0.01 s
 Time Step:       6, Simulation Time:      0.01 s
 Time Step:       7, Simulation Time:      0.01 s
 Time Step:       8, Simulation Time:      0.02 s
 Time Step:       9, Simulation Time:      0.02 s
 Time Step:      10, Simulation Time:      0.02 s
 Time Step:      20, Simulation Time:      0.04 s
 Time Step:      30, Simulation Time:      0.06 s
 Time Step:      40, Simulation Time:      0.08 s
 Time Step:      50, Simulation Time:      0.10 s
 Time Step:      60, Simulation Time:      0.12 s
 Time Step:      70, Simulation Time:      0.14 s
 Time Step:      80, Simulation Time:      0.16 s
 Time Step:      90, Simulation Time:      0.18 s
 Time Step:     100, Simulation Time:      0.20 s

STOP: FDS completed successfully (CHID: strong_scaling_test_001)

 -------

3 replies

drjfloyd Jun 1, 2025
Collaborator

The google page you give says that the C4 machines use two threads per core. This tends to result in poor performance for FDS which is very memory intensive. If you reach a point where you have saturated the memory bus to the CPU or to a core, then you will have threads or cores sitting idle as they wait for data. Ideally you should see time scale with cores, so 4 cores should be half 2 cores and 8 cores half of 4 cores. In your data 2 to 4 is a 1.25 speed-up instead of 2 and 4 to 8 a 1.2 speed up.

sarmento Jun 1, 2025
Author

Thank you for your reply.
OK! Let me run the tests again with n_threads_per_core = 1 and I will report those in a moment.

sarmento Jun 1, 2025
Author

I just re-run the tests. So, for the same machine (c2-standard-32):

VM c4-standard-32 (before | after)

Case	Time (s) 2 threads/core	Time (s) 1 thread/core
strong_scaling_test_001.fds	433	461
strong_scaling_test_008.fds	492	596
strong_scaling_test_032.fds	812	748
strong_scaling_test_064.fds	1019	894
strong_scaling_test_096.fds	1156	959

We are using the same number of physical cores in both experiments (16). In the second column, we see the results of using the "native" setting of 2 threads per core of the Google Cloud VMs (already reported before). Third column has the new results in which we set the VM to map 1 thread to 1 physical core. So, as you mentioned, once the problem has a certain number of partitions, then hyper-threading is detrimental because you start hitting the memory bandwidth limits.

Also, here are the results for the other case.
Again, first second refers to already reported result. Third column are the results of running on only 1 thread per core.

File strong_scaling_test_032.

n_cores	Time (s) 2 threads/core	Time (s) 1 thread/core
1	1334	1236
2	1069	1131
4	884	920
8	747	779
16	812	748
24	826	759
48	1123	841
96	3005	1236

Again, clearly, as you add more threads, the memory bottleneck becomes more obvious, so after a certain point, you do want to just use one thread per core. But, as mentioned by @rmcdermo there are two confounding factors here.

So, let's see what happens when we only have 1 mesh:

n_cores	Time (s) 2 threads/core	Time (s) 1 thread/core
1	1151	1383
2	794	979
4	596	657
8	501	552
16	433	461
24	410	422
48	425	409
96	749	495

Again, there seem to be two regimes. Also, for this size of a problem, you basically plateau at around 48 threads.
But the scaling factor is still more or less the same, whether we are using 1 or 2 threads per core.

What do you think?

Thank you so much for all the feedback!

mcgratta · 2025-06-01T15:59:40Z

mcgratta
Jun 1, 2025
Maintainer

I think that these tests are missing the point. The MPI scaling tests are intended to test the MPI functionality, not OpenMP. To test the OpenMP, it is best to use a single mesh input file because the OpenMP threads will work to speed up the 3-D do-loops. When you apply OpenMP processes to cases with different numbers and dimensions of meshes, you are now confusing the speed-up offered by OpenMP with the slowdown caused by looping over loops of diminishing size.

I suggest you use the MPI scaling tests for MPI, and the OMP scaling tests for OpenMP.

5 replies

sarmento Jun 1, 2025
Author

Good point. Let me look again at this. I guess you're talking about these:

https://github.com/firemodels/out/tree/master/OMP_Scaling_Tests

I will run on these next and report the results.
Thank you!

sarmento Jun 1, 2025
Author

Dear @mcgratta: I was not able to find any specific fds test file in this directory. Is there any specific input file for this test? Many thanks!

mcgratta Jun 2, 2025
Maintainer

I added some notes for the MPI scaling tests and some notes for the OpenMP scaling tests. I suggest that you first repeat our scaling tests on your own system. You may need to write your own scripts to run the jobs. Our scripts labelled Run_All.sh make use of a bash script that writes the Slurm command file.

sarmento Jun 2, 2025
Author

Dear @mcgratta: thank you so much! This is extremely useful.
I will work based on your notes and come back to you soon.
Thank you.
Best!

mcgratta Jun 2, 2025
Maintainer

Let me know if something is not clear, and if you are able to use the various scripts. These are mainly used internally at NIST, but I would be interested if you are able to use them.

Benchmarking FDS on multi-cpu machines #14701

Uh oh!

sarmento May 29, 2025

Replies: 4 comments · 12 replies

Uh oh!

drjfloyd May 29, 2025 Collaborator

Uh oh!

sarmento May 30, 2025 Author

Uh oh!

rmcdermo May 30, 2025 Maintainer

Uh oh!

sarmento May 30, 2025 Author

Uh oh!

mcgratta May 30, 2025 Maintainer

Uh oh!

sarmento May 30, 2025 Author

Uh oh!

Uh oh!

sarmento Jun 1, 2025 Author

Experimental Setting

Initial Results

File strong_scaling_test_032.

VM c4-standard-32

Initial comments

Appdedix A: Code

Appendix B: Dump of all results

Appendix C: Example Log FIle

Uh oh!

drjfloyd Jun 1, 2025 Collaborator

Uh oh!

sarmento Jun 1, 2025 Author

Uh oh!

Uh oh!

sarmento Jun 1, 2025 Author

VM c4-standard-32 (before | after)

File strong_scaling_test_032.

Uh oh!

mcgratta Jun 1, 2025 Maintainer

Uh oh!

Uh oh!

sarmento Jun 1, 2025 Author

Uh oh!

sarmento Jun 1, 2025 Author

Uh oh!

mcgratta Jun 2, 2025 Maintainer

Uh oh!

sarmento Jun 2, 2025 Author

Uh oh!

mcgratta Jun 2, 2025 Maintainer

sarmento
May 29, 2025

Replies: 4 comments 12 replies

drjfloyd
May 29, 2025
Collaborator

sarmento May 30, 2025
Author

rmcdermo May 30, 2025
Maintainer

sarmento May 30, 2025
Author

mcgratta
May 30, 2025
Maintainer

sarmento May 30, 2025
Author

sarmento
Jun 1, 2025
Author

drjfloyd Jun 1, 2025
Collaborator

sarmento Jun 1, 2025
Author

sarmento Jun 1, 2025
Author

mcgratta
Jun 1, 2025
Maintainer

sarmento Jun 1, 2025
Author

sarmento Jun 1, 2025
Author

mcgratta Jun 2, 2025
Maintainer

sarmento Jun 2, 2025
Author

mcgratta Jun 2, 2025
Maintainer