-
Notifications
You must be signed in to change notification settings - Fork 21
Description
I am currently trying to run SENSEI with a simulation where I use ADIOS2 for collecting the data on a separate node and running the visualizations there.
I noticed however, that the memory usage on the receiving node was increasing with every timestep. The case I simulated was not gigantic, the file written by PosthocIO were about 10GB per timestep, which is also round about the size the memory consumption increased with every timestep. After a few dozen steps the Endpoint crashed because no more memory could be allocated.
As far as I can tell this only happens on the receiving node. When I tried it before with visualizing on the simulating nodes I saw no concerning leaks whatsoever. In the case outlined above the increasing memory was also only visible on the receiving node, the 8 simulating nodes were pretty much constant.
I created a small test example with the oscillator miniapp:
example.slurm:
#!/bin/bash -x
#SBATCH --job-name=example
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=6
...
# Loading modules
...
module load sensei/4.1.0-adios2-catalyst-5.10.1
export PROFILER_ENABLE=1
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
# Starting Simulation:
rm -rf info.sst
srun -N4 -n32 --cpu-bind=verbose oscillator -b 65536 -s "9999999 9999999 9999999" -p 100 --t-end 3 -j 6 -f transport.xml --sync periodic-3772.osc &
# Starting SENSEI Endpoint
srun -N1 -n1 --cpu-bind=verbose SENSEIEndPoint -t transport.xml -a analysis.xml &> "mpi-$SLURM_JOB_ID.endpoint" &
wait
rm -rf info.sst
transport.xml:
<sensei>
<transport type="adios2" enabled="1" engine="SST" filename="info" frequency="1">
<engine_parameters>
verbose = 5
RendezvousReaderCount = 1
RegistrationMethod = File
OpenTimeoutSecs = 300
<!--DataTransport = RDMA-->
</engine_parameters>
<mesh name="mesh">
<cell_arrays>data</cell_arrays>
</mesh>
</transport>
</sensei>
analysis.xml:
<sensei>
<analysis type="PosthocIO" enabled="1" frequency="1" output_dir="./posthocIO" file_name="output" mode="paraview">
<mesh name="mesh">
<cell_arrays>data</cell_arrays>
</mesh>
</analysis>
<analysis type="catalyst" enabled="1" frequency="1" pipeline="slice" array="data" association="cell" image-filename="./datasets/slice-%ts.png" image-width="1920" image-height="1080" />
</sensei>
I made the size of the oscillator inputs pretty large so you would actually be able to see if the memory would increase.
Running this setup, our jobreporting shows the following memory consumption for the Endpoint node:
In this case the memory increased around 168GB over the complete 30 minutes and the Endpoint received (according to the log) 344 timesteps. So the leaked memory should be about 500MB per timestep (The steps you see in the graph are is just the sampling rate of the job reporting (about once every 1-2 minutes, not actually the consumption per step!).
SENSEI Version: 4.1.0
ADIOS2 Version: 2.7.1
I already tried going in with a debugger, but I could not find any immediate obivous cause for this. I will try to play around with the ADIOS2 parameters a bit and report any further discoveries.
Thanks in advance,
~Jonathan