Skip to content

Growing Memory Consumption with Endpoint + ADIOS2 #107

@jwindgassen

Description

@jwindgassen

I am currently trying to run SENSEI with a simulation where I use ADIOS2 for collecting the data on a separate node and running the visualizations there.
I noticed however, that the memory usage on the receiving node was increasing with every timestep. The case I simulated was not gigantic, the file written by PosthocIO were about 10GB per timestep, which is also round about the size the memory consumption increased with every timestep. After a few dozen steps the Endpoint crashed because no more memory could be allocated.

As far as I can tell this only happens on the receiving node. When I tried it before with visualizing on the simulating nodes I saw no concerning leaks whatsoever. In the case outlined above the increasing memory was also only visible on the receiving node, the 8 simulating nodes were pretty much constant.

I created a small test example with the oscillator miniapp:

example.slurm:

#!/bin/bash -x
#SBATCH --job-name=example
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=6
...

# Loading modules
...
module load sensei/4.1.0-adios2-catalyst-5.10.1

export PROFILER_ENABLE=1
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# Starting Simulation:
rm -rf info.sst
srun -N4 -n32 --cpu-bind=verbose oscillator -b 65536 -s "9999999 9999999 9999999" -p 100 --t-end 3 -j 6 -f transport.xml --sync periodic-3772.osc &


# Starting SENSEI Endpoint
srun -N1 -n1 --cpu-bind=verbose SENSEIEndPoint -t transport.xml -a analysis.xml &> "mpi-$SLURM_JOB_ID.endpoint" &

wait
rm -rf info.sst

transport.xml:

<sensei>
    <transport type="adios2" enabled="1" engine="SST" filename="info" frequency="1">
        <engine_parameters>
            verbose = 5
            RendezvousReaderCount = 1
            RegistrationMethod = File
            OpenTimeoutSecs = 300
            <!--DataTransport = RDMA-->
        </engine_parameters>
        
        <mesh name="mesh">
            <cell_arrays>data</cell_arrays>
        </mesh>
    </transport>
</sensei>

analysis.xml:

<sensei>
    <analysis type="PosthocIO" enabled="1" frequency="1" output_dir="./posthocIO" file_name="output" mode="paraview">
        <mesh name="mesh">
            <cell_arrays>data</cell_arrays>
        </mesh>
    </analysis>
    
    <analysis type="catalyst" enabled="1" frequency="1" pipeline="slice" array="data" association="cell" image-filename="./datasets/slice-%ts.png" image-width="1920" image-height="1080" />
</sensei>

I made the size of the oscillator inputs pretty large so you would actually be able to see if the memory would increase.

Running this setup, our jobreporting shows the following memory consumption for the Endpoint node:
image
In this case the memory increased around 168GB over the complete 30 minutes and the Endpoint received (according to the log) 344 timesteps. So the leaked memory should be about 500MB per timestep (The steps you see in the graph are is just the sampling rate of the job reporting (about once every 1-2 minutes, not actually the consumption per step!).

SENSEI Version: 4.1.0
ADIOS2 Version: 2.7.1

I already tried going in with a debugger, but I could not find any immediate obivous cause for this. I will try to play around with the ADIOS2 parameters a bit and report any further discoveries.

Thanks in advance,
~Jonathan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions