Using dask on a cluster for lagged cross-correlation between complex variables #6790

iuryt · 2022-07-15T17:13:14Z

iuryt
Jul 15, 2022

I have a complex data (Fourier coefficients) that has the following dimensions (noise, time, kx, ky)=(5, 200, 4096, 4096).
I want to calculate the lagged time correlation between different noises.

I have a file for each noise that I am loading using open_mfdataset giving chunks="auto"

fnames = glob("data/omega_run*")
fnames.sort()

noise = [0,1e-6,1e-8,1e-10,1e-12]
ωk = xr.open_mfdataset(fnames,concat_dim="noise",combine="nested",chunks="auto").assign_coords(noise=noise).ωk

I can't load everything (250Gb) into the memory.
AFAIK, xr.corr only works for real variables and does not perform lagged correlation.
I could also try to somehow wrap scipy.signal.correlate

My first try was to run a simple example for the lag 0:

x1 = ωk.sel(noise=0)
x2 = ωk.sel(noise=1e-8)

C0 = ((x1*x2).sum("time")/np.sqrt((x1**2).mean("time"))/np.sqrt((x2**2).mean("time"))).real

with ProgressBar():
    C0.compute()

which takes about 17 minutes to run.

[########################################] | 100% Completed | 17m 8ss

What I want is to have a correlation matrix like (lag, kx, ky) for each combination I give for different noise.
I was thinking on using dask to parallelize and speedup this process, but I am already requesting resources from the cluster I am using for the jupyterlab I am running for this. So I am not sure if the client I create for dask can somehow interact with SLURM for this process or if there are better alternatives with the built-in functions we have for xarray.

dcherian · 2022-07-15T17:28:08Z

dcherian
Jul 15, 2022
Maintainer

See https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html

You'll need something like

cluster = dask_jobqueue.SLURMCluster(...)
cluster.scale(4)
client = distributed.Client(cluster)

For your notebook, you shouldn't be requesting all the resources you need for your job. That will go through the queue separately.

1 reply

iuryt Jul 15, 2022
Author

Cool! So wait, if I do

cluster = SLURMCluster(
    queue='myqueue',
    project="correlations",
    cores=24,
    memory="45 GB"
)

you are saying I don't need to ask that much of memory and cores because dask will submit a job for each chunk with these resources?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using dask on a cluster for lagged cross-correlation between complex variables #6790

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using dask on a cluster for lagged cross-correlation between complex variables #6790

Uh oh!

iuryt Jul 15, 2022

Replies: 1 comment · 1 reply

Uh oh!

dcherian Jul 15, 2022 Maintainer

Uh oh!

iuryt Jul 15, 2022 Author

iuryt
Jul 15, 2022

Replies: 1 comment 1 reply

dcherian
Jul 15, 2022
Maintainer

iuryt Jul 15, 2022
Author