FFT of very large arrays (>100GB) #13842

ItayKishon-Remondo · 2023-01-03T15:39:37Z

ItayKishon-Remondo
Jan 3, 2023

Update (17.04.2023): The last question in this thread is independent of the previous ones (albeit related).
You may skip directly to it: #13842 (comment)

Hi,

I apologize in advance for such a lengthy question, and would like to thank you for reading & assisting.

I'm building a wave optics simulation, in which one of the primary functions used are FFTs. Currently I'm running a preliminary phase (pardon the pun) of the system, where the array size I'm using is (10_000,10_000), yet for the appropriate volume & sampling eventually required I will need at least (100_000,100_000).

I've noticed the memory limit when attempting to use a (20_000,20_000) array, which for complex64 (single precision) amounts to about 3.0GB. The FFT operation itself, jnp.fft.fft(), seemed to require peak memory of about 5-fold during the calculation, which exceeded my A4000's 16GB.

The expected array size (80GB for 100_000^2, complex64) will probably not fit on a single GPU's RAM, not to mention other arrays which are also required for other parts of the calculation, as well as FFT's mid-calculation memory requirements.

I understand that I probably need a grid of GPU/TPUs, e.g. Google's Cloud TPU.
In that case, I have several questions:

Putting HW RAM limits aside, are there currently any SW limitations for such a large FFT? (e.g. Attempt to perform FFT requiring batched cufft plan of size > 4 GB causes coredump #9591 (comment))
Provided enough TPUs (e.g. Cloud TPU), can the FFT operation be divided across several devices?
If so - must that be explicit (via sharding?), or does JAX/CloudTPU handle that automatically?
A full iteration of the simulation contains many steps, including dozens of FFT. Can a function with so many parameters be traced, memory-wise? The main reason for that is being able to run optimizations (e.g. JAXopt, Optax)
Re. the peak memory usage I've observed during the FFT: how can I find the estimated memory required for this operation? I've witnessed roughly 5 times the size of the array, yet that was far from being accurately profiled.
Finally - I'm currently only using 1D FFT (jnp.fft.fft), which seems to be relatively easy to calculate in row chunks, and therefore distributed over several TPUs. What would be the case for 2D FFT (jnp.fft.fft2)?

I do understand that these are not trivial requirements, and I do hope that such a calculation is even feasible.
Any help, suggestion or reference would be highly appreciated. Thanks!

ItayKishon-Remondo · 2023-01-19T15:45:31Z

ItayKishon-Remondo
Jan 19, 2023
Author

I've made some progress, now being able to perform FFT of a (48_000,40_000) array, via Google Cloud TPU v2-8 (8 devices of 8GB each):

import jax
import jax.numpy as jnp
import numpy as np
from jax.experimental import mesh_utils
from jax.sharding import PositionalSharding
from functools import partial
# see https://github.com/google/jax/pull/6143
jax.config.update('jax_default_matmul_precision', 'float32')
from jax_smi import initialise_tracking
initialise_tracking(interval=0.1)

print('device_count = ',jax.device_count())
#> device_count =  8

x_np = np.zeros([48000,40000]).astype(np.complex64)
print(x_np.size / 1000**2, "M-elements")
print(f"{x_np.size * x_np.itemsize /1024**3:.2f} GB")
#> 1920.0 M-elements
#> 14.31 GB

def sharding_for_fft_axis(axis):
    assert axis<=1
    devices = mesh_utils.create_device_mesh((jax.device_count()))
    sharding_shape = [(1,len(devices)),(len(devices),1)][axis]
    sharding = PositionalSharding(devices).reshape(sharding_shape)
    return sharding

fft_axis = 0
x = jax.device_put(
    x_np,
    sharding_for_fft_axis(axis=fft_axis)
).block_until_ready()

@partial(jax.jit, static_argnames=['fft_axis'])
def fft(x,fft_axis):
    y = jax.lax.with_sharding_constraint(
        jnp.fft.fft(x, axis=fft_axis),
        sharding_for_fft_axis(axis=fft_axis)
    )
    return y

y = fft(x,fft_axis)

There are still some hurdles. Two of which are:

Sharding unable to utilize all RAM/HBM #14075
Each FFT operation requires sharding in the other axis of the FFT (see sharding_for_fft_axis above). Hence, when 2D FFT is needed, one must "reshard" the result of the 1st FFT in order to perform the 2nd FFT. E.g.:
```
fft_axis=1
z = jax.device_put(
    y,
    sharding_for_fft_axis(axis=fft_axis)
).block_until_ready()
```
For some reason, this operation takes a long time (~55 sec). The reason is yet unclear to me.

My next step would be to use larger TPU topologies, i.e. slices & pods.

I'd greatly appreciate any advice on this effort.
Thank you.

0 replies

ItayKishon-Remondo · 2023-04-17T13:46:55Z

ItayKishon-Remondo
Apr 17, 2023
Author

Note: the previous comments in this thread are not a prerequisite for this question, although part of the same effort.

Gist of the matter

Given an array sharded appropriately (see below) across several devices, I'm unable instruct JIT/AOT to perform FFT while using only each local slice. It seems that each device is loading the entire array (although not needed), evident both by tracing the memory usage, as well as looking at compiled HLO output.

Further details

For this example, I'm using a Google Cloud VM with 4 x NVidia T4 GPUs, 16GB each.
When using a single device, I'm able to perform FFT of a c128[8000,32768] array; This is roughly the maximal size for such an amount of RAM.
To utilize several devices, I'm sharding the input array over axis 0, while performing FFT over axis 1. I.e., 4 slices of c128[2000,32768] each.
When performing an FFT jitted explicitly over the relevant sharding:

jit_fft_allslices = jax.jit(
    lambda a,n,axis: jnp.fft.fft(a=a, n=n, axis=axis),
    static_argnums=[1,2], in_shardings = sharding, out_shardings=sharding )

, it seems that the compiled result is using, as some intermediate stage, the entire array, thus negating the main reason for sharding it:

compiled(fft_allslices):
 HloModule jit__lambda_, is_scheduled=true, entry_computation_layout={(c128[2000,16000]{1,0})->c128[2000,32768]{1,0}}

%fused_computation (param_1: c128[8000,32768], param_1.3: u32[]) -> c128[2000,32768] {
  %param_1 = c128[8000,32768]{1,0} parameter(0)
  %param_1.3 = u32[] parameter(1)
  %convert.4 = s32[] convert(u32[] %param_1.3), metadata={op_name="jit(<lambda>)/jit(main)/jit(fft)/fft[fft_type=FftType.FFT fft_lengths=(32768,)]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
  %constant_18 = s32[] constant(2000), metadata={op_name="jit(<lambda>)/jit(main)/jit(fft)/fft[fft_type=FftType.FFT fft_lengths=(32768,)]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
  %multiply.4 = s32[] multiply(s32[] %convert.4, s32[] %constant_18), metadata={op_name="jit(<lambda>)/jit(main)/jit(fft)/fft[fft_type=FftType.FFT fft_lengths=(32768,)]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
  %constant_17 = s32[] constant(0), metadata={op_name="jit(<lambda>)/jit(main)/jit(fft)/fft[fft_type=FftType.FFT fft_lengths=(32768,)]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
  ROOT %dynamic-slice.4 = c128[2000,32768]{1,0} dynamic-slice(c128[8000,32768]{1,0} %param_1, s32[] %multiply.4, s32[] %constant_17), dynamic_slice_sizes={2000,32768}, metadata={op_name="jit(<lambda>)/jit(main)/jit(fft)/fft[fft_type=FftType.FFT fft_lengths=(32768,)]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
}

ENTRY %main.18_spmd (param: c128[2000,16000]) -> c128[2000,32768] {
  %constant_14 = c128[] constant((0, 0)), metadata={op_name="jit(<lambda>)/jit(main)/jit(_pad)/gather[dimension_numbers=GatherDimensionNumbers(offset_dims=(), collapsed_slice_dims=(0, 1), start_index_map=(0, 1)) slice_sizes=(1, 1) unique_indices=True indices_are_sorted=True mode=GatherScatterMode.PROMISE_IN_BOUNDS fill_value=None]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
  %param = c128[2000,16000]{1,0} parameter(0), sharding={devices=[4,1]0,1,2,3}
  %pad.2 = c128[2000,32768]{1,0} pad(c128[2000,16000]{1,0} %param, c128[] %constant_14), padding=0_0x0_16768, metadata={op_name="jit(<lambda>)/jit(main)/jit(_pad)/pad[padding_config=((0, 0, 0), (0, 16768, 0))]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
  %all-gather = c128[8000,32768]{1,0} all-gather(c128[2000,32768]{1,0} %pad.2), channel_id=1, replica_groups={{0,1,2,3}}, dimensions={0}, use_global_device_ids=true, metadata={op_name="jit(<lambda>)/jit(main)/jit(fft)/fft[fft_type=FftType.FFT fft_lengths=(32768,)]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
  %fft.1 = c128[8000,32768]{1,0} fft(c128[8000,32768]{1,0} %all-gather), fft_type=FFT, fft_length={32768}, metadata={op_name="jit(<lambda>)/jit(main)/jit(fft)/fft[fft_type=FftType.FFT fft_lengths=(32768,)]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
  %partition-id = u32[] partition-id(), metadata={op_name="jit(<lambda>)/jit(main)/jit(fft)/fft[fft_type=FftType.FFT fft_lengths=(32768,)]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
  ROOT %fusion = c128[2000,32768]{1,0} fusion(c128[8000,32768]{1,0} %fft.1, u32[] %partition-id), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(<lambda>)/jit(main)/jit(fft)/fft[fft_type=FftType.FFT fft_lengths=(32768,)]" source_file="/tmp/ipykernel_134513/3728213830.py" source_line=2}
}

Although I'm only taking baby steps in looking at HLO code, it seems that although the main function is using the expected input & output parameter shapes (ENTRY %main.18_spmd (param: c128[2000,16000]) -> c128[2000,32768]), when calling the actual FFT function it uses the entire array's shape: %fft.1 = c128[8000,32768]{1,0} fft(c128[8000,32768]{1,0} %all-gather) ... (the %all-gather name is also a dead giveaway...).

This is, obviously, what I trying to avoid.

Another indication of such memory usage is demonstrated by monitoring it during the execution; The plateau at 13:57:35 is synthetically inserted (time.sleep), to discern between the input memory allocation and the FFT. Similarly, the plateau at 13:57:40 onwards is after the FFT.

Manually FFT-ing each shard

To demonstrate (mainly to myself...) that this kind of sharded FFT is indeed feasible, I've iterated over all the shards, sequantially performing FFT only on each local shard, and finally reassembling the results to a single array.

As expected, this method enabled a 4-fold larger array to be used, i.e. c128[32000,32768]. Seen here, with a synthetic sleep between iterations (to visually discern memory usage), each device is using only its local shard.

Obviously, such a method not desired for real use, mainly due to the sequential nature of operation (as well as several other reasons).

Full Example

As this message is long enough as it is, I've uploaded the entire example, containing both methods (the desired jit'ing with local shards only, and the sequential independent devices one), with HLO outputs of lower & compile:

https://gist.github.com/ItayKishon-Remondo/d04de8ccc5f19a683538235380ffc5c3

HW & SW Details

Ubuntu 22.04 LTS
Python 3.9.12
jax 0.4.8
jaxlib 0.4.7+cuda12.cudnn88
CUDA 12.1
4 x NVIdia Tesla T4, 16GB

Thank you for reaching so far. Any help or reference would be greatly appreciated.

0 replies

yashjha123 · 2023-08-01T05:33:47Z

yashjha123
Aug 1, 2023

Amazing work! I'm curious, how did it go?

6 replies

Findus23 Aug 10, 2023

I'm also currently trying to use JAX for large 3D-FFTs and am stuck at the fact that already a simple example gives different and incorrect results with JIT/GPU vs. no-JIT or CPU: #16909

ItayKishon-Remondo Aug 14, 2023
Author

Hi all,
I'm very glad you found those posts helpful.

Following #15680, I've implemented a native JAX solution using shmap. Although it's working, it is far from being elegant & efficient, and furthermore - it won't scale to multi-host (which is obviously required for >100GB arrays).

import jax
from jax.sharding import Mesh, SingleDeviceSharding, PositionalSharding, PartitionSpec as P
from jax.experimental import mesh_utils
from jax.experimental.shard_map import shard_map

_DEVICE_MESH_FOR_FFT_AXIS_0 = mesh_utils.create_device_mesh( (1                 , jax.device_count()) )
_DEVICE_MESH_FOR_FFT_AXIS_1 = mesh_utils.create_device_mesh( (jax.device_count(), 1                 ) )

class MeshForFFT(Enum):
    FFT_AXIS_0 = Mesh(_DEVICE_MESH_FOR_FFT_AXIS_0, axis_names=('i', 'j'))
    FFT_AXIS_1 = Mesh(_DEVICE_MESH_FOR_FFT_AXIS_1, axis_names=('i', 'j'))

class ShardingForFFT(Enum):
    SINGLE_DEVICE0 = SingleDeviceSharding(jax.devices()[0])
    FFT_AXIS_0     = PositionalSharding(_DEVICE_MESH_FOR_FFT_AXIS_0)
    FFT_AXIS_1     = PositionalSharding(_DEVICE_MESH_FOR_FFT_AXIS_1)

def _fft_sharded(a, n=None, axis=None, forward_fft:bool=True):
    """FFT utilizing all available local GPUs.
    Using jax.experimental.shard_map (other solutions available, not implemented here)

    Args:
        a (_type_): Input array (see np.fft.fft)
        n (_type_): Length of output axis (see np.fft.fft)
        axis (_type_): Axis over which to FFT (see np.fft.fft)
        forward_fft (bool): True for fft (forward), False for ifft (inverse)

    Raises:
        ValueError: _description_

    Returns:
        _type_: The truncated or zero-padded input, transformed along the axis indicated by axis (see np.fft.fft)
    """

    # Replicate numpy.fft's original behavior: "If n is not given, ..."
    if n is None:
        # "... the length of the input along the axis specified by axis is used."
        n = a.shape[axis]

    @partial(shard_map, mesh=MeshForFFT.FFT_AXIS_0.value, in_specs=P(None, 'j'), out_specs=P(None, 'j'))
    def _shmap_fft_axis0(a):
        return jnp.fft.fft(a=a, n=n, axis=0)

    @partial(shard_map, mesh=MeshForFFT.FFT_AXIS_1.value, in_specs=P('i', None), out_specs=P('i', None))
    def _shmap_fft_axis1(a):
        return jnp.fft.fft(a=a, n=n, axis=1)

    @partial(shard_map, mesh=MeshForFFT.FFT_AXIS_0.value, in_specs=P(None, 'j'), out_specs=P(None, 'j'))
    def _shmap_ifft_axis0(a):
        return jnp.fft.ifft(a=a, n=n, axis=0)

    @partial(shard_map, mesh=MeshForFFT.FFT_AXIS_1.value, in_specs=P('i', None), out_specs=P('i', None))
    def _shmap_ifft_axis1(a):
        return jnp.fft.ifft(a=a, n=n, axis=1)

    fft_axis = axis
    sharding_axis = 1-fft_axis      # distinguish between the axis for FFT, and the one for sharding (which must be the OTHER axis; thus, for 2D = 1-fft_axis)

    if fft_axis==0:
        if forward_fft:
            _shmap_fft = _shmap_fft_axis0
        else:
            _shmap_fft = _shmap_ifft_axis0
        sharding_by_axis = ShardingForFFT.FFT_AXIS_0.value
    elif fft_axis==1:
        if forward_fft:
            _shmap_fft = _shmap_fft_axis1
        else:
            _shmap_fft = _shmap_ifft_axis1
        sharding_by_axis = ShardingForFFT.FFT_AXIS_1.value
    else: 
        raise ValueError(f"Axis out of range: {fft_axis}. May be only [0,1].")

    jit_fft_shmooped = jax.jit(
        lambda a: _shmap_fft(a),
        in_shardings = sharding_by_axis, out_shardings=sharding_by_axis )
    
    jit_fft_shmooped_lowered  = jit_fft_shmooped.lower(a)
    jit_fft_shmooped_compiled = jit_fft_shmooped_lowered.compile()

    fft_output = jit_fft_shmooped_compiled(a)

    return fft_output

@Findus23, it seems you're in a much more advanced state than I am, utilizing cuFFTMp (which I believe is the correct (scalable) solution to this problem).

Going forward, I'm afraid I won't be the one implementing that. I'm handing this over to @ChaimDryzun-Remondo, which I hope will keep both updating here & make use of this great community.

Cheers,
Itay

Findus23 Aug 17, 2023

Thank you for your notes.
After avoiding the bugs from #16909, I now also stumbled into an issue that the FFT uses far too much memory. It seems like intermediate results are not de-allocated fast enough and therefore the next step FFT runs out of memory. With XLA_PYTHON_CLIENT_PREALLOCATE=false and XLA_PYTHON_CLIENT_ALLOCATOR=platform I can do quite a bit larger arrays before it runs out memory.
I still think that it's using more memory than it should and I'm not quite sure how to debug it, but I am now able to 3d-FFT a 1536^3 array with input dtype np.float32 with a rfftn on two A40 GPUs (46GB RAM each) without running OOM.

Findus23 Aug 18, 2023

In case it helps anyone else:

I have now updated my script to work with distributed hosts allowing me to use more than 2 GPUs. One thing I struggled with was accessing the distributed output array. Using process_allgather means that all data is copied back to all GPUs making them run out of memory. But then I found #16816 which mentions Array.addressable_shards. Using out_data.addressable_data(0) I can therefore read only the local subset of the sharded array.

This gist contains a full example (requires jax 0.4.8). Using 2 hosts with each 2 A40 GPUs (46GB RAM) I can calculate the 3d rfftn of a 2048^3 array (so roughly the same size as the 10_000^2 from the original post) in about 12s.

https://gist.github.com/Findus23/eb5ecb9f65ccf13152cda7c7e521cbdd

yashjha123 Aug 21, 2023

intermediate results are not de-allocated fast enough and therefore the next step

You might be interested in using donate_argnums to avoid OOM. Conceptually, you can use any of the input buffers to save output arrays of a jitted function. I'm not sure what lives are exactly referring to.
In case the OOM occurs outside the jitted function, then you can simply delete the array and trigger garbage collection manually.
arr = some large array del arr import gc; gc.collect()
Outputs: a number if any object is deleted

DavidLanders95 · 2023-08-21T08:42:52Z

DavidLanders95
Aug 21, 2023

I now also stumbled into an issue that the FFT uses far too much memory. It seems like intermediate results are not de-allocated fast enough and therefore the next step FFT runs out of memory.

Thanks @Findus23 and @ItayKishon-Remondo for this interesting work. I am trying to solve a similar problem whereby I need to currently calculate 1000s of 2D FFTs in parallel each of image size 256x256, but I would want to scale this up in the future. I have access to two 3090s with 48Gb of VRAM in total currently. One problem I have run into when trying to use CUPY to solve this problem on a single GPU, that causes issues with running out of memory, is that cuda (cufft package) does not seem to calculate the fourier transform inplace - see the features section of this open source alternative - https://github.com/vincefn/pyvkfft. This effectively halves the amount of available VRAM, since cuda seems to make a copy of the array. This might be part of the reason why it uses more VRAM than it should.

It might be irrelevant to JAX considering it's immutability requirements, anyway, of the above solutions posted, which would you recommend for solving this problem: The JAX+CUFFTMp approach https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/JAX_FFT

or the approach shared by you @Findus23 in this gist https://gist.github.com/Findus23/eb5ecb9f65ccf13152cda7c7e521cbdd?

0 replies

FFT of very large arrays (>100GB) #13842

Uh oh!

Uh oh!

ItayKishon-Remondo Jan 3, 2023

Replies: 4 comments · 6 replies

Uh oh!

ItayKishon-Remondo Jan 19, 2023 Author

Uh oh!

Uh oh!

ItayKishon-Remondo Apr 17, 2023 Author

Gist of the matter

Further details

Manually FFT-ing each shard

Full Example

HW & SW Details

Thank you for reaching so far. Any help or reference would be greatly appreciated.

Uh oh!

yashjha123 Aug 1, 2023

Uh oh!

Findus23 Aug 10, 2023

Uh oh!

Uh oh!

ItayKishon-Remondo Aug 14, 2023 Author

Uh oh!

Findus23 Aug 17, 2023

Uh oh!

Findus23 Aug 18, 2023

Uh oh!

yashjha123 Aug 21, 2023

Uh oh!

DavidLanders95 Aug 21, 2023

ItayKishon-Remondo
Jan 3, 2023

Replies: 4 comments 6 replies

ItayKishon-Remondo
Jan 19, 2023
Author

ItayKishon-Remondo
Apr 17, 2023
Author

yashjha123
Aug 1, 2023

ItayKishon-Remondo Aug 14, 2023
Author

DavidLanders95
Aug 21, 2023