Difference in weight's shape between Conv2d and PyTorch's Conv2d #724

mzbac · 2024-02-22T06:00:32Z

mzbac
Feb 22, 2024

Just noticed the mlx Conv2d weight shape is different from the PyTorch Conv2d shape. I'm wondering if there's a specific reason why we want to implement it differently? I am bringing this up mainly because it's causing some issues when trying to load some Pytorch model weights, as we have to do some weight conversion or using custom Conv2d instead of the default nn.Conv2d.

For example, in the clip.
https://github.com/ml-explore/mlx-examples/blob/main/clip/convert.py#L62-L65

Answered by awni

Feb 22, 2024

Nice! A couple comments on that:

You could inherit from nn.Conv2d and override just the __call__ method

Maybe even smoother instead of overriding __call__ you could override __setattr__ and just write the weight in the right order then:

    def __setattr__(self, key: str, val: Any):
          if key == "weight":
              val = mx.swapaxes(val, 0, 3)
        self[key] = val

That at least let's you avoid the transpose on each call to mx.conv2d which could come with a small perf penalty.

Though it's probably not worth being so clever about, I also find just pre transforming the weights to be simple and explicit.

View full answer

angeloskath · 2024-02-22T06:11:24Z

angeloskath
Feb 22, 2024
Maintainer

Well, not the only reason but the first one is that we use NHWC vs NCHW.

This means that the matrix multiplication would be with a weight of shape OhwC where hw are the kernel sizes. PyTorch's would be OChw (as it is).

8 replies

mzbac Feb 22, 2024
Author

@angeloskath Thanks for the explanation. Sorry, I'm not an expert on this. Just to clarify, is this for performance reasons? If so, do you think we could provide some flexibility for nn.Conv2d so that we can support loading the PyTorch model weights and transpose them during the forward pass?

awni Feb 22, 2024
Maintainer

I would say it's more for consistency reasons than perf reasons. All the layers in MLX assume the channel/feature dimension is last. It makes it easy and clean IMO to stack layers without needing to transpose or swap axes. Also I find all of our other 2/3d layers are much simpler to implement because of this (e.g. BatchNorm is simple without needing ND specializations).

Regarding interop with PyTorch that is tricky. I don't think we want to hardcode something in our core Conv2D layer to deal with it. But maybe it's not too bad to do a preproc step after loading weights along the lines of:

weights = {k: mx.swapaxes(v, 0, 3) if v.ndim == 4 else v for k, v in weights.items()}

mzbac Feb 22, 2024
Author

Thanks for the detailed explanation. In this case, I think using a custom Conv2d to support loading PyTorch models doesn't seem so bad . Just in case someone else runs into a similar issue, here is the custom Conv2d module I used to work around this:

class Conv2d(nn.Module):
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: Union[int, tuple],
        stride: Union[int, tuple] = 1,
        padding: Union[int, tuple] = 0,
        bias: bool = True,
    ):
        super().__init__()

        kernel_size, stride, padding = map(
            lambda x: (x, x) if isinstance(x, int) else x,
            (kernel_size, stride, padding),
        )
        scale = math.sqrt(1 / (in_channels * kernel_size[0] * kernel_size[1]))
        self.weight = mx.random.uniform(
            low=-scale,
            high=scale,
            shape=(out_channels, in_channels, *kernel_size),
        )
        if bias:
            self.bias = mx.zeros((out_channels,))

        self.padding = padding
        self.stride = stride

    def _extra_repr(self):
        return (
            f"{self.weight.shape[-1]}, {self.weight.shape[0]}, "
            f"kernel_size={self.weight.shape[1:2]}, stride={self.stride}, "
            f"padding={self.padding}, bias={'bias' in self}"
        )

    def __call__(self, x):
        # pytorch conv2d expects the weight tensor to be of shape [out_channels, in_channels, kH, KW]
        # mlx conv2d expects the weight tensor to be of shape [out_channels, kH, KW, in_channels]
        y = mx.conv2d(x, self.weight.transpose(0, 2, 3, 1), self.stride, self.padding)
        if "bias" in self:
            y = y + self.bias
        return y

awni Feb 22, 2024
Maintainer

Nice! A couple comments on that:

You could inherit from nn.Conv2d and override just the __call__ method

Maybe even smoother instead of overriding __call__ you could override __setattr__ and just write the weight in the right order then:

    def __setattr__(self, key: str, val: Any):
          if key == "weight":
              val = mx.swapaxes(val, 0, 3)
        self[key] = val

That at least let's you avoid the transpose on each call to mx.conv2d which could come with a small perf penalty.

Though it's probably not worth being so clever about, I also find just pre transforming the weights to be simple and explicit.

Answer selected by mzbac

mzbac Feb 22, 2024
Author

Yeah, I was a bit worrying that the ndim == 4 may conflict with something else, but definitely worth trying it out.

bitanath Apr 27, 2025

So it is safe to say that MLX weights for Conv layers will always be NHWC while PyTorch weights will always be NCHW. This is similar to Tensorflow iirc which also stores weights as NHWC. Is that the inspiration? @awni

awni Apr 27, 2025
Maintainer

So it is safe to say that MLX weights for Conv layers will always be NHWC
That's correct.

similar to Tensorflow iirc which also stores weights as NHWC. Is that the inspiration?

The rationale is that it's faster and results in simpler code. It's easier to write fast kernels with channels last. Also since all the layers in MLX assume the channels are last so almost never have to transpose and move axes around. In PyTorch you often do because the location of the channels axis changes

bitanath May 23, 2025

Thanks a lot for the help! I managed to roll my own SD 1.5 ported to MLX (did not find this in MLX Examples). Main utilities is SD 1.5 is more flex than SD 2.1/XL even Flux for LORA and fine tunes. Faced some gotchas on GroupNorm -> need to be pytorch_compatible=True and on the NCHW vs NHWC porting -> every convolution layer. But it works well and 10x faster than CPU (although about 10x slower than CUDA GPU). Here is the repo would appreciate any feedback or any way to make it even faster: https://github.com/bitanath/mlx-stable-diffusion

Difference in weight's shape between Conv2d and PyTorch's Conv2d #724

Uh oh!

Uh oh!

mzbac Feb 22, 2024

Replies: 1 comment · 8 replies

Uh oh!

angeloskath Feb 22, 2024 Maintainer

Uh oh!

mzbac Feb 22, 2024 Author

Uh oh!

Uh oh!

awni Feb 22, 2024 Maintainer

Uh oh!

mzbac Feb 22, 2024 Author

Uh oh!

awni Feb 22, 2024 Maintainer

Uh oh!

mzbac Feb 22, 2024 Author

Uh oh!

bitanath Apr 27, 2025

Uh oh!

awni Apr 27, 2025 Maintainer

Uh oh!

bitanath May 23, 2025

mzbac
Feb 22, 2024

Replies: 1 comment 8 replies

angeloskath
Feb 22, 2024
Maintainer

mzbac Feb 22, 2024
Author

awni Feb 22, 2024
Maintainer

mzbac Feb 22, 2024
Author

awni Feb 22, 2024
Maintainer

mzbac Feb 22, 2024
Author

awni Apr 27, 2025
Maintainer