seems not faster?

Hi there~
I tried this library in the environment (python 3.12.8 and torch 2.6.0+cu124), but it didn't seem to speed up, but became slower?
Is it an environment problem? Or maybe the compile of the new version of torch more powerful? thks~

```
$python3 test.py 
Output From nn.Linear (compiled):
 tensor([[-0.2332,  0.0908, -0.0438,  ...,  0.1122,  0.0160,  0.1544],
        [ 0.0186,  0.1241, -0.0663,  ..., -0.0545, -0.0704, -0.1599],
        [-0.1022,  0.0374,  0.1137,  ..., -0.0730,  0.0739, -0.2000],
        ...,
        [ 0.1812, -0.1332, -0.2463,  ..., -0.1785,  0.0004,  0.2174],
        [ 0.0446, -0.2808, -0.0799,  ...,  0.0555, -0.1138, -0.0076],
        [ 0.0435, -0.0329, -0.0820,  ..., -0.0171, -0.0221,  0.1840]],
       device='cuda:0', dtype=torch.float16)
Output From CublasLinear:
 tensor([[-2.3328e-01,  9.1309e-02, -4.3945e-02,  ...,  1.1200e-01,
          1.6022e-02,  1.5430e-01],
        [ 1.8524e-02,  1.2421e-01, -6.6589e-02,  ..., -5.4535e-02,
         -7.0435e-02, -1.6052e-01],
        [-1.0205e-01,  3.7170e-02,  1.1316e-01,  ..., -7.3181e-02,
          7.3730e-02, -1.9958e-01],
        ...,
        [ 1.8091e-01, -1.3342e-01, -2.4609e-01,  ..., -1.7834e-01,
         -2.2888e-04,  2.1716e-01],
        [ 4.4464e-02, -2.8101e-01, -7.9956e-02,  ...,  5.5481e-02,
         -1.1414e-01, -7.6065e-03],
        [ 4.3610e-02, -3.2715e-02, -8.2092e-02,  ..., -1.6968e-02,
         -2.2156e-02,  1.8457e-01]], device='cuda:0', dtype=torch.float16)
TORCH_F16_W/_F32_ACC_(COMPILED):_ 1038.24 us 132.38 TFLOPS
CUBLAS_F16_W/_F16_ACC:_ 1328.69 us 103.44 TFLOPS
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

seems not faster? #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

seems not faster? #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions