-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Hi there~
I tried this library in the environment (python 3.12.8 and torch 2.6.0+cu124), but it didn't seem to speed up, but became slower?
Is it an environment problem? Or maybe the compile of the new version of torch more powerful? thks~
$python3 test.py
Output From nn.Linear (compiled):
tensor([[-0.2332, 0.0908, -0.0438, ..., 0.1122, 0.0160, 0.1544],
[ 0.0186, 0.1241, -0.0663, ..., -0.0545, -0.0704, -0.1599],
[-0.1022, 0.0374, 0.1137, ..., -0.0730, 0.0739, -0.2000],
...,
[ 0.1812, -0.1332, -0.2463, ..., -0.1785, 0.0004, 0.2174],
[ 0.0446, -0.2808, -0.0799, ..., 0.0555, -0.1138, -0.0076],
[ 0.0435, -0.0329, -0.0820, ..., -0.0171, -0.0221, 0.1840]],
device='cuda:0', dtype=torch.float16)
Output From CublasLinear:
tensor([[-2.3328e-01, 9.1309e-02, -4.3945e-02, ..., 1.1200e-01,
1.6022e-02, 1.5430e-01],
[ 1.8524e-02, 1.2421e-01, -6.6589e-02, ..., -5.4535e-02,
-7.0435e-02, -1.6052e-01],
[-1.0205e-01, 3.7170e-02, 1.1316e-01, ..., -7.3181e-02,
7.3730e-02, -1.9958e-01],
...,
[ 1.8091e-01, -1.3342e-01, -2.4609e-01, ..., -1.7834e-01,
-2.2888e-04, 2.1716e-01],
[ 4.4464e-02, -2.8101e-01, -7.9956e-02, ..., 5.5481e-02,
-1.1414e-01, -7.6065e-03],
[ 4.3610e-02, -3.2715e-02, -8.2092e-02, ..., -1.6968e-02,
-2.2156e-02, 1.8457e-01]], device='cuda:0', dtype=torch.float16)
TORCH_F16_W/_F32_ACC_(COMPILED):_ 1038.24 us 132.38 TFLOPS
CUBLAS_F16_W/_F16_ACC:_ 1328.69 us 103.44 TFLOPS
Metadata
Metadata
Assignees
Labels
No labels