Skip to content

Issue with SUP basecalling using RTX3080 10GB #1489

@awjga

Description

@awjga

New issue checks

Dorado version

1.1.1

Dorado subcommand

Basecaller

The issue

dorado basecaller sup input.pod5 > basecalled.bam

Output:
[2025-09-04 15:12:33.631] [info] > Creating basecall pipeline
[2025-09-04 15:12:34.581] [info] Using CUDA devices:
[2025-09-04 15:12:34.581] [info] cuda:0 - NVIDIA GeForce RTX 3080
[2025-09-04 15:12:35.444] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 3080" and model dna_r10.4.1_e8.2_400bps_sup@v5.2.0. Full benchmarking will run for this device, which may take some time.
[2025-09-04 15:12:36.267] [info] cuda:0 using chunk size 12288, batch size 96
[2025-09-04 15:12:36.475] [info] cuda:0 using chunk size 6144, batch size 192
[2025-09-04 15:14:27.768] [warning] Caught Torch error 'CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /builds/machine-learning/torch-builds/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xb0 (0x7f220d205390 in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7f2203da5896 in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7f220d1a639c in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0xb245e1d (0x7f220d181e1d in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: + 0xb24677e (0x7f220d18277e in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: + 0xb259f98 (0x7f220d195f98 in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: /mnt/SSD1/dorado-1.1.1-linux-x64/bin/dorado() [0x54d7d4]
frame #7: + 0xc2b23 (0x7f2201bf7b23 in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f2201ed2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f22018f6353 in /lib/x86_64-linux-gnu/libc.so.6)

System specifications

Operating system:Ubuntu 20.04.6 LTS
CPU: AMD Ryzen 9 3950X 16-Core Processor
GPU: RTX 3080 10 gb
SSD

Hi,

I am facing an issue when try to use sup to basecall my reads, after starting the command for a few minutes the GPU would crash and I would need to reboot the system to to use the GPU.
I have also tried to lower the batch size to -b 12 and the same error occurs.
When I tried using HAC for basecalling the same pod5 files, there was no issue with the same system.

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions