-
Notifications
You must be signed in to change notification settings - Fork 96
Description
New issue checks
- I have read the Dorado Documentation.
- I did not find an existing issue.
Dorado version
1.1.1
Dorado subcommand
Basecaller
The issue
dorado basecaller sup input.pod5 > basecalled.bam
Output:
[2025-09-04 15:12:33.631] [info] > Creating basecall pipeline
[2025-09-04 15:12:34.581] [info] Using CUDA devices:
[2025-09-04 15:12:34.581] [info] cuda:0 - NVIDIA GeForce RTX 3080
[2025-09-04 15:12:35.444] [info] Calculating optimized batch size for GPU "NVIDIA GeForce RTX 3080" and model dna_r10.4.1_e8.2_400bps_sup@v5.2.0. Full benchmarking will run for this device, which may take some time.
[2025-09-04 15:12:36.267] [info] cuda:0 using chunk size 12288, batch size 96
[2025-09-04 15:12:36.475] [info] cuda:0 using chunk size 6144, batch size 192
[2025-09-04 15:14:27.768] [warning] Caught Torch error 'CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /builds/machine-learning/torch-builds/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xb0 (0x7f220d205390 in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7f2203da5896 in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7f220d1a639c in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0xb245e1d (0x7f220d181e1d in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: + 0xb24677e (0x7f220d18277e in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: + 0xb259f98 (0x7f220d195f98 in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #6: /mnt/SSD1/dorado-1.1.1-linux-x64/bin/dorado() [0x54d7d4]
frame #7: + 0xc2b23 (0x7f2201bf7b23 in /mnt/SSD1/dorado-1.1.1-linux-x64/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f2201ed2609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f22018f6353 in /lib/x86_64-linux-gnu/libc.so.6)
System specifications
Operating system:Ubuntu 20.04.6 LTS
CPU: AMD Ryzen 9 3950X 16-Core Processor
GPU: RTX 3080 10 gb
SSD
Hi,
I am facing an issue when try to use sup to basecall my reads, after starting the command for a few minutes the GPU would crash and I would need to reboot the system to to use the GPU.
I have also tried to lower the batch size to -b 12 and the same error occurs.
When I tried using HAC for basecalling the same pod5 files, there was no issue with the same system.
Thank you