Skip to content

RuntimeError: BIPP: Eigensolver error #39

@Kincaidr

Description

@Kincaidr

Hi,

In the previous issue #38 I managed to run bipp successfully with the command:

srun python test3.py redundant J0040.ms -o run_2/output --column DATA -p 100 -f 2.5 -b 4e34,300,-4e34 -l 2 -n 100 -t 0,10 -c 0,10

This was just a test run. Now when i want to do the full image with all timesteps and channels. So removing -t 0,10 -c 0,10 and change -n 100 to -n 6000 :

srun python test3.py redundant J0040.ms -o run_2/output --column DATA -p 100 -f 2.5 -b 4e34,300,-4e34 -l 2 -n 6000

It runs for a while and then I get the error:

[1:32:24, 530.88s/it]^M20it [1:32:24, 277.24s/it]
Traceback (most recent call last):
  File "/scratch/kincaid/bipp_run_J0040/test3.py", line 422, in <module>
    imager.collect(wl, fi, S_new, W.data, XYZ.data, uvw)
RuntimeError: BIPP: Eigensolver error
srun: error: kh082: task 0: Exited with exit code 1
srun: Terminating StepId=330158.0

The output looks like:

[2025-02-27 18:51:56.999] [bipp] [debug] nufft size 17, direct evaluation
[2025-02-27 18:51:59.626] [bipp] [debug] nufft size 20, direct evaluation
[2025-02-27 18:52:02.791] [bipp] [debug] eigensolver (host) nVis = 3721
[2025-02-27 18:52:02.791] [bipp] [debug] Eigensolver: removing 0 columns / rows
[2025-02-27 18:52:02.792] [bipp] [debug] array "gram":  size (61, 61), min (-0.0009978566769194095, -0.0008183230735699571), max (1.0000000000000002, 0.0008183230735699572), sum (57.99573051064264, -1.3088368085059114e-17), fp classes [normal,zero]
[2025-02-27 18:52:02.846] [bipp] [error] NufftSynthesis.collect() error: BIPP: Eigensolver error
[2025-02-27 18:52:03.116] [bipp] [info] 
 ============================================================================================================
                             #         Total          %   Parent %        Median           Min           Max
------------------------------------------------------------------------------------------------------------
Create NUFFTSynthesis        1     223.09 ms     100.00     100.00     223.09 ms     223.09 ms     223.09 ms

0x12ca9980 collect          21       5.54 ks     100.00     100.00       1.60 ms       1.55 ms       5.54 ks

============================================================================================================
 

[2025-02-27 18:52:03.116] [bipp] [info] 0x3d7cdf0 Context destroyed
[1740678724.185939] [kh082:1861093:0] cuda_copy_iface.c:524  UCX  ERROR cuCtxGetCurrent(&cuda_context) failed: unrecognized error code 4
[1740678724.185968] [kh082:1861093:0]  cuda_ipc_iface.c:537  UCX  ERROR cuCtxGetCurrent(&cuda_context) failed: unrecognized error code 4

I have attached the full err_output.txt

err_output.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions