Replies: 1 comment 3 replies
-
Hi @amirgon , does ArrowFlight uses UCT or UCP API?
|
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I would like to bring up an issue we are seeing with UCX when running Apache ArrowFlight over UCX.
I'm not sure if this is a bug in UCX, in ArrowFlight or in the way we use ArrowFlight with UCX, so I'm opening this for discussion.
The problem is a bit hard to reproduce, it only happens occasionally with very large amounts of data transferred over ArrowFlight with UCX/Posix-shmem transport.
When the problem happens, communication freezes and UCX shows these errors:
and also:
The last one is unusual since we use
UCX_POSIX_USE_PROC_LINK=n
and in such caseuct_posix_shm_open
should be called and notuct_posix_file_open
.(null)
appears in the file name sinceposix_config->dir
was null, as expected when usinguct_posix_shm_open
, howeverUCT_POSIX_SEG_FLAG_SHM_OPEN
was unexpectedly read as 0 on that specific segment although it should have been 1 for all segments.Digging into this further, we found out that the issue was related to memory pool corruption, specifically
mm_recv_desc
pool.The corruption caused elements from the allocated list to be linked to the free list, which caused eventually the errors above.
The reason
mm_recv_desc
got corrupted was that it was used from multiple threads.To support zero-copy, the received buffers were released only when they reached their final destination on a different thread, so the allocating thread and the releasing thread are different, while UCX mpool is not thread safe.
To fix that we added a spinlock on ucx mpool functions
ucs_mpool_get_inline
anducs_mpool_add_to_freelist
, and this seems to resolve the issue without impacting performance. (If this is a valid solution, I can create a PR)After this fix the issue doesn't block us any more, however, there are still open questions:
Beta Was this translation helpful? Give feedback.
All reactions