-
Couldn't load subscription status.
- Fork 164
Description
Hi.
I'm using libsmb2 on a Windows 11 laptop to connect to a Samba drive on a Linux embedded device. The connection happens through a USB3 cable, where the embedded device is detected as a network card on the laptop using RNDIS.
It has been working well so far, with the exception that if I unplug the USB cable during a transfer, then I have a crash when I delete the context later with smb2_destroy_context.
Now I tried to update the lib to the latest commit on master, as my prior version of libsmb2 was from June 2022.
Hang forever
But now with the latest new version, I get an infinite hang if I unplug the cable during a read operation using the sync API:
- It stays stuck forever in the
wait_for_replyfunction. - Adding a timeout does not change anything, because
fdis never invalidated to -1 and the function does not return. - In
smb2_service_fd, the polling raises flagsPOLLERRandPOLLHUP. ThePOLLHUPblock is never reached due to thegotoinPOLLERRblock. However, in both cases I don't see anything that could invalidate thefdthere.
Fix
EDIT: After checking the diff since June 2022, I see that smb2_service and smb2_service_fd now return a t_socket type instead of an int. But on Windows, this type is defined as unsigned, so returning -1 and checking for < 0 does not work, causing the function to not return and thus the infinite hang.
The faulty commit seems to be 4b1ef1f6bd09f96a9f0b78accfeac6e2c2766f34.
Crash
EDIT2: With the latest version and the hang fixed, if I unplug the USB cable while it's transferring a file, and then I close the SMB client and call smb2_destroy_context, then I get something looking like memory corruption, causing fully random crashes in the rest of the application. Commenting the line smb2_destroy_context in the destructor of my client fixes these crashes.
Any idea what could go wrong in that case ? This seems very difficult to find and fix (on Windows, without valgrind); I'll try to replace the calloc and free functions and put prints, but the nature of the code with multiple callbacks and generic void* types makes it extremely difficult to debug.
Fix
EDIT3: After two full days of analysis, the commit 0dc1cc795512c808b52ecc83f7cd3e530936655d which creates a special case for the close operation is the cause of my crash.
In my workflow, I read a file and then close it using smb2_close() (sync). However, if the USB network interface is removed during the transfer, the read operation fails, but the smb2_close() is still called to simplify the logic, expecting it to fail as well. In this smb2_close(), the wait_for_reply() call fails with code -1 due to an expected error in smb2_service(), also returning -1. However in that case, the callback remains set somewhere in a PDU that was queued. So calling the free() at the end of smb2_close() frees this memory that is still referenced in a PDU callback.
Then during the call to smb2_destroy_context when destroying the client, the PDUs in the queues are cancelled and all callback are called. But as the one for the close operation was already freed, it crashes !
What could be done to fix these issues? Reverting these 2 commits is maybe not the best fix, as one of them was added to fix a memory leak apparently.
In my case I'll fix locally already, because a leak is better than a crash.