Skip to content

Analyze and improve Authd connection management #28908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
fdalmaup opened this issue Mar 31, 2025 · 5 comments · May be fixed by #29135 or wazuh/qa-integration-framework#338
Open
4 tasks done

Analyze and improve Authd connection management #28908

fdalmaup opened this issue Mar 31, 2025 · 5 comments · May be fixed by #29135 or wazuh/qa-integration-framework#338
Assignees
Labels
level/task type/bug Something isn't working

Comments

@fdalmaup
Copy link
Member

fdalmaup commented Mar 31, 2025

Description

As part of efforts to improve agent registration, it was found that the remote server thread may present an unexpected behavior when handling multiple connections to it:

/* Thread for remote server */
void* run_remote_server(__attribute__((unused)) void *arg) {

Since some delays may arise under heavy load conditions where multiple connections are being opened, the sockets management using the select function needs to be reviewed.

Tasks

  • Analysis of the code and reproduction of the potential bug. (A script simulating the connections might be required to achieve this)
  • Improvements proposal.
  • Development of the improvements.
  • Unit tests update.
@fdalmaup fdalmaup added level/task type/bug Something isn't working labels Mar 31, 2025
@wazuhci wazuhci moved this to Backlog in XDR+SIEM/Release 4.13.0 Mar 31, 2025
@MiguelazoDS MiguelazoDS self-assigned this Apr 1, 2025
@wazuhci wazuhci moved this from Backlog to In progress in XDR+SIEM/Release 4.13.0 Apr 1, 2025
@MiguelazoDS
Copy link
Member

Update

After discussing this with the team and implementing a script simulating multiple connection from agents I could replicate, although not persistently, the errors in the cloud environment shared.

The main reason about this issue analyzed by Octavio Valle is due to the way authd is currently managing connection. The connection management is sequential, causing delays and blocking behavior due to the unordered sequence of connection and payload received from the client by the server.

The server will block on read waiting for client connection payload.

SSL Error (-1)

This was replicated by killing the client ending the connection abruptly before sending the payload, what may indicates that the analysis mentioned above is on the right path, since the connection was successful but the server couldn't read the payload probably during an agent restart in a congested network.

2025/04/03 17:45:25 wazuh-authd: INFO: Agent key generated for 'SKFHFEWYHZYAQTD' (requested by any)
2025/04/03 17:45:25 wazuh-authd: INFO: New connection from 192.168.33.70
2025/04/03 17:45:26 wazuh-authd: ERROR: SSL Error (-1)
2025/04/03 17:45:34 wazuh-remoted: INFO: (1409): Authentication file changed. Updating.
2025/04/03 17:45:34 wazuh-remoted: INFO: (1410): Reading authentication keys file.

SSL read (unable to receive message).

This was replicated by launching a big number of request with different delay between connection and payload

In the client side we can see this. Some of them receive the response, some do not.

Sending: OSSEC A:'NUGBSIKKCPPZQVA' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'
Received: OSSEC K:'1433 NUGBSIKKCPPZQVA any bdb8e1374002149ce40a491fbe37dabb72c61ce1ac532bb928065597d2bd1392'
Sleeping for: 1000 milliseconds
Sending: OSSEC A:'ECFJHKAREPFQZCI' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'
Received: OSSEC K:'1434 ECFJHKAREPFQZCI any e9d13032fe97059cb7667dc54d9fcf9267de7576ac668ec45f7fb1a085a45703'
Sleeping for: 1000 milliseconds
Sending: OSSEC A:'QSFCZEOFPQNFYBT' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'
SSL read (unable to receive message). Error code: 5
Sleeping for: 1000 milliseconds
Sending: OSSEC A:'OJBRLAKATRWROHK' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'
Received: OSSEC K:'1435 OJBRLAKATRWROHK any 8f75ce63a36482f6f784d70850279cdaa3677ce39ed74f283f52d537e10e86c9'

In the server side we can see that the manager never received that request. i.e QSFCZEOFPQNFYBT
The ending was a timeout.

root@jammy:/home/vagrant/wazuh# grep -A 3 "ECFJHKAREPFQZCI" /var/ossec/logs/ossec.log
2025/04/03 18:13:34 wazuh-authd: INFO: Received request for a new agent (ECFJHKAREPFQZCI) from: 192.168.33.70
2025/04/03 18:13:34 wazuh-authd: INFO: Agent key generated for 'ECFJHKAREPFQZCI' (requested by any)
2025/04/03 18:13:34 wazuh-authd: INFO: New connection from 192.168.33.70
2025/04/03 18:13:36 wazuh-authd: INFO: Client timeout from 192.168.33.70
2025/04/03 18:13:36 wazuh-authd: INFO: New connection from 192.168.33.70
root@jammy:/home/vagrant/wazuh# grep "QSFCZEOFPQNFYBT" /var/ossec/logs/ossec.log
root@jammy:/home/vagrant/wazuh#

@MiguelazoDS
Copy link
Member

MiguelazoDS commented Apr 4, 2025

Update

Sometimes the error code is 1 and sometimes 5. This is a custom script, not an actual agent.

Sending: OSSEC A:'SGXJUEUWOKOAJQH' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'
SSL read (unable to receive message). Hostname: SGXJUEUWOKOAJQH, Error code: 1
If Agent verification is enabled, agent key and certificates may be incorrect!

Sending: OSSEC A:'LTQSVETNNBLVITR' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'
SSL read (unable to receive message). Hostname: LTQSVETNNBLVITR, Error code: 5
If Agent verification is enabled, agent key and certificates may be incorrect!
Sleeping for: 1110 milliseconds

The select implementation here does not manage the fd for each client connection, it's just a "smart" sleep waiting for the server.

A quick fix for this behavior is increase the network timeout value. After changing the options for auth in internal_options.conf the problem no longer happen and all agents are registered. Of course it will block that amount of time if the


An approach that I would like to consider here, is to make a non-blocking server managing the the connections and use epoll_wait instead of accept.

I have to better understand how to manage SSL connection because the ssl_read, also writes during the handshake and other ssl operations.

What I can see is that we already have some wrapper for epoll in notify_op.c/h

I'll be uploading changes here https://github.com/wazuh/wazuh/tree/enhancement/28908-authd-connection-management

There's a long trip to get this done, I need to deeply understand the epoll API for this.

@MiguelazoDS
Copy link
Member

MiguelazoDS commented Apr 8, 2025

Update

Although I don't see "definitely lost" I do not trust too much in those "still reachable" leaks

==167814== LEAK SUMMARY:
==167814==    definitely lost: 0 bytes in 0 blocks
==167814==    indirectly lost: 0 bytes in 0 blocks
==167814==      possibly lost: 0 bytes in 0 blocks
==167814==    still reachable: 4,774 bytes in 82 blocks
==167814==         suppressed: 0 bytes in 0 blocks

Note

I took a look and most of them may be required for other modules, they are variables that live inside shared structures.

Image
Image
Image

Trying to reproduce the issue

10 agents

Image

Expand
 Sleeping for: 1160 milliseconds
Sleeping for: 1180 milliseconds
Sleeping for: 1150 milliseconds
Sleeping for: 1140 milliseconds
Sleeping for: 1170 milliseconds
Sleeping for: 1130 milliseconds
Sleeping for: 1120 milliseconds
Sleeping for: 1110 milliseconds
Sleeping for: 1100 milliseconds
Sending: OSSEC A:'USCEZZLNPZEPKNM' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'

Received: OSSEC K:'002 USCEZZLNPZEPKNM any 0d3fe25e0e9a40fa83dc014eb690f3b3a35a26581765cfc1e71d8a8842352400'

[2]    2984504 done       ./client $((1200-i*10))
  Sending: OSSEC A:'SMRVPWFFKOOBUVI' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'

Received: OSSEC K:'003 SMRVPWFFKOOBUVI any 3cefef13347e3accbcf78c8a846edfa67cf897c7cf05773833fdb4decba46588'

[5]    2984515 done       ./client $((1200-i*10))
  Sending: OSSEC A:'CHWOFNBWALCFOGZ' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'

Received: OSSEC K:'004 CHWOFNBWALCFOGZ any 782b8d458b60c3dbcd68b0f70ec775b83586694b919a50cfda6850053a900c38'

[3]    2984513 done       ./client $((1200-i*10))
  Sending: OSSEC A:'KBISOEORBUTLKWM' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'

Received: OSSEC K:'005 KBISOEORBUTLKWM any 633b1ba4dc3a8dd45678a530b4e8a81661e3bcb437ba7952266fee2cb7ddcb91'

[6]    2984516 done       ./client $((1200-i*10))
  Sending: OSSEC A:'PEJCYOOQHCMEAJF' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'

Received: OSSEC K:'006 PEJCYOOQHCMEAJF any 3ced314f782c080a3a46ffc5f4d45c8e39c41ad60f7723d1f1fbc263b38fad90'

[7]    2984517 done       ./client $((1200-i*10))
  Sending: OSSEC A:'DQXTBASICKQLBBZ' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'

Received: OSSEC K:'007 DQXTBASICKQLBBZ any c23c4cb1fd49b48afb084574b9fcbe3042143fce6f80a1bf1bb75a6317f587ef'

[4]    2984514 done       ./client $((1200-i*10))
Sending: OSSEC A:'OAYCETNIDIUPGHP' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'

  Received: OSSEC K:'008 OAYCETNIDIUPGHP any b7caa6de793851445f49b46e5bfbff128f2bdde3de39fc46de7e42211cff92c2'
Sending: OSSEC A:'XWHXCWXNAWDMRZH' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'


[8]    2984518 done       ./client $((1200-i*10))
Received: OSSEC K:'009 XWHXCWXNAWDMRZH any dbc9ba52be31427aaa5970e101aa86fbd050a0fb1c580200e10bcf8870f33dc1'

[9]    2984519 done       ./client $((1200-i*10))
  Sending: OSSEC A:'SPGUBDHEPCYVXPB' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'

Received: OSSEC K:'010 SPGUBDHEPCYVXPB any e75459ca88361dc66e8f136329efa9549dbb84e2a54bde732da06bd6bbff23ce'

[10]  - 2984520 done       ./client $((1200-i*10))
  Sending: OSSEC A:'XIYOBECWGWSPKBC' V:'v4.8.2' G:'default' K:'abd7a6c8638bc0b3a57db0763cbd1aa94335af6b'

Received: OSSEC K:'011 XIYOBECWGWSPKBC any d4ded835cf624aa6fa932c31c0ea454b75b56e28a36e569731a4ebf94ff1187e'

[11]  + 2984521 done       ./client $((1200-i*10))

Image

1000 agents (max size epoll queue)

for i in $(seq 1 1000); do; ./client $((1200-i))&; done
root@jammy:/home/vagrant/wazuh/src# agent_control -l | tail -n 20
   ID: 984, Name: QWJDVKNEDHUGWMN, IP: any, Never connected
   ID: 985, Name: JILKSJNAOMYKLEG, IP: any, Never connected
   ID: 986, Name: VYWHXTPYJYUSUTL, IP: any, Never connected
   ID: 987, Name: GMTXGWGRUKSJIYD, IP: any, Never connected
   ID: 988, Name: LCDXKPFATJZDTYD, IP: any, Never connected
   ID: 989, Name: LSERTXEGUCCTACL, IP: any, Never connected
   ID: 990, Name: GBNNVLVYOFPYMDT, IP: any, Never connected
   ID: 991, Name: CFFUGHGETVEQCJM, IP: any, Never connected
   ID: 992, Name: UPNJFROVPYDKQPM, IP: any, Never connected
   ID: 993, Name: GIHNPSFDXGUBKKR, IP: any, Never connected
   ID: 994, Name: JTIAJUOJLJEAJFD, IP: any, Never connected
   ID: 995, Name: QPFYLGMTNFZKHPE, IP: any, Never connected
   ID: 996, Name: UUXYHTQLTMGFAKF, IP: any, Never connected
   ID: 997, Name: DIERUUPDDYPVCYG, IP: any, Never connected
   ID: 998, Name: GQOMUKLKHFIFBPQ, IP: any, Never connected
   ID: 999, Name: VSOFTKBAEATUCMN, IP: any, Never connected
   ID: 1000, Name: YYIHWZHVZIPQZLK, IP: any, Never connected

List of agentless devices:

@MiguelazoDS MiguelazoDS linked a pull request Apr 8, 2025 that will close this issue
@MiguelazoDS MiguelazoDS linked a pull request Apr 8, 2025 that will close this issue
@MiguelazoDS
Copy link
Member

MiguelazoDS commented Apr 8, 2025

Update

Some thread sanitizer warnings

This seems to be an issue with closing the pipe write fd in the T2 thread

WARNING: ThreadSanitizer: data race (pid=212279)
  Write of size 8 at 0x7ba000000050 by thread T2:
    #0 close ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1690 (libtsan.so.0+0x36996)
    #1 run_remote_server os_auth/main-server.c:1032 (wazuh-authd+0x12833)

  Previous read of size 8 at 0x7ba000000050 by main thread:
    #0 write ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1141 (libtsan.so.0+0x4b310)
    #1 handler os_auth/main-server.c:1212 (wazuh-authd+0x11606)
    #2 main os_auth/main-server.c:660 (wazuh-authd+0xed41)

  Location is file descriptor 5 created by main thread at:
    #0 pipe ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1726 (libtsan.so.0+0x3ea28)
    #1 main os_auth/main-server.c:519 (wazuh-authd+0xea38)

  Thread T2 (tid=212282, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:969 (libtsan.so.0+0x605b8)
    #1 main os_auth/main-server.c:637 (wazuh-authd+0xef32)

SUMMARY: ThreadSanitizer: data race os_auth/main-server.c:1032 in run_remote_server

There are some others previous to this implementation that I would like to discuss with the team

==================
WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=229740)
    #0 calloc ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:672 (libtsan.so.0+0x31edc)
    #1 w_get_timestamp shared/time_op.c:88 (wazuh-authd+0xc8066)
    #2 main os_auth/main-server.c:660 (wazuh-authd+0xed41)

SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal shared/time_op.c:88 in w_get_timestamp
==================
==================
WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=229740)
    #0 malloc ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:655 (libtsan.so.0+0x31c57)
    #1 __fopen_internal libio/iofopen.c:65 (libc.so.6+0x7f64d)
    #2 _IO_new_fopen libio/iofopen.c:86 (libc.so.6+0x7f64d)
    #3 wfopen shared/file_op.c:2791 (wazuh-authd+0xaf545)
    #4 main os_auth/main-server.c:660 (wazuh-authd+0xed41)

SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal libio/iofopen.c:65 in __fopen_internal
==================
==================
WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=229740)
    #0 malloc ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:655 (libtsan.so.0+0x31c57)
    #1 __GI__IO_file_doallocate libio/filedoalloc.c:101 (libc.so.6+0x7eba3)
    #2 fprintf ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1721 (libtsan.so.0+0x441b2)
    #3 _log_function shared/debug_op.c:235 (wazuh-authd+0xa7a7c)
    #4 main os_auth/main-server.c:660 (wazuh-authd+0xed41)

SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal libio/filedoalloc.c:101 in __GI__IO_file_doallocate
==================
==================
WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=229740)
    #0 free ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:711 (libtsan.so.0+0x37ab8)
    #1 __GI__IO_setb libio/genops.c:331 (libc.so.6+0x8dc78)
    #2 _log_function shared/debug_op.c:256 (wazuh-authd+0xa7ccc)
    #3 main os_auth/main-server.c:660 (wazuh-authd+0xed41)

SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal libio/genops.c:331 in __GI__IO_setb
==================
==================
WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=229740)
    #0 free ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:711 (libtsan.so.0+0x37ab8)
    #1 _IO_deallocate_file libio/libioP.h:862 (libc.so.6+0x7ed26)
    #2 _IO_new_fclose libio/iofclose.c:74 (libc.so.6+0x7ed26)
    #3 _log_function shared/debug_op.c:256 (wazuh-authd+0xa7ccc)
    #4 main os_auth/main-server.c:660 (wazuh-authd+0xed41)

SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal libio/libioP.h:862 in _IO_deallocate_file
==================
2025/04/08 18:00:55 wazuh-authd[229740] main-server.c:1212 at handler(): INFO: (1225): SIGNAL [(2)-(Interrupt)] Received. Exit Cleaning...
==================
WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=229740)
    #0 free ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:711 (libtsan.so.0+0x37ab8)
    #1 _log_function shared/debug_op.c:265 (wazuh-authd+0xa7d37)
    #2 main os_auth/main-server.c:660 (wazuh-authd+0xed41)

SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal shared/debug_op.c:265 in _log_function
==================
==================
WARNING: ThreadSanitizer: signal handler spoils errno (pid=229740)
    #0 handler os_auth/main-server.c:1203 (wazuh-authd+0x11580)
    #1 main os_auth/main-server.c:660 (wazuh-authd+0xed41)

SUMMARY: ThreadSanitizer: signal handler spoils errno os_auth/main-server.c:1203 in handler

@MiguelazoDS
Copy link
Member

MiguelazoDS commented Apr 9, 2025

Update

Fixing all these cases

https://github.com/wazuh/wazuh/actions/runs/14341591613/job/40201919478

Most of them are related to the new line character in msgs coming from agents, but there's one that causes a crash.

Image

There are two tests that seems to fail due to timeout

FAILED test_authd/test_common/test_authd.py::test_ossec_auth_messages[Multiple Group - Exceed max multigrup length] - AssertionError: The manager did not respond to the message sent.
FAILED test_authd/test_common/test_authd.py::test_ossec_auth_messages[Multiple Group - Send a message of exactly 16k] - AssertionError: The manager did not respond to the message sent.

I increased the buffer size to fix those two last failing tests

@wazuhci wazuhci moved this from In review to Pending review in XDR+SIEM/Release 4.12.1 Apr 9, 2025
@wazuhci wazuhci moved this from Pending review to In progress in XDR+SIEM/Release 4.12.1 Apr 10, 2025
@wazuhci wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 4.12.1 Apr 10, 2025
@wazuhci wazuhci moved this from Pending review to In review in XDR+SIEM/Release 4.12.1 Apr 10, 2025
@wazuhci wazuhci moved this from In review to In progress in XDR+SIEM/Release 4.12.1 Apr 11, 2025
@wazuhci wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 4.12.1 Apr 11, 2025
@wazuhci wazuhci moved this from Pending review to In review in XDR+SIEM/Release 4.12.1 Apr 14, 2025
@wazuhci wazuhci moved this from In review to Pending final review in XDR+SIEM/Release 4.12.1 Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level/task type/bug Something isn't working
Projects
Status: Pending final review
3 participants