Skip to content

CP-48676: Reuse pool sessions on slave logins #5986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 1, 2024

Conversation

snwoods
Copy link
Contributor

@snwoods snwoods commented Sep 12, 2024

This PR allows pool sessions to be reused, rather than wasting time creating new ones each time to talk to the slaves.

The average time spent across 15 slave get_servertime xenapi calls on a virtual pool was 20.97ms with reuse-pool-sessions = false.
image
Compared to an average of 17.11ms with reuse-pool-sessions = true
image

In particular, the part of the request that has actually been changed, the login_no_password call has gone from about 2.41ms to 0.008ms, 300x faster than previously.

One part of why this speedup is possible is because we no longer have to do a pool.get_all for each session (to force the newly created session's time to update). If we were to validate the reusable pool session is valid before use, we would still need to do this pool.get_all. However, as we are now no longer destroying the db session after use, it should continue to be valid. As such, to achieve the best speed increase, this option of validating the reusable pool session (validate-reusable-pool-session in xapi.conf) is off by default.

in
new_session_id
in
match (originator, pool, is_local_superuser, uname) with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of match seems questionable as no destruction of the tuple is used and we still have when and if.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point thanks, I think originally match made sense but I refactored it to the point it no longer does.

@snwoods snwoods force-pushed the private/stevenwo/CP-48676 branch from d76a68f to b52c92f Compare September 13, 2024 14:05
else
let new_session_id = create_session () in
with_lock reusable_pool_session_lock (fun () ->
reusable_pool_session := new_session_id
Copy link
Contributor

@edwintorok edwintorok Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a lock for writing we also need to use a lock for reading, otherwise this is not different from just having the assignment directly (at least on OCaml 4), and there is a race where a session would be created and potentially leaked on a race.

Usually the lock needs to be placed around the whole read-modify-write cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I do wonder if the performance will be worse though if the threads have to wait for each other to get a session?

@snwoods snwoods force-pushed the private/stevenwo/CP-48676 branch from b52c92f to 510b17d Compare September 23, 2024 10:31
(try Db.Session.destroy ~__context ~self with _ -> ()) ;
Rbac.destroy_session_permissions_tbl ~session_id:self
with_lock reusable_pool_session_lock (fun () ->
if self <> !reusable_pool_session then (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume we also need to protect this read with the lock @edwintorok ? Otherwise one thread could assign to reusable_pool_session just after this is read, and then the session would be destroyed?

@robhoes
Copy link
Member

robhoes commented Sep 23, 2024

One part of why this speedup is possible is because we no longer have to do a pool.get_all for each session (to force the newly created session's time to update).

You are referring to the following code in login_no_password_common:

  (* At this point, the session is created, but with an incorrect time *)
  (* Force the time to be updated by calling an API function with this session *)
  let rpc = Helpers.make_rpc ~__context in
  ignore (Client.Pool.get_all ~rpc ~session_id) ;

But does anyone understand why this is needed at all? That is, in what way in the time "incorrect" and how does the API call make it correct? The session is created in the DB in this function with last_active equal to Date.now (). The API call will trigger Session_check.check on the coordinator, which validates the session and may indeed update last_active, but it only does that if the current value if the DB field is more than 10 min in the past, which I hope won't ever be the case in this scenario. What am I missing?

I guess issues may occur if the pool member doing the login and the coordinator do not have the same time on their clocks. But if there is the case I bet lots of stuff may go wrong...

I found the original ticket now that caused this code to be added. It was due to a "clock skew" of 9 (!) days between the two hosts, which meant that the session got deleted by the GC as it was "not used" for 24 hours :)

@snwoods snwoods force-pushed the private/stevenwo/CP-48676 branch from 510b17d to f732223 Compare September 24, 2024 09:21
@snwoods
Copy link
Contributor Author

snwoods commented Sep 24, 2024

One part of why this speedup is possible is because we no longer have to do a pool.get_all for each session (to force the newly created session's time to update).

You are referring to the following code in login_no_password_common:

  (* At this point, the session is created, but with an incorrect time *)
  (* Force the time to be updated by calling an API function with this session *)
  let rpc = Helpers.make_rpc ~__context in
  ignore (Client.Pool.get_all ~rpc ~session_id) ;

But does anyone understand why this is needed at all? That is, in what way in the time "incorrect" and how does the API call make it correct? The session is created in the DB in this function with last_active equal to Date.now (). The API call will trigger Session_check.check on the coordinator, which validates the session and may indeed update last_active, but it only does that if the current value if the DB field is more than 10 min in the past, which I hope won't ever be the case in this scenario. What am I missing?

I guess issues may occur if the pool member doing the login and the coordinator do not have the same time on their clocks. But if there is the case I bet lots of stuff may go wrong...

I found the original ticket now that caused this code to be added. It was due to a "clock skew" of 9 (!) days between the two hosts, which meant that the session got deleted by the GC as it was "not used" for 24 hours :)

@robhoes I did some testing with removing the pool.get_all when creating a session and it was (obviously) faster than master (by about 3 seconds) but using reusable session was still 1s faster than even that. So it depends if the 9 day clock skew is likely to reappear or not, we could remove the pool.get_all? But I think reusable sessions is still worth it in either case. Removing pool.get_all from session creation would still be worth doing if possible, as it would work in all the occasions where we can't reuse the session i.e. when it's not an internal_xapi_master_to_xapi_slave_login

@snwoods
Copy link
Contributor Author

snwoods commented Sep 26, 2024

Ring3: BVT + BST suiterun 205345 was all green with the latest changes (it looks like moving the lock to the appropriate places fixed the last few errors)

@snwoods
Copy link
Contributor Author

snwoods commented Sep 27, 2024

I've run two pool bootstorm jobs, one for toolstack-next and one for my build tag with these changes, job ids 4116362 and 4116900 respectively. If you search the jaeger kfd endpoint for these job ids, you can see the speedups provided e.g. Xapi_session.login_no_password_common taking 2.79ms on toolstack-next and 0.14ms on my build tag.

@snwoods snwoods force-pushed the private/stevenwo/CP-48676 branch from f732223 to 5779b35 Compare September 27, 2024 14:35
@snwoods
Copy link
Contributor Author

snwoods commented Sep 27, 2024

Rebased on master to fix merge conflict

~rbac_permissions ~db_ref ~client_certificate =
Context.with_tracing ~originator ~__context __FUNCTION__ @@ fun __context ->
let internal_xapi_master_to_xapi_slave_login =
(xapi_internal_originator, true, true, None)
Copy link
Member

@robhoes robhoes Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to just put this tuple into the if-statement below, as on it's own these values are meaningless (what Booleans are these etc.).

if (originator, pool, is_local_superuser, uname) =
   (xapi_internal_originator, true, true, None)

makes this clearer.

Rbac_audit.session_destroy ~__context ~session_id:self ;
(try Db.Session.destroy ~__context ~self with _ -> ()) ;
Rbac.destroy_session_permissions_tbl ~session_id:self
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's log the else-case here, so that it is clear why there is no Session.destroy in the logs as part of a session.logout call.

Prevent this reusable pool session from being destroyed so that it
remains valid. This feature can be toggled with the flag reuse-pool-sessions.

Signed-off-by: Steven Woods <steven.woods@citrix.com>
Add a new flag validate-reusable-pool-session to xapi globs which skips
the reusable pool session validity check if it is false. This saves time
as we are no longer calling pool.get_all for each session.

Signed-off-by: Steven Woods <steven.woods@citrix.com>
@snwoods snwoods force-pushed the private/stevenwo/CP-48676 branch from 5779b35 to c27b1d4 Compare September 30, 2024 15:02
@robhoes robhoes added this pull request to the merge queue Oct 1, 2024
Merged via the queue into xapi-project:master with commit 15721af Oct 1, 2024
15 checks passed
@snwoods snwoods deleted the private/stevenwo/CP-48676 branch October 7, 2024 20:15
github-merge-queue bot pushed a commit that referenced this pull request Mar 14, 2025
This was first introduced in
#5986 but was later reverted
as the mutex it introduced caused a deadlock with the DB mutex. This PR
has now been reworked to use an Atomic variable, removing the need for
the mutex.

This PR allows pool sessions to be reused (if the flag is turned on,
which it is by default). This greatly speeds up communications between
hosts which is important for repetitive calls e.g. one customer was
running get_servertime frequently on all hosts in a pool which was
causing problems due to the length of each call.

When we get this reusable session, we can optionally validate the
session before using it but this increases the duration of the call, so
this is off by default. Excepting a toolstack restart, the reusable
session should always be valid as it is excluded from deletion in
destroy_db_session
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants