Skip to content

CA-392887: set_tls_config immediately after enabling clustering #5644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Vincent-lau
Copy link
Contributor

@Vincent-lau Vincent-lau commented May 21, 2024

Previously xapi calls set_tls_config after the join call, which will restart the remote server of xapi-clusterd. In the meantime, another xapi-clusterd might also be joining, which causes distribute_state to be called while the remote server is restarting.

Now move the set_tls_config immediately after the cluster daemon is enabled to reduce the chance of race. Moreover, join_internal already creates a tls_config and passes it to xapi-clusterd, there is no point resetting the tls config immediately afterwards.

This does not, however, solve the whole problem. For that, we need to make sure that distribute_state and set_tls_config cannot happen at the same time. More generally, any remote calls cannot happen while tls_config is running. Hence we need them to hold the same lock. This will be done in xapi-clusterd.

@Vincent-lau Vincent-lau force-pushed the private/shul2/tls-config branch from 6909565 to 68e4040 Compare May 21, 2024 10:21
@@ -239,15 +239,15 @@ let resync_host ~__context ~host =
(Ref.string_of self) ;
Xapi_clustering.Daemon.enable ~__context ;
maybe_switch_cluster_stack_version ~__context ~self ~cluster_stack ;
let verify = Stunnel_client.get_verify_by_default () in
set_tls_config ~__context ~self ~verify ;
(* Note that join_internal and enable both use the clustering lock *)
Client.Client.Cluster_host.enable ~rpc ~session_id ~self
) ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an implicit else branch here for the "joined and not enabled" case, which now excludes set_tls_config as well. This is correct?

Copy link
Contributor Author

@Vincent-lau Vincent-lau Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good point. If the cluster daemon is not enabled, then in theory xapi-clusterd service itself should not be running, hence any requests to xapi-clusterd should fail. That means not only should we do the set_tls_config inside enabled case, but the observer initialisation should be moved into the enabled path as well. Need to verify that....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is no need to do anything if xapi-clusterd is not enabled, I have resturctured the code for this

@Vincent-lau Vincent-lau force-pushed the private/shul2/tls-config branch from 68e4040 to 3e8cd41 Compare June 18, 2024 15:45
@Vincent-lau Vincent-lau requested a review from lindig June 18, 2024 15:47
@Vincent-lau
Copy link
Contributor Author

requesting a review from @lindig since he originally wrote this code to make sure this refactor is valid

Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is valid. When we wrote TLS certification code we did not think carefully enough about the sequence.

@Vincent-lau
Copy link
Contributor Author

xrt tests pending, let's see wait for the tests before merging this

@lindig
Copy link
Contributor

lindig commented Jun 27, 2024

@Vincent-lau how are the tests? Can we merge this?

@Vincent-lau Vincent-lau marked this pull request as draft June 27, 2024 13:43
@Vincent-lau
Copy link
Contributor Author

Vincent-lau commented Jun 27, 2024

Let's draft this for now as it might take some time to triage the xrt issues

@Vincent-lau Vincent-lau force-pushed the private/shul2/tls-config branch from 3e8cd41 to a869550 Compare July 2, 2024 09:48
Previously xapi calls `set_tls_config` regardless of whether a host
has joined or enabled, which will restart the remote server of
xapi-clusterd. In the meantime, another xapi-clusterd might also be
joining, which causes `distribute_state` to be called while the
remote server is restarting.

Now remove the `set_tls_config`, this is because `join_internal` already
creates a tls_config and passes it to xapi-clusterd, but xapi-clusterd
does not store that tls_config in its db, it just starts the http server
with that tls config. Modifying xapi-clusterd to store that config will
be done in a separate PR. Moreover, `cluster_host.enable` also calls `set_tls_config`,
which means there is no need to call `set_tls_config` if the cluster
host is joined but not enabled.

Also move the observer and watcher creation into the not joined case,
since cluster_host.enable already calls them and there is no need to
call them if the host is not enabled.

This does not, however, solve the whole problem. For that, we need to
make sure that `distribute_state` and `set_tls_config` cannot happen at
the same time. More generally, any remote calls cannot happen while
`tls_config` is running. Hence we need them to hold the same lock. This
will be done in xapi-clusterd.

Signed-off-by: Vincent Liu <shuntian.liu2@cloud.com>
@Vincent-lau Vincent-lau force-pushed the private/shul2/tls-config branch from a869550 to 5975562 Compare July 2, 2024 09:50
@Vincent-lau Vincent-lau marked this pull request as ready for review July 2, 2024 09:51
@coveralls
Copy link

Coverage Status

coverage: 44.887%. remained the same
when pulling 5975562 on Vincent-lau:private/shul2/tls-config
into f11657e on xapi-project:master.

@Vincent-lau
Copy link
Contributor Author

Vincent-lau commented Jul 2, 2024

A few small refactoring to fix bugs in xrt, should be ready to go now, accompanied by a xapi-clusterd PR

@Vincent-lau Vincent-lau merged commit 5519cf9 into xapi-project:master Jul 8, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants