Skip to content

CA-399260: Keep both new and old certs during the switchover #6223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 22, 2025

Conversation

changlei-li
Copy link
Contributor

In host-refresh-server-certificate, host generates and applies new pool certificate after stunnel restart. There is 5s that other hosts don't trust the new certificate. Then operations related with xapi:pool SNI tls connection will fail. Both the old and new certificates shall be trusted during the switchover to avoid this. At the end of the refresh procedure, remove_stale_cert will rename the new pem to pem and update ca bundle.

@changlei-li changlei-li marked this pull request as draft January 14, 2025 08:00
@changlei-li changlei-li force-pushed the private/changleli/CA-399260 branch from 039c0e0 to dd3c2de Compare January 14, 2025 08:08

rm -f "$BUNDLE.tmp"
touch "$BUNDLE.tmp"
for TMP_CERT in $TMP_CERTS; do
if cat "$TMP_CERT" >> "$BUNDLE.tmp"; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use if to ignore cat xxx.new.pem error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for dealing with error conditions, it needs to be more explicit. When is the then branch taken here and when not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if statement checks whether the cat command succeeds. If the cat command successfully appends the contents of the temporary certificate file to the bundle file, the script then appends an empty line to the bundle file using the echo "" >> "$BUNDLE.tmp" command.
If the cat command fails, it won't stop and go on the next command even if set -e.

As I know, we can ignore the command error in shell script in two ways

  1. cat "$TMP_CERT" >> "$BUNDLE.tmp" || true
  2. if cat "$TMP_CERT" >> "$BUNDLE.tmp"; then

If we don't ignore the cat error, the script will fail and stop, return error to xapi. Then we may get error "message: Subprocess exited with unexpected code 1; stdout = [ ]; stderr = [ cat: 21a92da1-c874-48e2-be70-e49e8ab15af1.new.pem: No such file or directory" in xapi. That is what happend in CA-362358.

Copy link
Contributor

@lindig lindig Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the .. || true idiom better. But what causes the failure in the first place and should we check those conditions? I assume $TMP_CERT did not exist because >> will succeed if the resulting file can be written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I am asking for help in the channel. There is a race condition in CA-362358, new.pem is found but it is renamed by other thread when cat. The convention is to add lock, but it seems no one wanted to add lock for this trifle. Then the new.pem was filtered out in #4893. I want to ignore the shell script error here following this thought.

Copy link
Member

@psafont psafont Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why these certificates are handled in a special way if they are meant to be included in the bundle in any case.

This issue was introduced by #4893

This PR was trying to fix a race condition:

Script to refresh CA bundle fails because a pem is renamed
This is a race condition between the code that refreshes directory of certificates and the generation of new certificates.
What happens is that the latter generates a temporary file ending in "new", the former reads the directory, which contains this file, the latter removes the temporary file and when the former tries to read the file it fails.

The PR removed the inclusion of .new.pem certificates, which we see that is wrong, but we need better coordination between the code that generates new certificates and this refresh of the bundle.

I believe we need to revert the change done to this script isntead of adding more code; and then fix the race between the two actions: we don't want to allow new pool certificates being generated while the pool is already in the process of changing certificates

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Certificates are refreshed every hour during testing but not in production - making this problem more visible
  • The call is "xe host-refresh-server-certificate host=$HOST"
  • Is this the collision between two instances of the same API call or a collision between unrelated API calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why these certificates are handled in a special way if they are meant to be included in the bundle in any case.

The new.pem is a temp file in switchover, it may be removed between find and cat. The error cat: /etc/stunnel/certs-pool/xxx.new.pem: No such file or directory can be ignored.

@lindig
Copy link
Contributor

lindig commented Jan 17, 2025

I suggest to close this PR and analyze the situation on the ticket. Xapi implements a lock that ensures only one instance of the bundling script is running. So we need to understand how this race happens. The script could be also easily improved to protect against using names that collide when multiple instances are running.

@changlei-li changlei-li reopened this Jan 20, 2025
@changlei-li
Copy link
Contributor Author

Reopen this PR as I think it's a simple way to solve CA-399260 and not reintroduce #4893.

@changlei-li changlei-li marked this pull request as ready for review January 20, 2025 08:20
@psafont
Copy link
Member

psafont commented Jan 20, 2025

This is adding a new variable and a new loop of using less parameters in the filtering, why?

This should be a partial revert of #4893, but I'm not convinced that the original issue isn't reintroduced.

@changlei-li
Copy link
Contributor Author

This is adding a new variable and a new loop of using less parameters in the filtering, why?

This should be a partial revert of #4893, but I'm not convinced that the original issue isn't reintroduced.

@psafont About your concern:
What failed in #4893? The pool-join failed because update-ca-bundle.sh failed. Detailly, while the script cat .new.pem, the file was removed. Although the moment when find, this file existed. We know the file exited in the certificate refresh switchover and will finally be removed.
If cat .new.pem fails, but the script ignores it and goes on. The pool-join will not fail. In this way, #4893 is not reintroduced.
The if cat "$TMP_CERT" >> "$BUNDLE.tmp"; then shell command is:
If the cat command successfully appends the contents of the temporary certificate file to the bundle file, the script then appends an empty line to the bundle file using the echo "" >> "$BUNDLE.tmp" command.
If the cat command fails, it won't stop and go on the next command even if set -e.
The script error is ignored and won't return to xapi.

In fact, the cat error can be ignored. Because the new.pem is removed at this moment indeed. It needn't be filled in the bundle file.

@@ -11,9 +11,15 @@ regen_bundle () {

mkdir -p "$CERTS_DIR"
CERTS=$(find "$CERTS_DIR" -not -name '*.new.pem' -name '*.pem')
TMP_CERTS=$(find "$CERTS_DIR" -name '*.new.pem')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to call thisNEW_CERTS and otherwise use new as well when only new certs are handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It both makes sense. Use NEW_CERTS to indicate they're .new.pems and use TMP_CERTS to indicate they're temp files will finally be removed. Anyway, NEW_CERTS shall be more direct. I accept.

@lindig
Copy link
Contributor

lindig commented Jan 21, 2025

I would encourage to make names unique as much as possible. by using $$ in names (this is the PID of the shell) to avoid collisions if ever two instances of this script run in parallel (which should not happen but it also not detected). The script could also try to detect this by touching a file and removing it at the end. This would need a trap to make sure it is always removed.

@changlei-li
Copy link
Contributor Author

I would encourage to make names unique as much as possible. by using $$ in names (this is the PID of the shell) to avoid collisions if ever two instances of this script run in parallel (which should not happen but it also not detected). The script could also try to detect this by touching a file and removing it at the end. This would need a trap to make sure it is always removed.

I think what you say is to add $$ in the temp bundle file, like $BUNDLE.$$.tmp. The script is wrapped by Helpers.update_ca_bundle with a mutex to ensure only one instance. Xapi uses the script via this wrap function all the places. So, it is not necessary to do it now.

In host-refresh-server-certificate, host generates and applies
new pool certificate after stunnel restart. There is 5s that
other hosts don't trust the new certificate. Then operations
related with xapi:pool SNI tls connection will fail. Both
the old and new certificates shall be trusted during the
switchover to avoid this. At the end of the refresh procedure,
remove_stale_cert will rename the new pem to pem and update
ca bundle.
So, both pem and new pem shall be filled into bundle file in
update-ca-bundle.sh. As the new pem is a temp file and has a
risk to be deleted by other instance between find and cat.
Ignore the cat new pem fail in the script. If the new pem
doesn't exist at the moment, just skip it and go on.

Signed-off-by: Changlei Li <changlei.li@cloud.com>
@changlei-li changlei-li force-pushed the private/changleli/CA-399260 branch from dd3c2de to 3406a64 Compare January 22, 2025 03:04
Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to merge this. I still don't understand how the situation arises; at some point new.pem is renamed to pem and this is where the race happens that a new.pem is suddenly missing.

@lindig lindig added this pull request to the merge queue Jan 22, 2025
Merged via the queue into xapi-project:master with commit 1c9b62a Jan 22, 2025
15 checks passed
@changlei-li changlei-li deleted the private/changleli/CA-399260 branch January 23, 2025 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants