-
Notifications
You must be signed in to change notification settings - Fork 292
CA-399260: Keep both new and old certs during the switchover #6223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA-399260: Keep both new and old certs during the switchover #6223
Conversation
039c0e0
to
dd3c2de
Compare
scripts/update-ca-bundle.sh
Outdated
|
||
rm -f "$BUNDLE.tmp" | ||
touch "$BUNDLE.tmp" | ||
for TMP_CERT in $TMP_CERTS; do | ||
if cat "$TMP_CERT" >> "$BUNDLE.tmp"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use if
to ignore cat xxx.new.pem error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is for dealing with error conditions, it needs to be more explicit. When is the then
branch taken here and when not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if statement checks whether the cat command succeeds. If the cat command successfully appends the contents of the temporary certificate file to the bundle file, the script then appends an empty line to the bundle file using the echo "" >> "$BUNDLE.tmp"
command.
If the cat command fails, it won't stop and go on the next command even if set -e
.
As I know, we can ignore the command error in shell script in two ways
- cat "$TMP_CERT" >> "$BUNDLE.tmp" || true
- if cat "$TMP_CERT" >> "$BUNDLE.tmp"; then
If we don't ignore the cat error, the script will fail and stop, return error to xapi. Then we may get error "message: Subprocess exited with unexpected code 1; stdout = [ ]; stderr = [ cat: 21a92da1-c874-48e2-be70-e49e8ab15af1.new.pem: No such file or directory" in xapi. That is what happend in CA-362358.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the .. || true
idiom better. But what causes the failure in the first place and should we check those conditions? I assume $TMP_CERT
did not exist because >>
will succeed if the resulting file can be written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I am asking for help in the channel. There is a race condition in CA-362358, new.pem is found but it is renamed by other thread when cat. The convention is to add lock, but it seems no one wanted to add lock for this trifle. Then the new.pem was filtered out in #4893. I want to ignore the shell script error here following this thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why these certificates are handled in a special way if they are meant to be included in the bundle in any case.
This issue was introduced by #4893
This PR was trying to fix a race condition:
Script to refresh CA bundle fails because a pem is renamed
This is a race condition between the code that refreshes directory of certificates and the generation of new certificates.
What happens is that the latter generates a temporary file ending in "new", the former reads the directory, which contains this file, the latter removes the temporary file and when the former tries to read the file it fails.
The PR removed the inclusion of .new.pem certificates, which we see that is wrong, but we need better coordination between the code that generates new certificates and this refresh of the bundle.
I believe we need to revert the change done to this script isntead of adding more code; and then fix the race between the two actions: we don't want to allow new pool certificates being generated while the pool is already in the process of changing certificates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Certificates are refreshed every hour during testing but not in production - making this problem more visible
- The call is "xe host-refresh-server-certificate host=$HOST"
- Is this the collision between two instances of the same API call or a collision between unrelated API calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why these certificates are handled in a special way if they are meant to be included in the bundle in any case.
The new.pem
is a temp file in switchover, it may be removed between find and cat. The error cat: /etc/stunnel/certs-pool/xxx.new.pem: No such file or directory
can be ignored.
I suggest to close this PR and analyze the situation on the ticket. Xapi implements a lock that ensures only one instance of the bundling script is running. So we need to understand how this race happens. The script could be also easily improved to protect against using names that collide when multiple instances are running. |
Reopen this PR as I think it's a simple way to solve CA-399260 and not reintroduce #4893. |
This is adding a new variable and a new loop of using less parameters in the filtering, why? This should be a partial revert of #4893, but I'm not convinced that the original issue isn't reintroduced. |
@psafont About your concern: In fact, the cat error can be ignored. Because the |
scripts/update-ca-bundle.sh
Outdated
@@ -11,9 +11,15 @@ regen_bundle () { | |||
|
|||
mkdir -p "$CERTS_DIR" | |||
CERTS=$(find "$CERTS_DIR" -not -name '*.new.pem' -name '*.pem') | |||
TMP_CERTS=$(find "$CERTS_DIR" -name '*.new.pem') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to call thisNEW_CERTS
and otherwise use new
as well when only new certs are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It both makes sense. Use NEW_CERTS to indicate they're .new.pem
s and use TMP_CERTS to indicate they're temp files will finally be removed. Anyway, NEW_CERTS shall be more direct. I accept.
I would encourage to make names unique as much as possible. by using |
I think what you say is to add |
In host-refresh-server-certificate, host generates and applies new pool certificate after stunnel restart. There is 5s that other hosts don't trust the new certificate. Then operations related with xapi:pool SNI tls connection will fail. Both the old and new certificates shall be trusted during the switchover to avoid this. At the end of the refresh procedure, remove_stale_cert will rename the new pem to pem and update ca bundle. So, both pem and new pem shall be filled into bundle file in update-ca-bundle.sh. As the new pem is a temp file and has a risk to be deleted by other instance between find and cat. Ignore the cat new pem fail in the script. If the new pem doesn't exist at the moment, just skip it and go on. Signed-off-by: Changlei Li <changlei.li@cloud.com>
dd3c2de
to
3406a64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to merge this. I still don't understand how the situation arises; at some point new.pem is renamed to pem and this is where the race happens that a new.pem is suddenly missing.
In host-refresh-server-certificate, host generates and applies new pool certificate after stunnel restart. There is 5s that other hosts don't trust the new certificate. Then operations related with xapi:pool SNI tls connection will fail. Both the old and new certificates shall be trusted during the switchover to avoid this. At the end of the refresh procedure, remove_stale_cert will rename the new pem to pem and update ca bundle.