Seems like some metrics would be useful: - Counter of number of failures to send messages. - Number of pending messages (gauge). Also the fact that the operator just keeps on being healthy when there is an error (slack is down now) seems somewhat wrong. With Slack down the operator remains healthy and no errors are being logged, from looking at the code they should be logged, but not seeing anything.