Skip to content

Stale Connections Suddenly Increase when there is a Spike on Application #3359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
akshayk-ktk opened this issue Apr 28, 2025 · 5 comments
Open

Comments

@akshayk-ktk
Copy link

We are noticing Redis commands taking more than 2 seconds when there is a spike on the application.

I am printing the pool stats in the application and I see that many connections became stale as soon as the spike comes.

What can be done to mitigate this?

Expected Behavior

Redis Commands should complete in expected time.

Current Behavior

Redis Commands take more than 2 seconds to complete the commands when there is a Spike.

Redis Client Configuration

poolSize: 5000
connMaxIdleTime: -1s
dialTimeout: 10s
poolTimeout: 10s
readTimeout: 10s
minIdleConns: 3500
maxIdleConns: 3800
connMaxLifetime: -1s

Redis Server

We are using elasticache redis version 7+ in cluster mode. Currently with single shard.

Application is on EC2 running on 4 instances.

Below are the screenshots

Image
Image

@ndyakov
Copy link
Member

ndyakov commented Apr 29, 2025

@akshayk-ktk based on your configuration, I think the reason for the increased number of stale connections should be an error either durring getting / initing the connection or when putting it back in the pool. Would you be able to check if there is anything reported in the logs? If not, would you be able to check what is the reason error value passed here:

func (p *ConnPool) Remove(_ context.Context, cn *Conn, reason error) {

@akshayk-ktk
Copy link
Author

@ndyakov We are using zerolog for logging and use the below custom struct to set it to the internal logger. We are not able to see any internal error logs here.

type RedisLogger struct {
	logger *zerolog.Logger
}

func (rl *RedisLogger) Printf(_ context.Context, format string, v ...interface{}) {
	rl.logger.Info().Bool("go-redis", true).Msgf(format, v...)
}

func init() {
	redis.SetLogger(&RedisLogger{
		logger: logger.GlobalLogger,
	})
}

Let me try adding logs in the Remove method to check what are the error values.

@akshayk-ktk
Copy link
Author

akshayk-ktk commented Apr 30, 2025

Hey @ndyakov after adding logs found some errors. Looks like I/O timeout errors.

Does this mean the EC2 network interface is not able to handle the load?

Our EC2 machines are AWS c6g.xlarge.

Elasticache Redis - cache.m6g.large

Image

@ndyakov
Copy link
Member

ndyakov commented Apr 30, 2025

@akshayk-ktk I cannot comment on the elasticache setup and on the AWS setup. You can try to play around with multiple shards to see if this improves with the database's horizontal scaling. As for now, doesn't look like client issue, but will let you decide if you are gonna try anything further now, or we should close this issue.

@akshaykhairmode
Copy link

Hey @ndyakov we are upgrading the redis cluster node type and ec2 instance node types one at a time so we can monitor if we still get IO erorrs and stale connection increase.

I would prefer to keep the issue open with an under observation tag if possible.

Will report back any observations we get here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants