Sunday, February 5, 2023

Issue with retrying connection

Go back to news
Manuel Astudillo
@manast

An issue relating workers that misteriously stopped processing jobs has been identified when there is a connection loss and a subsequent reconnection.

The issue manifests itself when a connection is re-established, and although the workers are indeed connected to Redis, they stop processing any jobs.

We have tracked down this issue to the speed at which the reconnections are retried. In the default configuration, the reconnection is retried using an expontential backoff strategy, and the first retries are performed within milliseoconds. This triggers the following bug in the Redis client library: ioredis #1718

The following code was used in BullMQ to define the default retry strategy for Redis connections:

this.opts = {
port: 6379,
host: "127.0.0.1",
retryStrategy: function (times: number) {
return Math.min(Math.exp(times), 20000);
},
...opts,
};

As it can be seen in the code above, the first retry is performed after 1ms, and the second retry after 2ms, etc. following an exponential backoff, however, since the first retries are so fast, the bug in the Redis client library is triggered.

The issue has been fixed in the latest version of BullMQ (v3.6.2), and the default retry strategy is now as follows:

this.opts = {
port: 6379,
host: "127.0.0.1",
retryStrategy: function (times: number) {
return Math.max(Math.min(Math.exp(times), 20000), 1000);
},
...opts,
};

So it will wait at least 1 second before retrying the first time, and then it will follow the exponential backoff strategy.

However, if you are defining you own custom retry strategy, you should make sure that the first retries are not performed within milliseconds, and wait one second minimum, at least until the issue in the Redis client library is fixed.