Skip to content

Commit ad91eed

Browse files
committed
feat(redis): add resilient redis conn layer
1 parent 2294d52 commit ad91eed

8 files changed

Lines changed: 506 additions & 3 deletions

File tree

.github/workflows/push.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ jobs:
3030
run: |
3131
echo "running OpenTelemetry Formatter tests"
3232
vendor/bin/phpunit tests/OpenTelemetry/Formatters/ --log-junit results_opentelemetry_tests.xml
33+
echo "running Redis Resilience tests"
34+
vendor/bin/phpunit tests/Redis/ --log-junit results_redis_tests.xml
3335
3436
- name: Upload OpenTelemetry Tests Output
3537
uses: actions/upload-artifact@v4
@@ -38,6 +40,13 @@ jobs:
3840
path: results_opentelemetry_tests.xml
3941
retention-days: 5
4042

43+
- name: Upload Redis Tests Output
44+
uses: actions/upload-artifact@v4
45+
with:
46+
name: results_redis_tests
47+
path: results_redis_tests.xml
48+
retention-days: 5
49+
4150
integration-tests:
4251
runs-on: ubuntu-latest
4352
strategy:
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# ADR-001: Resilient Redis Connection with Idempotent Command Retry
2+
3+
**Date:** 2026-02-25
4+
**Status:** Accepted
5+
**Authors:** Sebastian Marcet
6+
7+
## Context
8+
9+
Production runs on a managed Valkey (Redis-compatible) instance on DigitalOcean over TLS. Transient connection failures — network hiccups, TLS renegotiation, server maintenance — cause `Predis\Connection\ConnectionException` errors that propagate up through the application and fail business-critical operations.
10+
11+
The immediate trigger was audit log job dispatch (`EmitAuditLogJob`) failing during Doctrine's `onFlush` event inside `MemberService::synchronizeGroups`. The Redis write failure bubbled up through the Doctrine `UnitOfWork::commit()` and caused the entire member synchronization transaction to fail. Audit logging is non-critical and should never break business operations.
12+
13+
Beyond audit logging, any Redis operation (cache reads/writes, session access, rate limiting) is vulnerable to the same transient failures.
14+
15+
## Decision
16+
17+
### 1. Resilient Redis Connection Layer
18+
19+
Introduce a custom Redis driver (`predis_resilient`) that automatically retries **idempotent** commands on `ConnectionException`, with disconnect/reconnect between attempts and exponential backoff.
20+
21+
**Architecture:**
22+
23+
- `ResilientPredisConnection` extends Laravel's `PredisConnection`, overrides `command()` to catch `ConnectionException` and retry only idempotent commands
24+
- `ResilientPredisConnector` extends `PredisConnector`, calls `parent::connect()` to reuse all config/TLS/option handling, then wraps the `Predis\Client` in `ResilientPredisConnection`
25+
- `RedisResilienceServiceProvider` registers the driver via `RedisManager::extend('predis_resilient', ...)`
26+
- Activated by setting `REDIS_CLIENT=predis_resilient` in `.env` — zero behavior change without this flag
27+
28+
**Retry behavior:**
29+
30+
| Command type | On ConnectionException | Rationale |
31+
|---|---|---|
32+
| Idempotent (GET, SET, DEL, HSET, EXPIRE, ...) | Disconnect, reconnect, retry up to N times with exponential backoff | Executing twice produces the same result |
33+
| Non-idempotent (INCR, LPUSH, RPUSH, EVAL, ...) | Rethrow immediately | Command may have executed before the read-side failed; retrying could duplicate data |
34+
35+
**Configuration** (per-connection in `config/database.php`):
36+
37+
| Parameter | Env var | Default | Description |
38+
|---|---|---|---|
39+
| `retry_limit` | `REDIS_RETRY_LIMIT` | 2 | Max retry attempts (0 disables retries) |
40+
| `retry_delay` | `REDIS_RETRY_DELAY` | 50 | Base delay in ms (doubles each attempt: 50, 100, 200) |
41+
42+
### 2. Job Dispatch Fallback for Audit Logging
43+
44+
Separately, `AuditLogOtlpStrategy` was updated to dispatch `EmitAuditLogJob` via `JobDispatcher::withSyncFallback()` instead of `EmitAuditLogJob::dispatch()`. This catches Redis failures at the job dispatch level and runs the audit log emission synchronously as a fallback — preventing audit logging from ever failing the parent business transaction.
45+
46+
## Consequences
47+
48+
### Positive
49+
50+
- Transient Redis failures on idempotent commands (cache GET/SET, session reads, key expiry) are automatically recovered without application-level error handling
51+
- Non-idempotent commands (queue pushes, counters, list operations) are never retried, preventing data duplication
52+
- Opt-in activation via env var — no risk to existing deployments
53+
- Per-connection retry configuration allows tuning (e.g., more retries for cache, fewer for workers)
54+
- Audit log failures can no longer crash business transactions
55+
56+
### Negative
57+
58+
- Retry adds latency on failure (up to ~350ms with defaults: 50 + 100 + 200ms backoff)
59+
- `usleep()` in the retry loop blocks the PHP process during backoff — acceptable for 2-3 retries but would need async handling at higher retry counts
60+
- The idempotent command list is manually maintained and must be updated if new Redis commands are used
61+
62+
### Neutral
63+
64+
- Queue push operations (`EVAL` with Lua scripts) are NOT retried by the resilient connection — they remain protected by `JobDispatcher::withSyncFallback` / `withDbFallback` at the application layer
65+
- The `predis_resilient` driver shares the same Predis `Client` configuration as `predis` — no TLS, auth, or timeout differences
66+
67+
## Alternatives Considered
68+
69+
1. **Predis built-in retry** — Predis does not offer connection retry on command failure (only on initial connect via `aggregate` connections). Rejected.
70+
71+
2. **Retry all commands** — Would risk duplicating non-idempotent operations (double LPUSH, double INCR) when the failure occurs after the command was sent but before the response was read. Rejected.
72+
73+
3. **Catch-and-ignore at every call site** — Would require wrapping every Redis call in try/catch throughout the codebase. Not maintainable. Rejected.
74+
75+
4. **Switch to phpredis extension** — phpredis has built-in retry support, but would require changing the entire Redis integration layer and testing all connection configurations. Disproportionate effort for the problem at hand. Not pursued.
76+
77+
## Files
78+
79+
| File | Purpose |
80+
|---|---|
81+
| `app/Redis/ResilientPredisConnection.php` | Connection with retry logic |
82+
| `app/Redis/ResilientPredisConnector.php` | Connector that swaps connection class |
83+
| `app/Providers/RedisResilienceServiceProvider.php` | Registers `predis_resilient` driver |
84+
| `config/database.php` | Added `retry_limit`, `retry_delay` to Redis connections |
85+
| `config/app.php` | Registered service provider |
86+
| `app/Audit/AuditLogOtlpStrategy.php` | Changed to `JobDispatcher::withSyncFallback` |
87+
| `tests/Redis/ResilientPredisConnectionTest.php` | 10 unit tests |
88+
| `.github/workflows/push.yml` | Added Redis tests to CI |
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
<?php namespace App\Providers;
2+
/**
3+
* Copyright 2026 OpenStack Foundation
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
* http://www.apache.org/licenses/LICENSE-2.0
8+
* Unless required by applicable law or agreed to in writing, software
9+
* distributed under the License is distributed on an "AS IS" BASIS,
10+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
* See the License for the specific language governing permissions and
12+
* limitations under the License.
13+
**/
14+
15+
use App\Redis\ResilientPredisConnector;
16+
use Illuminate\Contracts\Foundation\Application;
17+
use Illuminate\Redis\RedisManager;
18+
use Illuminate\Support\ServiceProvider;
19+
20+
/**
21+
* Class RedisResilienceServiceProvider
22+
*
23+
* Registers the "predis_resilient" Redis driver which adds automatic
24+
* retry-with-reconnect for idempotent commands on transient failures.
25+
*
26+
* To activate, set REDIS_CLIENT=predis_resilient in your .env.
27+
*/
28+
class RedisResilienceServiceProvider extends ServiceProvider
29+
{
30+
public function boot(): void
31+
{
32+
$this->app->afterResolving('redis', function (RedisManager $redis, Application $app) {
33+
$redis->extend('predis_resilient', function () {
34+
return new ResilientPredisConnector();
35+
});
36+
});
37+
}
38+
}
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
<?php namespace App\Redis;
2+
/**
3+
* Copyright 2026 OpenStack Foundation
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
* http://www.apache.org/licenses/LICENSE-2.0
8+
* Unless required by applicable law or agreed to in writing, software
9+
* distributed under the License is distributed on an "AS IS" BASIS,
10+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
* See the License for the specific language governing permissions and
12+
* limitations under the License.
13+
**/
14+
15+
use Illuminate\Redis\Connections\PredisConnection;
16+
use Illuminate\Support\Facades\Log;
17+
use Predis\Connection\ConnectionException;
18+
19+
/**
20+
* Class ResilientPredisConnection
21+
*
22+
* Extends the default PredisConnection to add automatic retry with
23+
* reconnect for idempotent Redis commands on transient connection failures.
24+
*
25+
* Non-idempotent commands (INCR, LPUSH, RPUSH, EVAL, etc.) are never
26+
* retried because the command may have already been executed on the
27+
* server before the read-side of the connection failed.
28+
*/
29+
class ResilientPredisConnection extends PredisConnection
30+
{
31+
private int $retryLimit;
32+
33+
private int $retryDelay;
34+
35+
/**
36+
* Commands that are safe to retry after a connection failure.
37+
* A command is safe when executing it twice produces the same result.
38+
*/
39+
private const IDEMPOTENT_COMMANDS = [
40+
// reads
41+
'GET', 'MGET', 'HGET', 'HGETALL', 'HMGET', 'HEXISTS', 'HLEN', 'HKEYS', 'HVALS',
42+
'LLEN', 'LRANGE', 'LINDEX',
43+
'SCARD', 'SMEMBERS', 'SISMEMBER',
44+
'ZCARD', 'ZCOUNT', 'ZRANGE', 'ZRANGEBYSCORE', 'ZREVRANGEBYSCORE', 'ZSCORE', 'ZRANK', 'ZREVRANK',
45+
'EXISTS', 'TYPE', 'TTL', 'PTTL', 'KEYS', 'SCAN', 'HSCAN', 'SSCAN', 'ZSCAN',
46+
'INFO', 'PING', 'DBSIZE', 'TIME', 'STRLEN', 'GETRANGE',
47+
// idempotent writes
48+
'SET', 'SETEX', 'PSETEX', 'MSET', 'SETNX', 'GETSET',
49+
'HSET', 'HMSET', 'HSETNX',
50+
'DEL', 'HDEL', 'UNLINK',
51+
'EXPIRE', 'EXPIREAT', 'PEXPIRE', 'PEXPIREAT', 'PERSIST',
52+
'SADD', 'SREM',
53+
'ZADD', 'ZREM', 'ZREMRANGEBYSCORE', 'ZREMRANGEBYRANK',
54+
];
55+
56+
/**
57+
* @param \Predis\Client $client
58+
* @param int $retryLimit Max number of retries (0 = no retries, behaves like stock PredisConnection)
59+
* @param int $retryDelay Base delay in milliseconds between retries (doubled each attempt)
60+
*/
61+
public function __construct($client, int $retryLimit = 2, int $retryDelay = 50)
62+
{
63+
parent::__construct($client);
64+
$this->retryLimit = $retryLimit;
65+
$this->retryDelay = $retryDelay;
66+
}
67+
68+
/**
69+
* @inheritdoc
70+
*/
71+
public function command($method, array $parameters = [])
72+
{
73+
try {
74+
return parent::command($method, $parameters);
75+
} catch (ConnectionException $e) {
76+
if (!$this->isIdempotent($method)) {
77+
throw $e;
78+
}
79+
return $this->retryCommand($method, $parameters, $e);
80+
}
81+
}
82+
83+
/**
84+
* Retry an idempotent command after reconnecting.
85+
*/
86+
private function retryCommand(string $method, array $parameters, ConnectionException $previous): mixed
87+
{
88+
$lastException = $previous;
89+
90+
for ($attempt = 1; $attempt <= $this->retryLimit; $attempt++) {
91+
$delay = $this->retryDelay * (2 ** ($attempt - 1)); // exponential back-off
92+
93+
Log::warning('ResilientPredisConnection: retrying command', [
94+
'command' => strtoupper($method),
95+
'attempt' => $attempt,
96+
'max_retries' => $this->retryLimit,
97+
'delay_ms' => $delay,
98+
'error' => $previous->getMessage(),
99+
]);
100+
101+
usleep($delay * 1000);
102+
103+
try {
104+
$this->client->disconnect();
105+
106+
return parent::command($method, $parameters);
107+
} catch (ConnectionException $e) {
108+
$lastException = $e;
109+
}
110+
}
111+
112+
Log::error('ResilientPredisConnection: all retries exhausted', [
113+
'command' => strtoupper($method),
114+
'retries' => $this->retryLimit,
115+
'error' => $lastException->getMessage(),
116+
]);
117+
118+
throw $lastException;
119+
}
120+
121+
private function isIdempotent(string $method): bool
122+
{
123+
return in_array(strtoupper($method), self::IDEMPOTENT_COMMANDS, true);
124+
}
125+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
<?php namespace App\Redis;
2+
/**
3+
* Copyright 2026 OpenStack Foundation
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
* http://www.apache.org/licenses/LICENSE-2.0
8+
* Unless required by applicable law or agreed to in writing, software
9+
* distributed under the License is distributed on an "AS IS" BASIS,
10+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
* See the License for the specific language governing permissions and
12+
* limitations under the License.
13+
**/
14+
15+
use Illuminate\Redis\Connectors\PredisConnector;
16+
17+
/**
18+
* Class ResilientPredisConnector
19+
*
20+
* Wraps the stock PredisConnector, reusing all its config/TLS/option
21+
* handling, and swaps the returned connection to ResilientPredisConnection.
22+
*/
23+
class ResilientPredisConnector extends PredisConnector
24+
{
25+
/**
26+
* @inheritdoc
27+
*/
28+
public function connect(array $config, array $options)
29+
{
30+
$connection = parent::connect($config, $options);
31+
32+
$retryLimit = (int) ($config['retry_limit'] ?? 2);
33+
$retryDelay = (int) ($config['retry_delay'] ?? 50);
34+
35+
return new ResilientPredisConnection(
36+
$connection->client(),
37+
$retryLimit,
38+
$retryDelay
39+
);
40+
}
41+
}

config/app.php

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,7 @@
156156
// App\Providers\BroadcastServiceProvider::class,
157157
App\Providers\EventServiceProvider::class,
158158
App\Providers\RouteServiceProvider::class,
159+
App\Providers\RedisResilienceServiceProvider::class,
159160
// Services
160161
App\Repositories\RepositoriesProvider::class,
161162
App\Services\FileSystem\Swift\SwiftServiceProvider::class,

config/database.php

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,6 @@
172172
*/
173173

174174
'redis' => [
175-
176175
'client' => env('REDIS_CLIENT', 'predis'),
177176
/*
178177
* @see https://github.com/predis/predis/wiki/Connection-Parameters
@@ -186,8 +185,9 @@
186185
'scheme' => env('REDIS_SCHEME', 'tcp'),
187186
'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1),
188187
'timeout' => env('REDIS_TIMEOUT', 30),
188+
'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2),
189+
'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50),
189190
],
190-
191191
'cache' => [
192192
'host' => env('REDIS_HOST'),
193193
'port' => env('REDIS_PORT'),
@@ -196,8 +196,9 @@
196196
'scheme' => env('REDIS_SCHEME', 'tcp'),
197197
'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1),
198198
'timeout' => env('REDIS_TIMEOUT', 30),
199+
'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2),
200+
'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50),
199201
],
200-
201202
'session' => [
202203
'host' => env('REDIS_HOST'),
203204
'port' => env('REDIS_PORT'),
@@ -206,6 +207,9 @@
206207
'scheme' => env('REDIS_SCHEME', 'tcp'),
207208
'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1),
208209
'timeout' => env('REDIS_TIMEOUT', 30),
210+
211+
'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2),
212+
'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50),
209213
],
210214

211215
'worker' => [
@@ -216,6 +220,8 @@
216220
'scheme' => env('REDIS_SCHEME', 'tcp'),
217221
'read_write_timeout' => env('REDIS_READ_WRITE_TIMEOUT', -1),
218222
'timeout' => env('REDIS_TIMEOUT', 30),
223+
'retry_limit' => (int) env('REDIS_RETRY_LIMIT', 2),
224+
'retry_delay' => (int) env('REDIS_RETRY_DELAY', 50),
219225
],
220226
],
221227
'allow_disabled_pk' => env('DB_ALLOW_DISABLED_PK', false),

0 commit comments

Comments
 (0)