Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions connect_attempts_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
# Report of connect attempts bugs

- affected versions: 10.1.x and smaller

Relevant configs:

- `proxy.config.http.connect_attempts_max_retries`
- `proxy.config.http.connect_attempts_max_retries_down_server`
- `proxy.config.http.connect_attempts_rr_retries`
- `proxy.config.http.down_server.cache_time` (the `fail_window`)

PRs:

- https://github.com/apache/trafficserver/pull/12846
- https://github.com/apache/trafficserver/pull/12880
- https://github.com/apache/trafficserver/pull/13092
- https://github.com/apache/trafficserver/pull/13102

## Net effect between 10.1.x and later

On 10.1.x and earlier, the three `connect_attempts_*` configs behaved
erratically:

- `connect_attempts_max_retries_down_server` was effectively unreachable
because the `is_down()` check was inverted (#12846).
- Single-address origins ignored down-state entirely — `select_best_http`
unconditionally returned the first (and only) entry (#12880).
- The "mark host DOWN" threshold was gated by `connect_attempts_rr_retries`
instead of `connect_attempts_max_retries` (#13102).
- The fail-window clock used `client_request_time` instead of the actual
failure time, so the window could start in the past or be pre-expired
(#13102).
- After `select_next_rr()`, `dst_addr` was not updated, so RR "switches"
silently kept hitting the same target (#13102).
- `mark_host_failure()` ran at end-of-txn against the post-switch
`dns_info.active`, so failures on A could be charged to B (#13102).
- RR exhaustion terminated retries early even when the current target
still had per-host budget (#13102).

The combined effect of PRs #12846, #12880, #13092, and #13102 is a
coherent model in which each config has a well-defined role tied to an
explicit `UP / DOWN / SUSPECT` state for the active target.

### Worked examples

All examples use the ATS defaults:

| Config | Default |
|----------------------------------------------|---------|
| `connect_attempts_max_retries` | 3 |
| `connect_attempts_max_retries_down_server` | 1 |
| `connect_attempts_rr_retries` | 3 |
| `down_server.cache_time` (`fail_window`) | 60s |

#### Example 1 — Single-address origin goes offline

Client sends repeated requests to an origin whose only A record points to
a down IP. How many connect attempts does ATS make before the IP is
marked DOWN in HostDB?

**10.1.x behavior — IP marked DOWN after 12 connect attempts (3 failed
transactions).**

Why: `mark_host_failure()` is only called from
`do_hostdb_update_if_necessary()`, which is only reached at the end of a
transaction (via `handle_server_connection_not_open`). So `fail_count`
increments **once per failed transaction**, not once per attempt. The
threshold is `connect_attempts_rr_retries` (default `3`) — the wrong
config, but that's what 10.1.x uses. Per-transaction there are `1 +
connect_attempts_max_retries = 4` attempts.

| Attempt | Txn | `fail_count` after | marked DOWN? |
|---------|-----|--------------------|--------------|
| 1 | 1 | 0 | no |
| 2 | 1 | 0 | no |
| 3 | 1 | 0 | no |
| 4 | 1 | 1 (end of txn) | no |
| 5 | 2 | 1 | no |
| 6 | 2 | 1 | no |
| 7 | 2 | 1 | no |
| 8 | 2 | 2 (end of txn) | no |
| 9 | 3 | 2 | no |
| 10 | 3 | 2 | no |
| 11 | 3 | 2 | no |
| **12** | 3 | 3 ≥ 3 | **yes** |

And because `select_best_http()` returns `info[0]` unconditionally for
single-address records (#12880), every one of those 12 attempts actually
hits the network. Even after the IP is marked DOWN, subsequent requests
keep trying it because the single-address code path ignores down-state.

**Later behavior — IP marked DOWN after 4 connect attempts (1 failed
transaction).**

Why: `do_hostdb_update_if_necessary()` is now called after **each**
failure in `handle_response_from_server()`, so `fail_count` increments
**once per attempt**. The threshold is `max_retries + 1 = 4` (for an UP
target). Per-transaction there are still `1 + max_retries = 4` attempts,
so the threshold is reached on the last attempt of the first failed
transaction.

| Attempt | Txn | `fail_count` after | marked DOWN? |
|---------|-----|--------------------|--------------|
| 1 | 1 | 1 | no |
| 2 | 1 | 2 | no |
| 3 | 1 | 3 | no |
| **4** | 1 | 4 ≥ 4 | **yes** |

Subsequent requests within `fail_window` (60s):
`select_best_http()` now honors down-state for single-address records,
so HostDB lookup returns `nullptr` and the SM reports `OriginDown`
immediately — **zero connect attempts**.

After `fail_window` elapses, the entry is SUSPECT. The next request gets
`connect_attempts_max_retries_down_server + 1 = 2` attempts (1 probe +
1 retry). Success → UP; failure → back to DOWN for another 60s.

#### Example 2 — Round-robin with one bad backend (A fails, B/C healthy)

Defaults apply: `rr_retries = 3`, `max_retries = 3`. A is unresponsive,
B and C are fine. Initial active = A.

**10.1.x behavior — client gets 502, and the wrong host is blamed.**

Two compounding bugs: (a) after `select_next_rr()` is called, neither
`dns_info.addr` nor `server_info.dst_addr` is updated, so the next
connect still goes to A's address. (b) `mark_host_failure()` is called
once at end-of-txn against `dns_info.active`, which by that point is
the *post-switch* host (B), so B's `fail_count` is incremented for A's
failures.

| Attempt | dst_addr (target) | dns_info.active | retry_attempts after | `(retry+1) % 3 == 0`? | Result |
|---------|-------------------|-----------------|----------------------|----------------------|--------|
| 1 | A | A | 1 | no | fail |
| 2 | A | A | 2 | no | fail |
| 3 | A | A → switched to B (`select_next_rr`); dst_addr untouched | 3 | **yes** | fail |
| 4 | A (still!) | B | 3 ≥ 3, give up | n/a | fail → 502 to client |

End of txn: `mark_host_failure()` increments `B.fail_count` even
though B was never tried. Repeated requests do this 3 times → **B
gets marked DOWN** while A (the actually-broken host) keeps getting
hit on every transaction.

**Later behavior — client gets 200; A's failures are correctly
recorded against A.**

`do_hostdb_update_if_necessary()` runs after each failure (so the
correct `dns_info.active` is in scope), and after `select_next_rr()`
the SM explicitly assigns `dns_info.addr` and `server_info.dst_addr`
to the new target.

| Attempt | dst_addr (target) | dns_info.active | A.fail_count after | retry_attempts after | RR switch? | Result |
|---------|-------------------|-----------------|--------------------|----------------------|------------|--------|
| 1 | A | A | 1 | 1 | no | fail |
| 2 | A | A | 2 | 2 | no | fail |
| 3 | A | A | 3 | 3 | **yes** → active=B, dst_addr=B | fail |
| **4** | **B** | B | 3 (unchanged) | n/a | n/a | **success → 200** |

A finishes this transaction with `fail_count = 3` (not yet at the
threshold of `max_retries + 1 = 4`). The next request that hashes to
A bumps it to 4 and marks A DOWN; subsequent requests within
`fail_window` skip A entirely until the window expires.

If B and C had also been DOWN (no selectable alternate),
`select_next_rr()` returns false and the SM stays on A using its
remaining per-host retry budget instead of giving up — new in
#13102.

**Later behavior**

- Failure timestamp is `ts_clock::now()` at the moment `mark_down`
is called, so the full `fail_window` is honored regardless of how
long the transaction took to decide the host was bad.


## Summary of bug-fix PRs

### PR #12846 — Fix `HostDBInfo::is_down` condition

**Bug.** `HostDBInfo::is_down()` had an inverted comparison:

```cpp
// Before
return (last_fail != TS_TIME_ZERO) && (last_fail + fail_window < now);
// After
return (last_fail != TS_TIME_ZERO) && (now <= last_fail + fail_window);
```

The pre-fix code reported a host as DOWN only *after* `fail_window` had elapsed —
the opposite of intent. During the fail window (when the host should be
considered blocked) it returned `false`.

**Behavior change.** A failed host is now correctly treated as DOWN for the
entire `fail_window` following `last_failure`, and returns to UP only after
the window has elapsed. Every downstream path that calls `is_down()` (retry
budget, RR selection, negative-cached check) is impacted.

### PR #12880 — Check state of HostDBInfo in `select_best_http`

**Bug.** For single-address (non round-robin) records,
`HostDBRecord::select_best_http()` bypassed the state check:

```cpp
// Before
best_alive = &info[0]; // unconditional
// After
if (info[0].select(now, fail_window)) {
best_alive = &info[0];
}
```

A single-address host that was marked down would still be returned from
HostDB and reused for the next request.

**Behavior change.** Single-address origins now honor `down_server` state the
same way RR origins do: if the only target is DOWN and still within the fail
window, HostDB lookup fails (`nullptr`), the SM reports `OriginDown`, and no
connection attempt is made until the window expires. The gold autest
`dns_host_down` was updated accordingly (second request returns 500 instead
of 502 once the first failure marked the IP down).

### PR #13092 — Clarify `HostDBInfo` state

**Not a bug fix — a model/refactor.** Introduces an explicit tri-state for
upstream health and reshapes the API around it:

```cpp
enum class HostDBInfo::State { UP, DOWN, SUSPECT };
```

- **UP** — no known failure; normal selection.
- **DOWN** — `_last_failure` set, `now` is within `fail_window`; not eligible.
- **SUSPECT** — `fail_window` has elapsed; a probe is permitted. A successful
response transitions it to UP via `mark_up()`; another failure returns it
to DOWN.

API renames: `is_alive` → `is_up`, `mark_active_server_alive` →
`mark_active_server_up`. `mark_down` now takes `fail_window`. `select()` is
replaced by callers using `is_down()` directly. `last_failure` / `fail_count`
become private atomics (`_last_failure`, `_fail_count`).

**Behavior change.** The previous two-state model collapsed SUSPECT into UP
once `fail_window` expired, which meant a recovering host was treated
identically to a never-failed host for retry sizing. With SUSPECT explicit,
callers (notably the retry machinery in #13102) can apply
`connect_attempts_max_retries_down_server` to probing traffic and
`connect_attempts_max_retries` to UP traffic.

### PR #13102 — Fix connect attempt retries

**Bugs.** The retry machinery in
`HttpTransact::handle_response_from_server()` and
`HttpSM::mark_host_failure()` had several interlocking issues:

1. **Wrong threshold used to mark host down.** `mark_host_failure()` called
`increment_fail_count(..., connect_attempts_rr_retries, ...)`. The RR
switch-over config was reused as the "mark down" threshold, so a host was
marked DOWN after `connect_attempts_rr_retries` failures instead of
`connect_attempts_max_retries + 1` total attempts.
2. **Wrong clock for failure timestamp.**
`do_hostdb_update_if_necessary()` recorded the failure at
`t_state.client_request_time` (request-receipt time), not
`ts_clock::now()`. For long requests the fail window started in the past,
sometimes expiring before the failure was even recorded.
3. **RR exhaustion short-circuited all retries.** When the
`connect_attempts_rr_retries` boundary was hit and no other RR member was
selectable, the old code gave up, even though the current target still
had per-host retry budget.
4. **`connect_attempts_max_retries_down_server` had no effect in practice.**
It was only consulted through `is_server_negative_cached()`, which
depended on the inverted `is_down()` from #12846. In many cases it was
either never applied or applied when it shouldn't have been.
5. **`connect_attempts_max_retries_down_server == 0` blocked the probe.**
A separate code path in `do_http_server_open()` refused to connect at all
when the target was negative-cached and the config was zero — preventing
the SUSPECT-state probe that the fail window was designed to allow.
6. **Retry config range allowed overflow.** `[0-255]` with `uint8_t`
threshold = `max_retries + 1` wraps to 0.
7. Rename `proxy.config.http.connect_attempts_max_retries_down_server` with `proxy.config.http.connect_attempts_max_retries_suspicious_server` to clarify behavior.
The `proxy.config.http.connect_attempts_max_retries_down_server` config is deprecated.



**Behavior change.**

- Retry budget is now driven by HostDB state (via the new helper
`HttpTransact::origin_server_connect_attempts_max_retries`):
- `UP` → `connect_attempts_max_retries`
- `DOWN` → 0 (bail out immediately)
- `SUSPECT` → `connect_attempts_max_retries_down_server`
- `mark_host_failure` uses `max_retries + 1` as the attempt budget and
`ts_clock::now()` as the failure time.
- RR switch and per-host retry are separated: if RR has no selectable
alternate, the SM stays on the current target and keeps retrying up to
the per-host budget rather than giving up.
- The `do_http_server_open()` early-bail on "negative cached +
down_server=0" is removed — DOWN vs SUSPECT is now the single source of
truth.
- `proxy.config.http.connect_attempts_max_retries`,
`…_max_retries_down_server`, and `…_rr_retries` clamped to `[0-254]` to
avoid the `uint8_t` overflow when adding 1 for the total-attempt count.
- A new warning is logged at config reconfigure when
`connect_attempts_max_retries_down_server == 0` and
`connect_attempts_rr_retries > 0`, because that combination prevents any
probe of a recovering (SUSPECT) origin.
21 changes: 17 additions & 4 deletions doc/admin-guide/files/records.yaml.en.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1812,15 +1812,28 @@ Origin Server Connect Attempts
The maximum number of connection retries |TS| can make when the origin server is not responding.
Each retry attempt lasts for `proxy.config.http.connect_attempts_timeout`_ seconds. Once the maximum number of retries is
reached, the origin is marked down (as controlled by `proxy.config.http.connect.down.policy`_. After this, the setting
`proxy.config.http.connect_attempts_max_retries_down_server`_ is used to limit the number of retry attempts to the known down origin.
`proxy.config.http.connect_attempts_max_retries_suspect_server`_ is used to limit the number of retry attempts when the origin is
in the SUSPECT state (recovering after `proxy.config.http.down_server.cache_time`_ has elapsed).

.. ts:cv:: CONFIG proxy.config.http.connect_attempts_max_retries_suspect_server INT 1
:reloadable:
:overridable:

Maximum number of connection retries |TS| can make while an origin is in the SUSPECT state (the first request after
`proxy.config.http.down_server.cache_time`_ has elapsed on a previously-down origin). The total attempt budget for a SUSPECT
origin is therefore ``connect_attempts_max_retries_suspect_server + 1`` (the initial probe plus each retry). If any attempt
succeeds, the origin transitions back to UP; if all attempts fail, the origin returns to DOWN for another
`proxy.config.http.down_server.cache_time`_ seconds. Typically smaller than `proxy.config.http.connect_attempts_max_retries`_
so the recovering origin is not flooded.

.. ts:cv:: CONFIG proxy.config.http.connect_attempts_max_retries_down_server INT 1
:reloadable:
:overridable:
:deprecated:

Maximum number of connection attempts |TS| can make while an origin is marked down per request. Typically this value is smaller than
`proxy.config.http.connect_attempts_max_retries`_ so an error is returned to the client faster and also to reduce the load on the down origin.
The timeout interval `proxy.config.http.connect_attempts_timeout`_ in seconds is used with this setting.
This setting is deprecated in favor of :ts:cv:`proxy.config.http.connect_attempts_max_retries_suspect_server`. If the
deprecated setting is set explicitly and the replacement is not, the deprecated value is mirrored forward and a warning is
logged. If both are set explicitly, the new setting wins and the deprecated value is ignored.

.. ts:cv:: CONFIG proxy.config.http.connect_attempts_retry_backoff_base INT 0
:reloadable:
Expand Down
Loading