*: fix reload tls certs in reconnect and fix region cache#537
Conversation
Signed-off-by: Ray Yan <yming0221@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThis PR refactors TLS certificate handling in SecurityManager to store file paths instead of in-memory PEM bytes and load materials on-demand. It updates RegionCache to return the invalidated store, coordinates KV client cache invalidation with region cache invalidation in PdRpcClient, simplifies request plan error handling, and adds tests validating invalidation and TLS reload behavior. ChangesTLS File Paths and Cache Invalidation Coordination
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Ray Yan <yming0221@gmail.com>
9e8125c to
ea87c9d
Compare
ea87c9d to
b2c2777
Compare
Signed-off-by: Ray Yan <yming0221@gmail.com>
b2c2777 to
7c798eb
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/pd/client.rs (1)
346-371:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPotential cache key race condition.
Between fetching the
cache_key(line 347) and storing the newly connected client (lines 358-366), the underlying TLS certificates could change. This would cache the client with a stale key, causing the next call to incorrectly reuse it until the certificates change again.Consider refetching the cache key immediately after
connect()succeeds and before inserting into the cache:match self.kv_connect.connect(address).await { Ok(client) => { // Refetch cache key to ensure consistency if let Ok(current_key) = self.kv_connect.connection_cache_key().await { self.kv_client_cache.write().await.insert( address.to_owned(), CachedKvClient { cache_key: current_key, client: client.clone(), }, ); } Ok(client) } Err(e) => Err(e), }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/pd/client.rs` around lines 346 - 371, The kv_client method has a race where the cache_key is read before connect() and may become stale by the time you insert a newly connected client; after a successful self.kv_connect.connect(address).await, refetch the cache key by calling self.kv_connect.connection_cache_key().await and only insert into self.kv_client_cache (CachedKvClient { cache_key: current_key, client: client.clone() }) when that refetch returns Ok(current_key); keep returning the client either way but avoid caching with the original pre-connect key (update code in kv_client around the connect() match and the cache insertion logic to use the refetched key).
🧹 Nitpick comments (2)
.cargo/config.toml (1)
1-2: ⚡ Quick winProvide context for the resolver configuration change.
The addition of
incompatible-rust-versions = "fallback"is not explained in the PR description or commit message. Please document why this Cargo resolver setting is necessary—for example, if it addresses a specific dependency conflict or build failure scenario.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.cargo/config.toml around lines 1 - 2, The change adding the Cargo resolver setting incompatible-rust-versions = "fallback" in .cargo/config.toml lacks explanation; update the PR description or the commit message to document why this resolver behavior is required (e.g., specific dependency version conflict, build failure on older/newer rustc, or to allow fallback resolution for a crate), mention which dependency or build scenario it addresses, and note any testing you did to validate the change so reviewers can understand and reproduce the rationale for using incompatible-rust-versions = "fallback".src/common/security.rs (1)
211-248: ⚡ Quick winAdd test case for same-size certificate updates.
The current test verifies that cache keys change when file sizes differ. However, it doesn't validate the behavior when certificates of the same size are rotated (a common scenario with fixed-length key formats). Consider adding a test case that:
- Writes files with identical sizes but different content
- Optionally preserves mtime using
filetimecrate- Verifies whether the cache key changes
This will surface the fingerprinting limitation flagged in the
connection_cache_key()review and help validate any future fix.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/common/security.rs` around lines 211 - 248, Add a new async test (alongside test_security_reload) that exercises SecurityManager::load and connection_cache_key for rotated certs with identical file sizes: create temp files for ca/cert/key with initial contents, load via SecurityManager::load and capture first = load_tls_materials() and key1 = connection_cache_key(). Then overwrite the three files with different content but exactly the same byte lengths (use padding) and optionally set identical or different mtimes using the filetime crate, reload to get second = load_tls_materials() and key2 = connection_cache_key(), and assert that the material bytes differ while checking whether key1 != key2 (to document current behavior); reference test_security_reload, SecurityManager::load, load_tls_materials, and connection_cache_key when locating insertion point.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/common/security.rs`:
- Around line 82-98: connection_cache_key currently uses file_signature
(mtime/size) which is too weak; replace that with a content-based cryptographic
hash to reliably detect certificate/key rotations. In connection_cache_key(),
instead of calling file_signature(...) for ca_path, cert_path and key_path, open
each file and compute a SHA-256 (or other crypto hash) over its contents (or if
performance is a concern, a SHA-256 of first+last N bytes), feed those digest
bytes into the hasher (or combine them directly) and return hasher.finish();
ensure you still respect tls_configured() and the same
ca_path/cert_path/key_path Option unwrapping logic. Use function names
connection_cache_key, file_signature (remove/replace),
ca_path/cert_path/key_path to locate the changes.
---
Outside diff comments:
In `@src/pd/client.rs`:
- Around line 346-371: The kv_client method has a race where the cache_key is
read before connect() and may become stale by the time you insert a newly
connected client; after a successful self.kv_connect.connect(address).await,
refetch the cache key by calling self.kv_connect.connection_cache_key().await
and only insert into self.kv_client_cache (CachedKvClient { cache_key:
current_key, client: client.clone() }) when that refetch returns
Ok(current_key); keep returning the client either way but avoid caching with the
original pre-connect key (update code in kv_client around the connect() match
and the cache insertion logic to use the refetched key).
---
Nitpick comments:
In @.cargo/config.toml:
- Around line 1-2: The change adding the Cargo resolver setting
incompatible-rust-versions = "fallback" in .cargo/config.toml lacks explanation;
update the PR description or the commit message to document why this resolver
behavior is required (e.g., specific dependency version conflict, build failure
on older/newer rustc, or to allow fallback resolution for a crate), mention
which dependency or build scenario it addresses, and note any testing you did to
validate the change so reviewers can understand and reproduce the rationale for
using incompatible-rust-versions = "fallback".
In `@src/common/security.rs`:
- Around line 211-248: Add a new async test (alongside test_security_reload)
that exercises SecurityManager::load and connection_cache_key for rotated certs
with identical file sizes: create temp files for ca/cert/key with initial
contents, load via SecurityManager::load and capture first =
load_tls_materials() and key1 = connection_cache_key(). Then overwrite the three
files with different content but exactly the same byte lengths (use padding) and
optionally set identical or different mtimes using the filetime crate, reload to
get second = load_tls_materials() and key2 = connection_cache_key(), and assert
that the material bytes differ while checking whether key1 != key2 (to document
current behavior); reference test_security_reload, SecurityManager::load,
load_tls_materials, and connection_cache_key when locating insertion point.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 01c444fc-e0da-4389-a2b6-62b50fe48b8b
📒 Files selected for processing (7)
.cargo/config.tomlCargo.tomlsrc/common/security.rssrc/pd/client.rssrc/request/mod.rssrc/request/plan.rssrc/store/client.rs
Signed-off-by: Ray Yan <yming0221@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/common/security.rs`:
- Around line 99-129: tls_channel currently calls load_tls_materials which does
blocking file I/O; move the blocking work into a spawn_blocking so the async
path doesn't block the Tokio runtime: create a small helper (or change
load_tls_materials) so that the PEM reads via load_pem_file are executed inside
tokio::task::spawn_blocking and awaited from tls_channel, then use the returned
(ca, cert, key) to build ClientTlsConfig as before; reference tls_channel,
load_tls_materials (or new load_tls_materials_blocking) and load_pem_file when
making the change.
In `@src/pd/client.rs`:
- Around line 276-280: The cache currently keys KV clients only by store address
so TLS cert/key rotations can be bypassed; update the KV client cache to store
and compare a TLS/connector "stamp" alongside the channel and force a reconnect
when the stamp changes. Concretely: extend the cached entry used by kv_client()
to include a tls_stamp (e.g., a version token from the connector or TLS
manager), update the client creation code path to record the current tls_stamp
when constructing a new client, modify kv_client() to check the current
tls_stamp against the cached entry and call invalidate_kv_client_cache(&address)
/ recreate the client when they differ, and ensure invalidate_store_cache (and
the code referenced around invalidate_kv_client_cache) also clears the stamp
entry so subsequent kv_client() calls will rebuild the channel with the new TLS
materials.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6b6105cf-f64c-4116-bbef-bf56393e98f5
📒 Files selected for processing (4)
src/common/security.rssrc/pd/client.rssrc/region_cache.rssrc/request/mod.rs
| async fn invalidate_store_cache(&self, store_id: StoreId) { | ||
| self.region_cache.invalidate_store_cache(store_id).await | ||
| let store = self.region_cache.invalidate_store_cache(store_id).await; | ||
| if let Some(store) = store { | ||
| self.invalidate_kv_client_cache(&store.address).await; | ||
| } |
There was a problem hiding this comment.
Address-only KV client caching still bypasses TLS rotation.
These changes only evict clients when store cache invalidation happens, but kv_client() still treats address as the entire cache key. After cert/key rotation, an existing entry for that store keeps reusing the old channel, so the new on-demand TLS reload never runs for long-lived clients. Please store a connector/TLS cache stamp alongside the client and force a reconnect when that stamp changes before returning a cache hit.
Also applies to: 343-353
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/pd/client.rs` around lines 276 - 280, The cache currently keys KV clients
only by store address so TLS cert/key rotations can be bypassed; update the KV
client cache to store and compare a TLS/connector "stamp" alongside the channel
and force a reconnect when the stamp changes. Concretely: extend the cached
entry used by kv_client() to include a tls_stamp (e.g., a version token from the
connector or TLS manager), update the client creation code path to record the
current tls_stamp when constructing a new client, modify kv_client() to check
the current tls_stamp against the cached entry and call
invalidate_kv_client_cache(&address) / recreate the client when they differ, and
ensure invalidate_store_cache (and the code referenced around
invalidate_kv_client_cache) also clears the stamp entry so subsequent
kv_client() calls will rebuild the channel with the new TLS materials.
There was a problem hiding this comment.
• Overall, the current implementation is acceptable if the goal is to tolerate TLS certificate rotation with at most one
failure and rely on the caller to retry.
The cache is still keyed only by store address, so an already cached long-lived gRPC/TLS client will not proactively pick
up newly rotated certificates. However, once that stale connection fails with a gRPC transport or API error, the client
invalidates the corresponding store/client cache entry. On the next full retry, it creates a new connection, reloads the
TLS files, and recovers automatically without manual intervention.
In short, the current design is reactive rather than proactive: it does not provide seamless certificate rotation or a
zero-failure guarantee, but it is generally sufficient for a "fail once, then recover on retry" model. The main assumption
is that the retry is a fresh end-to-end request, not reuse of an already captured stale store/client handle.
There was a problem hiding this comment.
🧩 Analysis chain
@yongman, thanks for the clear explanation of the design intent. The reactive model — accept one failure on a stale TLS connection, let the error propagate through the existing retry infrastructure to invalidate the cache, then rebuild the channel with freshly loaded cert material — is a reasonable trade-off, particularly given the retry scaffolding already present in RetryableMultiRegion. The key pre-condition you called out (that retries are fresh end-to-end, not reuses of a captured RegionStore handle) is an important assumption to keep in mind for future changes.
I'll record this as a design note and withdraw the comment.
🏁 Script executed:
#!/bin/bash
# Resolve the review comment left by coderabbitai on this PR
gh api graphql -f query='
query {
repository(owner: "tikv", name: "client-rust") {
pullRequest(number: 537) {
reviewThreads(first: 50) {
nodes {
id
isResolved
comments(first: 1) {
nodes {
author { login }
body
}
}
}
}
}
}
}'
Length of output: 10108
🏁 Script executed:
#!/bin/bash
gh api graphql -f query='
mutation {
resolveReviewThread(input: { threadId: "PRRT_kwDOCL4jiM6CTX__" }) {
thread {
id
isResolved
}
}
}'
Length of output: 266
@yongman, thanks for the clear write-up of the design intent. The reactive model you've described — accept one failure, let the error path invalidate both the region and KV client cache entries, then rebuild the channel with freshly loaded TLS materials on the next retry — is a reasonable and pragmatic trade-off given the retry infrastructure already in place. The key assumption (retries are fresh end-to-end requests rather than reuses of a captured RegionStore handle) is an important invariant to maintain going forward.
I wasn't able to resolve the comment programmatically (insufficient permissions), but I'll leave that to you. Happy to withdraw the concern — the current design is acceptable for the "fail once, recover on retry" goal.
🧬 Code Graph Analysis Results
<no_relevant_code_snippets>
✏️ Learnings added
Learnt from: yongman
Repo: tikv/client-rust PR: 537
File: src/pd/client.rs:276-280
Timestamp: 2026-05-18T03:17:55.642Z
Learning: In `src/pd/client.rs` (tikv/client-rust), the KV client cache is intentionally keyed only by store address. TLS certificate rotation is handled reactively: when a stale cached gRPC channel fails with a transport/API error, the error path invalidates the corresponding store and KV client cache entries, and the next retry rebuilds the connection with freshly loaded TLS materials from disk. This is a deliberate "fail once, recover on retry" design — it does not provide zero-failure/seamless cert rotation, but avoids proactive per-request TLS stamp comparisons. The correctness assumption is that retries are always fresh end-to-end requests, not reuses of an already captured `RegionStore` handle.
Learnt from: yongman
Repo: tikv/client-rust PR: 537
File: src/pd/client.rs:276-280
Timestamp: 2026-05-18T03:17:55.642Z
Learning: In `src/pd/client.rs` (tikv/client-rust), the KV client cache (`kv_client_cache`) is intentionally keyed only by store address. TLS certificate rotation is handled **reactively**: when a stale cached gRPC channel fails (transport/API error), the error path invalidates the corresponding region-cache and KV client-cache entries, and the next retry rebuilds the connection by reloading TLS materials from disk. This is a deliberate "fail once, recover on retry" design — it does not provide zero-failure/seamless cert rotation, but avoids proactive per-request TLS stamp comparisons. The correctness assumption is that retries are always fresh end-to-end requests, not reuses of an already captured `RegionStore` handle.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
Deep Review
Problem Summary
Solution Walkthrough
src/common/security.rsreplaces in-memoryca/certbytes with stored file paths.SecurityManager::connect()now reloads the CA, cert, and key from disk every time it builds a new TLSEndpoint, which is what enables rotated files to affect future connections.src/common/security.rsalso addsconnection_cache_key(), which hashes per-file metadata fromfile_signature()and returnsNonewhen TLS is disabled. The intended use is to tell whether a cached TiKV client was created from the same TLS inputs as the current filesystem state.src/store/client.rsextendsKvConnectwith an optionalconnection_cache_key()hook.TikvConnectforwards that call toSecurityManager, so the TiKV transport layer can participate in TLS-aware cache invalidation without changing non-TLS connectors.src/pd/client.rschangeskv_client_cachefromHashMap<String, KvClient>toHashMap<String, CachedKvClient<_>>.PdRpcClient::kv_client()now reads the current TLS cache key, compares it with the cached entry for the target store, and reconnects when the key changes.src/request/plan.rsbroadens retry handling inRetryableMultiRegion::single_shard_handler(). Previously onlyLeaderNotFoundduringmap_region_to_store()went throughhandle_other_error(). Now any failure frommap_region_to_store()orplan.apply_store()invalidates region cache state, consumes region backoff, and retries the full shard mapping flow.src/request/mod.rsandsrc/pd/client.rsadd tests for the new retry path and cache-key-sensitive TiKV client reuse.src/common/security.rsadds a reload test that proves file rereads and cache-key changes when file contents change..cargo/config.tomladdsresolver.incompatible-rust-versions = "fallback"so Cargo can resolve dependencies on older toolchains in CI.Findings (ordered by severity)
Medium: TLS rotation detection is based only on(file length, modified time). A renewed PEM with the same length and an unchanged or coarse-grained mtime will produce the same cache key, so the cachedChannelkeeps using the old TLS identity indefinitely. This undercuts the main goal of the patch, and the added test only covers size-changing rewrites, so the blind spot is untested. Refs:src/common/security.rs:83-92,src/common/security.rs:161-169,src/common/security.rs:205-241.It's enough use file length and the file mtime as fingerprint.
Costs and Negative Impacts
src/request/plan.rs:166-189is directionally correct for stale store mapping, but the TLS cache-key design introduces new failure cases where cert-file metadata availability, rather than connection health, decides whether a request can proceed.RetryableMultiRegionnow retries anymap_region_to_store()/apply_store()error instead of onlyLeaderNotFound. For in-tree request types that mostly means more aggressive recovery, but it is still a behavioral change for any downstreamShardableimplementation that usedapply_store()errors as terminal failures. Refs:src/request/plan.rs:168-189.Option<u64>.Engineering Rules Check
AGENTS.mdwas found under/home/dev/workspace/client-rust, so there were no repository-specific engineering rules to validate beyond the code itself.Questions and Assumptions
origin/HEAD...HEADbecause no explicit diff or PR metadata was provided.bd2c601,5e975b0, and29f4ff1.Suggested Tests / Validation
connection_cache_key()temporarily returns an error after a client has already been cached; verify whether requests should reuse the cached client or fail closed.kv_client()cache hits under TLS to measure the cost of the per-request metadata reads.cargo test --libpassed locally on May 12, 2026.Summary by CodeRabbit
Performance
Reliability
Chores