Skip to content

RSPEED-2943: add authorization monitoring metrics#1639

Closed
major wants to merge 4 commits intolightspeed-core:mainfrom
major:split/authz-metrics
Closed

RSPEED-2943: add authorization monitoring metrics#1639
major wants to merge 4 commits intolightspeed-core:mainfrom
major:split/authz-metrics

Conversation

@major
Copy link
Copy Markdown
Contributor

@major major commented Apr 29, 2026

Stack: 2/3 - Depends on #1638. Merge after #1638, then merge #1640.

Description

Adds bounded Prometheus metrics for authorization checks and latency by protected action. This gives SLO dashboards visibility into authorization success, denial, and unexpected error paths without adding user or request identifiers as labels.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

  • Assisted-by: OpenCode
  • Generated by: N/A

Related Tickets & Documents

  • Related Issue # RSPEED-2943
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • `uv run pytest tests/unit/metrics/test_recording.py tests/unit/authorization/test_middleware.py`
  • `uv run radon cc -s src/authorization/middleware.py src/metrics/recording.py`
  • `uv run make verify`

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

Warning

Rate limit exceeded

@major has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 49 minutes and 28 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b02527bf-e174-47ff-8edf-8d2f12aceabe

📥 Commits

Reviewing files that changed from the base of the PR and between 4d7fe24 and 4cfe046.

📒 Files selected for processing (26)
  • src/app/endpoints/responses.py
  • src/app/endpoints/responses_telemetry.py
  • src/authentication/api_key_token.py
  • src/authentication/jwk_token.py
  • src/authentication/k8s.py
  • src/authentication/noop.py
  • src/authentication/noop_with_token.py
  • src/authentication/rh_identity.py
  • src/authentication/utils.py
  • src/authorization/middleware.py
  • src/metrics/__init__.py
  • src/metrics/recording.py
  • src/observability/__init__.py
  • src/observability/splunk.py
  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml
  • tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml
  • tests/e2e-prow/rhoai/scripts/e2e-ops.sh
  • tests/e2e/features/environment.py
  • tests/unit/app/endpoints/test_responses_splunk.py
  • tests/unit/authentication/test_jwk_token.py
  • tests/unit/authentication/test_k8s.py
  • tests/unit/authentication/test_rh_identity.py
  • tests/unit/authentication/test_utils.py
  • tests/unit/authorization/test_middleware.py
  • tests/unit/metrics/test_recording.py
  • tests/unit/observability/test_splunk.py

Walkthrough

Authorization latency metrics are now measured and recorded throughout the authorization middleware. A new metrics module component provides helper functions to normalize authorization labels and safely record both authorization check counts and duration observations. Metrics are recorded in a finally block to ensure emission even when authorization errors occur.

Changes

Cohort / File(s) Summary
Authorization Metrics Recording
src/authorization/middleware.py
Adds latency measurement and metric recording to authorization checks via new _record_authorization_metrics helper; records check results and duration in finally block to ensure metrics are emitted even on errors, including when authorization is denied or missing auth occurs.
Metrics Infrastructure
src/metrics/__init__.py
Introduces Prometheus metrics for authorization: authorization_checks_total counter and authorization_duration_seconds histogram tracking authorization attempts by action and result, with configurable duration buckets.
Metrics Recording Helpers
src/metrics/recording.py
Adds authorization-specific recording functions that normalize action/result labels to bounded sets, record checks and duration to metrics, and gracefully handle metric recording exceptions with warning logs.
Test Coverage
tests/unit/authorization/test_middleware.py, tests/unit/metrics/test_recording.py
Unit tests validating that metric recording failures do not disrupt authorization flow, recording helpers normalize unexpected labels, and exceptions are logged without propagation.

Sequence Diagram

sequenceDiagram
    participant Client
    participant AuthMiddleware as Auth Middleware
    participant RoleResolver as Role Resolver
    participant MetricsRecorder as Metrics Recorder
    participant Prometheus as Prometheus Metrics

    Client->>AuthMiddleware: Authorization Request
    activate AuthMiddleware
    
    Note over AuthMiddleware: Start timing (monotonic)
    
    AuthMiddleware->>RoleResolver: Resolve Role
    activate RoleResolver
    RoleResolver-->>AuthMiddleware: Role/Result
    deactivate RoleResolver
    
    alt Authorization Success
        AuthMiddleware->>AuthMiddleware: Set result = "success"
    else Authorization Denied
        AuthMiddleware->>AuthMiddleware: Set result = "denied"
        Note over AuthMiddleware: Raise 403
    end
    
    AuthMiddleware->>MetricsRecorder: Record Check (action, result)
    activate MetricsRecorder
    MetricsRecorder->>Prometheus: Increment Counter
    Prometheus-->>MetricsRecorder: OK
    deactivate MetricsRecorder
    
    AuthMiddleware->>MetricsRecorder: Record Duration (action, result, latency)
    activate MetricsRecorder
    MetricsRecorder->>Prometheus: Observe Histogram
    Prometheus-->>MetricsRecorder: OK
    deactivate MetricsRecorder
    
    deactivate AuthMiddleware
    AuthMiddleware-->>Client: Response/Exception
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main change: adding authorization monitoring metrics. It is concise, specific, and matches the core objective described in the PR description.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@major major mentioned this pull request Apr 29, 2026
19 tasks
@major major force-pushed the split/authz-metrics branch from 0469516 to 4d7fe24 Compare April 30, 2026 02:48
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/metrics/__init__.py`:
- Around line 78-90: The new module-level metrics authorization_checks_total and
authorization_duration_seconds must be annotated as constants using Final;
import Final from typing if not present and add type annotations like
Final[Counter] for authorization_checks_total and Final[Histogram] for
authorization_duration_seconds so they follow the project's constant-typing
rule; leave the instantiation expressions unchanged and ensure the import of
Final is added at the top of the module if missing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7933fdef-3790-4918-b1ff-b17e571bddad

📥 Commits

Reviewing files that changed from the base of the PR and between ca125c4 and 4d7fe24.

📒 Files selected for processing (5)
  • src/authorization/middleware.py
  • src/metrics/__init__.py
  • src/metrics/recording.py
  • tests/unit/authorization/test_middleware.py
  • tests/unit/metrics/test_recording.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Pylinter
  • GitHub Check: integration_tests (3.12)
  • GitHub Check: radon
  • GitHub Check: Pyright
  • GitHub Check: build-pr
  • GitHub Check: E2E: server mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 2
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E: library mode / ci / group 2
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: library mode / ci / group 1
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
🧰 Additional context used
📓 Path-based instructions (4)
tests/unit/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/unit/**/*.py: Use pytest for all unit and integration tests
Use pytest-mock for AsyncMock objects in unit tests
Use marker pytest.mark.asyncio for async tests

Files:

  • tests/unit/authorization/test_middleware.py
  • tests/unit/metrics/test_recording.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Do not use unittest - pytest is the standard testing framework for this project

Files:

  • tests/unit/authorization/test_middleware.py
  • tests/unit/metrics/test_recording.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
All modules start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
Type aliases defined at module level for clarity
Use Final[type] as type hint for all constants
All functions require docstrings with brief descriptions
Complete type annotations for function parameters and return types
Use Union types with modern syntax: str | int
Use Optional[Type] for optional type hints
Use snake_case with descriptive, action-oriented function names (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns: return new data structures instead of modifying parameters
Use async def for I/O operations and external API calls
Use logger.debug() for detailed diagnostic information
Use logger.info() for general information about program execution
Use logger.warning() for unexpected events or potential problems
Use logger.error() for serious problems that prevented function execution
All classes require descriptive docstrings explaining purpose
Use PascalCase for class names with descriptive names and standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes use ABC with @abstractmethod decorators
Complete type annotations for all class attributes, use specific types, not Any
Follow Google Python docstring conventions for all modules, classes, and functions
Docstring Parameters section documents function parameters
Docstring Returns section documents function return values
Docstring Raises section documents exceptions that may be raised
Use black for code formatting
Use pylint for static analysis with source-roots configuration set to "src"
Use pyright for type checking
Use ruff for fast linting
Use pydocstyle for docstring style validation
Use mypy for additional type checking
Use bandit for security issue detection

Files:

  • src/metrics/__init__.py
  • src/authorization/middleware.py
  • src/metrics/recording.py
src/**/__init__.py

📄 CodeRabbit inference engine (AGENTS.md)

Package __init__.py files contain brief package descriptions

Files:

  • src/metrics/__init__.py
🔇 Additional comments (4)
tests/unit/authorization/test_middleware.py (1)

325-355: Good resilience coverage for metric failure in the success path.

This test correctly verifies that authorization success is not masked by metric failures and that warning logging plus duration recording still occur.

tests/unit/metrics/test_recording.py (1)

191-290: Strong coverage for authorization recorder behavior.

The new parametrized tests and normalization assertions validate both happy-path recording and swallowed metric-update failures cleanly.

src/authorization/middleware.py (1)

32-55: Resilient instrumentation in finally is implemented correctly.

The result-state transitions and isolated metric-recording error handling preserve authorization outcomes while still emitting observability data for all paths.

Also applies to: 154-195

src/metrics/recording.py (1)

18-33: Bounded-label normalization and record helpers look solid.

The new normalization + recorder functions are clear, type-annotated, and aligned with the intended bounded-label metric strategy.

Also applies to: 35-61, 160-195

Comment thread src/metrics/__init__.py Outdated
major and others added 4 commits May 8, 2026 07:18
Move telemetry functions from responses.py into a dedicated
responses_telemetry.py module. Centralize fire-and-forget dispatch
logic in observability/splunk.py as dispatch_splunk_event().

No behavioral changes to request handling. Reduces responses.py
from 1201 to 1057 lines.

Signed-off-by: Major Hayden <major@redhat.com>
…t restores llama-stack (lightspeed-core#1628)

* Add diagnostic pod logs on e2e failure and remove disrupt-once optimization

* Increase vLLM max-model-len to 35936 (GPU memory limit)

* Accept 503 as valid port-forward proof in e2e connectivity check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Major Hayden <major@redhat.com>
Signed-off-by: Major Hayden <major@redhat.com>
@major major force-pushed the split/authz-metrics branch from 91ff835 to 4cfe046 Compare May 8, 2026 12:19
@major
Copy link
Copy Markdown
Contributor Author

major commented May 8, 2026

Superseded by #1640 which now contains all three metrics commits as a stack.

@major major closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants