Skip to content

[WIP][python] Add user identity-based token cache for RESTTokenFileIO#7564

Closed
shyjsarah wants to merge 4 commits intoapache:masterfrom
shyjsarah:dev-pythob-cache-token
Closed

[WIP][python] Add user identity-based token cache for RESTTokenFileIO#7564
shyjsarah wants to merge 4 commits intoapache:masterfrom
shyjsarah:dev-pythob-cache-token

Conversation

@shyjsarah
Copy link
Copy Markdown
Contributor

@shyjsarah shyjsarah commented Mar 31, 2026

Purpose

This PR implements user identity-based token cache isolation in RESTTokenFileIO to prevent multiple users from sharing the same data token when accessing the same table location.

Problem: The previous implementation used only table identifier as the token cache key, causing token cache pollution when multiple users accessed the same table within a single process. All users would share the same data token regardless of their actual authentication credentials.

Solution:

  • Use (path, user_identity) as the token cache key to ensure proper isolation between different users
  • Extract user identity from authentication provider (Bear Token, DLF AccessKey, ECS Role, STS File, etc.) - all use actual access_key_id
  • Implement double-check locking pattern for thread-safe token caching
  • Path-based caching avoids issues with table renames (physical location remains constant)

Key Changes:

  1. Add user_identity parameter to RESTTokenFileIO.__init__() - stores user identity for cache key construction
  2. Add _extract_user_identity() method in RESTCatalog to extract user identity from auth provider:
    • Bear Token → "bear:{token}"
    • DLF AccessKey/ECS Role/STS File → "dlf:{access_key_id}" (actual AK from token, e.g., STS.xxx)
  3. Modify try_to_refresh_token() to use (path, user_identity) tuple as cache key
  4. Implement thread-safe token cache with double-check locking (instance lock + global cache lock)
  5. Pass user_identity when creating RESTTokenFileIO instances in file_io_for_data()

Tests

New Unit Tests (test_token_cache_isolation.py):

  • test_different_users_have_separate_token_cache - Verifies different users get separate tokens for same path
  • test_same_user_reuses_token_cache - Verifies same user reuses cached token
  • test_table_rename_preserves_token_cache - Verifies table rename doesn't break token cache (path-based)
  • test_empty_user_identity_isolation - Verifies empty user identity handling
  • test_token_cache_with_expiry_check - Verifies token expiration checking logic

- Add user_identity parameter to RESTTokenFileIO constructor
- Use (path, user_identity) as token cache key to ensure proper isolation
  between different users accessing the same table location
- Add _extract_user_identity() method in RESTCatalog to extract user identity
  from authentication provider (Bear Token, DLF AccessKey, ECS Role, etc.)
- Implement double-check locking pattern for thread-safe token caching
- Add comprehensive unit tests covering:
  * Multi-user isolation scenarios
  * Token cache reuse for same user
  * Table rename preservation (path-based caching)
  * Token expiry checking

This prevents token cache pollution where different users would share
the same data token when accessing the same table.
Use actual access_key_id from token for all DLF authentication types:
- ECS Role and STS File now extract the real access_key_id (STS.xxx) instead of role name or file path
- Ensures consistent user identity identification across all auth methods
- Simplified logic: all DLF auth types use 'dlf:{access_key_id}' format
Remove trailing whitespace in blank lines to comply with PEP 8.
No functional changes, only formatting fixes.
@shyjsarah shyjsarah closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant