Document OWN_GIL mode features and usage

benoitc · benoitc · commit 7921bbe69234 · 2026-03-15T23:11:38.000+01:00
CHANGELOG.md:
- Add OWN_GIL Context Mode with feature list
- Add Process-Local Environments for OWN_GIL
- Add Per-Process Event Loop Namespaces
- Add OWN_GIL Test Suites section
- Add Changed section for asyncio compatibility fixes

docs/owngil_internals.md:
- Add Quick Start section with usage examples
- Add Feature Compatibility table
- Add Benchmarking section with example output

docs/scalability.md:
- Add OWN_GIL to mode comparison table
- Add OWN_GIL Mode section with architecture, usage, process-local envs
- Update subinterp section to clarify shared-GIL behavior
- Add "When to use OWN_GIL" guidance
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -67,6 +67,44 @@
   - `examples/bench_async_task.erl` - Erlang benchmark runner
   - `priv/test_async_task.py` - Python async task implementation
 
+- **OWN_GIL Context Mode** - True parallel Python execution (Python 3.12+)
+  - `py_context:start_link(Id, owngil)` - Create context with dedicated pthread and GIL
+  - Each OWN_GIL context runs in its own thread with independent Python GIL
+  - Enables true CPU parallelism across multiple Python contexts
+  - Full feature support: channels, buffers, callbacks, PIDs, reactor, async tasks
+  - `py_context:get_nif_ref/1` - Get NIF reference for low-level operations
+  - New benchmark: `examples/bench_owngil.erl` comparing SHARED_GIL vs OWN_GIL
+  - See [OWN_GIL Internals](docs/owngil_internals.md) for architecture details
+
+- **Process-Local Environments for OWN_GIL** - Namespace isolation within shared contexts
+  - `py_context:create_local_env/1` - Create isolated Python namespace for calling process
+  - `py_nif:context_exec(Ref, Code, Env)` - Execute with process-local environment
+  - `py_nif:context_eval(Ref, Expr, Locals, Env)` - Evaluate with process-local environment
+  - `py_nif:context_call(Ref, Mod, Func, Args, Kwargs, Env)` - Call with process-local environment
+  - Multiple Erlang processes can share an OWN_GIL context with isolated namespaces
+  - Interpreter ID validation prevents cross-interpreter env usage
+
+- **Per-Process Event Loop Namespaces** - Process isolation for event loop API
+  - `py_nif:event_loop_exec/2` - Execute code in calling process's namespace
+  - `py_nif:event_loop_eval/2` - Evaluate expression in calling process's namespace
+  - Functions defined via exec callable via `create_task` with `__main__` module
+  - Automatic cleanup when Erlang process exits
+
+- **OWN_GIL Test Suites** - Feature verification
+  - `py_context_owngil_SUITE` - Core OWN_GIL functionality (15 tests)
+  - `py_owngil_features_SUITE` - Feature integration (44 tests covering channels,
+    buffers, callbacks, PIDs, reactor, async tasks, asyncio, local envs)
+
+### Changed
+
+- **Event Loop Lock Ordering** - GIL acquired before `namespaces_mutex` in cleanup paths
+  to prevent ABBA deadlocks with normal execution path
+
+- **Asyncio Compatibility** - Fixed for Python 3.12+ with subinterpreters
+  - Thread-local event loop context in `process_ready_tasks`
+  - Eager task execution handling for Python 3.12+
+  - Deprecation warning fix: use `erlang.run()` instead of `erlang.install()`
+
 ## 2.1.0 (2026-03-12)
 
 ### Added
diff --git a/docs/owngil_internals.md b/docs/owngil_internals.md
@@ -4,6 +4,50 @@
 
 OWN_GIL mode provides true parallel Python execution using Python 3.12+ per-interpreter GIL (`PyInterpreterConfig_OWN_GIL`). Each OWN_GIL context runs in a dedicated pthread with its own subinterpreter and GIL.
 
+## Quick Start
+
+```erlang
+%% Create an OWN_GIL context (requires Python 3.12+)
+{ok, Ctx} = py_context:start_link(1, owngil),
+
+%% Basic operations work the same as other modes
+{ok, 4.0} = py_context:call(Ctx, math, sqrt, [16], #{}),
+ok = py_context:exec(Ctx, <<"x = 42">>),
+{ok, 42} = py_context:eval(Ctx, <<"x">>),
+
+%% True parallelism: multiple OWN_GIL contexts execute simultaneously
+{ok, Ctx2} = py_context:start_link(2, owngil),
+%% Ctx and Ctx2 run in parallel with independent GILs
+
+%% Process-local environments for namespace isolation
+{ok, Env} = py_context:create_local_env(Ctx),
+CtxRef = py_context:get_nif_ref(Ctx),
+ok = py_nif:context_exec(CtxRef, <<"my_var = 'isolated'">>  , Env),
+
+%% Cleanup
+py_context:stop(Ctx),
+py_context:stop(Ctx2).
+```
+
+## Feature Compatibility
+
+All major erlang_python features work with OWN_GIL mode:
+
+| Feature | Status | Notes |
+|---------|--------|-------|
+| `py_context:call/5` | Full | Function calls |
+| `py_context:eval/2` | Full | Expression evaluation |
+| `py_context:exec/2` | Full | Statement execution |
+| Channels (`py_channel`) | Full | Bidirectional messaging |
+| Buffers (`py_buffer`) | Full | Zero-copy streaming |
+| Callbacks (`erlang.call`) | Partial | Uses thread_worker, not re-entrant |
+| PIDs (`erlang.Pid`) | Full | Round-trip serialization |
+| Send (`erlang.send`) | Full | Fire-and-forget messaging |
+| Reactor (`erlang.reactor`) | Full | FD-based protocols |
+| Async Tasks | Full | `py_event_loop:create_task` |
+| Asyncio | Full | `asyncio.sleep`, `gather`, etc. |
+| Process-local envs | Full | Namespace isolation |
+
 ## Architecture
 
 ```
@@ -395,6 +439,46 @@ Use shared-GIL (subinterp) when:
 - High call frequency
 - Resource constraints
 
+## Benchmarking
+
+Run the benchmark to compare modes on your system:
+
+```bash
+rebar3 compile && escript examples/bench_owngil.erl
+```
+
+Example output:
+```
+========================================================
+  OWN_GIL vs SHARED_GIL Benchmark
+========================================================
+
+System Information
+------------------
+  Erlang/OTP:       27
+  Schedulers:       8
+  Python:           3.14.0
+  Subinterp:        true
+
+1. Single Context Latency (1000 calls to math.sqrt)
+   Mode            us/call    calls/sec
+   ----            -------    ---------
+   subinterp           2.5       400000
+   owngil             10.2        98000
+
+2. Parallel Throughput (4 contexts, 10000 calls each)
+   Mode            total_ms   calls/sec
+   ----            --------   ---------
+   subinterp          100.5       398000
+   owngil              28.3      1415000   <- 3.5x faster
+
+3. CPU-Bound Speedup (fibonacci(30) x 4 contexts)
+   Mode            total_ms   speedup
+   ----            --------   -------
+   subinterp          800.2      1.0x
+   owngil             205.1      3.9x     <- near-linear scaling
+```
+
 ## Safety Mechanisms
 
 ### Interpreter ID Validation
diff --git a/docs/scalability.md b/docs/scalability.md
@@ -21,22 +21,61 @@ py:num_executors().
 | Mode | Python Version | Parallelism | GIL Behavior | Best For |
 |------|----------------|-------------|--------------|----------|
 | **free_threaded** | 3.13+ (nogil build) | True N-way | None | Maximum throughput |
-| **subinterp** | 3.12+ | True N-way | Per-interpreter | CPU-bound, isolation |
+| **owngil** | 3.12+ | True N-way | Per-interpreter (dedicated thread) | CPU-bound parallel |
+| **subinterp** | 3.12+ | None (shared GIL) | Shared GIL (pool) | High call frequency |
 | **multi_executor** | Any | GIL contention | Shared, round-robin | I/O-bound, compatibility |
 
 ### Free-Threaded Mode (Python 3.13+)
 
 When running on a free-threaded Python build (compiled with `--disable-gil`), erlang_python executes Python calls directly without any executor routing. This provides maximum parallelism for CPU-bound workloads.
 
+### OWN_GIL Mode (Python 3.12+)
+
+Creates dedicated pthreads with independent GILs for true parallel Python execution. Each OWN_GIL context runs in its own thread, enabling CPU parallelism.
+
+**Architecture:**
+- Each context gets a dedicated pthread with its own subinterpreter and GIL
+- Requests dispatched via mutex/condvar IPC (not dirty schedulers)
+- True parallel execution across multiple OWN_GIL contexts
+- Higher per-call latency (~10μs vs ~2.5μs) but better parallelism
+
+**Usage:**
+```erlang
+%% Create OWN_GIL contexts for parallel execution
+{ok, Ctx1} = py_context:start_link(1, owngil),
+{ok, Ctx2} = py_context:start_link(2, owngil),
+
+%% These execute in parallel with independent GILs
+spawn(fun() -> py_context:call(Ctx1, heavy_compute, run, [Data1]) end),
+spawn(fun() -> py_context:call(Ctx2, heavy_compute, run, [Data2]) end).
+```
+
+**Process-Local Environments:**
+```erlang
+%% Multiple processes can share an OWN_GIL context with isolated namespaces
+{ok, Env} = py_context:create_local_env(Ctx),
+CtxRef = py_context:get_nif_ref(Ctx),
+ok = py_nif:context_exec(CtxRef, <<"x = 42">>, Env),
+{ok, 42} = py_nif:context_eval(CtxRef, <<"x">>, #{}, Env).
+```
+
+**When to use OWN_GIL:**
+- CPU-bound Python workloads that benefit from parallelism
+- Long-running computations
+- When you need true concurrent Python execution
+- Scientific computing, ML inference, data processing
+
+**See also:** [OWN_GIL Internals](owngil_internals.md) for architecture details.
+
 ### Sub-interpreter Mode (Python 3.12+)
 
-Uses Python's sub-interpreter feature with per-interpreter GIL (`Py_GIL_OWN`). Each sub-interpreter runs in its own dedicated thread with its own GIL, enabling true parallel execution across interpreters.
+Uses Python's sub-interpreter feature with a shared GIL pool. Multiple contexts share the GIL but have isolated namespaces. Best for high call frequency with low latency.
 
 **Architecture:**
-- Thread pool manages N subinterpreters (default: number of schedulers)
-- Each subinterpreter has its own thread, GIL, and Python state
-- Requests are routed to subinterpreters via `py_context_router`
-- 25-30% faster cast operations compared to worker mode
+- Pool of pre-created subinterpreters with shared GIL
+- Execution on dirty schedulers with `PyThreadState_Swap`
+- Lower latency (~2.5μs) but no true parallelism
+- Best throughput for short operations
 
 **Note:** Each sub-interpreter has isolated state. Use the [Shared State](#shared-state) API to share data between workers.
 
@@ -74,11 +113,17 @@ Runs N executor threads that share the GIL. Requests are distributed round-robin
 - You're running CPU-bound workloads
 - Memory efficiency is important
 
-**Use Subinterpreters (Python 3.12+) when:**
-- You need parallelism with state isolation
-- You want crash isolation between contexts
-- You're running untrusted or unstable code
-- You need predictable per-request state
+**Use OWN_GIL (Python 3.12+) when:**
+- You need true CPU parallelism across Python contexts
+- Running long computations (ML inference, data processing)
+- Workload benefits from multiple independent Python interpreters
+- You can tolerate higher per-call latency for better throughput
+
+**Use Subinterpreters/Shared-GIL (Python 3.12+) when:**
+- You need high call frequency with low latency
+- Individual operations are short
+- You want namespace isolation without thread overhead
+- Memory efficiency is important (shared interpreter pool)
 
 **Use Multi-Executor (Python < 3.12) when:**
 - Running on older Python versions