[POC] Minimal Kernel Autotuning Support

### Summary

Add **minimal kernel autotuning support** to TornadoVM to automatically evaluate and select efficient execution configurations (e.g., work-group and grid sizes) at runtime, inspired by Triton’s `triton.autotune`.

---

### Scope (Intentionally Small)

This sub-issue focuses on a **first, minimal autotuning capability**:

- Support autotuning for **work-group and grid dimensions only**
- Limit autotuning to **single-kernel task graphs**
- Perform autotuning **once per kernel per device**
- Cache the best configuration in-memory (no persistence required)

This is intended as a foundation for future extensions (tiling, memory layouts, heuristics).

---

### Proposed Functionality

1. **Configuration Set**
   - Allow a small, user-defined set of candidate execution configurations
   - Example: different `(globalSize, localSize)` combinations

2. **Runtime Benchmarking**
   - On first execution, run each configuration and measure execution time

3. **Selection & Caching**
   - Select the fastest configuration
   - Cache the result for subsequent executions on the same device

4. **Transparent Integration**
   - Autotuned configuration replaces the default execution without user-side changes

---

### Motivation

Performance in TornadoVM is sensitive to execution parameters and GPU architecture.  
Today, finding good configurations is manual and hardware-specific.

Even a limited autotuning mechanism would:
- Reduce manual tuning effort
- Improve out-of-the-box performance
- Benefit GPU-heavy workloads (e.g., attention, GEMM) used in **GPULlama3**

---

### Example (Hypothetical)

A kernel is executed with candidate local sizes:
- `(16,16)`
- `(32,8)`
- `(8,32)`

TornadoVM benchmarks each once, selects the fastest, and reuses it for all future executions.

---

### Out of Scope (For This Sub-Issue)

- Persistent caching across JVM runs
- Large configuration search spaces
- Compiler-driven autotune generation
- Memory tiling or algorithmic variants

---

### Expected Outcome

A small but functional autotuning mechanism that demonstrates feasibility and provides immediate performance benefits, serving as a stepping stone toward a full Triton-like autotuning framework.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] Minimal Kernel Autotuning Support #96

Summary

Scope (Intentionally Small)

Proposed Functionality

Motivation

Example (Hypothetical)

Out of Scope (For This Sub-Issue)

Expected Outcome

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[POC] Minimal Kernel Autotuning Support #96

Description

Summary

Scope (Intentionally Small)

Proposed Functionality

Motivation

Example (Hypothetical)

Out of Scope (For This Sub-Issue)

Expected Outcome

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions