Skip to content

[POC] Minimal Kernel Autotuning Support #96

@mikepapadim

Description

@mikepapadim

Summary

Add minimal kernel autotuning support to TornadoVM to automatically evaluate and select efficient execution configurations (e.g., work-group and grid sizes) at runtime, inspired by Triton’s triton.autotune.


Scope (Intentionally Small)

This sub-issue focuses on a first, minimal autotuning capability:

  • Support autotuning for work-group and grid dimensions only
  • Limit autotuning to single-kernel task graphs
  • Perform autotuning once per kernel per device
  • Cache the best configuration in-memory (no persistence required)

This is intended as a foundation for future extensions (tiling, memory layouts, heuristics).


Proposed Functionality

  1. Configuration Set

    • Allow a small, user-defined set of candidate execution configurations
    • Example: different (globalSize, localSize) combinations
  2. Runtime Benchmarking

    • On first execution, run each configuration and measure execution time
  3. Selection & Caching

    • Select the fastest configuration
    • Cache the result for subsequent executions on the same device
  4. Transparent Integration

    • Autotuned configuration replaces the default execution without user-side changes

Motivation

Performance in TornadoVM is sensitive to execution parameters and GPU architecture.
Today, finding good configurations is manual and hardware-specific.

Even a limited autotuning mechanism would:

  • Reduce manual tuning effort
  • Improve out-of-the-box performance
  • Benefit GPU-heavy workloads (e.g., attention, GEMM) used in GPULlama3

Example (Hypothetical)

A kernel is executed with candidate local sizes:

  • (16,16)
  • (32,8)
  • (8,32)

TornadoVM benchmarks each once, selects the fastest, and reuses it for all future executions.


Out of Scope (For This Sub-Issue)

  • Persistent caching across JVM runs
  • Large configuration search spaces
  • Compiler-driven autotune generation
  • Memory tiling or algorithmic variants

Expected Outcome

A small but functional autotuning mechanism that demonstrates feasibility and provides immediate performance benefits, serving as a stepping stone toward a full Triton-like autotuning framework.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions