Skip to content

JINO-ROHIT/kernels

Repository files navigation

kernels

writing really fast kernels

note: all the kernels are written for a rtx 4090 ti(16GB VRAM)

total CUDA device: 1
card: NVIDIA GeForce RTX 4060 Ti
CUDA compute capability: 8.9
total global memory: 15.5656 GB
clock rate: 2595 MHz
l2 cache size: 33554432
total constant memory: 65536
total shared memory per block: 49152
total registers per block: 65536
warp size: 32
max threads per SM: 1536
max size of each dimension in a block: 1024 x 1024 x 64
max size of each dimension in a grid: 2147483647 x 65535 x 65535
SM: 34
Tensor Cores: 34 * 4 = 136

profiler

profile using nvidia compute

sudo -E ncu-ui

for cute dsl kernels

sudo ncu $(which python3) <03_naive_vadd.py>

About

writing really fast kernels

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors