writing really fast kernels
note: all the kernels are written for a rtx 4090 ti(16GB VRAM)
total CUDA device: 1
card: NVIDIA GeForce RTX 4060 Ti
CUDA compute capability: 8.9
total global memory: 15.5656 GB
clock rate: 2595 MHz
l2 cache size: 33554432
total constant memory: 65536
total shared memory per block: 49152
total registers per block: 65536
warp size: 32
max threads per SM: 1536
max size of each dimension in a block: 1024 x 1024 x 64
max size of each dimension in a grid: 2147483647 x 65535 x 65535
SM: 34
Tensor Cores: 34 * 4 = 136
profile using nvidia compute
sudo -E ncu-ui
for cute dsl kernels
sudo ncu $(which python3) <03_naive_vadd.py>