kernels

writing really fast kernels

note: all the kernels are written for a rtx 4090 ti(16GB VRAM)

total CUDA device: 1
card: NVIDIA GeForce RTX 4060 Ti
CUDA compute capability: 8.9
total global memory: 15.5656 GB
clock rate: 2595 MHz
l2 cache size: 33554432
total constant memory: 65536
total shared memory per block: 49152
total registers per block: 65536
warp size: 32
max threads per SM: 1536
max size of each dimension in a block: 1024 x 1024 x 64
max size of each dimension in a grid: 2147483647 x 65535 x 65535
SM: 34
Tensor Cores: 34 * 4 = 136

profiler

profile using nvidia compute

sudo -E ncu-ui

for cute dsl kernels

sudo ncu $(which python3) <03_naive_vadd.py>

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
assets		assets
cuda		cuda
cute_dsl		cute_dsl
cute_kernels		cute_kernels
cutlass @ 6b3e607		cutlass @ 6b3e607
hopper		hopper
notes		notes
ptx		ptx
tensara_solutions/easy		tensara_solutions/easy
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
benchmark.cu		benchmark.cu
device.cu		device.cu
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kernels

profiler

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kernels

profiler

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages