NVIDIA/cutlass

C++ 9.3k stars

CUDA Templates and Python DSLs for High-Performance Linear Algebra

✓ Synced 1h ago Share on X →
README badge: [![ngmi](https://ngmi.review/badge/NVIDIA/cutlass.svg)](https://ngmi.review/repo/NVIDIA/cutlass)
640 Merged PRs
12 days Avg Merge Time
0m Fastest PR
1 year Slowest PR
#971 Global Speed Rank

PR Size Analysis

Lines changed (additions + deletions) vs review outcomes. Re-sync to populate data for older PRs.

PRs by size
Avg review time (hrs)
Clean approval rate (%)

Top Reviewers

Recent Merged PRs

# Title Author Time Reviews Blocks
#3041 remove mixed_input_fmha_prefill.py @keithzzzzz 5.7h 1
#3009 [CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) @yingluosanqian 6 days 3
#3027 Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py @aragorn-guan 1 day 1
#2971 [CuTeDSL]fix tvm-ffi path in from_dlpack @rsmallblue 23 days 5
#3032 v4.4 tag release update. @Junkai-Wu 1.1h 1
#3004 [CuTeDSL] Add sub_packed_f32x2 operation @tridao 8 days 1
#3021 [Cute-DSL] Add option for issue_clc_query without multicast @tridao 1 day 1
#3022 [Cute-DSL] Add cute.arch.fmin by calling nvvm @tridao 1 day 1
#2970 [CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store @aragorn-guan 20 days 1
#2919 Refactor binary_op functions to remove unused result parameter @pbelevich 1 month 2
#2999 v4.4 release update v2. @Junkai-Wu 45m 1
#2995 [CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator @huanghua1994 58m 1
#2988 fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release @dongxiao92 2 days 1
#2990 fix performance regression in cute-dsl examples for 4.4-ctk13.1 release @myu-guo 1 day 1
#2985 [CUTEDSL] Update example code nvvm API usage from nvvm enum to str @XiaoSongXS 1 day 1
#2969 [NVVM API] Update cutedsl nvvm api change @XiaoSongXS 5 days 1
#2891 docs: note when DSL dumps are populated @ColinPeppler 1 month 2
#2979 v4.4 update. @Junkai-Wu 2.5h 1
#2965 [Bug Fix]Set NumSplitsM to 1 when TileShapeM < 128 in sm90 fp8 blockwise scaling CollectiveMma @HydraQYH 3 days 3
#2945 Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial @Johnsonms 12 days 1