DeepGEMM: बारीक scaling के जरिए साफ़ और efficient FP8 GEMM kernel

(github.com/deepseek-ai)

2 पॉइंट द्वारा GN⁺ 2025-02-27 | 1 टिप्पणियां | WhatsApp पर शेयर करें

DeepGEMM एक high-performance tensor core kernel library है, जो आधुनिक LLMs के मुख्य compute primitive—GEMM, fused MoE, MQA scoring, HyperConnection आदि—को एक CUDA codebase में जोड़ती है
सभी kernels हल्के JIT module के रूप में runtime पर compile होते हैं; installation के दौरान CUDA compilation की जरूरत नहीं होती, और यह C++20, CUDA Toolkit, PyTorch, CUTLASS 4.0 या ऊपर की मांग करता है
यह CUTLASS और CuTe की कुछ concepts का उपयोग करता है, लेकिन templates और algebraic structures पर भारी निर्भर नहीं है; सीमित संख्या के core kernel functions के साथ NVIDIA GPU kernel optimization सीखना आसान बनाने के लिए design किया गया है
support range में FP8, FP4, BF16 GEMM, grouped GEMM, DeepSeek v3.2 के लिए MQA logits kernel, और communication व compute को overlap करने वाला Mega MoE शामिल हैं; SM90 और SM100 पर memory layout constraints अलग हैं
lightweight design के बावजूद, यह विभिन्न matrix shapes में expert-tuned libraries के बराबर या उससे अधिक performance का लक्ष्य रखता है, और H800 पर अधिकतम 1550 TFLOPS हासिल करने वाला update शामिल है

DeepGEMM का उद्देश्य और design

DeepGEMM एक tensor core kernel library है, जो latest large language models में इस्तेमाल होने वाले प्रमुख compute primitives को एक CUDA codebase में integrate करती है
- GEMM: FP8, FP4, BF16
- communication overlap वाला fused MoE: Mega MoE
- lightning indexer के लिए MQA scoring
- HyperConnection(HC)
सभी kernels हल्के Just-In-Time(JIT) module के रूप में runtime पर compile होते हैं
- installation process में CUDA compilation की जरूरत नहीं होती
CUTLASS और CuTe की कुछ concepts का उपयोग करता है
- लेकिन भारी templates या algebraic structures पर बहुत ज्यादा निर्भर नहीं करता
- core kernel functions की संख्या सीमित रखकर codebase को simple बनाए रखता है
lightweight design के बावजूद, यह बताता है कि कई matrix shapes में expert-tuned libraries के बराबर या उससे बेहतर performance देता है

मुख्य updates

16 अप्रैल 2026 का update Mega MoE, FP8xFP4 GEMM, FP4 Indexer, PDL, तेज JIT compilation आदि शामिल करता है
- details #304 में हैं
- Mega MoE benchmark #316 में है
28 सितंबर 2025 का update DeepSeek v3.2 lightning indexer के लिए weighted ReLU MQA logits scoring kernel जोड़ता है
- details #200 में हैं
20 जुलाई 2025 का update SM90 और SM100 दोनों को support करता है, और low CPU overhead वाले JIT CPP module के साथ पूरा refactor किया गया है
- NVRTC और compilation के बाद SASS optimization disabled हैं
- NVRTC को बाद में support किए जाने के रूप में दिखाया गया है
- NVCC 12.9 FFMA interleaving अपने-आप करता है, इसलिए post-compilation optimization अब support नहीं है
- details #112 में हैं
14 मई 2025 का update dense और MoE backward के लिए weight gradient kernel जोड़ता है
- details #95 में हैं
7 मई 2025 का update NVRTC support के साथ अधिकतम 10x compilation speedup देता है
- इसे DG_JIT_USE_NVRTC=1 से enable किया जा सकता है
- कुछ मामलों में performance loss हो सकता है
- details #94 में हैं
18 अप्रैल 2025 का update H800 पर अधिकतम 1550 TFLOPS हासिल करता है
- संबंधित items #74, #78, #81, #86, 340d988 हैं

requirements और installation flow

runtime environment के लिए NVIDIA SM90 या SM100 architecture वाला GPU चाहिए
software requirements ये हैं
- Python 3.8 या ऊपर
- C++20 support करने वाला compiler
- CUDA Toolkit
  - SM90: CUDA 12.3 या ऊपर
  - best performance के लिए CUDA 12.9 या ऊपर की strongly recommendation है
  - SM100: CUDA 12.9 या ऊपर
- PyTorch 2.1 या ऊपर
- CUTLASS 4.0 या ऊपर
- {fmt} library
development environment में submodule सहित repository clone करने के बाद develop.sh से जरूरी include linking और CPP JIT module build किया जाता है
installation install.sh चलाने के बाद Python project में deep_gemm import करने के तरीके से होती है

GEMM interface और layout constraints

DeepGEMM के GEMM kernel naming convention में D = C + A @ B है
input shape layout NT पर आधारित है
- fp8_gemm_nt D = C + A @ B.T करता है
SM90 implementation सिर्फ NT memory layout support करता है
- यह row-major, col-major combination के बराबर है
SM100 implementation NT, TN, NN, TT सभी memory layouts support करता है
दोनों architectures में LHS scaling factor TMA-aligned और transposed layout में होना चाहिए
- SM90 scaling factor को FP32 format में मांगता है
- SM100 packed UE8M0 format मांगता है, और 4 UE8M0 को एक torch.int में pack करता है
input transpose या FP8 casting जैसे काम user को अलग से handle करने होंगे
- library simple PyTorch utility functions देती है, लेकिन performance धीमी हो सकती है
- मुख्य focus GEMM kernel optimization है

Dense और Grouped GEMM

basic non-grouped FP8 GEMM fp8_gemm_{nt, nn, tn, tt} functions का उपयोग करता है
contiguous layout वाला grouped GEMM, CUTLASS के traditional grouped GEMM से अलग, सिर्फ M axis को group करता है
- N और K fixed होने चाहिए
- design MoE models में experts द्वारा same shape share करने वाली स्थिति के लिए है
training forward pass या inference prefilling में हर expert द्वारा process किए गए tokens की संख्या अलग हो सकती है
- ऐसे tokens को एक tensor में concatenate किए गए रूप को contiguous layout कहा जाता है
- हर expert segment GEMM M block size से aligned होना चाहिए
- alignment criteria get_mk_alignment_for_contiguous_layout() से देखा जाता है
MoE weight backward के लिए K-axis grouped API भी उपलब्ध है
- M और N fixed होने चाहिए
- संबंधित function k_grouped_fp8_gemm_tn_contiguous है
inference decoding stage में जब CUDA graph on हो और CPU expert-wise token count नहीं जान सकता, तो masked grouped GEMM support करता है
- mask tensor देने पर kernel सिर्फ valid regions compute करता है
- function m_grouped_fp8_gemm_nt_masked है
- DeepEP के low-latency kernel output को input के रूप में इस्तेमाल करने का example है

DeepSeek v3.2 Indexer के लिए MQA kernel

V3.2 MQA kernel family non-paged version और paged version देती है
- non-paged prefilling के लिए है
- paged decoding के लिए है
fp8_mqa_logits 6 inputs लेता है
- q: E4M3 tensor, shape [seq_len, num_heads, head_dim]
- kv: E4M3 tensor और float scaling factor
  - tensor shape [seq_len_kv, head_dim] है
  - scaling factor shape [seq_len_kv] है
- weights: float tensor, shape [seq_len, num_heads]
- cu_seq_len_k_start, cu_seq_len_k_end: int tensor, shape [seq_len]
- clean_logits: unfilled logits को -inf से clean करना है या नहीं
output tensor shape [seq_len, seq_len_kv] है और token-to-token logits दर्शाता है
हर q token i, cu_seq_len_k_start[i] से cu_seq_len_k_end[i] से पहले तक के kv token j पर iterate करता है
- kv_j में scaling factor multiply करता है
- q[i, :, :] @ kv_j से per-head values compute करता है
- ReLU apply करने के बाद weights[i, :] multiply कर sum करता है और scalar logit बनाता है
paged version function fp8_paged_mqa_logits है

Mega MoE

Mega MoE कई MoE stages को एक mega-kernel में fuse करता है
- EP dispatch
- linear 1, FP8xFP4
- SwiGLU
- linear 2, FP8xFP4
- EP combine
Mega MoE NVLink communication और tensor core compute को overlap करता है
execution के लिए symmetric memory का उपयोग करने वाला multi-process launch जरूरी है
usage flow इस प्रकार है
- deep_gemm.get_symm_buffer_for_mega_moe से symmetric memory buffer allocate करें
  - PyTorch 2.9 या ऊपर चाहिए
- deep_gemm.transform_weights_for_mega_moe से FP4 और UE8M0 SF सहित weights को required layout में transform करें
- call से पहले input, scaling factor, top-k index, top-k weight को buffer में copy करें
- deep_gemm.fp8_fp4_mega_moe से fused mega MoE kernel run करें
पूरे multi-process setup और benchmarking example tests/test_mega_moe.py में हैं

utilities और environment variables

मुख्य utility functions execution resources, alignment, JIT compilation, scaling factor transformation को control करते हैं
- deep_gemm.set_num_sms / get_num_sms: इस्तेमाल किए जाने वाले maximum SMs की संख्या set और query करना
- deep_gemm.set_tc_util / get_tc_util: approximate tensor core utilization ratio set और query करना
- deep_gemm.set_pdl / get_pdl: Programmatic Dependent Launch(PDL) enable और disable करना
- deep_gemm.set_mk_alignment_for_contiguous_layout / get_mk_alignment_for_contiguous_layout: contiguous layout की group-level M/K alignment set और query करना
- deep_gemm.transform_sf_into_required_layout: scaling factor को required layout में transform करना
- deep_gemm.get_tma_aligned_size: जरूरी TMA alignment size query करना
JIT-related environment variables debug output, cache location, compiler selection, profiling options को control करते हैं
- DG_JIT_DEBUG: JIT debug information output
- DG_PRINT_CONFIGS: हर shape के लिए selected config output
- DG_JIT_CACHE_DIR: compiled kernel cache directory, default $HOME/.deep_gemm
- DG_JIT_USE_NVRTC: NVCC के बजाय NVRTC इस्तेमाल करना; fast compilation possible है, लेकिन कुछ मामलों में performance कम हो सकती है
- DG_JIT_NVCC_COMPILER: NVCC compiler path
- DG_JIT_CPP_STANDARD: C++ standard version, default 20
debug और profiling environment variables भी दिए गए हैं
- DG_JIT_DUMP_ASM, DG_JIT_DUMP_PTX, DG_JIT_DUMP_SASS: PTX और SASS output dump
- DG_JIT_WITH_LINEINFO: profiling tools के लिए source line information शामिल करना
- DG_COMM_KERNEL_DEBUG: Mega MoE call से पहले symmetric buffer को 0 से initialize करना
- DG_USE_NVIDIA_TOOLS: external NVIDIA tools चलाते समय internal profiling skip करना
build options installation और kernel loading method को control करते हैं
- DG_SKIP_CUDA_BUILD: installation के दौरान CUDA extension build skip करना
- DG_FORCE_BUILD: pre-built wheel download के बजाय local build force करना
- DG_JIT_USE_RUNTIME_API: kernel loading के लिए CUDA Runtime API इस्तेमाल करना, CUDA runtime 12.8 या ऊपर चाहिए

license और citation

DeepGEMM repository MIT License के तहत open है
project बताता है कि यह CUTLASS से inspired है
citation item का title DeepGEMM: clean and efficient BLAS kernel library on GPU है

1 टिप्पणियां

GN⁺ 2025-02-27

Hacker News टिप्पणियाँ

FFMA SASS interleaving वाकई हैरान करने वाले स्तर का लगता है
NVCC 12.2 और 12.3 के बीच CUTLASS FP8 kernel performance बेहतर होने की बात देखकर किसी ने compiled SASS की तुलना की, तो कई FADD instructions में एक bit interleaving pattern की तरह flipped मिला, और open source CUDA assembler implementation को देखकर शायद यह पता चला कि वह bit yield bit है, जो current warp को छोड़कर दूसरे warp को run होने देता है
इसका इस्तेमाल करके compiled binary की FFMA instructions को modify करने वाला script बनाया गया, और सिर्फ yield bit ही नहीं बल्कि warp के yield करने पर register reuse संभव नहीं होता, इसलिए reuse bit भी साथ में flip किया गया ताकि fine-grained scaling FP8 GEMM में MMA instructions और promoted FFMA instructions बेहतर overlap हों; कुछ मामलों में इससे 10% से ज्यादा performance बढ़ी, यह काफी प्रभावशाली है
- मैंने कहीं और पढ़ा था कि performance-critical matrix operation optimization में इस तरह का तरीका आम तौर पर इस्तेमाल होता है
  बस इस खास समस्या के लिए शायद दूसरी AI कंपनियों ने अभी तक इसकी जरूरत महसूस नहीं की थी, और अंततः संभव है कि सभी लगभग इसी निष्कर्ष तक पहुँचें
- Scott Gray ने 2015 में Maxwell पर बिल्कुल यही और उससे भी अधिक चीजें पहले ही खोज ली थीं, और उसके बाद भी कई लोगों ने इस पर काफी काम किया है
यह उदाहरण अच्छी तरह दिखाता है कि सिर्फ high-level code से hardware performance निकालने के मामले में मौजूदा compiler अभी कितना पीछे है
यह सोचने वाली बात है कि पारंपरिक compiler techniques या AI-आधारित optimization agents को ऐसे नतीजे पाने के लिए क्या चाहिए होगा
- लगता है reinforcement learning feedback loop के भीतर बहुत बड़ा trial-and-error करना पड़ेगा
रिपोर्ट की गई speedup संख्याएँ उनके अपने CUTLASS-आधारित baseline की तुलना में हैं
सोच रहा हूँ कि किसी ने cuBLAS के साथ direct performance comparison किया है या नहीं
अब तक मैंने CUTLASS GEMM के जो नतीजे देखे हैं, वे cuBLAS से लगभग 10% के भीतर थे; अगर paper में कही गई 2x~2.5x बढ़त बनी रहती है, तो वह सचमुच प्रभावशाली होगी
- मैं आमतौर पर FP8 से बचता हूँ और I8 को पसंद करता हूँ, लेकिन यह सवाल देखकर cuBLAS कितना अच्छा करता है, यह जानने की जिज्ञासा हुई
  पहले तो FP8 जैसी mixed-precision workloads को संभालने के लिए cuBLAS में cuBLASLt extension API चाहिए
  फिर A x B में E5M2 x E5M2 जैसी उचित लगने वाली type combinations supported नहीं हैं, लेकिन E5M2 x E4M3 supported है; और Ampere, Hopper, Blackwell में matrix A का layout हमेशा transposed होना चाहिए जैसी पाबंदियाँ भी बनी रहती हैं
  मैंने FP8 cuBLASLt benchmark को अपने "Less Slow C++" repository <https://github.com/ashvardanian/less_slow.cpp> में integrate किया है, और उसे मौजूदा cuBLAS तथा हाथ से लिखे CUDA/PTX benchmark सूची में जोड़ दिया है
  इसे H200 GPU पर चला रहा हूँ, जिसका performance H100 जैसा होना चाहिए, और square inputs पर throughput लगभग 1.35 Peta-ops पर peak करता है
  256 पर 2.68T/s, 512 पर 20.49T/s, 1024 पर 144.23T/s, 2048 पर 665.68T/s, 4096 पर 1.26P/s, 8192 पर 1.34P/s, और 16384 पर लगभग 1.23P/s मिला; यह dense GEMM के लिए NVIDIA द्वारा प्रचारित आँकड़ों <https://resources.nvidia.com/en-us-data-center-overview-mc/e...> का लगभग 67% है
- मैंने सुना था कि CUTLASS से cuBLAS से भी बेहतर performance निकाला जा सकता है
  मुझे लगा था कि baseline में cuBLAS और CUTLASS में से जो बेहतर हो, वही चुना गया होगा
इस तरह का open source उद्योग के efficiency हासिल करने के लक्ष्य को बहुत अच्छी तरह दिखाता है
लेकिन इस software का लाभ सीखने-समझने, experiment करने, या consumer hardware पर model serve करने की कोशिश करने वाले सामान्य open source community से अधिक, बड़े पैमाने पर model serve करने वाली बड़ी कंपनियों को, यानी DeepSeek के संभावित प्रतिस्पर्धियों को, ज्यादा मिलेगा
- efficiency बेहतर होने पर अंततः इसका फायदा सबको, DeepSeek को भी, सस्ता hardware मिलने के रूप में हो सकता है
मुझे यकीन नहीं है कि लगातार और कम precision के लिए optimize करना लंबी अवधि में सही दिशा है
इसका मतलब यह है कि model वास्तव में काफी sparse है, और अभी ऐसा हो सकता है, लेकिन मेरा मानना है कि यह मूल रूप से sparse होना जरूरी है इसलिए नहीं, बल्कि training के तरीकों में कुछ खराब विचार मिले हुए हैं, इसलिए ऐसा दिख रहा है
- जब तक मुफ्त में मिलने वाली sparsity काम कर रही है, तब तक उसका फायदा उठा लेना चाहिए
  ऐसे बहुत अच्छे models को train करना जो सिर्फ higher precision पर ही संभव हों, यह research problem है; जबकि low-precision training और inference एक engineering problem है
  CNN के दौर से, कम से कम 9 साल से, हम यह कर रहे हैं, और मुझे लगता है कि यह काम अभी कुछ साल और चलेगा
- चूँकि activation functions floating-point numbers की dynamic range का बड़ा हिस्सा फेंक देती हैं, इसलिए पहले से saturated activation regions के लिए बहुत चौड़ी range रखना शायद उपयोगी नहीं है, यह बात काफी स्पष्ट लगती है
Blackwell के native microscaling support MXFP की वजह से यह बेमानी भी हो सकता है
Hopper में इसे एक ज्यादा coarse granularity पर, लेकिन FP32 scaling factors के साथ, manually implement किया गया था
- सही है
  इस तरह के high-quality public demos अच्छी तरह दिखाते हैं कि $NVDA की moat आखिर कहाँ है
  general-purpose GPUs इतने flexible हैं कि hardware vendor ने शुरुआत में जिन बातों के बारे में नहीं सोचा था, लेकिन जो पर्याप्त रूप से तर्कसंगत हैं, ऐसे कई काम programming के जरिए कराए जा सकते हैं
  लेकिन अगर भविष्य धीरे-धीरे dedicated hardware support की ओर सिमटता जाता है और इस तरह के software optimization की गुंजाइश खत्म हो जाती है, तो तथाकथित CUDA moat टूट जाएगा
  इस खेल में बने रहने के लिए NVIDIA जैसे अपनी ही moat को खुद खोदकर कमजोर कर रही है :p
वाह, यह MIT license के तहत है
अच्छा होगा अगर बड़ी कंपनियाँ भी इस तरह के open source collaboration मॉडल को अपनाएँ
मैं अब भी यह सोचता रहता हूँ कि undocumented instructions आखिर मौजूद क्यों हैं
भले वे पूरी तरह stable न हों, फिर भी क्या उन्हें users को उपलब्ध कराना बेहतर नहीं होगा?
लगता है कि ऐसी बातें अंदरूनी तौर पर documented तो होंगी, फिर इन्हें सार्वजनिक क्यों नहीं किया जाता, समझ नहीं आता
अस्पष्टता पर आधारित security काम नहीं करती, और competitors तो वैसे भी सब कुछ reverse-engineer कर लेते हैं
- शायद यह उसी वजह से होता होगा जैसे हमारी बनाई चीज़ों में भी undocumented हिस्से रह जाते हैं
  या तो समय की कमी होती है, या फिर unstable या experimental features के लिए support का संकेत नहीं देना चाहते
  अगर असर सिर्फ बगल वाली team तक सीमित रहे, तो बाद में बदलना भी कहीं आसान होता है
- यह मान लेना भी ज़रूरी नहीं कि “ऐसी बातें अंदरूनी तौर पर documented होंगी” सही ही हो
  संभव है कि वे बस architecture design documents या specs जैसी जगहों में ही दर्ज हों, और ऐसे documents को साझा करना वे स्वाभाविक रूप से नहीं चाहेंगे
सच कहूँ तो यह मेरी उपयोग-सीमा और समझ, दोनों से बाहर की बात है
फिर भी ऐसी खोजें और सुधार साझा करके सबको लाभ पहुँचाने का तरीका सच में सराहनीय और ताज़गीभरा है
- FFMA का मतलब Fused Floating-point Multiply-Add है, और यह एक बुनियादी GPU instruction है जो D = A*B + C को एक ही बार में करता है
  matrix multiplication और deep learning workloads में यह बहुत महत्वपूर्ण है
  NVIDIA के SASS में FFMA instruction 64-bit या 128-bit instruction के रूप में encode होता है, और इसमें कई control bits होते हैं जो उसके सटीक व्यवहार को तय करते हैं
  जब yield bit set होती है, तो यह warp scheduler को बताती है कि इस instruction के बाद मौजूदा warp execution छोड़ सकता है, और hardware latency छिपाने के लिए किसी दूसरे warp को चला सकता है
  GPU बड़े पैमाने की parallelism से high throughput पाते हैं, और जब एक warp memory wait जैसी वजह से रुक जाता है, तो दूसरा warp आगे बढ़ सकता है
  reuse bit यह दिखाती है कि source register को अगली operation में तुरंत reuse किया जा सकता है या नहीं, और अगर yield bit set हो तो इसे अनिवार्य रूप से off रखना पड़ता है
  क्योंकि warp के yield करने पर यह ज़रूरी नहीं कि अगला चलने वाला वही warp हो, और कोई दूसरा warp register file state बदल सकता है; इसलिए hardware यह guarantee नहीं कर सकता कि yield के बाद भी register value बनी रहेगी
  FFMA instructions पर yield bit को alternating pattern में set करने से compiler explicit scheduling points बना सकता है जहाँ दूसरे warp आगे बढ़ सकें, और correctness बनाए रखने के लिए उस instruction की reuse bit भी साथ में clear करनी पड़ती है
  यह बदलाव खास तौर पर matrix multiplication के केंद्र में रहने वाले MMA instructions और FP8 को higher precision में accumulate करने के लिए convert करने वाले promotion FFMA instructions को overlap कराने में मदद करता है
  FP8 GEMM में आम तौर पर accumulation के लिए higher precision में conversion और फिर वापस लौटने की प्रक्रिया चाहिए होती है, जिससे अतिरिक्त FFMA बनते हैं; इससे memory bandwidth की मांग तो घटती है, लेकिन promotion/demotion operations के मिश्रण के साथ एक जटिल compute pattern बनता है
  “fine-grained scaling” का मतलब शायद computation के कई बिंदुओं पर precision को सावधानी से manage करना है
  yield bit में यह छेड़छाड़ compute operations और format conversion को बेहतर तरीके से interleave होने देती है, जिससे GPU execution units का अधिक कुशल उपयोग होता है; इस optimization के बिना warp scheduler शायद स्वाभाविक switching opportunities न ढूँढ पाए और compute resources कम इस्तेमाल हों

DeepGEMM: बारीक scaling के जरिए साफ़ और efficient FP8 GEMM kernel

DeepGEMM का उद्देश्य और design

मुख्य updates

requirements और installation flow

GEMM interface और layout constraints

Dense और Grouped GEMM

DeepSeek v3.2 Indexer के लिए MQA kernel

Mega MoE

utilities और environment variables

license और citation

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News टिप्पणियाँ