Flash-MoE - लैपटॉप पर 397 अरब पैरामीटर मॉडल चलाना

(github.com/danveloper)

5 पॉइंट द्वारा GN⁺ 2026-03-23 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

397B पैरामीटर वाले Mixture-of-Experts मॉडल को MacBook Pro(48GB RAM) पर 4.4 टोकन/सेकंड से अधिक की गति से चलाने वाला C/Metal-आधारित inference engine
पूरे 209GB मॉडल को SSD से स्ट्रीम करते हुए, Python या किसी framework के बिना केवल C और Metal shaders से कार्यान्वित
SSD Expert Streaming, FMA-optimized kernel, Deferred GPU Compute आदि के जरिए GPU·SSD·CPU parallel efficiency को अधिकतम किया गया
4-bit quantization configuration गुणवत्ता और गति के बीच संतुलन बनाती है, और tool calling capability सहित production-level output उत्पन्न करती है
लैपटॉप वातावरण में भी अल्ट्रा-लार्ज MoE मॉडल पर real-time inference संभव बनाने वाला lightweight optimization का उदाहरण

प्रदर्शन परिणाम

4-bit expert (FMA kernel) configuration में 4.36 tok/s, अच्छी गुणवत्ता, कुल 209GB disk उपयोग
4-bit base configuration 3.90 tok/s देता है, यानी FMA optimization से पहले का चरण
2-bit expert (trust OS) configuration 5.74 tok/s तक पहुँचता है, लेकिन JSON output errors के कारण tool calling संभव नहीं
2-bit peak single token 7.05 tok/s तक पहुँचता है, लेकिन वास्तविक उपयोग के लिए उपयुक्त नहीं
4-bit quantization वास्तविक संचालन के लिए उपयुक्त configuration है

हार्डवेयर वातावरण

MacBook Pro (Apple M3 Max), 16-core CPU(12P+4E), 40-core GPU, 16-core ANE
48GB unified memory, bandwidth लगभग 400GB/s
SSD 1TB Apple Fabric, 17.5GB/s sequential read speed
macOS 26.2 (Darwin 25.2.0) वातावरण

मॉडल आर्किटेक्चर

कुल 60 transformer layers: 45 GatedDeltaNet(linear attention) + 15 full attention
हर layer में 512 experts हैं, जिनमें से K=4 प्रति token सक्रिय होते हैं (1 shared expert सहित)
hidden dimension 4096
मुख्य तकनीकें
- SSD Expert Streaming
  - expert weights (4-bit आधार पर 209GB) को NVMe SSD से parallel pread() के जरिए जरूरत पड़ने पर लोड किया जाता है
  - हर layer में सक्रिय सिर्फ 4 experts लोड होते हैं (प्रत्येक लगभग 6.75MB)
  - OS page cache अपने आप caching संभालता है, अलग cache की जरूरत नहीं
  - Apple के “LLM in a Flash” पेपर से प्रेरित संरचना
- FMA-optimized dequant kernel
  - (nibble * scale + bias) * x ऑपरेशन को fma(nibble, scale*x, bias*x) रूप में पुनर्व्यवस्थित किया गया
  - scale*x और bias*x को पहले से calculate करके GPU FMA units एक ही instruction में काम करती हैं
  - simple implementation की तुलना में 12% speed improvement
- Metal Compute Shaders
  - 4-bit/2-bit dequant matrix-vector multiply, SwiGLU activation, RMS normalization, GPU attention(Q@Kᵀ, softmax, scores@V), RoPE, MoE merge+residual+gate आदि को hand-coded Metal kernels से implement किया गया
- Deferred GPU Expert Compute
  - CMD3(expert forward pass) commands को asynchronous submit किया जाता है ताकि GPU के चलने के दौरान CPU अगली layer तैयार करे
  - merge+normalization+residual ऑपरेशन भी GPU पर चलते हैं और सीधे अगली layer को सौंपे जाते हैं
- Accelerate BLAS का उपयोग
  - GatedDeltaNet की recurrent computation के लिए cblas_sscal, cblas_sgemv, cblas_sger का उपयोग
  - scalar code की तुलना में 64% तेज प्रदर्शन
- Trust the OS
  - custom cache हटाकर, OS page cache (LRU-based, लगभग 35GB) expert data caching संभालता है
  - अपना Metal LRU, malloc cache, LZ4 compressed cache — इन सबसे धीमा निकला
  - natural 71% cache hit rate हासिल
unified memory constraints
- Apple Silicon पर SSD DMA और GPU computation एक ही memory controller साझा करते हैं
- parallel execution के दौरान GPU bandwidth saturation से latency बहुत बढ़ जाती है
- GPU → SSD → GPU sequential pipeline ही hardware-optimal form है

Flash-MoE - लैपटॉप पर 397 अरब पैरामीटर मॉडल चलाना

प्रदर्शन परिणाम

हार्डवेयर वातावरण

मॉडल आर्किटेक्चर

मुख्य तकनीकें

SSD Expert Streaming

FMA-optimized dequant kernel

Metal Compute Shaders

Deferred GPU Expert Compute

Accelerate BLAS का उपयोग

Trust the OS

unified memory constraints

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.