Rapid-MLX - Apple Silicon के लिए अल्ट्रा-फास्ट लोकल AI इंजन

xguru · 2026-05-12T09:46:02+09:00

Apple Silicon Mac पर लोकल AI मॉडल चलाने के लिए inference engine, जो Apple के MLX framework पर आधारित native Metal compute kernels का उपयोग करता है Ollama की तुलना में अधिकतम 4.2x तेज inference speed - Phi-4 Mini 14B पर 180 tok/s (Ollama के 56 tok/s की तुलना में 3.2x), Qwen3.5-9B पर 108 tok/s (Ollama के 41 tok/s की तुलना में 2.6x) cached state में TTFT 0.08 सेकंड (Kimi-Linear-48B के आधार पर), और अधिकांश मॉडलों में 0.1~0.3 सेकंड का स्तर 17 tool-calling parsers built-in और model name आधारित auto-detection — 4bit quantized model अगर टूटे हुए tool calls को text के रूप में output करे तब भी उन्हें अपने-आप structured format में restore करता है 16GB MacBook Air (Qwen3.5-4B, 160 tok/s) से 256GB Mac Studio Ultra (DeepSeek V4 Flash 158B, 31 tok/s, 1M context) तक RAM के अनुसार optimal model mapping प्रदान करता है 16GB MacBook Air/Pro: Qwen3.5-4B 4bit → 2.4GB RAM उपयोग, 160 tok/s, chat·coding·tool calling संभव 24GB MacBook Pro: Qwen3.5-9B 4bit → 5.1GB, 108 tok/s, general-purpose model 32GB Mac Mini/Studio: Qwen3.5-27B 4bit (15.3GB, 39 tok/s), Nemotron-Nano 30B 4bit (18GB, 141 tok/s, 100% tool calling), Qwen3.6-35B-A3B 4bit (20GB, 95 tok/s, 256 MoE experts, 262K context) 48~64GB: Qwen3.5-35B-A3B 8bit → 37GB, 83 tok/s, smartness + speed का सबसे अच्छा संतुलन 96GB+: Qwen3.5-122B mxfp4 → 65GB, 57 tok/s, frontier-grade intelligence 128GB+: DeepSeek V4 Flash 158B-A13B 2-bit DQ → 91GB, 56 tok/s, day-0 frontier MoE 192~256GB: Qwen3.5-122B 8bit (130GB, 44 tok/s) या DeepSeek V4 Flash 8-bit (136GB, 31 tok/s, 1M context) 4bit मेमोरी बचाने के लिए (ज़्यादातर मामलों में अनुशंसित), 8bit उच्च-गुणवत्ता inference के लिए, mxfp4 उच्च-गुणवत्ता 4bit format है chain-of-thought मॉडलों की reasoning process को अलग reasoning_content field में विभाजित करने वाला reasoning separation फीचर - Qwen3, DeepSeek-R1, MiniMax, GPT-OSS formats समर्थित standard transformers के लिए KV cache trimming और Qwen3.5 hybrid architecture के लिए DeltaNet state snapshots (~0.1ms restore) के जरिए multi-turn conversation में TTFT 2~5x बेहतर, और यह बिना किसी अलग flag के हमेशा enabled रहता है बड़े context requests में, जहाँ local prefill धीमा हो, वहाँ GPT-5, Claude जैसे cloud LLMs पर अपने-आप switch करने वाला smart cloud routing समर्थित OpenAI API का drop-in replacement — Cursor, Claude Code, Aider, LangChain, PydanticAI, smolagents, Hermes Agent, Open WebUI जैसे OpenAI-compatible apps को localhost:8000/v1 से तुरंत जोड़ा जा सकता है Vision (Gemma 4, Qwen-VL), Audio (TTS/STT), Embeddings, Gradio Chat UI, schema-constrained JSON generation जैसी multimodal और optional extensions समर्थित TurboQuant V-cache (86% memory reduction), KV cache quantization, prefill chunking, tool logits bias जैसी विभिन्न optimization techniques built-in model + agent harness compatibility को मापने वाला MHI (Model-Harness Index) उपलब्ध — Qwopus 27B ने MHI 92 के साथ सबसे अधिक स्कोर किया Speculative Decode (1.5~2.3x), EAGLE-3 (3~6.5x), ReDrafter (1.4~1.5x) जैसी अतिरिक्त acceleration techniques roadmap में शामिल Apache 2.0 license

(github.com/raullenchai)

13 पॉइंट द्वारा xguru 7 시간 전 | 3 टिप्पणियां | WhatsApp पर शेयर करें

Apple Silicon Mac पर लोकल AI मॉडल चलाने के लिए inference engine, जो Apple के MLX framework पर आधारित native Metal compute kernels का उपयोग करता है
Ollama की तुलना में अधिकतम 4.2x तेज inference speed - Phi-4 Mini 14B पर 180 tok/s (Ollama के 56 tok/s की तुलना में 3.2x), Qwen3.5-9B पर 108 tok/s (Ollama के 41 tok/s की तुलना में 2.6x)
cached state में TTFT 0.08 सेकंड (Kimi-Linear-48B के आधार पर), और अधिकांश मॉडलों में 0.1~0.3 सेकंड का स्तर
17 tool-calling parsers built-in और model name आधारित auto-detection — 4bit quantized model अगर टूटे हुए tool calls को text के रूप में output करे तब भी उन्हें अपने-आप structured format में restore करता है
16GB MacBook Air (Qwen3.5-4B, 160 tok/s) से 256GB Mac Studio Ultra (DeepSeek V4 Flash 158B, 31 tok/s, 1M context) तक RAM के अनुसार optimal model mapping प्रदान करता है
- 16GB MacBook Air/Pro: Qwen3.5-4B 4bit → 2.4GB RAM उपयोग, 160 tok/s, chat·coding·tool calling संभव
- 24GB MacBook Pro: Qwen3.5-9B 4bit → 5.1GB, 108 tok/s, general-purpose model
- 32GB Mac Mini/Studio: Qwen3.5-27B 4bit (15.3GB, 39 tok/s), Nemotron-Nano 30B 4bit (18GB, 141 tok/s, 100% tool calling), Qwen3.6-35B-A3B 4bit (20GB, 95 tok/s, 256 MoE experts, 262K context)
- 48~64GB: Qwen3.5-35B-A3B 8bit → 37GB, 83 tok/s, smartness + speed का सबसे अच्छा संतुलन
- 96GB+: Qwen3.5-122B mxfp4 → 65GB, 57 tok/s, frontier-grade intelligence
- 128GB+: DeepSeek V4 Flash 158B-A13B 2-bit DQ → 91GB, 56 tok/s, day-0 frontier MoE
- 192~256GB: Qwen3.5-122B 8bit (130GB, 44 tok/s) या DeepSeek V4 Flash 8-bit (136GB, 31 tok/s, 1M context)
- 4bit मेमोरी बचाने के लिए (ज़्यादातर मामलों में अनुशंसित), 8bit उच्च-गुणवत्ता inference के लिए, mxfp4 उच्च-गुणवत्ता 4bit format है
chain-of-thought मॉडलों की reasoning process को अलग reasoning_content field में विभाजित करने वाला reasoning separation फीचर - Qwen3, DeepSeek-R1, MiniMax, GPT-OSS formats समर्थित
standard transformers के लिए KV cache trimming और Qwen3.5 hybrid architecture के लिए DeltaNet state snapshots (~0.1ms restore) के जरिए multi-turn conversation में TTFT 2~5x बेहतर, और यह बिना किसी अलग flag के हमेशा enabled रहता है
बड़े context requests में, जहाँ local prefill धीमा हो, वहाँ GPT-5, Claude जैसे cloud LLMs पर अपने-आप switch करने वाला smart cloud routing समर्थित
OpenAI API का drop-in replacement — Cursor, Claude Code, Aider, LangChain, PydanticAI, smolagents, Hermes Agent, Open WebUI जैसे OpenAI-compatible apps को localhost:8000/v1 से तुरंत जोड़ा जा सकता है
Vision (Gemma 4, Qwen-VL), Audio (TTS/STT), Embeddings, Gradio Chat UI, schema-constrained JSON generation जैसी multimodal और optional extensions समर्थित
TurboQuant V-cache (86% memory reduction), KV cache quantization, prefill chunking, tool logits bias जैसी विभिन्न optimization techniques built-in
model + agent harness compatibility को मापने वाला MHI (Model-Harness Index) उपलब्ध — Qwopus 27B ने MHI 92 के साथ सबसे अधिक स्कोर किया
Speculative Decode (1.5~2.3x), EAGLE-3 (3~6.5x), ReDrafter (1.4~1.5x) जैसी अतिरिक्त acceleration techniques roadmap में शामिल
Apache 2.0 license

3 टिप्पणियां

parkindani 3 시간 전

omlx की तुलना में परफ़ॉर्मेंस कैसी होगी, यह जानने की उत्सुकता है।

xguru 6 시간 전

मैं व्यक्तिगत रूप से antirez/ds4 से deepseek4 चला कर देख रहा हूं, और स्पीड ds4 की थोड़ी ज़्यादा तेज़ लगती है।

ds4 128gb-विशेष है, इसलिए थोड़ा अजीब-सा लगता है, लेकिन बाकी मॉडलों में यह अच्छा हो सकता है।

हाल ही में HuggingFace के CEO का एक ट्वीट काफ़ी लोकप्रिय हुआ था, जिसमें उन्होंने कहा कि Qwen3.6 27B के साथ विमान में कोडिंग करके देखा तो वह Opus स्तर का लगा। इसे भी 3.6 27B पर चला कर देखना पड़ेगा।
https://x.com/julien_c/status/2047647522173104145

yangeok 7 시간 전

कोरियन में performance कैसी होगी, यह जानने की जिज्ञासा है.. मैं 96GB वाला इस्तेमाल कर रहा हूँ, लेकिन शायद paid LLMs से performance कम ही होगी, है ना..?

gemini cli जितना ही हो जाए तो भी अच्छा लगेगा haha

Rapid-MLX - Apple Silicon के लिए अल्ट्रा-फास्ट लोकल AI इंजन

संबंधित पढ़ाई

3 टिप्पणियां