Qwen3.5 लोकल रनिंग गाइड

(unsloth.ai)

33 पॉइंट द्वारा GN⁺ 2026-03-09 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

Alibaba की Qwen3.5 मॉडल श्रृंखला 0.8B से 397B तक कई आकारों में उपलब्ध है, और मल्टीमॉडल हाइब्रिड reasoning फीचर तथा 256K context को सपोर्ट करती है
Unsloth सभी Qwen3.5 मॉडलों को Dynamic 2.0 GGUF quantization के साथ उपलब्ध कराता है, और इन्हें llama.cpp या LM Studio के जरिए लोकल में चलाया जा सकता है
Thinking मोड और non-thinking मोड के बीच स्विच किया जा सकता है, और छोटे मॉडल (0.8B~9B) डिफ़ॉल्ट रूप से non-thinking मोड पर सेट हैं
हर मॉडल के लिए आवश्यक RAM/VRAM क्षमता और recommended settings (temperature, top_p आदि) दिए गए हैं, और Mac 22GB वातावरण में भी 27B·35B मॉडल चलाए जा सकते हैं
Unsloth GGUF ने बेहतर quantization algorithm और imatrix data लागू करके प्रदर्शन सुधारा है, लेकिन यह Ollama के साथ compatible नहीं है

Qwen3.5 अवलोकन

Qwen3.5, Alibaba द्वारा जारी की गई नई LLM श्रृंखला है, जिसमें 0.8B·2B·4B·9B (छोटे) से लेकर 27B·35B·122B·397B (बड़े) तक शामिल हैं
- यह मल्टीमॉडल हाइब्रिड reasoning को सपोर्ट करती है और 201 भाषाओं तथा 256K context length को संभाल सकती है
- agent coding, vision, conversation, long-context tasks में यह उच्च प्रदर्शन दिखाती है
35B और 27B मॉडल को 22GB RAM वाले Mac पर भी चलाया जा सकता है
सभी GGUF फ़ाइलें बेहतर quantization algorithm और नए imatrix data का उपयोग करती हैं
- chat, coding, long-context, और tool-calling में बेहतर प्रदर्शन
- MXFP4 layers को कुछ GGUF (Q2_K_XL, Q3_K_XL, Q4_K_XL) से हटाया गया है

हार्डवेयर आवश्यकताएँ

तालिका के अनुसार मॉडल आकार के हिसाब से न्यूनतम memory requirements दी गई हैं
- उदाहरण: 0.8B~2B मॉडल के लिए 3GB, 9B के लिए 5.5GB (3-bit आधार), 35B-A3B के लिए 17GB आवश्यक
- 397B-A17B के लिए 3-bit आधार पर 180GB, और 4-bit आधार पर 214GB आवश्यक
कुल memory (RAM+VRAM) मॉडल फ़ाइल के आकार से अधिक होनी चाहिए ताकि सर्वोत्तम प्रदर्शन मिल सके
- यदि memory कम हो, तो SSD/HDD offloading के साथ चलाया जा सकता है, लेकिन गति कम हो जाएगी
27B सटीकता-प्राथमिकता वाला विकल्प है, जबकि 35B-A3B गति-प्राथमिकता वाला विकल्प है

recommended settings

अधिकतम context window: 262,144 (YaRN के साथ 1M तक बढ़ाया जा सकता है)
presence_penalty: 0.0~2.0 (repetition घटाने के लिए, ज्यादा होने पर प्रदर्शन थोड़ा कम हो सकता है)
output length: 32,768 tokens recommended
Thinking मोड और Non-thinking मोड के अनुसार setting values अलग हैं
- Thinking मोड: सामान्य कार्यों के लिए temperature=1.0, coding के लिए 0.6
- Non-thinking मोड: सामान्य कार्यों के लिए temperature=0.7, reasoning tasks के लिए 1.0
छोटे मॉडल (0.8B~9B) में reasoning डिफ़ॉल्ट रूप से disabled है
- सक्षम करने के लिए --chat-template-kwargs '{"enable_thinking":true}' का उपयोग करें

रनिंग और inference ट्यूटोरियल

सभी मॉडल Dynamic 4-bit MXFP4_MOE GGUF संस्करण में उपलब्ध हैं
llama.cpp का उपयोग करके लोकल inference प्रक्रिया
- GitHub से latest version install करने के बाद, -DGGML_CUDA विकल्प से GPU/CPU चुनें
- Hugging Face से मॉडल डाउनलोड करें (hf download unsloth/Qwen3.5-XXB-GGUF)
- llama-cli या llama-server कमांड से चलाएँ
LM Studio में भी चलाया जा सकता है
- मॉडल खोजने के बाद GGUF डाउनलोड करें, और YAML फ़ाइल के जरिए Thinking toggle सक्रिय करें
- restart के बाद toggle फीचर उपलब्ध होगा

मॉडल-वार रनिंग सारांश

Qwen3.5-35B-A3B: 24GB RAM/Mac पर Dynamic 4-bit के साथ तेज inference संभव
Qwen3.5-27B: 18GB RAM/Mac पर चल सकता है
Qwen3.5-122B-A10B: 70GB RAM/Mac वातावरण में चलता है
Qwen3.5-397B-A17B:
- 3-bit: 192GB RAM, 4-bit: 256GB RAM आवश्यक
- 24GB GPU + 256GB RAM संयोजन पर प्रति सेकंड 25 tokens से अधिक generate करता है
- Gemini 3 Pro, Claude Opus 4.5, GPT-5.2 के समान प्रदर्शन स्तर

inference server और API integration

llama-server के जरिए इसे OpenAI-compatible API के रूप में deploy किया जा सकता है
- openai Python library से लोकल server पर request भेजी जा सकती है
- उदाहरण: "http://127.0.0.1:8001/v1"; endpoint का उपयोग
Tool Calling फीचर सपोर्ट करता है
- Python code execution, terminal commands, math operations आदि के लिए function calling संभव
- unsloth_inference() उदाहरण कोड उपलब्ध है

benchmark परिणाम

Unsloth GGUF benchmark
- Qwen3.5-35B Dynamic quant ने अधिकांश bit ranges में SOTA प्रदर्शन दिखाया
- 150 से अधिक KL Divergence tests, कुल 9TB GGUF data उपयोग
- 99.9% KLD पर Pareto Frontier में सर्वोच्च प्रदर्शन
Qwen3.5-397B-A17B
- Benjamin Marie के third-party test में
  - मूल 81.3%, UD-Q4_K_XL 80.5%, UD-Q3_K_XL 80.7%
  - accuracy में 1 point से कम गिरावट, और लगभग 500GB memory की बचत
- Q3 को memory-saving विकल्प, और Q4 को stability विकल्प के रूप में सुझाया गया है

अन्य फीचर्स

Reasoning enable/disable कमांड उपलब्ध (--chat-template-kwargs)
Claude Code / OpenAI Codex के साथ integration संभव
Tool Calling Guide के जरिए लोकल LLM tool-calling configuration संभव
Ollama compatible नहीं, केवल llama.cpp-आधारित backend सपोर्टेड है

Qwen3.5 लोकल रनिंग गाइड

Qwen3.5 अवलोकन

हार्डवेयर आवश्यकताएँ

recommended settings

रनिंग और inference ट्यूटोरियल

मॉडल-वार रनिंग सारांश

inference server और API integration

benchmark परिणाम

अन्य फीचर्स

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.