Xeon पर 1–2 Arc A770 के साथ DeepSeek-R1-671B-Q4_K_M चलाना

(github.com/intel)

2 पॉइंट द्वारा GN⁺ 2025-03-08 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Intel GPU पर llama.cpp सीधे चलाना चाहने वाले यूज़र्स के लिए IPEX-LLM portable zip/tgz का quick start दस्तावेज़ है; नए पैकेज में Xeon के 1–2 Arc A770 पर DeepSeek-R1-671B-Q4_K_M चलाने तक को कवर किया गया है
टार्गेट environment Windows और Linux दोनों हैं, और Intel Core Ultra/11वीं–14वीं पीढ़ी तथा Intel Arc A-Series/B-Series GPU पर GGUF मॉडल चलाने की प्रक्रिया बताई गई है
बेसिक flow यह है कि GGUF मॉडल को local में तैयार करने के बाद llama-cli को -ngl 99, -c 2500, -n 2048, --temp 0 जैसे options देकर चलाया जाता है
सिर्फ Linux के लिए उपलब्ध FlashMoE, DeepSeek V3/R1 सीरीज़ के MoE GGUF चलाने के लिए अनुकूलित CLI है; DeepSeek V3/R1 के लिए CPU memory 380GB, 1–8 Arc A770 और 500GB disk चाहिए
जिन environments में कई Intel GPU मिले-जुले हैं, वहाँ default रूप से सभी GPU इस्तेमाल होते हैं, इसलिए iGPU/dGPU combination में ONEAPI_DEVICE_SELECTOR से GPU specify किया जा सकता है या SYCL_DEVICE_CHECK=0 से check बंद किया जा सकता है

portable zip/tgz से llama.cpp चलाना

llama.cpp portable zip, ipex-llm आधारित ऐसा पैकेज है जो Intel GPU पर llama.cpp को सीधे चलाता है
यह manual installation कम करने वाले portable zip/tgz flow पर आधारित है, और नया portable zip Xeon के 1 या 2 Arc A770 पर DeepSeek-R1-671B-Q4_K_M चलाने को कवर करता है
verify की गई hardware range:
- Intel Core Ultra processors
- Intel Core 11th~14th gen processors
- Intel Arc A-Series GPU
- Intel Arc B-Series GPU

Windows quick start

Intel GPU driver को latest version में update करने की सलाह दी जाती है
v2.3.0-nightly release से Windows के लिए IPEX-LLM llama.cpp portable zip डाउनलोड करें और extract करें
cmd में extracted folder पर जाएँ
- cd /d PATH\TO\EXTRACTED\FOLDER
कई GPU इस्तेमाल करने वाले यूज़र्स run करने से पहले GPU selection setting लागू कर सकते हैं

GGUF मॉडल चलाना

run करने से पहले community GGUF model को local directory में डाउनलोड या copy करना होगा
example model bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF का DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf है
model path को actual location से बदलने के बाद llama-cli.exe चलाएँ

llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv

example output में 1 Intel Arc A770 Graphics SYCL device, KV cache, SYCL compute buffer, sampler settings और token generation performance की जानकारी दिखती है

Linux quick start

GPU driver version check करें, और ज़रूरत हो तो Intel client GPU driver installation guide के अनुसार install करने की सलाह दी जाती है
v2.3.0-nightly release से Linux के लिए IPEX-LLM llama.cpp portable tgz डाउनलोड करें और extract करें
terminal में extracted folder पर जाएँ
- cd /PATH/TO/EXTRACTED/FOLDER
Linux पर llama.cpp portable zip इस्तेमाल करते समय oneAPI को source नहीं करना चाहिए

GGUF मॉडल चलाना

Windows की तरह DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf जैसे community GGUF model को local में तैयार करें
model path को actual location से बदलने के बाद ./llama-cli चलाएँ

./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048  -t 8 -e -ngl 99 --color -c 2500 --temp 0 -no-cnv

example output में SYCL device list, llama_kv_cache_init, llama_init_from_model, sampler chain, n_ctx = 2528, n_batch = 4096, n_predict = 2048 जैसी run information शामिल होती है

FlashMoE से DeepSeek V3/R1 चलाना

FlashMoE, llama.cpp के ऊपर बनाया गया command-line tool है, और DeepSeek V3/R1 जैसे MoE models चलाने के लिए tuned है
अभी यह Linux platform पर इस्तेमाल किया जा सकता है
tested MoE GGUF models:
दूसरे MoE GGUF models भी supported हैं
requirements और सावधानियाँ
- DeepSeek V3/R1 run requirements:
  - CPU memory 380GB
  - 1–8 Arc A770
  - disk 500GB
    - बड़े models या दूसरी precision के लिए अधिक resources की ज़रूरत हो सकती है
    - 1 Arc A770 वाले platform पर OOM से बचने के लिए context length घटानी होगी; उदाहरण के तौर पर command के अंत में -c 1024 जोड़ें
    - dual-socket platform पर BIOS में SNC (Sub-NUMA Clustering) enable करके और run command से पहले numactl --interleave=all जोड़कर बेहतर decoding performance मिल सकती है
    - FlashMoE इस्तेमाल करते समय भी oneAPI को source नहीं करना चाहिए
CLI run
- example model DeepSeek-R1-Q4_K_M.gguf है, और पहले split file का path specify किया जाता है
```
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?" -no-cnv
```
- example output में 8 SYCL devices के KV buffer, pipeline parallelism enabled, graph nodes/splits, n_threads = 48, n_ctx = 4096, n_batch = 4096 जैसी run information दिखती है
Serving run
```
./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --serve -n 512 -np 2 -c 4096
```
- -n predict किए जाने वाले tokens की संख्या है, -np parallel decoding sequences की संख्या है, और -c total context size है
- values को requirements के अनुसार adjust किया जा सकता है
- Serving feature v2.3.0 nightly build से उपलब्ध है
- example output में n_slots = 2, हर slot का n_ctx_slot = 2048, model loading, chat template, और http://127.0.0.1:8080 server waiting state शामिल होती है

multi-GPU selection और SYCL errors

अलग-अलग SYCL devices detect होना
- अगर अलग-अलग GPU मिले-जुले हों, तो Detected different sycl devices error आ सकता है
- उदाहरण ऐसी स्थिति है जहाँ 2 Arc A770 और 1 Intel UHD Graphics 770 iGPU साथ में detect होते हैं
- अगर GPU समान नहीं हैं, तो workload device memory के अनुसार assign होता है; example में iGPU compute task का 2/3 हिस्सा ले लेता है, जिससे performance काफी गिर जाती है
- दो options हैं
  - best performance पाने के लिए iGPU को disable करें
  - check बंद करके सभी devices इस्तेमाल करें
```
set SYCL_DEVICE_CHECK=0
export SYCL_DEVICE_CHECK=0
```
इस्तेमाल करने वाला GPU specify करना
- अगर कई Intel GPU हैं, तो llama.cpp default रूप से सभी GPU पर चलता है
- सिर्फ specific GPU इस्तेमाल करने के लिए llama.cpp command शुरू करने से पहले ONEAPI_DEVICE_SELECTOR environment variable set करें
- Windows:
```
set ONEAPI_DEVICE_SELECTOR=level_zero:0
set ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"
```
- Linux:
```
export ONEAPI_DEVICE_SELECTOR=level_zero:0
export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"
```
- multi-GPU selection details के लिए multi_gpus_selection.md देखें

performance options और signature verification

Immediate command lists
- SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS तय करता है कि GPU work submission के लिए immediate command lists इस्तेमाल होंगी या नहीं
- आम तौर पर इससे performance बढ़ सकती है, लेकिन exceptions हो सकते हैं; इसलिए environment variable को on और off दोनों cases में test करके optimal performance खोजने की सलाह दी जाती है
- Windows:
```
set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
- Linux:
```
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
- अधिक जानकारी के लिए Intel का Level Zero immediate command lists documentation देखें
portable zip/tgz 2.2.0 signature verification
- portable zip/tgz version 2.2.0 में openssl से signature verify किया जा सकता है
- verification से पहले system में openssl installed होना चाहिए
```
openssl cms -verify -in <portable-zip-or-tgz-file-name>.pkcs1.sig -inform DER -content <portable-zip-or-tgz-file-name> -out nul -noverify
```

1 टिप्पणियां

GN⁺ 2025-03-08

Hacker News की राय

इस configuration में VRAM कम है, इसलिए CPU और GPU memory के बीच काफी data ले जाना पड़ेगा, तो performance बहुत अच्छी न होने की संभावना ज्यादा है
फिर भी DeepSeek-R1 का 256GB से कम वाला quantized model मौजूद है, और यह distilled version नहीं है: https://unsloth.ai/blog/deepseekr1-dynamic
पूरे FP8 DSR1 से अंतर को quantify करना मुश्किल है, लेकिन ~Q2 quantized model भी उम्मीद से काफी usable था
एक और उल्लेखनीय model DeepSeek v2.5 है; इसमें V3/R1 से कम parameters हैं, लेकिन consumer hardware पर चलाने के लिए फिर भी aggressive quantization चाहिए। हाल में किसी ने इसे बना रखा है: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...
DeepSeek v2.5 को Llama 3 70B से बेहतर मानने की गुंजाइश भी है, इसलिए जो लोग local inference चलाना चाहते हैं, उनके लिए यह ऐसा model है जिसके बारे में और लोगों को पता होना चाहिए
- Unsloth R1 quantization को dual Xeon Gold 5218 और 384GB DDR4-2666 पर test किया था, और memory channels का करीब आधा ही इस्तेमाल किया, इसलिए setup optimal नहीं था
  IQ2_XXS / 183GB, 16k context के आधार पर CPU-only में prompt processing 3 tokens/sec, response 1.44 tokens/sec था; CPU + NVIDIA RTX 70GB VRAM में prompt processing 4.74 tokens/sec, response 1.87 tokens/sec था
  अगर Unsloth DeepSeek V3 के लिए भी ऐसी ही quantization दे दे, तो ज्यादा उपयोगी होगा। reasoning tokens की जरूरत नहीं होगी, इसलिए समान tokens/sec पर भी कुल मिलाकर तेज हो सकता है
- v2.5 को एक बार चलाकर देखने वाला हूं, और उम्मीद है कि इतने छोटे quantization के बाद भी यह v3.5 जितना consistent रहेगा
  मैं Q2_K_XL इस्तेमाल कर रहा हूं और निजी तौर पर यह काफी अच्छा लगता है। FP8 से जहां कमी दिखती है वह creative writing में है; वही story prompt कुछ बार डालकर FP8 से compare करें तो फर्क दिखता है
  coding में 1.58-bit, Q2XXS या Q2_K_XL की तुलना में निश्चित रूप से ज्यादा errors करता है
- अभी 8 tokens/sec से अधिक मिल रहा है, और इस post में demo है: https://www.linkedin.com/posts/jasondai_run-671b-deepseek-r1...
https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...
8 tokens/sec से अधिक की requirement CPU memory 380GB, 1–8 ARC A770 cards, और 500GB disk है
- Jason Dai की post का demo भी देखा जा सकता है: https://www.linkedin.com/posts/jasondai_with-the-latest-ipex...
- मुझे यह जानने की उत्सुकता है कि 8 tokens/sec या उससे अधिक पाने के लिए क्या एक Intel Arc A770 card काफी है
- यह configuration लगभग कितने की पड़ेगी, यह जानने की उत्सुकता है
  लगता है 10,000 डॉलर से कम होगी, और tokens/sec के आंकड़े भी शायद नहीं देखे हैं
इस स्थिति में Xeon की ठीक-ठीक क्या भूमिका है, यह जानने की जिज्ञासा है। कोई दूसरा x86 प्रोसेसर इस्तेमाल क्यों नहीं किया जा सकता?
- शायद इसलिए कि Xeon न होने वाले ज़्यादातर motherboards में commercially उपलब्ध DIMM से इतनी memory लगाने लायक memory channels नहीं होते
- DDR4 UDIMM में प्रति module अधिकतम 32GB, DDR5 UDIMM में प्रति module अधिकतम 64GB होता है, और Xeon न होने वाले motherboards में आम तौर पर अधिकतम 4 UDIMM slots होते हैं, इसलिए प्रति node 128GB/256GB की सीमा आ जाती है
  Server motherboards में प्रति socket 16 तक DIMM slots होते हैं और वे RDIMM/LRDIMM support करते हैं, इसलिए ज़्यादा modules और बड़ी capacity वाले modules लगाए जा सकते हैं
  Covid के peak समय में 128GB UDIMM launch हुए थे
- पर्याप्त total RAM को reasonable कीमत पर देने वाले motherboards Epyc के अलावा बहुत ज़्यादा नहीं हैं। Testing/development के लिए 512GB RAM वाला used Dell dual-socket पुराना Xeon server काफी सस्ते में खरीदा जा सकता है
  अभी कुछ मिनट search किया तो video card या SSD जोड़ने से पहले के basis पर 1500 dollars से कम वाले भी आसानी से दिख रहे हैं, और 1024GB RAM configurations भी 2000 dollars से कम में दिख रहे हैं
  कम से कम full-speed PCI-Express x16 3.0 cards कई लगाने हों तो PCIe lanes भी पर्याप्त चाहिए, जो single-socket Intel workstation motherboard में मिलना मुश्किल है
  उदाहरण के तौर पर 512GB RAM वाले कुछ relatively सस्ते configurations दिए जा सकते हैं। Power बहुत खाएंगे और noisy होंगे, लेकिन hp या supermicro जैसे दूसरे x86-64 hardware में भी यही approach है। आम तौर पर 16 x 32GB DDR4 DIMM configuration होता है
  https://www.ebay.com/itm/186991103256?_skw=dell+poweredge+t6...
  https://www.ebay.com/itm/235978320621?_skw=dell+poweredge+r7...
  https://www.ebay.com/itm/115819389940?_skw=dell+poweredge+r7...
ज़्यादा बड़ी लेकिन धीमी RAM बहुत अधिक मात्रा में लगाए हुए GPU क्यों नहीं आते, यह सोचने वाली बात है। तब बड़े models रखे जा सकते, और कीमत भी फिर भी affordable रहती
- उसकी ज़रूरत कहाँ पड़ेगी। Gaming के लिए तो शायद नहीं, और AI के लिए हो तो पैसा दो—फिलहाल Nvidia का तरीका यही है
  AI GPU की demand supply से ज़्यादा है, और उस demand के बड़े हिस्से के पीछे subsidies, loans और investment money पाने वाला overheated पैसा लगा है। GPU company वह पैसा ले सकती है
  अफसोस की बात है कि VRAM हल्के use और पैसे वाले use को अलग करने का perfect criterion है। यह कुछ वैसा ही है जैसे SSO enterprise और non-enterprise को अलग करने का perfect criterion बन गया है और उस पर SSO tax लग जाता है
- ऐसा बनाने पर ज़्यादा महंगा GPU खरीदने की motivation कम हो जाएगी
- ज़्यादा VRAM वाला GPU बनाना बेशक संभव है, लेकिन ऐसा करना पड़े इतनी competition नहीं है। मौजूदा तरीका कहीं ज़्यादा profitable है
- AMD Halo Strix की खबर नहीं देखी क्या? AI में यह Nvidia 4090 से दोगुने से भी ज़्यादा तेज़ है, और पिछले हफ्ते launch हुआ था
क्या DeepSeek ने model naming OpenAI से सीखी है?
- Convention अजीब तो है, लेकिन पूरे industry में, खासकर GGUF models में, यह काफी standard है। इसका मतलब है कि 671B parameters को 4-bit में quantize किया गया है
  K_M term GGUF के लिए ज़्यादा specific लगती है, और concrete quantization strategy बताती है
लेख में थोड़ी और जानकारी होनी चाहिए। TPS numbers सब x से क्यों छिपाए गए हैं, इस configuration से कैसी performance expect की जा सकती है, और हाल में popular हुए dual Epyc workstation configuration से तुलना कैसी है, यह जानना चाहूंगा
- अभी 2-socket 5th-gen Xeon (EMR) पर 8TPS से अधिक मिल रहा है
- अगर हाल में popular बताए जा रहे dual Epyc workstation recipe का link हो तो देखना चाहूंगा
सैंपल आउटपुट में tokens/sec की वैल्यू छिपाई गई है, इसे देखकर लगता है कि यह वाकई काफ़ी अच्छे से चल रहा होगा
Nvidia के बाहर भी LLM और Stable Diffusion inference चलाने के लिए कुछ विकल्प दिख रहे हैं। Intel Arc, Apple M series, और अब AMD Ryzen AI Max भी है
यह तो साफ़ है कि Nvidia पर चलाना सबसे optimal है, लेकिन उचित कीमत वाले high-VRAM Nvidia card मिलना मुश्किल है, इसलिए non-Nvidia hardware के बारे में भी बार-बार सोचना पड़ता है
अगर training या fine-tuning में दिलचस्पी नहीं है और सिर्फ़ inference करना है, तो क्या ऐसे समाधान सच में इस्तेमाल लायक हैं? यह भी जानना चाहूँगा कि Linux मशीन पर संभव है या नहीं
- अगर गंभीरता से करना है, तो Nvidia पर जाना ही सही है
  यह लेख असल में Intel की “हमने भी GPU बनाया है” याद दिलाने वाली बात जैसा है, और budget card अपने-आप में अच्छा है, लेकिन ecosystem बहुत पीछे है
  सच कहूँ तो यह ऐसा क्षेत्र है जहाँ budget बचाकर ठीक से करना मुश्किल है
AI के लिए APU आ जाएँ तो GPU में दिलचस्पी तेज़ी से कम हो सकती है
AMD Halo Strix या Apple M3 Studio APU के साथ 512GB या 128GB RAM इस्तेमाल की जा सकती है, तो महंगा Nvidia 4090 क्यों खरीदें
Nvidia ने जितना हो सके उतने लंबे समय तक कीमतें ऊँची और performance कम रखी है, और अब जाकर competition आया है। Intel भी ढेर सारी RAM वाला APU बना सकता है
उम्मीद है Nvidia थोड़ा घबराया हुआ होगा

Xeon पर 1–2 Arc A770 के साथ DeepSeek-R1-671B-Q4_K_M चलाना

portable zip/tgz से llama.cpp चलाना

Windows quick start

GGUF मॉडल चलाना

Linux quick start

GGUF मॉडल चलाना

FlashMoE से DeepSeek V3/R1 चलाना

requirements और सावधानियाँ

disk 500GB

CLI run

Serving run

multi-GPU selection और SYCL errors

अलग-अलग SYCL devices detect होना

इस्तेमाल करने वाला GPU specify करना

performance options और signature verification

Immediate command lists

portable zip/tgz 2.2.0 signature verification

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय