GGUF में वेट्स के अलावा क्या होता है, और अभी क्या कमी है?

(nobodywho.ooo)

3 पॉइंट द्वारा GN⁺ 3 시간 전 | 1 टिप्पणियां | WhatsApp पर शेयर करें

GGUF वह language model फ़ाइल फ़ॉर्मैट है जिसे llama.cpp इस्तेमाल करता है; यह रनटाइम के लिए ज़रूरी मेटाडेटा को एक ही फ़ाइल में रखकर model distribution और loading को आसान बनाता है
Chat templates Jinja2 scripts होते हैं जो conversation format, tool calling, और multimedia message encoding को संभालते हैं, लेकिन implementations के बीच इनके व्यवहार में अंतर होता है
GGUF special tokens जैसे end token और recommended sampler settings रख सकता है, और हाल में इसमें sampler chain order भी explicitly बताना संभव हुआ है
अभी tool calling format हर model में अलग है, इसलिए inference engines में hardcoded handling की ज़रूरत पड़ती है; grammar-based parser generation standard सुधार के लिए एक संभावित दिशा बनी हुई है
think_token, projection model bundling, और capability flags की कमी के कारण thought segments को अलग करना, multimodal setup बनाना, और supported features को detect करना अभी भी कठिन है

GGUF क्या समेटता है

GGUF वह फ़ाइल फ़ॉर्मैट है जिसे llama.cpp language models के लिए इस्तेमाल करता है
GGUF का मुख्य फ़ायदा यह है कि यह model चलाने के लिए ज़रूरी कई components को एक ही फ़ाइल में समेट देता है
- Hugging Face के सामान्य safetensors repository में ज़रूरी JSON फ़ाइलें कई जगह बिखरी रहती हैं
- एक सामान्य Ollama model layer JSON, Go templates आदि के साथ OCI रूप में होता है
GGUF इस तरह की अतिरिक्त जानकारी को एक फ़ाइल में रखकर models को संभालना आसान बनाता है

Chat templates

Conversational language models को token sequences के एक खास format पर train किया जाता है, और यह format बातचीत की संरचना जैसा दिखता है
Gemma4 format का उदाहरण इस प्रकार है

<|turn>user
Hi there!<turn|>
<|turn>model
Hi there, how can I help you today?<turn|>

LFM2 format template का उदाहरण इस प्रकार है

<s>
<|im_start|>user Hi there!<|im_end|>
<|im_start|>assistant Hi there, how can I help you today?<|im_end|>

वास्तविक templates इससे कहीं अधिक जटिल हो जाते हैं, क्योंकि इनमें reasoning blocks, tool descriptions, tool calls और responses, और image·audio·video जैसे multimedia messages की encoding भी शामिल हो सकती है
यह काम chat template करता है, जो Jinja2 template language में लिखा हुआ script होता है
- उदाहरण के लिए Gemma 4 में शामिल chat template देखा जा सकता है
- GGUF metadata में default chat template tokenizer.chat_template key के नीचे store होता है
एक model के पास कई chat templates हो सकते हैं
- tool calling support वाला template और बिना support वाला template अलग-अलग हो सकता है
- ज़्यादातर models एक बड़ा single chat template देते हैं, और केवल tool दिए जाने पर tool-calling logic चलाते हैं
- कुछ models में अलग से tool-specific chat template ढूँढना पड़ता है
Jinja2 loops, conditions, assignments, lists, dictionaries आदि के साथ लगभग एक programming language जैसा है
- Conversational LLM applications को हर नए message के जुड़ने पर Gemma द्वारा दिए गए लगभग 250-line Jinja script जैसे program को चलाने वाला interpreter शामिल करना पड़ता है
अलग-अलग implementations में Jinja handling भी अलग होती है
- Hugging Face transformers Python की मौजूदा jinja2 library का उपयोग करता है
- llama.cpp के llama-server और llama-cli अपना Jinja implementation इस्तेमाल करते हैं
- libllama API में exposed llama_chat_apply_template पुराना तरीका है, जिसमें कुछ chat formats सीधे C++ में hardcode किए गए हैं
- NobodyWho minijinja का उपयोग करता है, जिसे Jinja के मूल लेखक ने Rust में फिर से implement किया है
- यह llama.cpp द्वारा कभी इस्तेमाल की गई minimal Jinja library minja से अलग है
Jinja implementations के बीच काफ़ी performance difference पाया जाता है
- Local LLM applications में chat template processing performance bottleneck नहीं होती, इसलिए यह बहुत बड़ा विवाद का विषय नहीं है

Special tokens

Language model दिए गए token sequence के लिए अगला token लगातार generate कर सकता है, इसलिए generation रोकने का कोई तरीका चाहिए
सामान्य समाधान end token रखना है; जब model यह token output करे, तो inference engine generation रोक दे
End token, special token का एक उदाहरण है
- Special tokens आम तौर पर tokenized characters से ज़्यादा अर्थ रखते हैं
- इन्हें आमतौर पर user को नहीं दिखाना चाहिए, लेकिन इनके पास अक्सर textual representation होता है, इसलिए इन्हें दिखाया जा सकता है
Gemma4 के कुछ special token उदाहरण इस प्रकार हैं
- 1 / <eos>: sequence का अंत, जिसे model generation रोकने के लिए output करता है
- 2 / <bos>: sequence की शुरुआत, जो input के आगे जोड़ा जाता है
- 46 / <|tool_call>: tool call की शुरुआत को दर्शाता है
- 47 / <tool_call|>: tool call के अंत को दर्शाता है
- 105 / <|turn>: conversation turn की शुरुआत को दर्शाता है
- 106 / <turn|>: conversation turn के अंत को दर्शाता है

Sampler settings और order

Language models अगले token की probability distribution output करते हैं, और इस distribution से token चुनने की प्रक्रिया को sampling कहा जाता है
सबसे सरल तरीका weighted distribution से random selection करना है
व्यवहार में, किसी विशेष token को चुनने से पहले probability distribution पर transformations लागू करने से बेहतर परिणाम मिल सकते हैं
जब labs कोई नया model जारी करती हैं, तो वे अक्सर उसके साथ कुछ recommended sampler settings भी देती हैं
बेहतर responses पाने के लिए users का Markdown फ़ाइलों आदि से values copy-paste करना भी आम बात है
NobodyWho ने user की manual copy-paste कम करने के लिए Hugging Face page पर curated models डाले और अपनी format में recommended sampler settings bundle कीं
- यह काम तो करता था, लेकिन model को उपयोगी बनाने के लिए NobodyWho की तरफ़ conversion ज़रूरी होती थी
GGUF format में हाल में जोड़ी गई सुविधा से sampler chain को सीधे model file के अंदर specify किया जा सकता है
- इससे NobodyWho की custom format की ज़रूरत खत्म हो गई, और यही वांछित परिणाम था
llm-sampling webapp में अलग-अलग sampler stages की भूमिका जल्दी देखी जा सकती है
Individual stages को drag-and-drop करने पर दिखता है कि sampling stages का order final distribution में बड़ा अंतर ला सकता है
Ollama image के JSON files या Hugging Face के generation_config.json सहित कई sampler-setting formats में sampling stages का order बताने का तरीका नहीं है
GGUF standard general.sampling.sequence field के ज़रिए sampling order specify कर सकता है
फिर भी कई GGUF models इस field को छोड़ देते हैं और llama.cpp के default behavior वाले implicit order पर निर्भर रहते हैं

अभी क्या गायब है

अच्छा inference engine अलग-अलग language models के लिए unified interface देना चाहता है
GGUF metadata की अतिरिक्त जानकारी को parse और use करने से model-specific code paths काफ़ी कम हो सकते हैं
Tool calling format
- लगभग सभी inference engines के पास अलग-अलग tool calling formats parse करने के लिए hardcoded paths होते हैं
- Qwen3 का tool calling format उदाहरण इस प्रकार है

<tool_call>{"name": "get_weather", "arguments": {"location": "Copenhagen"}}</tool_call>

Qwen3.5 का tool calling format उदाहरण इस प्रकार है

<tool_call>
<function=get_weather>
<parameter=city>
Copenhagen
</parameter>
</function>
</tool_call>

Gemma4 का tool calling format उदाहरण इस प्रकार है

<|tool_call>call:get_weather{city:<|"|>Copenhagen<|"|>}<tool_call|>

नया model आते ही कई inference engines को अपना-अपना parser implement करना पड़ता है
अगर model file में grammar शामिल हो और उसी grammar से parser निकाला जा सके, तो यह GGUF standard में एक बेहतरीन जोड़ हो सकता है
NobodyWho दिए गए specific tool के अनुरूप constraint grammar generate करने का एक अतिरिक्त चरण चलाता है
- इससे tool calls की type safety सुनिश्चित की जा सकती है
- यह खास तौर पर तब उपयोगी है जब 1B से छोटे models integer की जगह float भेजने जैसी गलती कर सकते हैं
भले ही सामान्य tool-calling parser बनाने वाली grammar उपलब्ध हो, NobodyWho को दिए गए concrete tools के हिसाब से grammar generate करने वाला function फिर भी implement करना पड़ेगा
किसी specific tool के अनुरूप concrete grammar बनाने और उससे parser निकालने योग्य meta-grammar format अभी भी एक दिलचस्प खुला सवाल है
Think token
- गायब चीज़ों में यह सबसे आसानी से जोड़ा जा सकने वाला हिस्सा है
- upstream Hugging Face repository ने think_token field शामिल करना शुरू कर दिया है
- think_token generated output के thought segment को अलग करने में बहुत उपयोगी है
  - Thought segment को आम तौर पर हटाना चाहिए या main output से अलग तरह से render करना चाहिए
- downstream GGUF conversion में यह field आमतौर पर शामिल नहीं होती
- नतीजतन, GGUF-based inference engines specific model families के लिए अलग code लिखे बिना thought stream को main output से अलग नहीं कर सकते
- Standard GGUF conversion pipeline में think_token जोड़ने से यह समस्या हल हो जाएगी
Projection models
- Image और audio को text में बदले बिना LLM द्वारा native रूप से समझने देने वाली multimodal LLM interaction के लिए non-text inputs को process करने वाला अतिरिक्त model चाहिए
- इस अतिरिक्त model को projection model कहा जाता है
- अभी प्रचलित तरीका दो GGUF files देने का है
  - एक main language model के लिए GGUF
  - दूसरी image और audio processing के लिए छोटा model
- यह तरीका GGUF की single-file convenience को तोड़ देता है
- अगर एक GGUF file के अंदर projection model के weights और settings bundle किए जा सकें, तो यह बड़ा सुधार होगा
- Projection model अक्सर लगभग 1GB का होता है
  - इसलिए जब इसकी ज़रूरत न हो, तो इस overhead से बचना भी ज़रूरी है
- Projection weights शामिल करने वाला GGUF और बिना शामिल किए वाला GGUF, दोनों variants देना एक उचित तरीका हो सकता है
- इससे फिर वही स्थिति मिल जाएगी जहाँ manage करने के लिए download का एक URL और disk पर cache करने के लिए एक ही file हो
Supported capabilities list
- हर model की supported capabilities अलग होती हैं, और सिर्फ़ GGUF file देखकर असल support detect करना आसान नहीं है
- कुछ models image input support करते हैं और कुछ नहीं
  - अभी सबसे अच्छा तरीका यह मान लेना है कि projection model दिया गया है तो image support मौजूद है
- कुछ models native tool calling support करते हैं और कुछ नहीं
  - अभी सबसे अच्छा तरीका chat template में tool JSON schema list render करने की कोशिश वाले हिस्से को string partial match से ढूँढना है
  - यह स्पष्ट रूप से अस्थायी जुगाड़ है
- कुछ models thought blocks output करते हैं और कुछ नहीं
  - चूँकि thought tags आमतौर GGUF metadata में नहीं होते, इसलिए यह जाँचना कठिन है कि model से thought blocks की अपेक्षा करनी चाहिए या नहीं
- अगर GGUF community model files में capability flags जोड़ दे, तो model-agnostic inference libraries अधिक consistent error messages और warnings दे सकेंगी
  - उदाहरण के लिए, native tool calling support न करने वाले model पर tool calling की कोशिश होने पर बेहतर guidance दी जा सकेगी

निष्कर्ष

GGUF model को सही तरह चलाने के लिए ज़रूरी अतिरिक्त जानकारी को एक ही फ़ाइल में समेटता है, जिससे model-specific code paths बहुत ज़्यादा बढ़ाने की ज़रूरत नहीं पड़ती
GGUF एक खुला और extensible format है, और इसके पास मज़बूत community है
अगर standard को मिलकर मज़बूत किया जाए, तो अच्छा developer experience बनाए रखते हुए applications में models को आसानी से बदला जा सकता है
GGUF metadata पहले से ही कई तरह से उपयोगी है, लेकिन tool-calling grammar, think_token, projection model bundling, और capability flags जैसे सुधार की गुंजाइश अभी भी बाकी है

1 टिप्पणियां

GN⁺ 3 시간 전

Hacker News की राय

यह थोड़ा अफसोस की बात है कि projection model अलग फ़ाइल में बंट गया, और मैं भी चाहता था कि यह एक ही फ़ाइल के अंदर हो
यह ऐसा क्यों हुआ, मुझे ठीक-ठीक नहीं पता, लेकिन GGUF डिज़ाइन करते समय जो single-file philosophy ध्यान में थी, उससे यह काफ़ी अलग है
उम्मीद है कोई इन दोनों को फिर से जोड़ने की पहल करेगा; इस बार मुझे लगता है कि मैं चर्चा की मुख्य धारा से थोड़ा बाहर हूँ :-)
- अभी MTP support पर काम चल रहा है, तो लगता है उस चर्चा के दौरान Mmproj की तरह MTP model को main GGUF से अलग करने का विचार आया था, लेकिन उसे खारिज कर दिया गया
  मुझे वह फ़ैसला पसंद है। इसलिए यह मानना भी ज़्यादा दूर की बात नहीं कि Mmproj फ़ाइल को भी GGUF के अंदर शामिल करने की संभावना खुली हो सकती है
  दिमाग में आने वाली एकमात्र समस्या यह है कि कौन-सा format रखा जाए: BF16, F16 वगैरह
GGML और GGUF open source machine learning/AI ecosystem के लिए बहुत महत्वपूर्ण रहे हैं
llama.cpp, whisper.cpp, stable-diffusion.cpp जैसे प्रोजेक्ट्स आम तौर पर तरह-तरह के platforms और hardware backends पर सीधे अच्छे से चल जाते हैं
- llama.cpp भले Meta की तरफ़ से आया हो, और मुझे Meta सच में पसंद नहीं, लेकिन यह मानना पड़ेगा कि बाकी चीज़ों की तुलना में यह सबसे आसान है
  compile करो, model डालो, और चला दो। फिर तुम्हें web UI और API भी मिल जाती है
> <|turn>user Hi there!<|turn>model Hi there, how can I help you today
हे भगवान, इन्होंने XML से भी कम पढ़ने-लायक format बना दिया
- यह format इंसानों के पढ़ने के लिए बनाया ही नहीं गया। असल में इसे देखने की ज़रूरत भी बहुत कम पड़ती है
  इसे इस तरह डिज़ाइन किया गया है कि यह वास्तविक content के साथ गड़बड़ न करे, और वह content इंटरनेट से आया कोई भी text हो सकता है
  ऐसा करने के लिए ऐसा format चाहिए जो कहीं और इस्तेमाल न होता हो
- सही है। memory usage efficiency के हिसाब से भी यह optimal format नहीं लगता
मेरे हिसाब से इस समय सबसे बड़ी कमी यह है कि model architecture को मौजूदा build में hardcode किए बिना define करने का तरीका नहीं है
ज़रूरी नहीं कि पूरी तरह supported models के साथ 1:1 performance parity ही हो
लेकिन release के पहले दिन से vendor-validated proper support होना या न होना ही तय करता है कि कोई model शानदार लगेगा या बेहद खराब। हाल की Gemma और Qwen releases इसका उदाहरण हैं
समाधान क्या हो, यह पक्का नहीं, लेकिन model graph को describe करने के लिए एक DSL लिखकर उसे GGUF में रखा जा सकता है
दूसरा विकल्प यह हो सकता है कि official model release के PyTorch modules को पढ़कर किसी तरह उन्हें GGML operations में बदला जाए
- GGUF spec में जानबूझकर computation graph शामिल करने की थोड़ी जगह छोड़ी गई थी, इस उम्मीद में कि कोई इसे आगे बढ़ाएगा
  मैं इसे पहले version में डालना चाहता था, लेकिन उस समय प्राथमिकता यह थी कि minimum viable spec निकले और implement हो
  मैं आज भी यह देखना चाहूँगा, लेकिन इसके लिए ऐसा champion चाहिए जो मौजूदा GGML IR की स्थिति को बहुत अच्छी तरह समझता हो
- computation graph को ONNX की तरह weights file के अंदर embed किया जा सकता है
  फिर एक common interface expose किया जा सकता है जो common parameters ले, और extra custom parameters को Wayland की तरह extensions में रखा जा सकता है
  तब सिर्फ LLaMa जैसे transformer परिवार ही नहीं, बल्कि RWKV जैसे recurrent neural network परिवार, multimodal models वगैरह भी support किए जा सकते हैं
  असल implementation कैसी होगी, यह नहीं पता, पर idea शानदार लगता है। बस चिंता यह है कि अगर computation graph model file के अंदर ही फिक्स हो, तो architecture improvements या optimizations जो weights बदले बिना हो सकते हैं, वे पुराने files पर conversion के बिना लागू नहीं हो पाएँगे
> GGUF की सबसे साफ-सुथरी बात यह है कि यह एक ही फ़ाइल है। Hugging Face के सामान्य safetensors repository की तुलना में, ज़रूरी JSON files इधर-उधर बिखरी होती हैं [...]
दिलचस्प बात यह है कि मेरे लिए AI model “हमेशा” एक single file ही रहे हैं। local image generation की दुनिया में यही standard था
safetensors फ़ाइलें भी अपने अंदर बहुत कुछ रख सकती हैं, इसलिए इसके लिए GGUF ज़रूरी ही हो, ऐसा नहीं है
लेकिन आधुनिक models के text encoders अपने आप में कई GB के language models होते हैं, इसलिए कोई भी हर checkpoint में उनकी duplicate copy नहीं रखता
- single-file distribution मेरा जानबूझकर रखा गया design goal था
  ज़्यादातर image models तब भी single-file थे या आज भी हैं, लेकिन LLM safetensors कम-से-कम उस समय ऐसे नहीं थे, और मैं इसे structural level पर enforce करना चाहता था
  साथ ही मैं runners, जैसे llama.cpp, पर JSON reader की dependency नहीं डालना चाहता था, जबकि ST approach में इसकी ज़रूरत पड़ती
  और बड़ी समस्या यह थी, अगर मुझे सही याद है, कि उस समय ST, GGML के नए quantization formats को support नहीं कर सकता था, जबकि अपना file format होने से ऐसी flexibility मिल सकती थी जो ST से पाना मुश्किल था
- “local image generation में AI model हमेशा single file थे” यह बात उस क्षेत्र में भी ठीक नहीं बैठती
  architecture को weights के साथ वास्तव में चलाने के लिए सिर्फ एक weight file नहीं, बल्कि कई encoders और decoders वगैरह भी चाहिए होते हैं
  तुम्हारा tool उसे छिपा सकता है, लेकिन सतह के नीचे वे चीज़ें फिर भी मौजूद रहती हैं
libllama API में exposed, और C++ में सीधे hardcode किए गए कुछ chat formats के लिए जो थोड़ा अजीब llama_chat_apply_template है, desktop-based inference apps को FLTK[0] के साथ छेड़ने वाले मेरे जैसे व्यक्ति के नज़रिए से अच्छा होता अगर यह llama.cpp वाला असली Jinja2 template parser इस्तेमाल करता
या कम-से-कम ऐसा कोई दूसरा C function होता जो यह काम कर देता। सही parsing के लिए, मसलन tool calling उपलब्ध है या नहीं यह template को पता हो, इसके लिए कई तरह का data पास करना पड़ सकता है
अभी तो मैं यही अस्थायी function इस्तेमाल कर रहा हूँ, लेकिन अंत में शायद मुझे खुद Jinja2 interpreter इस्तेमाल करना पड़ेगा या llama.cpp का code उठाकर जोड़ना पड़ेगा
फिर भी GGUF का all-in-one approach बहुत सुविधाजनक है। और मैं सहमत हूँ कि projection model का अलग फ़ाइल होना अजीब लगता है
जब मैंने पहली बार vision-support model लिया, तो सिर्फ वही GGUF डाउनलोड किया जो ठीक लग रहा था, लेकिन llama.cpp ने कहा कि model को process नहीं कर सकता, और काफी देर बाद समझ आया कि एक extra file भी चाहिए
उस समय मेरे दिमाग में सचमुच यही आया: “क्या GGUF सब कुछ समेटने वाला format नहीं था?” :-P
[0] https://i.imgur.com/GiTBE1j.png
मैं हमेशा Hugging Face repository जैसे safetensors + metadata files format का इस्तेमाल करता आया हूँ
यह कोई बड़ी असुविधा नहीं है, लेकिन GGUF का ज़्यादा compact format और अच्छा support होना अच्छी बात लगती है
GGUF में अभी क्या नहीं है, यह देखते-देखते मैंने उल्टा GGUF के बारे में और सीखा
tool-calling format बहुत स्वाभाविक लगता है, और शायद LLM से agent की ओर बढ़ने का एक अहम मील का पत्थर बन सकता है
हाल ही में मैंने TheBloke का 7B Mistral डाउनलोड किया था ताकि उसे आज़मा सकूँ, और मेरे पास 4070 है
- मुझे Mistral पसंद है, लेकिन वह model अब सबसे बेहतर नहीं है
  Gemma 4 e4b को एक बार आज़माना चाहिए। इसका आकार Mistral 7B के आसपास है और 4070 पर अच्छे से चलेगा
  “E4B” नाम थोड़ा भ्रम पैदा कर सकता है
- 7B Mistral अब काफ़ी पुराना हो चुका है
  12GB 4070 पर Qwen 3.5 9B q4km या Qwen 3.6 35B चलाया जा सकता है। दूसरा वाला कहीं ज़्यादा स्मार्ट है, लेकिन memory offloading की वजह से बहुत धीमा है
  दोनों को LM Studio में चलाकर देखो, उनकी क्षमता सचमुच चौंकाने वाली है
- मैंने 2070 पर भी इसे बहुत तेज़ और अच्छा चलते देखा है
  मुझे TheBloke पसंद है, काश वह अब भी models बना रहा होता

GGUF में वेट्स के अलावा क्या होता है, और अभी क्या कमी है?

GGUF क्या समेटता है

Chat templates

Special tokens

Sampler settings और order

अभी क्या गायब है

Tool calling format

Think token

Projection models

Supported capabilities list

निष्कर्ष

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय