Gemma 4 विज़ुअल गाइड

(newsletter.maartengrootendorst.com)

17 पॉइंट द्वारा GN⁺ 26 일 전 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

Google DeepMind द्वारा जारी Gemma 4 एक multimodal LLM family है, जिसमें E2B·E4B·31B·26B A4B सहित कुल 4 मॉडल हैं, और सभी variants image input को support करते हैं
सभी मॉडल local attention (sliding window) और global attention layers को बारी-बारी से रखने वाली एक साझा संरचना का उपयोग करते हैं, और अंतिम layer हमेशा global attention पर fixed रहती है
global attention layers में GQA (Grouped Query Attention), K=V तकनीक, p-RoPE जैसी तीन efficiency techniques एक साथ लागू की गई हैं, जिससे memory और computation दोनों की बचत होती है
छोटे मॉडल (E2B·E4B) Per-Layer Embeddings (PLE) के जरिए बड़े embedding tables को flash memory में स्टोर करके VRAM उपयोग को न्यूनतम करते हैं, और इनमें audio encoder भी जोड़ा गया है
Gemma 4, variable aspect ratio और resolution को support करने वाले vision encoder (ViT आधारित) और MoE (26B A4B) architecture के जरिए on-device से लेकर large-scale inference तक व्यापक उपयोग को support करता है

Gemma 4 family की संरचना

इसमें 4 मॉडल हैं, और यह dense architecture तथा MoE architecture, दोनों प्रकार का उपयोग करता है
- Gemma 4 - E2B: Per-Layer Embeddings लागू, effective parameters 2 अरब
- Gemma 4 - E4B: Per-Layer Embeddings लागू, effective parameters 4 अरब
- Gemma 4 - 31B: 31 अरब parameters वाला dense मॉडल
- Gemma 4 - 26B A4B: कुल 26 अरब parameters वाला MoE मॉडल, inference के समय केवल 4 अरब parameters सक्रिय
सभी मॉडल multimodal हैं और अलग-अलग size तथा resolution की image inputs को process कर सकते हैं
छोटे मॉडल (E2B·E4B), image और text के अलावा audio input भी support करते हैं

Gemma 4 की साझा architecture

Attention layers की interleaving

Gemma 3 की तरह इसमें भी local attention (sliding window) और global attention layers को बारी-बारी से रखा गया है
- sliding window attention: केवल एक निश्चित range के भीतर के tokens को refer करता है → computation कम होती है
- global attention: पूरे sequence को refer करता है → पूरे context की संरचना समझ सकता है
sliding window का आकार
- छोटे मॉडल (E2B·E4B): 512 tokens
- बड़े मॉडल (26B A4B·31B): 1024 tokens
Gemma 3 में कुछ मामलों में अंतिम layer local attention होती थी, लेकिन Gemma 4 में अंतिम layer हमेशा global attention पर fixed है
interleaving ratio
- E2B: 4 local attention layers + 1 global attention layer का 4:1 pattern
- बाकी मॉडल: 5:1 pattern (5 local layers + 1 global layer)

Global attention की efficiency

GQA (Grouped Query Attention)

global attention layers में 8 query heads, 1 KV head को share करते हैं, जिससे KV cache storage बहुत कम हो जाता है
KV heads की संख्या घटाने से होने वाली performance कमी को संतुलित करने के लिए Key dimension size को 2 गुना बढ़ाया गया है

K=V तकनीक

global attention layers में Keys और Values को समान रखा गया है, जिससे KV cache memory की जरूरत और कम होती है
यह ऐसी तकनीक है जो performance पर बड़ा असर डाले बिना memory efficiency बढ़ाती है

p-RoPE

RoPE (rotary positional encoding) को पूरी dimension पर नहीं, बल्कि केवल कुछ dimensions पर लागू किया जाता है (यदि p=0.25 हो, तो केवल ऊपरी 25% pairs पर लागू)
low-frequency pairs का उपयोग positional information की जगह semantic information को सुरक्षित रखने के लिए किया जाता है
global attention में लंबे context के कारण होने वाली tokens के बीच दूरी की distortion समस्या को कम करने में यह विशेष रूप से प्रभावी है
global attention layer पर लागू कुल सुधारों का सार:
- अंतिम layer हमेशा global attention
- हर 8 queries पर 1 Key shared
- Key dimension 2 गुना बड़ा
- Keys = Values
- p=0.25 के साथ p-RoPE लागू

Vision encoder

Vision Transformer (ViT) आधारित, जो images को patch sequences में बदलकर embeddings बनाता है
- हर patch का आकार 16×16 pixels है
छोटे मॉडल (E2B·E4B) में 15 करोड़ parameters वाला vision encoder है, जबकि बाकी मॉडल 55 करोड़ parameters वाला vision encoder उपयोग करते हैं

Variable aspect ratio support

पारंपरिक ViT square input पर fixed होता है → aspect ratio बदलने पर positional encoding में समस्या आती है
Gemma 4 में 2D RoPE जोड़ा गया है: patch embeddings को दो हिस्सों में बांटकर horizontal (w) और vertical (h) positional information को स्वतंत्र रूप से encode किया जाता है
input image को 16×16 pixel patches के अनुरूप adaptive resizing किया जाता है, और जो हिस्से पूरी तरह fit नहीं होते उन्हें padding दी जाती है
variable-sized patches को spatial position आधारित pooling के जरिए fixed संख्या के patch embeddings में घटाया जाता है

Variable resolution support (soft token budget)

soft token budget की अवधारणा जोड़ी गई है: LLM तक भेजे जाने वाले patch embeddings की अधिकतम संख्या सीमित की जाती है
- user द्वारा चुने जा सकने वाले budgets: 70, 140, 280, 560, 1120 tokens
budget जितना अधिक होगा (जैसे 1120), उतनी अधिक resolution बनी रहेगी; budget कम होने पर (जैसे 70) image downscale कर दी जाएगी
उदाहरण: budget 280 होने पर अधिकतम patches की संख्या = 9 × 280 = 2,520 (3×3 block unit पर average pooling लागू)

Linear projection

vision encoder की output embeddings, LLM के token embeddings से dimension और distribution में अलग होती हैं, इसलिए इन्हें एक छोटे neural network से project किया जाता है
projection के बाद RMSNorm लागू किया जाता है ताकि आगे आने वाले Transformer blocks की expected scale के अनुरूप हो सके
linear projection layer को Gemma 4 के साथ train किया गया है ताकि patch embeddings, LLM की अपेक्षित values के अनुरूप optimize हो सकें

Gemma 4 - 31B (Dense)

यह 31 अरब parameters वाला dense architecture मॉडल है, और Gemma 4 variants में सबसे बुनियादी संरचना के सबसे करीब है
संरचनात्मक रूप से यह Gemma 3 के 27B मॉडल जैसा है, लेकिन इसमें K=V और p-RoPE जैसी Gemma 4 की साझा improvements लागू की गई हैं
layers की संख्या 62 से घटाकर 60 की गई है, लेकिन हर layer की width बढ़ाई गई है

Gemma 4 - 26B A4B (Mixture of Experts)

कुल 26 अरब parameters होने के बावजूद inference के समय केवल 4 अरब parameters (active parameters) उपयोग होते हैं, इसलिए यह 4B मॉडल जैसी गति से चल सकता है
MoE (Mixture of Experts) संरचना: एक सामान्य बड़े FFNN की जगह कई छोटे FFNN (Experts) रखे जाते हैं, जिनमें से input के अनुसार कुछ ही सक्रिय होते हैं
- कुल 128 Experts में से inference के समय 8 Experts चुने जाते हैं और सक्रिय होते हैं
- 1 shared Expert हमेशा सक्रिय रहता है: यह general knowledge processing संभालता है और इसका आकार अन्य Experts से 3 गुना बड़ा है
Router, हर input token के लिए Expert selection probabilities बनाकर routing करता है, और चुने गए Experts के output पर probability weights लागू किए जाते हैं
सभी parameters memory में load होते हैं, लेकिन वास्तविक computation में केवल 8 Experts + 1 shared Expert का उपयोग होता है → बाकी 119 standby में रहते हैं

Gemma 4 - E2B & E4B (Dense + Per-Layer Embeddings)

Per-Layer Embeddings (PLE)

मॉडल के अंदर नहीं, बल्कि हर layer के लिए अलग embedding lookup table जोड़ी जाती है, ताकि छोटे devices पर VRAM उपयोग न्यूनतम रहे
E2B के आधार पर: 262,144 tokens × 35 layers × 256 dimensions की PLE table → flash memory में store
inference शुरू होने पर input token की layer-wise embeddings केवल एक बार lookup की जाती हैं → उसके बाद हर layer में दोबारा lookup की जरूरत नहीं
हर decoder block के बीच gating function embedding weights तय करता है, फिर उसे मूल embedding size में project किया जाता है (E2B: 256→1536, E4B: 256→2560)
projected embeddings को normalize करने के बाद पिछले decoder block के output के साथ जोड़ा जाता है → इससे मॉडल token अर्थ को लगातार refer कर सकता है
"E" का अर्थ effective parameters है, जिसमें PLE शामिल नहीं है

Audio encoder

यह केवल छोटे मॉडल (E2B·E4B) में जोड़ा गया है और automatic speech recognition तथा translation जैसे उपयोगों के लिए काम आता है
audio processing के 3 चरण:
1. Feature extraction: raw audio → mel-spectrogram (time × frequency 2D representation)
2. Chunk grouping: mel features को chunks में समूहित करके token sequence का प्रारंभिक बिंदु बनाना
3. Downsampling: 2D convolution की 2 layers से sequence length घटाकर soft tokens बनाना
audio encoder के लिए Conformer उपयोग किया गया है: यह standard Transformer encoder में convolution module जोड़ने वाली संरचना है
Conformer output embeddings को भी vision encoder की तरह linear projection के जरिए Gemma 4 की embedding space के अनुरूप बदला जाता है

Gemma 4 विज़ुअल गाइड

Gemma 4 family की संरचना

Gemma 4 की साझा architecture

Attention layers की interleaving

Global attention की efficiency

GQA (Grouped Query Attention)

K=V तकनीक

p-RoPE

Vision encoder

Variable aspect ratio support

Variable resolution support (soft token budget)

Linear projection

Gemma 4 - 31B (Dense)

Gemma 4 - 26B A4B (Mixture of Experts)

Gemma 4 - E2B & E4B (Dense + Per-Layer Embeddings)

Per-Layer Embeddings (PLE)

Audio encoder

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.