microgpt

(karpathy.github.io)

3 पॉइंट द्वारा GN⁺ 2026-03-02 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

शुद्ध Python की एक single file में 200 lines के भीतर GPT मॉडल की training और inference की पूरी प्रक्रिया लागू करने वाला न्यूनतम language model structure
dataset, tokenizer, autograd engine, GPT-2 जैसा neural network, Adam optimizer, और training व inference loop — सब कुछ शामिल
names dataset को train करके नए नाम generate करता है, और self-implemented autograd व Transformer structure के जरिए GPT के core principles दिखाता है
बड़े LLMs के विपरीत, बिना किसी dependency के शुद्ध Python पर चलता है, जहाँ सिर्फ algorithmic essence बची है
microgpt को समझने पर ChatGPT जैसे बड़े models की बुनियादी algorithmic structure समझी जा सकती है

microgpt अवलोकन

microgpt 200 lines के Python code में लिखा गया एक न्यूनतम GPT implementation है, जिसमें किसी external library की dependency नहीं है
- इसमें dataset, tokenizer, autograd, GPT-2 जैसी structure, Adam optimizer, training और inference loop सब शामिल हैं
यह Karpathy का एक कलात्मक project है, जिसे LLM को उसके मूल स्तर तक सरल बनाने के लिए बनाया गया है, और यह micrograd·makemore·nanogpt जैसी श्रृंखला का विस्तार है
पूरा code GitHub Gist, webpage, और Google Colab पर उपलब्ध है

dataset

लगभग 32,000 नामों वाली एक text file इस्तेमाल की जाती है, जिसमें हर line पर एक नाम है
हर नाम को एक document माना जाता है, और model इस pattern को सीखकर नए नाम generate करता है
training के बाद generated examples: kamon, ann, karai, jaire, vialan आदि

tokenizer

हर unique character को integer ID देने वाला एक साधारण character-based tokenizer
alphabet a–z और BOS(beginning of sequence) token सहित कुल 27 tokens
हर document को [BOS, e, m, m, a, BOS] के रूप में wrap करके train किया जाता है

automatic differentiation (Autograd)

Value class scalar values और gradients को track करती है और computation graph बनाती है
addition, multiplication, power, log, exponent, ReLU जैसी basic operations के local gradients store करती है
backward() method chain rule लागू करके backpropagation करता है
PyTorch के .backward() जैसा ही algorithm scalar स्तर पर सीधे implement किया गया है

parameter initialization

model में लगभग 4,192 parameters हैं
embedding table, attention weights, MLP weights, output projection आदि से बना है
हर parameter को Gaussian distribution के random values से initialize किया जाता है

model architecture

GPT-2 structure का सरल रूप, जिसमें RMSNorm, ReLU, और residual connection का उपयोग है
मुख्य components:
- embedding stage: token और positional embedding को जोड़ा जाता है
- multi-head attention: Q, K, V vectors की गणना के बाद KV cache के माध्यम से पिछले tokens की जानकारी का उपयोग
- MLP block: 2-layer feedforward network जो local computation करता है
- output stage: vocabulary size (27) के लिए logits बनाता है
KV cache training के दौरान भी active रहता है, और backpropagation cache के माध्यम से flow करता है

training loop

हर step में एक document चुनकर उसे [BOS, ... , BOS] में tokenize किया जाता है
model अगले token की probability predict करता है और cross-entropy loss की गणना करता है
loss.backward() से gradient निकालने के बाद Adam optimizer से parameters update होते हैं
learning rate linear decay तरीके से घटती है
1,000 steps में loss लगभग 3.3 → 2.37 तक घटता है

inference

training पूरी होने के बाद, BOS token से शुरू करके नया नाम generate किया जाता है
हर step पर softmax probability के आधार पर अगला token sample किया जाता है
temperature value से creativity नियंत्रित होती है (कम होने पर अधिक conservative, अधिक होने पर अधिक diverse)
example output: kamon, ann, karai, jaire, vialan, karia, yeran, anna आदि

चलाने का तरीका

सिर्फ Python होने पर चलाया जा सकता है (python train.py)
लगभग 1 minute में training पूरी हो जाती है, और हर step पर loss value print होती है
Colab notebook में भी बिल्कुल उसी तरह चलाया जा सकता है

code development stages

train0.py से train5.py तक step-by-step विस्तार
- Bigram → MLP → Autograd → Attention → Multi-head → Adam
हर stage को Gist की build_microgpt.py revision में देखा जा सकता है

वास्तविक LLMs से अंतर

data: microgpt में 32K names, जबकि वास्तविक LLMs में खरबों tokens
tokenizer: character level vs. BPE-based subword
Autograd: scalar-based Python vs. GPU tensor operations
architecture: 4K parameters vs. सैकड़ों अरब parameters
training: single document repetition vs. large-scale batch और mixed-precision training
optimization: simple Adam vs. carefully tuned hyperparameters और scheduling
post-processing: SFT और RL stages से गुजरकर ChatGPT जैसी form में विकसित
inference infrastructure: GPU distribution, KV cache management, quantization, speculative decoding आदि

FAQ सारांश

model एक mathematical function है, जो input tokens को अगले token की probability में बदलता है
इसमें कोई “समझ” नहीं होती; यह statistical pattern learning के जरिए prediction करता है
ChatGPT के समान token prediction loop को छोटे रूप में implement किया गया है
“hallucination” probabilistic sampling का स्वाभाविक परिणाम है
गति धीमी है, लेकिन LLM के core algorithm को पूरी तरह reproduce करता है
बेहतर results के लिए training steps, model size, और dataset को adjust किया जा सकता है
dataset बदलने पर शहरों के नाम, Pokémon नाम, कविताएँ आदि जैसे अलग-अलग patterns सीखे जा सकते हैं

microgpt LLM के सभी core algorithms को न्यूनतम रूप में लागू करने वाला एक शैक्षणिक और प्रयोगात्मक model है, जो बड़े language models की कार्यप्रणाली को पूरी स्पष्टता से दिखाता है।