microgpt - 200 लाइनों के शुद्ध Python में लागू GPT प्रशिक्षण और inference

(karpathy.github.io)

69 पॉइंट द्वारा GN⁺ 2026-02-17 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

karpathy द्वारा जारी किया गया एक art project. 200 लाइनों की single-file implementation में बिना किसी external dependency के पूरा GPT algorithm implement किया गया है
production LLMs से इसका फर्क सिर्फ scale और efficiency का है, core वही है; इस code को समझना मतलब GPT के algorithmic essence को समझना
इसमें dataset, tokenizer, autograd engine, GPT-2 जैसी Transformer architecture, Adam optimizer, और training·inference loop तक शामिल हैं
micrograd, makemore, nanogpt जैसे पुराने projects और 10 साल के LLM simplification work का निचोड़ है, जो GPT के essence को ऐसे minimum form में रखता है जिसे अब और सरल नहीं किया जा सकता
यह 32,000 नामों के data पर train होकर विश्वसनीय लगने वाले नए नाम generate करता है, और सभी calculations को scalar-level autograd से सीधे करता है
training process loss calculation → backpropagation → Adam update से बना है, और लगभग 1 मिनट में चल सकता है

microgpt अवलोकन

microgpt एक 200-लाइन का Python script है, जो GPT model की training और inference process को पूरी तरह implement करता है
- इसमें बिना external library के dataset, tokenizer, autograd, model, optimizer, training loop सब शामिल हैं
micrograd, makemore, nanogpt जैसे मौजूदा projects को जोड़कर एक single file में व्यवस्थित किया गया है
यह ऐसी implementation है जिसमें “अब और सरल नहीं किया जा सकता” स्तर तक सिर्फ algorithmic core छोड़ा गया है
पूरा code GitHub Gist, वेबपेज, Google Colab पर उपलब्ध है

dataset संरचना

बड़े language model का ईंधन text data stream है; production में internet web pages का उपयोग होता है, लेकिन microgpt में हर line में एक नाम वाले 32,000 नामों का सरल उदाहरण लिया गया है
हर नाम को एक "document" माना जाता है, और model का लक्ष्य data के भीतर के statistical patterns को सीखकर वैसा नया document generate करना है
training पूरी होने के बाद model "kamon", "karai", "vialan" जैसे विश्वसनीय नए नामों को "hallucinate" करता है
ChatGPT के नज़रिए से user के साथ बातचीत भी बस एक "अजीब दिखने वाला document" है; prompt से document को initialize करने पर model की response statistical document completion बन जाती है

tokenizer

neural network अक्षरों पर नहीं बल्कि numbers पर काम करते हैं, इसलिए text को integer token ID sequence में बदलने और फिर वापस restore करने का तरीका चाहिए
tiktoken (GPT-4 में उपयोग) जैसे production tokenizers efficiency के लिए character chunks पर काम करते हैं, लेकिन सबसे सरल tokenizer dataset के हर unique character को एक integer assign करता है
lowercase a-z को sort करके हर character को index के रूप में ID दी जाती है; integer value का अपने आप में कोई अर्थ नहीं होता, हर token एक अलग discrete symbol है
BOS(Beginning of Sequence) special token जोड़कर बताया जाता है कि "नया document शुरू/समाप्त हुआ"; "emma" को [BOS, e, m, m, a, BOS] की तरह wrap किया जाता है
अंतिम vocabulary size 27 है (26 lowercase अक्षर + 1 BOS)

automatic differentiation (Autograd)

neural network training के लिए gradient चाहिए: हर parameter के लिए यह जानना होता है कि "इस value को थोड़ा बढ़ाने पर loss बढ़ेगा, घटेगा, और कितना?"
computation graph में बहुत सारे inputs (model parameters और input tokens) होते हैं, लेकिन यह single scalar output यानी loss पर converge करता है
Backpropagation output से शुरू होकर graph को उलटी दिशा में follow करता है और calculus के chain rule पर निर्भर करते हुए सभी inputs के लिए loss के gradients निकालता है
इसे Value class से implement किया गया है: हर Value एक single scalar (.data) को wrap करती है और यह track करती है कि उसकी calculation कैसे हुई
- जोड़, गुणा जैसी operations के समय नया Value अपने inputs (_children) और उस operation के local derivatives (_local_grads) को याद रखता है
- उदाहरण: __mul__ ∂(a·b)/∂a=b, ∂(a·b)/∂b=a को record करता है
supported operation blocks: जोड़, गुणा, power, log, exp, ReLU
backward() method graph को reverse topological order में traverse करती है और हर step पर chain rule लागू करती है
- loss node पर self.grad = 1 से शुरू होता है (∂L/∂L=1)
- local gradients को path के साथ गुणा करते हुए parameters तक propagate किया जाता है
+= से accumulation (assignment नहीं): जब graph branch होता है, तो हर branch से gradients स्वतंत्र रूप से बहकर आते हैं और जुड़ने चाहिए (यह multivariable chain rule का परिणाम है)
algorithmic रूप से यह PyTorch के .backward() जैसा ही है, लेकिन tensor की जगह scalar level पर काम करता है, इसलिए काफी सरल है पर efficiency कम है

parameter initialization

parameters model का knowledge हैं: floating-point numbers का बड़ा set, जो random से शुरू होता है और training के दौरान बार-बार optimize होता है
इन्हें Gaussian distribution से लिए गए छोटे random values से initialize किया जाता है
यह state_dict नाम की matrices से बना है: embedding table, attention weights, MLP weights, final output projection
hyperparameter settings:
- n_embd = 16: embedding dimension
- n_head = 4: attention heads की संख्या
- n_layer = 1: layers की संख्या
- block_size = 16: maximum sequence length
छोटे model के हिसाब से इसमें 4,192 parameters हैं (GPT-2 में 1.6 billion, आधुनिक LLMs में सैकड़ों billions)

आर्किटेक्चर

मॉडल आर्किटेक्चर एक stateless function है: यह token, position, parameters, और पिछली positions के cached key/value लेकर अगले token के लिए logits (scores) लौटाता है
GPT-2 का अनुसरण करता है, लेकिन थोड़ा सरल बनाया गया है: RMSNorm (LayerNorm की जगह), bias नहीं, ReLU (GeLU की जगह)
helper functions
- linear: matrix-vector multiplication के जरिए weight matrix की हर row के लिए एक dot product की गणना, neural network का बुनियादी building block यानी सीखा गया linear transformation
- softmax: raw scores (logits) को probability distribution में बदलता है, सभी मान [0,1] सीमा में आते हैं और कुल योग 1 होता है, numerical stability के लिए पहले maximum value घटाई जाती है
- rmsnorm: vector को इस तरह rescale करता है कि उसका root mean square unit हो, ताकि activations network में आगे बढ़ते समय बहुत बड़े या बहुत छोटे न हों, और training stable रहे
मॉडल संरचना
- embedding: token ID और position ID अपनी-अपनी embedding tables (wte, wpe) से row refer करते हैं, और दोनों vectors को जोड़कर token क्या है और sequence में कहाँ है, यह एक साथ encode किया जाता है
  - आधुनिक LLM अक्सर positional embedding छोड़कर RoPE जैसी relative positioning techniques का उपयोग करते हैं
- attention block: current token को Q (query), K (key), V (value) तीन vectors में project किया जाता है
  - query: “मैं क्या ढूँढ रहा हूँ?”, key: “मेरे पास क्या है?”, value: “अगर मुझे चुना गया तो मैं क्या दूँगा?”
  - उदाहरण: “emma” में दूसरा “m” जब अगला token predict कर रहा हो, तब यह “हाल में कौन-सा vowel आया था?” जैसी query सीख सकता है, और पहले का “e” इस query से अच्छी तरह match करके ऊँचा attention weight पा सकता है
  - keys और values को KV cache में जोड़ा जाता है ताकि पिछली positions को refer किया जा सके
  - हर attention head query और सभी cached keys के बीच dot product की गणना करता है (√d_head से scale करके), softmax से attention weights प्राप्त करता है, और cached values का weighted sum निकालता है
  - सभी heads के outputs को जोड़कर attn_wo से project किया जाता है
  - attention block ही एकमात्र जगह है जहाँ position t का token अतीत 0..t-1 के tokens को “देख” सकता है, attention ही token communication mechanism है
- MLP block: 2-layer feedforward network: embedding dimension के 4 गुना तक expand → ReLU लागू → फिर वापस reduce
  - यहीं position-wise “सोच” का अधिकांश हिस्सा होता है
  - attention के विपरीत, समय t पर यह पूरी तरह local computation है
  - Transformer communication (attention) और computation (MLP) को बारी-बारी से व्यवस्थित करता है
- residual connections: attention और MLP blocks दोनों अपने output को input में वापस जोड़ते हैं
  - इससे gradients network के आर-पार सीधे बह सकते हैं, और deep models को train करना संभव होता है
- output: अंतिम hidden state को lm_head से vocabulary size पर project किया जाता है ताकि प्रति token एक logit बने (यहाँ 27 numbers), ऊँचा logit = उस token के अगले आने की संभावना अधिक
- KV cache की विशेषता: training के दौरान भी KV cache का उपयोग दुर्लभ है, लेकिन microgpt एक बार में सिर्फ एक token process करता है, इसलिए इसे स्पष्ट रूप से बनाया जाता है; cached keys और values computation graph के live Value nodes होते हैं जिन पर backpropagation होता है

training loop

training loop बार-बार यह करता है: (1) document चुनना → (2) tokens पर model forward चलाना → (3) loss निकालना → (4) backpropagation से gradients पाना → (5) parameters update करना
tokenization
- हर training step में एक document चुना जाता है और दोनों ओर BOS से wrap किया जाता है: “emma” → [BOS, e, m, m, a, BOS]
- मॉडल का लक्ष्य है कि पिछले tokens दिए होने पर हर अगले token की भविष्यवाणी करे
forward pass और loss
- tokens को एक-एक करके model में feed किया जाता है और साथ में KV cache बनाया जाता है
- हर position पर model 27 logits output करता है, जिन्हें softmax से probabilities में बदला जाता है
- हर position का loss सही अगले token की negative log probability होता है: −log p(target), इसे cross-entropy loss कहते हैं
- loss यह मापता है कि model वास्तव में आने वाले token से कितना चकित है: probability 1.0 देने पर loss 0, और probability 0 के पास होने पर loss +∞
- पूरे document की position-wise losses का average लेकर एक scalar loss मिलता है
backward pass
- loss.backward() एक बार call करने से पूरे computation graph पर backpropagation चलता है
- उसके बाद हर parameter का .grad बताता है कि loss कम करने के लिए उसे कैसे बदलना चाहिए
Adam optimizer
- साधारण gradient descent (p.data -= lr * p.grad) की जगह Adam का उपयोग होता है
- हर parameter के लिए दो moving averages रखे जाते हैं:
  - m: हाल के gradients का औसत (momentum)
  - v: हाल के gradient squares का औसत (parameter-wise learning rate adaptation)
- m_hat, v_hat शून्य से initialized m, v के bias-corrected संस्करण हैं
- learning rate training के दौरान linearly decay होती है
- update के बाद .grad = 0 से reset किया जाता है
training results
- 1,000 steps में loss लगभग 3.3 (27 tokens में random guess: −log(1/27)≈3.3) से घटकर लगभग 2.37 हो जाता है
- जितना कम, उतना बेहतर, और न्यूनतम 0 है (perfect prediction), इसलिए सुधार की गुंजाइश है, लेकिन यह स्पष्ट है कि model नामों के statistical patterns सीख रहा है

inference

training पूरी होने के बाद model से नए names sample किए जा सकते हैं; parameters fix करके forward pass को loop में चलाया जाता है, और हर generated token को अगले input के रूप में वापस feed किया जाता है
sampling process
- हर sample BOS token से शुरू होता है (“नया नाम शुरू”)
- model 27 logits बनाता है → probabilities में बदलता है → उन probabilities के अनुसार randomly एक token sample करता है
- उस token को अगले input के रूप में वापस feed किया जाता है, और model फिर BOS (“समाप्त”) बनाता है या maximum sequence length तक पहुँचने तक यह प्रक्रिया दोहराई जाती है
temperature
- softmax से पहले logits को temperature से divide किया जाता है
- temperature 1.0: model द्वारा सीखी गई distribution से सीधे sampling
- कम temperature (जैसे 0.5): distribution को अधिक sharp बनाता है, जिससे model के अधिक conservative होकर top choices लेने की संभावना बढ़ती है
- temperature 0 के करीब: हमेशा सबसे अधिक probability वाले एक token को चुनता है (greedy decoding)
- उच्च temperature: distribution को flatter बनाता है, जिससे output अधिक diverse लेकिन कम consistent होती है

चलाने का तरीका

सिर्फ Python चाहिए (pip install नहीं, कोई dependency नहीं): python train.py
MacBook पर लगभग 1 मिनट लगता है
हर step पर loss print होती है: ~3.3 (random) से ~2.37 तक घटती है
training पूरी होने के बाद hallucinated नए names generate होते हैं: “kamon”, “ann”, “karai” आदि
Google Colab notebook में भी चलाया जा सकता है, और Gemini से सवाल भी पूछे जा सकते हैं
दूसरे datasets आज़माए जा सकते हैं, num_steps बढ़ाकर अधिक देर तक train किया जा सकता है, और model size बढ़ाकर बेहतर results पाए जा सकते हैं

code progression steps

फ़ाइल	जोड़ी गई सामग्री
`train0.py`	bigram count table — कोई neural network नहीं, कोई gradient नहीं
`train1.py`	MLP + manual gradients (numerical & analytical) + SGD
`train2.py`	Autograd (Value class) — manual gradients का replacement
`train3.py`	positional embeddings + single-head attention + rmsnorm + residual
`train4.py`	multihead attention + layer loop — पूरा GPT architecture
`train5.py`	Adam optimizer — यही `train.py` है

build_microgpt.py Gist की Revisions में सभी versions और हर step के बीच के diff देखे जा सकते हैं

प्रोडक्शन LLM से अंतर

microgpt में GPT training और execution का पूरा algorithmic essence शामिल है; ChatGPT जैसे प्रोडक्शन LLM से अंतर core algorithm को नहीं बदलता, बल्कि उन तत्वों में है जो इसे scale पर काम करने लायक बनाते हैं
डेटा
- 32K छोटे नामों की जगह इंटरनेट टेक्स्ट के खरबों tokens (वेबपेज, किताबें, कोड आदि) पर training
- डेटा deduplication, quality filtering, और domains के बीच सावधानीपूर्वक mixing
tokenizer
- single character की जगह BPE(Byte Pair Encoding) जैसे subword tokenizer का उपयोग
- जो character sequences अक्सर साथ आते हैं उन्हें एक token में merge किया जाता है; "the" जैसे सामान्य शब्द एक single token, जबकि rare words टुकड़ों में बंटते हैं
- ~100K token vocabulary, इसलिए हर position पर अधिक content दिखता है और यह काफी अधिक efficient है
Autograd
- pure Python के scalar Value objects की जगह tensors (बड़े multi-dimensional numeric arrays) का उपयोग, जो GPU/TPU पर चलते हैं और प्रति सेकंड अरबों floating-point operations कर सकते हैं
- PyTorch tensors के लिए autograd संभालता है, और FlashAttention जैसे CUDA kernels कई operations को fuse करते हैं
- गणित वही रहता है, लेकिन बहुत सारे scalars parallel में process होते हैं
architecture
- microgpt: 4,192 parameters, GPT-4 स्तर के models: सैकड़ों अरब
- कुल मिलाकर Transformer neural network बहुत समान है, लेकिन कहीं अधिक चौड़ा (embedding dimension 10,000+) और कहीं अधिक गहरा (100+ layers)
- अतिरिक्त लेगो ब्लॉक प्रकार और क्रम में बदलाव:
  - RoPE (rotary positional embeddings) — learned positional embeddings की जगह
  - GQA (grouped query attention) — KV cache का आकार घटाने के लिए
  - gated linear activations — ReLU की जगह
  - MoE (mixture of experts) layers
- residual stream के ऊपर attention (communication) और MLP (computation) का बारी-बारी से आना वाला core structure अच्छी तरह बना रहता है
training
- प्रति step एक document की जगह विशाल batches (प्रति step लाखों tokens), gradient accumulation, mixed precision (float16/bfloat16), और सावधानीपूर्वक hyperparameter tuning
- frontier models की training के लिए हजारों GPU महीनों तक चलते हैं
optimization
- microgpt: Adam + simple linear learning rate decay
- बड़े scale पर optimization एक अलग विशेषज्ञता का क्षेत्र है: reduced precision (bfloat16, fp8), बड़े GPU clusters पर training
- optimizer settings (learning rate, weight decay, beta parameters, warmup/decay schedules) को बहुत बारीकी से tune करना पड़ता है; सही मान model size, batch size, और dataset composition पर निर्भर करते हैं
- scaling laws (जैसे Chinchilla) यह guide करते हैं कि fixed compute budget को model size और training tokens की संख्या के बीच कैसे बांटना है
- बड़े scale पर इन details में गलती होने पर लाखों डॉलर की compute बर्बाद हो सकती है, इसलिए teams पूरी training run से पहले व्यापक छोटे-scale experiments करती हैं
post-training
- training से निकला base model ("pretrained" model) एक document completer होता है, chatbot नहीं
- इसे ChatGPT में बदलने की प्रक्रिया के दो चरण हैं:
  - SFT (supervised fine-tuning): documents को curated conversations से बदलकर training जारी रखना, algorithmically कोई बदलाव नहीं
  - RL (reinforcement learning): model response generate करता है → score दिया जाता है (human, "judge" model, algorithm) → feedback से सीखता है
- मूल रूप से यह अभी भी documents पर train हो रहा होता है, लेकिन अब document खुद model से निकले tokens से बने होते हैं
inference
- लाखों users को model serve करने के लिए अपना engineering stack चाहिए: request batching, KV cache management और paging (vLLM आदि), speed के लिए speculative decoding, memory कम करने के लिए quantization (int8/int4 में चलाना), और model को कई GPUs में distribute करना
- मूल रूप से यह अब भी sequence के अगले token का prediction ही करता है, लेकिन इसे तेज़ बनाने की engineering पर बहुत मेहनत लगती है

FAQ

क्या model किसी चीज़ को "समझता" है?
- यह एक दार्शनिक प्रश्न हो सकता है, लेकिन यांत्रिक रूप से: कोई जादू नहीं होता
- model input tokens को अगले token की probability distribution में map करने वाला एक बड़ा mathematical function है
- training के दौरान parameters को इस तरह adjust किया जाता है कि सही अगला token अधिक probable हो जाए
- इसे "समझ" माना जाए या नहीं, यह व्यक्ति पर निर्भर है, लेकिन mechanism पूरी तरह इन्हीं 200 लाइनों में समाया है
यह काम क्यों करता है?
- model में हजारों adjustable parameters होते हैं, और optimizer हर step पर loss कम करने के लिए उन्हें थोड़ा-थोड़ा बदलता है
- कई steps के बाद parameters ऐसे मानों पर स्थिर हो जाते हैं जो डेटा की statistical regularities को capture करते हैं
- नामों के मामले में: कई नाम consonant से शुरू होते हैं, "qu" साथ आने की प्रवृत्ति रखता है, लगातार 3 consonants दुर्लभ होते हैं, आदि
- model explicit rules नहीं सीखता, बल्कि इन्हें दर्शाने वाली probability distributions सीखता है
इसका ChatGPT से क्या संबंध है?
- ChatGPT इसी core loop (next-token prediction, sampling, repeat) को बहुत बड़े scale पर ले जाता है और conversational बनाने के लिए post-training जोड़ता है
- chat करते समय system prompt, user messages, और responses सब sequence के tokens ही होते हैं
- model documents को microgpt द्वारा नाम पूरा करने की तरह ही एक-एक token करके पूरा करता है
"hallucination" क्या है?
- model probability distribution से sample करके tokens generate करता है
- इसमें truth की कोई अवधारणा नहीं होती, यह सिर्फ training data के हिसाब से statistically plausible sequences जानता है
- microgpt का "karia" जैसा नाम "hallucinate" करना वही समान phenomenon है जो ChatGPT के confident होकर गलत facts बोलने में दिखता है
- दोनों ही वास्तविकता नहीं, बल्कि plausible-sounding completions हैं
यह इतना धीमा क्यों है?
- microgpt pure Python में एक बार में एक scalar process करता है, इसलिए एक training step में कई सेकंड लगते हैं
- GPU पर वही गणित लाखों scalars को parallel में process करके कई orders of magnitude तेज़ चलता है
क्या इसे बेहतर नाम generate करने के लिए बनाया जा सकता है?
- हाँ: ज्यादा देर तक train करें (num_steps बढ़ाएँ), model size बढ़ाएँ (n_embd, n_layer, n_head), या बड़ा dataset इस्तेमाल करें
- बड़े scale पर भी यही वही महत्वपूर्ण control knobs हैं
अगर dataset बदल दें तो?
- model डेटा में मौजूद किसी भी pattern को सीख लेता है
- अगर इसे शहरों के नाम, Pokémon नाम, English words, या short poem files से बदल दें, तो यह उन्हीं को generate करना सीख जाएगा
- बाकी code में कोई बदलाव ज़रूरी नहीं

microgpt - 200 लाइनों के शुद्ध Python में लागू GPT प्रशिक्षण और inference

microgpt अवलोकन

dataset संरचना

tokenizer

automatic differentiation (Autograd)

parameter initialization

आर्किटेक्चर

helper functions

मॉडल संरचना

training loop

tokenization

forward pass और loss

backward pass

Adam optimizer

training results

inference

sampling process

temperature

चलाने का तरीका

code progression steps

प्रोडक्शन LLM से अंतर

डेटा

tokenizer

Autograd

architecture

training

optimization

post-training

inference

FAQ

क्या model किसी चीज़ को "समझता" है?

यह काम क्यों करता है?

इसका ChatGPT से क्या संबंध है?

"hallucination" क्या है?

यह इतना धीमा क्यों है?

क्या इसे बेहतर नाम generate करने के लिए बनाया जा सकता है?

अगर dataset बदल दें तो?

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.