GPT-2 आधारित, 3000-बाइट C में लागू ChatGPT क्लोन (2023)

(nicholas.carlini.com)

2 पॉइंट द्वारा GN⁺ 2024-12-13 | 1 टिप्पणियां | WhatsApp पर शेयर करें

सिर्फ करीब 3000 बाइट C code से GPT-2 inference engine बनाया गया है, जो weights loading से लेकर tokenization, Transformer execution और output conversion तक पूरी flow को संभालता है
छोटे code size को बनाए रखते हुए भी KV caching, तेज matrix multiplication और optional OMP parallelization के जरिए GPT-2 Small के जवाब आधुनिक machines पर कुछ सेकंड में generate करता है
output quality “objectively काफी खराब” स्तर की है, और UTF-8 handling तथा बड़े models चलाते समय memory requirements जैसी practical limitations बाकी हैं
implementation को matrix operations, neural network layers, Transformer, Byte Pair Encoding, I/O, weights और BPE loading में बांटा गया है, जिससे छोटे inference engine की पूरी structure दिखती है
GPT-2, GPT-4 की तुलना में बहुत कमजोर 2019 का open-source model है, लेकिन modern language models चलाने के core components को छोटे C code में भी व्यक्त किया जा सकता है

3000-बाइट C से बना GPT-2 runner

यह program बिना dependencies वाला GPT-2 implementation है, जो original TensorFlow files से weight matrices और BPE files पढ़ता है
input को एक सरल Byte Pair Encoding(BPE) encoder से tokenize किया जाता है, और output को BPE decoder से फिर string में बदला जाता है
internal structure basic linear algebra package, matrix operations, Transformer architecture और inference code तक जाता है
code GitHub पर public है
GPT-2 Small आधुनिक machine पर एक जवाब लगभग कुछ सेकंड में generate करता है
- KV caching implement की गई है
- efficient matrix multiplication इस्तेमाल किया गया है
- optional तौर पर OMP parallelization चालू किया जा सकता है

चलाने की शर्तें और सीमाएं

इस implementation से ChatGPT जैसा interactive program बनाया जा सकता है, लेकिन output quality अच्छी नहीं है
UTF-8 character handling में कुछ peculiarities हैं
XL size model को लंबी context length के साथ चलाने पर लगभग 100GB RAM की जरूरत पड़ सकती है
ASCII input और GPT-2 Small combination हो तो यह लगभग कहीं भी चल सकता है

GPT-2 और Transformer कैसे काम करते हैं

ChatGPT एक ऐसी application है जो language model के साथ इंसान जैसी बातचीत कर सकती है, और GPT-4 को ChatGPT चलाने वाले latest model के रूप में पेश किया गया है
यह C program 2019 के model GPT-2 से ChatGPT जैसा व्यवहार implement करता है
GPT-2 Transformer family का machine learning model है
Transformer fixed-size word sequence को input के रूप में लेकर अगले word की prediction करता है
इसी प्रक्रिया को दोहराने पर arbitrary length की sequence generate की जा सकती है

matrix operations और macro-based compression

neural networks matrix operations से बने होते हैं, इसलिए implementation न्यूनतम Matrix struct से शुरू होता है
- float* dat
- int rows, cols
जरूरी operations broadly दो तरह के हैं
- matrix-constant operations
- matrix-matrix operations
C macros से दोहराए जाने वाले loop structures कम किए जाते हैं, और सिर्फ specific operator बदलकर कई functions generate किए जाते हैं
C का #define simple substitution के करीब है, इसलिए सामान्य operators के साथ-साथ semicolon वाले expressions को भी macro arguments में डालकर code size घटाया जा सकता है

तेज matrix multiplication

basic matrix multiplication तीन nested loops वाले simple O(n³) implementation से शुरू होता है
cache और memory access characteristics को ध्यान में रखते हुए loops को इस तरह बदला गया है कि वही memory बार-बार read और write हो
fast implementation में j और k को 4-4 से increment किया जाता है, और अंदर k2, j2 loops इस्तेमाल होते हैं
inference stage में पहले से compute किए गए कुछ results reuse करने के लिए, matrix A के सिर्फ एक हिस्से को B से multiply करने का तरीका जोड़ा गया है

neural network layers का implementation

Transformer बनाने के लिए कुछ neural network layers सीधे implement किए गए हैं
GELU activation function macro से implement किया गया है
causal attention के लिए matrix के lower-triangular हिस्से को process करने वाला function है
- attention matrix को future tokens न देखकर केवल past देखने तक सीमित करता है
LayerNorm हर layer की mean और variance को normalize करता है
Linear function matrix multiplication के बाद bias को tiling तरीके से जोड़ता है

Transformer का मुख्य हिस्सा

Transformer implementation हर layer में यह flow दोहराता है
- LayerNorm और Linear से गुजरकर query, key, value को एक बार में compute करता है
- head के हिसाब से qkv को split करता है
- query और key का product compute करता है और causal attention processing apply करता है
- softmax result को value matrix से multiply करता है
- results को इकट्ठा कर residual connection apply करता है
- GELU और Linear से गुजरकर फिर residual connection apply करता है
अंत में final LayerNorm के बाद, last token position के output को embedding weights से multiply करके next token candidates compute करता है

KV caching का तरीका

Transformer inference में एक token generate करने के बाद next token बनाने के लिए पूरी function को दोबारा compute करने की जरूरत नहीं होती
अगर Nवें token तक compute किए गए अधिकांश results reuse किए जाएं, तो N+1वें token generation के लिए केवल कुछ extra work चाहिए
implementation सभी allocations को उसी memory block के अंदर sequentially करता है
हर matrix multiplication को हमेशा वही memory इस्तेमाल करने के लिए बनाया गया है, ताकि next iteration में memory को 0 से initialize किए बिना previous results रखे जा सकें
नए iteration में केवल N+1वीं row compute की जाती है

Byte Pair Encoding implementation

language model को fixed-size input चाहिए, इसलिए असीमित संख्या वाले words को ज्यों का त्यों word-level पर handle करना मुश्किल है
character-level model को सभी words के meanings शुरू से सीखने पड़ते हैं, और average word length के अनुपात में effective context size घटाने की समस्या होती है
GPT-2 जैसे models word pieces से tokens बनाने के लिए BPE इस्तेमाल करते हैं
- common words एक token बन सकते हैं
- rare words छोटे pieces में टूट जाते हैं
- उदाहरण के लिए nicholas को nich, o, las की तरह split किया जा सकता है
सामान्य BPE algorithm adjacent token pairs को बार-बार merge करता है
यह C implementation code size घटाने के लिए linear-time algorithm के बजाय संभावित रूप से exponential time लेने वाला recursive तरीका इस्तेमाल करता है
- current word के prefix से match करने वाली vocabulary entry ढूंढता है
- बाकी string को recursively tokenize करता है
- length और vocabulary index के आधार पर best tokenization चुनता है

weights loading

neural network weights को disk से पढ़ना होता है, और file 32-bit float के flat binary serialization format में होती है
GPT-2 model sizes समान architecture इस्तेमाल करते हैं, और weights भी समान order में stored होते हैं, इसलिए सही shape वाली matrices को क्रम से पढ़ना काफी है
layer storage order उम्मीद से अलग है
- layer 0, 1 के बाद 10 आता है
- क्योंकि names lexicographic order में sorted हैं
- string sorting में 10 2 से पहले आता है
implementation इस order को actual layer order में बदलने के लिए permutation code इस्तेमाल करता है

BPE vocabulary loading

BPE चलाने के लिए vocabulary file को पहले disk से पढ़ना होता है
original file Python में पढ़ने के लिए format की गई है, और छोटे C code से parse करने में आसान format नहीं है
file word list नहीं बल्कि BPE merge list है
- जैसे Hello token सीधे stored होने के बजाय यह stored होता है कि H और ello को merge करना है
file UTF-8 जैसी, लेकिन बिल्कुल वैसी नहीं, encoding इस्तेमाल करती है
- printable ASCII characters वैसे ही stored होते हैं
- 0~31 range के non-printable characters 188 + character value के रूप में encoded होते हैं
- उदाहरण के लिए space को Ġ token के रूप में encode किया जाता है
disk पर मौजूद Ġ UTF-8 में 0xc4 0xa0 है, इसलिए इसे वापस space में बदलने के लिए अलग handling चाहिए

छोटा code क्या दिखाता है

machine learning की दशकों की प्रगति को कुछ हजार बाइट code में compress किया जा सकता है
actual model weights को छोड़ दें, तो modern neural network चलाने के लिए जरूरी elements लगभग गायब नहीं हैं
यह implementation मुख्यतः मजे के लिए बनाया गया था, लेकिन यह दिखाने वाला उदाहरण है कि neural networks वास्तव में simple components से चल सकते हैं

1 टिप्पणियां

GN⁺ 2024-12-13

Hacker News की राय

मैंने कोड खुद चलाकर नहीं देखा, लेकिन इसका छोटा आकार प्रभावशाली है
शुरुआती ELIZA प्रोग्राम इससे बड़े थे, यह सोचें तो पिछले 4 सालों में ऐसी चीज़ को byte-level पर ठूंस पाना संभव हो गया है
अगर किसी को पता हो कि जादू कहाँ छिपा है, तो समझाए। क्या यह GELU function है, या Bash script से डाउनलोड होने वाला model?
- जादू का ज़्यादातर हिस्सा Bash script से डाउनलोड होने वाली 475MB model file में है
- चलाकर देखा तो बहुत प्रभावशाली नहीं लगा
  Who are you? पर यह I am Alice. जवाब देता है, और computer या capabilities के बारे में पूछने पर I am a computer model trained by OpenAI. How can I help you? दोहराता है
  addition समझाने को कहें तो multiplication की व्याख्या दे देता है, और 2+2 या Sum 2+2 को बस ज्यों का त्यों दोहरा देता है
GPT-2 जब पहली बार आया था, तब उसे आज़माने की याद है
एक दोस्त के साथ chat logs export करके GPT-2 को fine-tune किया और उसे हम दोनों की बातचीत की नकल करने दी; वह बहुत मज़ेदार था और कभी-कभी डराने वाली हद तक सटीक भी
सोचता हूँ GPT-2 से GPT-3 तक की बड़ी छलांग किस वजह से आई। बड़ा model, ज़्यादा data, या दोनों—पता नहीं
मुझे पता है RLHF ने बड़ा फर्क डाला, लेकिन base GPT-3 model भी अगर पर्याप्त examples दिए जाएँ तो सिर्फ text completion से काफी उपयोगी था
ठीक-ठीक नहीं जानता, लेकिन GPT-2 ने लिखी मेरी कुछ पसंदीदा परीकथाएँ हैं
https://deepdreams.stavros.io/episodes/the-princess-the-fair...
- सच में अच्छी, वाकई मज़ेदार और सुनते-सुनते सो जाने के लिए भी अच्छी कहानी है
  सोच रहा हूँ क्या यह इसी पेज के GPT-2 से बनाई गई है
- प्रभावशाली और अजीब, फिर भी लगभग 90% तक सुसंगत है, इसलिए उसमें एक खास विचित्र माहौल बनता है
“ज़्यादातर यह मज़े के लिए बनाया गया है, लेकिन यह दिखाने का अच्छा उदाहरण है कि neural networks असल में कितने सरल हो सकते हैं” वाला हिस्सा दिलचस्प है
चुप, किसी को मत बताना। Artificial intelligence पैसे कमाने के लिए इस्तेमाल होने वाला काला जादू है
क्या GPT-2 को instruction-tune किया गया है, इसलिए इसे असली chat में इस्तेमाल किया जा सकता है?
अगर नहीं, तो इसे ChatGPT clone कहना काफी खिंचा हुआ लगता है
- लेख में पहले से यह लिखा है: अगर output quality की परवाह न करें, तो ChatGPT जैसी चीज़ बनाई जा सकती है; objectively output काफी भयानक है, लेकिन चलता है
  असल में लगभग अनुपयोगी है, और नाम उधार लेने के अलावा इसका संबंध बहुत कम है। फिर भी यह compile होकर चलने वाला program है
  जिस project के बारे में बनाने वाले ने भी माना है कि यह ठीक से काम नहीं करता, उसके performance की तारीफ करती प्रतिक्रियाएँ देखकर लगता है कि आखिरकार buzzword से ध्यान खींचना ही मुख्य बात है
“ठीक-ठाक macros वाली languages देख रही हैं? Lisp हमेशा C से बेहतर नहीं होता!” वाला वाक्य इस बार स्वीकार्य है। क्योंकि यह ऊपर की तरफ किया गया मज़ाक है
अगर code link नहीं दिखा, तो वह article में ही छिपा है: https://github.com/carlini/c-chat-gpt-2
classic AI chatbot में इससे बेहतर भी देखा है
https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas...
Splotch थोड़ा सा सुधारने पर modern Unix-like systems पर भी अच्छी तरह compile हो जाता है
सोच रहा हूँ किसी ने local पर चलाकर देखा है कि यह GPT-2 किस तरह का output देता है या नहीं
- लगता है लगभग हमेशा वही output दोहराता रहता है
  फिर भी काफी दिलचस्प है, और मैं खुद अंदर झांककर इसे tweak करना चाहूँगा। काफी समय से local पर GPT-2 से खेलना चाहता था
- पढ़कर लगा कि अगर वही temperature और seed इस्तेमाल हों, तो आम तौर पर load किए गए GPT-2 model और इस program में load किए गए model को बिल्कुल वही output देना चाहिए
  code में temperature और seed मैं सीधे confirm नहीं कर पाया, और मुख्य रूप से यह देखने की कोशिश कर रहा था कि इसे obfuscate क्यों किया गया
  de-obfuscate करने पर भी code बहुत लंबा नहीं होगा; अगर करीब 10,000 characters हो, तो screen पर देखना ही काफी प्रभावशाली लगेगा
आजकल gptscript इस्तेमाल करें तो अपना ChatGPT जल्दी implement किया जा सकता है
https://github.com/gptscript-ai/gptscript
GELU सच में जादू जैसा है:
UNARY(GELU, b / 2 * (1 + tanh(.7978845 * (b + .044715 * b * b * b))))
- यह GELU की असली mathematical definition का सिर्फ practical approximation है
  definition है GELU(x) := x * Φ(x), जहाँ Φ(x) Gaussian distribution का cumulative distribution function है
- इसका रूप fast inverse square root की याद दिलाता है

GPT-2 आधारित, 3000-बाइट C में लागू ChatGPT क्लोन (2023)

3000-बाइट C से बना GPT-2 runner

चलाने की शर्तें और सीमाएं

GPT-2 और Transformer कैसे काम करते हैं

matrix operations और macro-based compression

तेज matrix multiplication

neural network layers का implementation

Transformer का मुख्य हिस्सा

KV caching का तरीका

Byte Pair Encoding implementation

weights loading

BPE vocabulary loading

छोटा code क्या दिखाता है

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय