Attention visualization: Transformer का दिल [वीडियो]

(3blue1brown.com)

1 पॉइंट द्वारा GN⁺ 2024-04-15 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Transformer का attention एक ऐसा mechanism है जो token embeddings को context के अनुसार update करता है, ताकि वही शब्द आसपास के शब्दों के आधार पर अलग अर्थ वाले vector में shift हो सके
एक attention head हर token से query/key/value vectors बनाता है, और key-query dot product व softmax के जरिए शब्दों के बीच relevance weights यानी attention pattern calculate करता है
GPT-style autoregressive models में masking apply की जाती है ताकि बाद के tokens पहले के tokens को प्रभावित न कर सकें; attention pattern का size context length के square के बराबर होता है, इसलिए बड़े context window को expand करना costly होता है
GPT-3 के उदाहरण में key/query matrices में प्रत्येक के 1,572,864 parameters हैं, और value map को low-rank transformation में बांटने पर प्रति head लगभग 6.3 मिलियन parameters होते हैं
कई attention heads और blocks को repeat करने वाला Transformer context को update करने के विविध तरीके सीखता है, और इसकी सफलता का एक बड़ा आधार GPU पर बहुत सारे computations को तेजी से handle करने की parallelization capability है

Transformer में attention की भूमिका

Transformer input text का इस्तेमाल next token prediction के लिए करता है, और input को पहले tokenization के जरिए शब्दों या word pieces में बांटा जाता है
हर token को high-dimensional vector यानी embedding में बदला जाता है
- इस embedding space की directions meaning से correspond कर सकती हैं
- उदाहरण के लिए, किसी खास direction में move करने से masculine noun embedding को उसके corresponding feminine noun embedding तक ले जाया जा सकता है
attention का लक्ष्य initial embeddings को धीरे-धीरे adjust करना है, ताकि उनमें individual word information के साथ ज्यादा समृद्ध contextual meaning भी शामिल हो सके

एक ही शब्द context के हिसाब से क्यों बदलता है

“American shrew mole”, “One mole of carbon dioxide”, “Take a biopsy of the mole” में mole के अर्थ अलग-अलग हैं
पहली embedding stage में mole का vector context न देखने वाली lookup table जैसा होता है, इसलिए तीनों cases में एक जैसा होता है
अगली stage, यानी attention block, में आसपास की embeddings mole embedding तक information पहुंचाकर उसकी value update कर सकती हैं
अच्छी तरह trained model mole के कई meanings को embedding space की अलग-अलग directions से जोड़ता है, और context के आधार पर general embedding में क्या जोड़ना है, यह calculate करता है
“Eiffel tower” और “miniature Eiffel tower” की तरह, किसी शब्द की embedding सिर्फ पास के शब्दों से नहीं, बल्कि दूर स्थित tokens से आने वाली information से भी update हो सकती है
next word prediction के लिए सिर्फ last vector इस्तेमाल होता है, इसलिए लंबे input के last word embedding में prediction के लिए जरूरी पूरे context की information कुछ हद तक होनी चाहिए

single attention head का calculation flow

मूल explanation single head of attention के आधार पर आगे बढ़ता है
उदाहरण sentence “A fluffy blue creature roamed the verdant forest.” में मान लेते हैं कि adjectives संबंधित noun की initial embedding को update कर रहे हैं
- यह example दिखाने के लिए है कि attention head क्या कर सकता है
- असल head का behavior कई parameters के cost function को कम करने के लिए adjust होने का परिणाम होता है, इसलिए उसे interpret करना मुश्किल है
initial embedding में word information के साथ position information भी शामिल होती है, और इसे \vec{E} से दर्शाया जाता है
लक्ष्य existing embedding से context reflect करने वाली नई embedding \vec{E}' बनाना है
Query
- पहले step में हर token embedding को query matrix W_Q से multiply करके query vector \vec{Q} बनाया जाता है
- इसे ऐसे सोच सकते हैं कि noun “क्या मेरे आगे adjective है?” जैसा सवाल पूछ रहा है
- W_Q के elements learned model parameters हैं, और कोई actual specific head क्या करता है, इसे interpret करना मुश्किल है
- Example के तौर पर इसे noun embedding को “पहले की positions में adjective खोजने” वाली direction में map करना माना जा सकता है
Key
- साथ ही हर embedding को key matrix W_k से multiply करके key vector \vec{K} बनाया जाता है
- key को query का potential answer माना जा सकता है, और यह query जैसी ही smaller-dimensional space में होता है
- key और query कितने aligned हैं, इसे dot product से measure किया जाता है
- dot product जितना बड़ा होगा, दोनों vectors उतने ज्यादा strongly aligned होंगे
- अगर fluffy और blue की key, creature की query से अच्छी तरह match करती हैं, तो large positive value मिलेगा
- सभी key-query pairs के dot products calculate करने पर scores का एक grid बनता है, जो दिखाता है कि कौन-सा शब्द किसी दूसरे शब्द के meaning update में कितना relevant है

Attention pattern और softmax

dot product scores -\infty से \infty तक के values ले सकते हैं, इसलिए हर column पर softmax apply करके उन्हें 0 और 1 के बीच normalize किया जाता है
normalized grid को attention pattern कहा जाता है
- हर column को इस बात के weight के रूप में देखा जा सकता है कि left word, top word को update करने में कितना relevant है
Original Transformer paper इसे ज्यादा compact notation में लिखता है
- Q और K query और key vectors के पूरे arrays हैं
- K^TQ सभी possible key-query dot products के grid को दर्शाता है
- paper notation में query और key rows में रखे जाते हैं और QK^T form में लिखे जाते हैं, इसलिए यह यहां समझाए गए diagram की तुलना में diagonal direction में flipped रूप होता है
Numerical stability के लिए key-query space dimension के square root, यानी \sqrt{d_k} से divide करने वाला term जोड़ा जाता है
softmax पूरी expression को wrap करता है, लेकिन meaning के हिसाब से हर column पर apply होता है

Masking और context size constraints

Training के दौरान model दिए गए text में सिर्फ एक next token predict नहीं करता, बल्कि हर partial sequence के बाद possible next token भी simultaneously predict करता है
- एक text example कई training examples की तरह काम करता है, जिससे efficiency बढ़ती है
GPT example में अगर बाद के tokens पहले के tokens को influence करें, तो next token का correct answer leak हो सकता है, इसलिए masking का इस्तेमाल होता है
- softmax से पहले उन positions की values को negative infinity पर set किया जाता है
- softmax के बाद वे positions 0 हो जाती हैं, और column normalized बना रहता है
ऐसा नहीं है कि attention में masking हमेशा apply होती ही है, लेकिन GPT example में बाद के tokens को पहले के tokens पर असर डालने से रोकने के लिए इसे हमेशा use किया जाता है
attention pattern का size context size के square के बराबर होता है
- इसलिए context size large language models की एक महत्वपूर्ण limitation हो सकता है
- बड़े context window के लिए attention mechanism को अधिक scalable बनाने वाले variants आए हैं, लेकिन यहां सिर्फ basic form को cover किया गया है

Value से embedding को सच में update करने का तरीका

attention pattern यह weights देता है कि कौन-सा शब्द किस शब्द को update करेगा, और अगला step actual embedding change amount बनाना है
हर embedding को value matrix W_V से multiply करके value vector बनाया जाता है
- value vector embedding जैसी ही high-dimensional space में होता है
- यह दर्शाता है कि relevant word किसी दूसरे शब्द के meaning को adjust करते समय कौन-सा specific change add करे
हर column में value vectors को attention pattern के corresponding weights से multiply करके सबको जोड़ने पर change amount \Delta \vec{E} बनता है
इस change amount को original embedding में जोड़ने पर context को reflect करने वाली नई embedding \vec{E}' बनती है
- Example में creature, fluffy और blue की information absorb करके “fluffy blue creature” के करीब meaning रखता है
इसी process को सभी columns पर apply करने से पूरी token sequence के refined embeddings attention block से बाहर आती हैं
single attention head को तीन तरह की learned parameter matrices — key matrix, query matrix, value matrix — से parameterize किया जाता है

GPT-3 के आधार पर parameter calculation

GPT-3 example में key और query matrices में embedding dimension के अनुरूप 12,288 columns और key-query space dimension के अनुरूप 128 rows होते हैं
- हर matrix में 1,572,864 parameters होते हैं
अगर value matrix को 12,288×12,288 square matrix रखा जाए, तो 150,994,944 parameters और जुड़ जाते हैं, जो key/query से कहीं ज्यादा हैं
असल में value map को दो छोटे matrices में decompose करके parameter count को key/query जैसा रखना ज्यादा efficient होता है
- पहला matrix बड़े embedding space को 128-dimensional जैसे छोटे space में नीचे लाता है
- दूसरा matrix छोटे space से वापस embedding space में ऊपर ले जाता है
- Linear algebra के नजरिए से, यह पूरे value map को low-rank transformation तक सीमित करता है
इस explanation में इन दो matrices को Value_\downarrow, Value_\uparrow कहा गया है, लेकिन ये conventional names नहीं हैं
चारों matrices को मिलाने पर एक attention head में लगभग 6.3 मिलियन parameters होते हैं

Self-attention और cross-attention

अब तक की structure को ज्यादा सटीक रूप से self-attention head कहा जाता है
cross-attention head उन models में आता है जो दो अलग-अलग datasets को process करते हैं
- उदाहरण के लिए, translation model में key एक language से और query दूसरी language से आ सकते हैं
- attention pattern दिखा सकता है कि एक language के words दूसरी language के words से कैसे correspond करते हैं
cross-attention में key और query maps अलग-अलग datasets पर operate करते हैं, यही बात इसे self-attention से अलग बनाती है
translation जैसी settings में बाद के tokens का पहले के tokens पर असर डालने का concept नहीं दिखता, इसलिए आमतौर पर masking नहीं होती

Multi-headed attention और repeated blocks

असल attention block कई heads को parallel में चलाने वाले multi-headed attention से बना होता है
GPT-3 हर block के अंदर 96 attention heads इस्तेमाल करता है
- 96 अलग-अलग key/query matrices, 96 अलग-अलग attention patterns बनाते हैं
- हर head अपनी value matrices से value vector sequence बनाता है
- हर token position पर सभी heads द्वारा सुझाए गए change amounts \Delta \vec{E} को sum करके original embedding में जोड़ा जाता है
कई heads को parallel में चलाने से model को उन कई तरीकों को सीखने की capacity मिलती है जिनसे context meaning बदलता है
GPT-3 के आधार पर, 96 heads वाला एक multi-headed attention block लगभग 600 मिलियन parameters रखता है
Paper और actual implementations में हर head के Value_\uparrow के corresponding matrices को एक बड़े output matrix में bundle करके पूरे multi-headed attention block से जोड़ा जाता है
- आम तौर पर जब किसी specific head की value matrix कहा जाता है, तो यहां Value_\downarrow कहे गए पहले projection step को refer किया जाता है

गहरे Transformer में meaning कैसे accumulate होता है

Transformer के अंदर data सिर्फ एक attention block से नहीं गुजरता, बल्कि कई attention blocks और multi-layer perceptron से गुजरता है
जब किसी word embedding ने context का कुछ हिस्सा absorb कर लिया होता है, तब भी उसे और refined surrounding embeddings से प्रभावित होने के मौके मिलते रहते हैं
Network जितना गहरा होता है, हर embedding दूसरी embeddings से उतना ज्यादा meaning absorb करती है, और sentiment, tone, कविता है या नहीं जैसी higher-level abstract features encode करने की capacity रखती है
GPT-3 में 96 layers शामिल हैं, और key/query/value से जुड़े parameters कुल मिलाकर 58 बिलियन से कम बताए जाते हैं
यह पूरे network parameters का लगभग एक-तिहाई है, और बाकी बड़ा हिस्सा attention के बीच वाले blocks से आता है
attention mechanism की सफलता का बड़ा हिस्सा किसी एक specific behavior में नहीं, बल्कि GPU से कम समय में बहुत सारी computation करने की high parallelization capability में है
Deep learning में scale-up से model performance में बड़े qualitative improvements आ सकते हैं—इस lesson की वजह से scaling allow करने वाली parallelizable architectures को बड़ा advantage मिलता है

1 टिप्पणियां

GN⁺ 2024-04-15

Hacker News की राय

quantum chemistry और कुछ machine learning पर काम कर चुके व्यक्ति के तौर पर, यह वीडियो देखते हुए Transformer model और quantum mechanics के बीच समानता काफी साफ दिखी
quantum mechanics में पूरे physical system की अवस्था बहुत high-dimensional normalized vector, यानी Hilbert space की एक half-line, के रूप में encode होती है, और समय के साथ बदलाव को मोटे तौर पर time-translation operator संभालता है, जिसे unitary matrix U = exp(-iHt) के रूप में देखा जा सकता है
वीडियो में कहा गया है कि next token prediction केवल अंतिम context-aware embedding vector से अगले context-aware embedding vector की गणना करके तय होता है, इसलिए यह किसी high-dimensional vector पर linear state function लागू करने के परिणाम जैसा लगता है
यह कुछ ऐसा महसूस होता है जैसे पूरे system का Hamiltonian training data से offline बनाया जाए, फिर किसी खास subsystem यानी context window को उस Hamiltonian के लिए उपयुक्त basis में reparameterize किया जाए, एक step का time translation लगाया जाए, और फिर वापस original basis में लौटा दिया जाए
हालांकि किसी खास field में research करने वाले व्यक्ति को हर समस्या उसी field के हथौड़े के लिए कील जैसी दिख सकती है, इसलिए उत्सुकता है कि क्या यह समानता दूसरों को भी दिखती है या यह बहुत जबरदस्ती की बात है
- मुझे लगता है यह तुलना ठीक से नहीं बैठती। पहले के सभी nonlinear steps को भूल भी जाएँ, तो जो बचता है वह बस एक linear dynamical system है; इसमें quantum mechanics की मुख्य विशेषताएँ, जैसे complex-number nature या unitary property, नहीं हैं
- लगता है यह बस state machine को समझाने जैसा है। state को vector में encode करना और matrix से steps आगे बढ़ाना implementation details के ज्यादा करीब नहीं है क्या
- हाल में मैंने इस विचार पर थोड़ा सोचा। अगर समय continuous नहीं है, तो क्या universe की quantum state पर किसी operator को recursively apply करके universe के time evolution को model किया जा सकता है
  अगर operator का एक application universe state को एक Planck time जितना आगे बढ़ाता हो, तो उत्सुकता है कि क्या हम ऐसे universe और continuous time वाले universe के बीच का फर्क observe कर पाएँगे
- पहले मेरे पास maths PhD intern था, जिसने कहा था कि high-dimensional linear algebra 1900s के हिसाब से भी बेहद advanced क्षेत्र था और computer science में इसमें बहुत कुछ नया खोजने की गुंजाइश है
  उस समय physics में क्या हो रहा था, उससे इसका connection अब जाकर ध्यान में आया
- आखिरकार क्या इसका मतलब यह है कि हमारे बनाए सबसे sophisticated computer models उस algorithm के करीब पहुँचने लगे हैं जो हमारे universe को define करता है। यानी कहें तो simulation फिर से सामने आ रही है क्या
CodeEmporium का YouTube वीडियो follow करना आसान लगा: https://www.youtube.com/watch?v=Nw_PJdmydZY
Transformer को analogy से समझाना मुश्किल है, और सच तो यह है कि यह क्यों काम करता है इसकी भी कोई अच्छी explanation नहीं है, इसलिए शायद mechanism दिखाकर interpretation viewer पर छोड़ देना बेहतर है
साथ ही dot product को vectors के एक-दूसरे पर projection के रूप में समझाना ज्यादा सरल है
- explanation यह है कि neural network P(next_word|previous_words) नाम की conditional probability distribution सीखने वाला statistical fitting algorithm है। weights उस distribution का model हैं, और LLM काफी हद तक hardware innovation है, जिसने GPUs को terabyte-scale data पर इसे बड़े पैमाने पर compute करने लायक बनाया
  “the cat sat on the ...” के बाद “mat” इसलिए आता है क्योंकि dataset में वह सबसे ज्यादा बार आया word है, और neural network ऐसी frequencies का model है
  “London in UK” जानता है लेकिन “London in France” नहीं जानता ऐसा दिखने का कारण भी यह है कि dataset में “UK” कहीं ज्यादा बार आता है
  algorithm खुद hardware के हिसाब से computation को align करने के अलावा कोई खास दिलचस्प काम नहीं करता। value data के अंदर मौजूद conditional probability structure से आती है, और वह structure लोगों द्वारा एक-दूसरे तक information पहुँचाने के लिए words को उपयोगी ढंग से arrange करने का परिणाम है
- computer scientist के नजरिए से differentiable hash table वाली interpretation अच्छी तरह फिट बैठी। AIAYN paper भी query/key/value नामों का इस्तेमाल करके उसी तरफ इशारा करता है, लेकिन “hash table” शब्द को explicitly नहीं कहता। शायद इसे किसी दूसरे paper में introduce किया गया होगा
- attention के बारे में मेरी व्यक्तिगत समझ यह है कि Transformer का output नए token vectors की sequence होता है, और हर output token vector अपने आसपास के input token vectors की context information शामिल करता है
  मुझे पता है यह explanation अधूरी है, लेकिन कुछ न होने से तो बेहतर है
एक सरल request को process करते समय LLM कैसे काम करता है, यह दिखाने वाला एक convincing visualization है: https://bbycroft.net/llm
यह 3blue1brown की detailed explanation को अच्छी तरह complement करता है
- इस तरह visualize करने पर महसूस होता है कि GPT-3 का scale अविश्वसनीय रूप से बड़ा है। GPT-4 यहाँ कैसा दिखेगा, इसकी कल्पना भी ठीक से नहीं हो पाती
शानदार वीडियो है। यह अच्छी तरह दिखाता है कि Q*K matrix multiplication bottleneck क्यों है। अगर sequence, यानी context window की length S है, तो सभी queries और सभी keys के results वाली SxS size matrix को memory में store करना पड़ता है
इस bottleneck को improve करने वाला एक नया-ish idea Ring Attention है, और यह लेख इसे अच्छी तरह समझाता है: https://learnandburn.ai/p/how-to-build-a-10m-token-context
उस लेख को मैंने edit किया था
- Flash Attention इस्तेमाल करने पर (S, S) matrix बनानी ही नहीं पड़ती। formula softmax(Q @ K^T / sqrt(d)) @ V के रूप में है, इसलिए final output tiles के हिसाब से बनाया जा सकता है
  Unsloth में Flash Attention की वजह से memory usage quadratic नहीं बल्कि linear बढ़ता है, fine-tuning 2x तेज होती है, VRAM usage 80% घटता है और inference भी 2x तेज होता है। हालांकि compute amount अब भी O(N^2) है
  long context में Unsloth का latest release HF+FA2 की तुलना में +1.9% overhead के साथ 4x लंबा context रख सकता है, जिससे H100 पर 228K context तक संभव है
- वीडियो में भी Ring Attention और कई अन्य techniques को list किया गया है, लेकिन कहा गया है कि यह इस वीडियो के scope में नहीं है: https://youtu.be/eMlx5fFNoYc?t=784
पिछला लेख “But what is a GPT?” भी वाकई अच्छा है: https://www.3blue1brown.com/lessons/gpt
इस वीडियो की वजह से मुझे एहसास हुआ कि attention mechanism किसी खास function से ज़्यादा एक तरह के meta function जैसा है
अगर मैंने सही समझा है, तो Attention + सीखे हुए weights transformer को कुछ हद तक मनमाना function सीखने देते हैं, और उस function में scaled dot-product जैसा matching mechanism शामिल होता है
- सही है। attention की ताकत function space को explore करने और constraints के भीतर सबसे अच्छा function सोच निकालने में है
  इसलिए मुझे लगता है कि linear attention standard attention की क्षमता के करीब कभी नहीं पहुँच सकता। क्योंकि हर input-output pair को explore करने वाला quadratic term इसकी अनिवार्य विशेषता है
यह वीडियो पचाने में आसान था, इसमें animation का बड़ा योगदान था। बोलने के timing के साथ expand-सिकुड़ने और खुलने का तरीका बहुत अच्छी तरह बनाया गया था
- यह निश्चित रूप से वह चीज़ है जो वह अधिकतर लोगों से बेहतर करता है। math animation के लिए उसकी अपनी बनाई custom animation library भी है: https://github.com/3b1b/manim
मैं एक काफ़ी करीबी संबंधित क्षेत्र में काम करता हूँ, और यह वीडियो तुरंत हमारी team के onboarding docs में शामिल हो गया
यह भी अहम है कि visualization code का बड़ा हिस्सा GitHub पर उपलब्ध है: https://github.com/3b1b/videos/tree/master/_2024/transformers
- दिलचस्प है, उस onboarding doc में और क्या-क्या शामिल है, यह जानने की जिज्ञासा है
आखिरकार समझ आ गया। पता नहीं दूसरे वीडियो इसे इतना confusing क्यों बना देते हैं
- विषय अपने-आप में confusing है, और 3b1b उतना ही अच्छा है
- मेरे अनुभव में Feynmann जैसे बेहद दुर्लभ अपवादों को छोड़ दें, तो researchers अक्सर अपने काम को दूसरों को साफ़-साफ़ समझाने में सबसे खराब होते हैं
  मुझे लगने लगा है कि teaching skill और research skill शायद आम तौर पर एक-दूसरे से अलग, लगभग mutually exclusive skills हैं
- मैं educational videos या content बेहतर बनाना चाहता हूँ, इसलिए जिज्ञासा है। 3b1b की तुलना में दूसरे वीडियो किन बातों में कमज़ोर थे, यह जानना चाहूँगा
- Grant में जटिल चीज़ों को बहुत स्पष्ट तरीके से समझाने की प्रतिभा है। उसका channel लोकप्रिय है, इसकी वजह है
- पता नहीं यह rhetorical question था या नहीं, लेकिन सवाल दिलचस्प है। मुझे लगता है कि ज़्यादातर लोग transformers को कम-से-कम तीन वजहों से confusing पाते हैं
  पहली, standard terminology अच्छी नहीं है। “attention” भी बस मुश्किल से intuitive है, “self-attention” उससे भी खराब है, और “key” और “value” की तो बात ही छोड़िए
  दूसरी, मुख्य papers—Attention is All You Need, BERT paper वगैरह—अच्छे से लिखे नहीं गए थे। मैं उनकी उपलब्धियों को कम नहीं आँक रहा, लेकिन कोई प्रभावशाली paper जिसमें बहुत बड़ा breakthrough हो, फिर भी explanation में कमज़ोर हो सकता है, और मेरे हिसाब से ऐसा ही था
  तीसरी, ये structures आम तौर पर “यह भी आज़माओ, वह भी आज़माओ और जो अच्छी तरह fit हो जाए उसे रखो” वाले तरीके से खोजे गए। ऐसा नहीं था कि पहले कोई introspective process था जिससे यह prediction निकला कि यह structure अच्छा काम करेगा, और फिर experiment से उसे verify किया गया; शुरू से अंत तक यह empirical था
  इसलिए हम पूरी तरह नहीं समझते कि यह इतना अच्छा क्यों काम करता है, और सभी explanations काफी हद तक post-hoc rationalizations जैसी हैं। हाल में कुछ काम यह भी संकेत देते हैं कि पर्याप्त tuning करने पर दूसरे structures भी लगभग उतना ही अच्छा काम कर सकते हैं। जिस चीज़ को हम पूरी तरह नहीं समझते, उसे समझाना कठिन है
मैं जानना चाहता हूँ कि current architecture कैसे evolve हुआ, इस पर कोई reference material है क्या। एक बहुत simple core idea से प्रसिद्ध “all you need” paper तक का flow देखना चाहता हूँ
वरना कई mechanisms अचानक प्रकट हुए जैसे लगते हैं, computation बहुत है लेकिन intuition कम
Jeremy Howard ने Twitter पर कहा था कि उन्होंने इस idea के कई versions कई बार देखे हैं, जिससे लगता है कि यह एक natural idea था। अगर यह idea दूसरी जगहों पर कैसे उभरा, इसके examples मिलें तो intuition बन सकती है
- मोटे तौर पर flow ऐसा है। शुरुआती seq-2-seq approach में LSTM इस्तेमाल होता था: एक input sequence को encode करता था और दूसरा output sequence को decode करता था। variable-length sentences को fixed-size vector में encode कर, फिर आम तौर पर अलग लंबाई वाली दूसरी sequence में decode करना काम कर गया—यह अपने-आप में हैरानी की बात है
  इस RNN/LSTM approach में fixed-size representation की कमजोरी थी, और output के किसी खास हिस्से को generate करते समय input sequence के किस हिस्से का उपयोग करना है, यह तय करना भी कठिन था। Bahdanau आदि ने encoder-decoder RNN में attention mechanism जोड़कर इसे हल किया, जिससे model सिर्फ final state नहीं बल्कि RNN की सभी previous states देख सकता था
  RNN training में inefficient थे, इसलिए Jakob Uszkoreit बड़े पैमाने के parallel hardware का बेहतर उपयोग करने का तरीका ढूँढना चाहते थे, और उन्होंने इस बात पर ध्यान दिया कि language सिर्फ sequential नहीं बल्कि hierarchical भी होती है। उन्होंने layered structure propose किया, जिसमें हर layer में subsequence के tokens को parallel process किया जाता था, और Bahdanau-style attention को रखते हुए tokens एक-दूसरे को refer करने वाली self-attention से अगली layer predict करते थे
  शुरुआती implementation काम करती थी, लेकिन उस समय convolution जैसे दूसरे approaches से बेहतर नहीं थी। बाद में Noam Shazeer ने उस idea को आगे बढ़ाकर एक structure बनाया जो कहीं बेहतर काम करता था, और unnecessary components हटाने के experiments के बाद वही original transformer बना, मेरी जानकारी में। final structure में key-based attention form किसने बनाया, यह मुझे ठीक से नहीं पता
  Attention is All You Need paper के original transformer में पुराने RNN-based approaches की तरह अलग encoder और decoder थे, और इसे Google के BERT जैसे शुरुआती models में भी इस्तेमाल किया गया। लेकिन language models के लिए यह अनिवार्य नहीं था, इसलिए OpenAI के GPT ने सिर्फ decoder part इस्तेमाल किया, और अब लगभग सब लोग यही तरीका अपनाते हैं। decoder-only transformer में input sentence सबसे नीचे वाली layer में जाता है, फिर हर layer से गुजरते हुए step-by-step transform होता है और ऊपर से निकलता है। input sequence के अंत में end token जोड़ा जाता है, और वही output sequence के अगले token, यानी last token, में transform होता है
- Karpathy ने Stanford lecture में transformer architecture का इतिहास अच्छी तरह summarize किया था: https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618

Attention visualization: Transformer का दिल [वीडियो]

Transformer में attention की भूमिका

एक ही शब्द context के हिसाब से क्यों बदलता है

single attention head का calculation flow

Query

Key

Attention pattern और softmax

Masking और context size constraints

Value से embedding को सच में update करने का तरीका

GPT-3 के आधार पर parameter calculation

Self-attention और cross-attention

Multi-headed attention और repeated blocks

गहरे Transformer में meaning कैसे accumulate होता है

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय