[अनुवाद] Vision Transformer का विज़ुअल परिचय (A Visual Guide to Vision Transformers)

(discuss.pytorch.kr)

13 पॉइंट द्वारा ninebow 2024-04-22 | 1 टिप्पणियां | WhatsApp पर शेयर करें

ℹ️ xguru ने परिचय कराया हुआ Visual Transformers विज़ुअल गाइड लेख देखकर, लेखक Data Scientist और Software Engineer Dennis Turp की अनुमति से, उनके लिखे Vision Transformer(ViT) पर विज़ुअल परिचय लेख (A Visual Guide to Vision Transformers) का अनुवाद किया गया है.
Vision Transformer(ViT), CV(Computer Vision) क्षेत्र में Transformer को लागू करने वाला एक मॉडल है, जो object detection और image classification जैसे क्षेत्रों में उत्कृष्ट प्रदर्शन दिखाता है. विशेष रूप से इसे image से feature निकालने वाले Visual Encoder के रूप में बहुत उपयोग किया जाता है.
मूल लेख की व्याख्या संक्षिप्त है, इसलिए जहाँ समझना कठिन हो सकता था वहाँ समझ में मदद के लिए कुछ टिप्पणियाँ जोड़ी गई हैं.

Vision Transformer(ViT) का विज़ुअल परिचय

यह लेख Vision Transformers(ViTs) का एक विज़ुअल परिचय है, जो image classification कार्यों में अत्याधुनिक(SotA, State-of-the-Art) प्रदर्शन दिखाने वाले deep learning models का एक वर्ग है. Vision Transformer, मूल रूप से natural language processing(NLP) के लिए डिज़ाइन की गई Transformer architecture को image data पर लागू करता है. इस लेख में, स्क्रॉल करते हुए डेटा के प्रवाह को समझने में मदद करने वाले विज़ुअलाइज़ेशन और सरल व्याख्या के साथ Vision Transformer के काम करने के तरीके को समझाया गया है. (:pytorch::kr:: यहाँ स्क्रॉल के जरिए समझाना कठिन है, इसलिए इसे image capture से बदला गया है. मूल लेख को साथ में देखें तो बेहतर होगा.)

This is a visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks. Vision Transformers apply the transformer architecture, originally designed for natural language processing (NLP), to image data. This guide will walk you through the key components of Vision Transformers in a scroll story format, using visualizations and simple explanations to help you understand how these models work and how the flow of the data through the model looks like.

0. डेटा को देखना / Lets start with the data

सामान्य convolutional neural networks(CNN) की तरह Vision Transformer भी supervised learning तरीके से प्रशिक्षित किया जाता है. यानी मॉडल को images और उनके संबंधित labels से बने dataset पर प्रशिक्षित किया जाता है.

Like normal convolutional neural networks, vision transformers are trained in a supervised manner. This means that the model is trained on a dataset of images and their corresponding labels.

1. केवल एक data point पर ध्यान देना / Focus on one data point

Vision Transformer अंदर से कैसे काम करता है, इसे समझने के लिए पहले हम केवल एक data point(batch size 1) पर ध्यान देंगे. और साथ में इस सवाल पर विचार करेंगे: Transformer में यह data input करने के लिए इसे कैसे तैयार(preprocess) किया जाता है?

To get a better understanding of what happens inside a vision transformer lets focus on a single data point (batch size of 1). And lets ask the question: How is this data point prepared in order to be consumed by a transformer?

2. फिलहाल label को अलग रखिए / Forget the label for the moment

label को हम बाद में, जब वह अधिक प्रासंगिक होगा, तब देखेंगे. अभी के लिए हमारे पास केवल एक image बचती है.

The label will become more relevant later. For now the only thing that we are left with is a single image.

3. image को patches में बाँटना / Create patches of the image

पूरी image को समान आकार के p x p patches में बाँटकर Transformer के अंदर उपयोग के लिए तैयार किया जाता है.

To prepare the image for the use inside the transformer we divide the image into equally sized patches of size p x p.

4. image patches को flatten करना / Flatting of the image patches

patches को p' = p² x c आकार के vectors में flatten किया जाता है. यहाँ p patch की एक भुजा का आकार है, और c channels की संख्या है. (:pytorch::kr:: उदाहरण के लिए, RGB image के मामले में channels की संख्या 3 होती है.)

The patches are now flattened into vectors of dimension p'= p²*c where p is the size of the patch and c is the number of channels.

5. patches से embeddings बनाना / Creating patch embeddings

ऊपर image patches से बने vectors को linear transformation के जरिए encode किया जाता है. इस तरह बना Patch Embedding Vector एक निश्चित आकार d रखता है.

These image patch vectors are now encoded using a linear transformation. The resulting Patch Embedding Vector has a fixed size d.

6. सभी patches को embed करना / Embedding all patches

जब image patches सभी को निश्चित आकार के vectors में embed कर दिया जाता है, तब हमें n x d आकार का एक array मिलता है. यहाँ n image patches की संख्या है, और d एक patch embedding का आकार है.

Now that we have embedded our image patches into vectors of fixed size, we are left with an array of size n x d where n is the the number of image patches and d is the size of the patch embedding

7. classification token(CLS) जोड़ना / Appending a classification token

मॉडल को प्रभावी ढंग से train करने के लिए, patch embeddings में एक अतिरिक्त vector जोड़ा जाता है, जिसे classification token (CLS token) कहा जाता है। यह vector neural network का एक learnable parameter होता है और इसे random रूप से initialize किया जाता है। ध्यान देने वाली बात यह है कि केवल एक ही CLS token होता है, और सभी data points में वही vector जोड़ा जाता है। (:pytorch::kr:: इस चरण तक, n patch embeddings में CLS token जोड़ने पर कुल (n+1) embeddings होते हैं, जहाँ प्रत्येक embedding का आकार d है, यानी कुल आकार (n+1) x d होता है.)

In order for us to effectively train our model we extend the array of patch embeddings by an additional vector called classification token (cls token). This vector is a learnable parameter of the network and is randomly initialized. Note: We only have one cls token and we append the same vector for all data points.

8. Positional embedding vectors जोड़ना / Add positional embedding Vectors

अब तक के patch embeddings में कोई positional information नहीं है। इस समस्या को हल करने के लिए, सभी patch embeddings में एक learnable, randomly initialized positional embedding vector जोड़ा जाता है। साथ ही, पहले जोड़े गए classification token (CLS token) में भी ऐसा ही positional vector जोड़ा जाता है। (:pytorch::kr:: Transformer में positional encoding के मानों को 'जोड़ा' जाता है। इसलिए vector के आकार में कोई परिवर्तन नहीं होता.)

Currently our patch embeddings have no positional information associated with them. We remedy that by adding a learnable randomly initialized positional embedding vector to all our patch embeddings. We also add a such a positional embedding vector to our classification token.

9. Transformer में इनपुट देना / Transformer Input

Positional embedding vectors जोड़ने के बाद हमारे पास (n+1) x d आकार का एक array बचता है। यही array Transformer का input होगा, जिसे अगले चरणों में और विस्तार से समझाया जाएगा।

After the positional embedding vectors have been added we are left with an array of size (n+1) x d. This will be our input for the transformer which will be explained in greater detail in the next steps.

10.1. Transformer: QKV बनाना / QKV Creation

Transformer के input patch embedding vectors को linear रूप से कई बड़े vectors में embed किया जाता है। इन नए vectors को फिर समान आकार के तीन हिस्सों में विभाजित किया जाता है। इन्हें क्रमशः Q - Query vector, K - Key vector, और V - Value vector कहा जाता है। इन सभी vectors में से प्रत्येक के (n+1) vectors प्राप्त होते हैं।

Our transformer input patch embedding vectors are linearly embedded into multiple large vectors. These new vectors are than separated into three equal sized parts. The Q - Query Vector, the K - Key Vector and the V - Value Vector . We will have (n+1) of a all of those vectors.

10.2. Transformer: attention scores की गणना / Attention Score Calculation

सबसे पहले attention scores A की गणना करने के लिए, सभी query vectors Q को सभी key vectors K के साथ गुणा किया जाता है।

To calculate our attention scores A we will now multiply all of our query vectors Q with all of our key vectors K.

10.3. Transformer: attention score matrix / Attention Score Matrix

इस प्रकार प्राप्त attention score matrix A की हर row पर softmax function लागू किया जाता है, ताकि प्रत्येक row का योग 1 हो जाए।

Now that we have the attention score matrix A we apply a softmax function to every row such that every row sums up to 1.

10.4. Transformer: aggregated contextual information की गणना / Aggregated Contextual Information Calculation

पहले patch embedding vector के लिए aggregated contextual information की गणना करने हेतु, attention matrix की पहली row पर ध्यान दिया जाता है। यहाँ Value vectors V को weights के रूप में उपयोग करके पहले image patch embedding के लिए aggregated contextual information vector बनाया जाता है।

To calculate the aggregated contextual information for the first patch embedding vector. We focus on the first row of the attention matrix. And use the entires as weights for our Value Vectors V. The result is our aggregated contextual information vector for the first image patch embedding.

10.5. Transformer: हर patch के लिए aggregated contextual information / Aggregated Contextual Information for every patch

Attention score matrix की बाकी rows के लिए भी यही प्रक्रिया दोहराई जाती है, और परिणामस्वरूप N+1 aggregated contextual information vectors प्राप्त होते हैं। यानी हर patch के लिए एक (=N) और classification token (CLS Token) के लिए एक (=1)। इस चरण पर पहला Attention Head पूरा होता है।

Now we repeat this process for every row of our attention score matrix and the result will be N+1 aggregated contextual information vectors. One for every patch + one for the classification token. This steps concludes our first Attention Head.

10.6. Transformer: multi-head attention / Multi-Head Attention

क्योंकि यहाँ (Transformer का) multi-head attention उपयोग हो रहा है, इसलिए अलग QKV mapping के साथ 10.1 से 10.5 तक की पूरी प्रक्रिया दोहराई जाती है। ऊपर के चित्र में समझाने के लिए 2 heads माने गए हैं, लेकिन आम तौर पर ViT में इससे कहीं अधिक heads होते हैं। अंत में इससे कई aggregated contextual information vectors प्राप्त होते हैं।

Now because we are dealing multi head attention we repeat the entire process from step 10.1 - 10-5 again with a different QKV mapping. For our explanatory setup we assume 2 Heads but typically a VIT has many more. In the end this results in multiple Aggregated contextual information vectors.

10.7. Transformer: अंतिम attention layer चरण / Last Attention Layer Step

इस तरह बनाए गए कई heads को stack करने के बाद, उन्हें पैच embedding के आकार के बराबर d आकार के vector में map किया जाता है।

These heads are stacked together and are mapped to vectors of size d which was the same size as our patch embeddings had.

10.8. Transformer: Attention layer का परिणाम निकालना / Attention Layer Result

इस तरह पिछले चरण से attention layer पूरी हो जाती है, और हमें input में इस्तेमाल किए गए embedding की ठीक वही संख्या और ठीक वही आकार वापस मिलते हैं।

The previous step concluded the attention layer and we are left with the same amount of embeddings of exactly the same size as we used as input.

10.9. Transformer: Residual connection जोड़ना / Residual connections

Transformer में Residual Connection का बहुत उपयोग होता है। इसका सीधा मतलब है कि पिछली layer के input को current layer के output में जोड़ दिया जाता है। यहां भी हम residual connection का उपयोग करेंगे।

Transformers make heavy use of residual connections which simply means adding the input of the previous layer to the output the current layer. This is also something that we will do now.

10.10. Transformer: Residual connection का परिणाम निकालना / Residual connection Result

ऐसे residual connection के जरिए (समान आकार d वाले vectors को आपस में जोड़कर) उसी आकार के vector बनते हैं।

The addition results in vectors of the same size.

10.11. Transformer: Feed Forward Network से गुजारना / Feed Forward Network

अब तक के output को non-linear activation function वाले feed forward artificial neural network से गुजारा जाता है।

Now these outputs are feed through a feed forward neural network with non linear activation functions

10.12. Transformer: अंतिम परिणाम निकालना / Final Result

Transformer में अब तक की computation के बाद एक और residual connection होता है, लेकिन यहां व्याख्या को संक्षिप्त रखने के लिए हम उसे छोड़कर Transformer layer की प्रक्रिया समाप्त करेंगे। अंत में Transformer input के समान आकार का output देता है।

After the transformer step there is another residual connections which we will skip here for brevity. And so the last step concluded the transformer layer. In the end the transformer produced outputs of the same size as input.

11. Transformer की प्रक्रिया दोहराना / Repeat Transformers

अब तक की पूरी Transformer प्रक्रिया, यानी 10.1 से 10.12 तक, कई बार दोहराई जाती है। यहां उदाहरण के तौर पर इसे 6 बार दिखाया गया है।

Repeat the entire transformer calculation Steps 10.1 - Steps 10.12 for the Transformer several times e.g. 6 times.

12. Classification token output की पहचान / Identify Classification token output

अगला चरण classification token (CLS token) के output की पहचान करना है। यही vector Vision Transformer की यात्रा के अंतिम चरण में इस्तेमाल होगा।

Last step is to identify the classification token output. This vector will be used in the final step of our Vision Transformer journey.

13. अंतिम चरण: Classification probabilities की भविष्यवाणी / Final Step: Predicting classification probabilities

सबसे अंतिम चरण में हम इस classification output token को एक और fully-connected artificial neural network से गुजारते हैं, ताकि input image के लिए classification probabilities का अनुमान लगाया जा सके।

In the final and last step we use this classification output token and another fully connected neural network to predict the classification probabilities of our input image.

14. Vision Transformer का training / Training of the Vision Transformer

Vision Transformer को train करने के लिए standard Cross-Entropy Loss Function का उपयोग किया जाता है, जो पहले अनुमानित class probabilities और सही true class label की तुलना करता है। मॉडल को backpropagation और gradient descent की मदद से train किया जाता है, जहां loss function को कम करने के लिए model parameters को update किया जाता है।

We train the Vision Transformer using a standard cross-entropy loss function, which compares the predicted class probabilities with the true class labels. The model is trained using backpropagation and gradient descent, updating the model parameters to minimize the loss function.

निष्कर्ष / Conclusion

इस visual guide में हमने data preparation से लेकर model training तक Vision Transformer के प्रमुख components को देखा। हमें उम्मीद है कि इस guide ने आपको यह समझने में मदद की होगी कि Vision Transformer कैसे काम करते हैं और image classification में उनका उपयोग कैसे किया जा सकता है।

In this visual guide, we have walked through the key components of Vision Transformers, from the data preparation to the training of the model. We hope this guide has helped you understand how Vision Transformers work and how they can be used to classify images.

Vision Transformer को और बेहतर समझने में मदद के लिए एक छोटा Colab Notebook भी तैयार किया गया है। 'Blogpost' की टिप्पणी भी ज़रूर देखें। यह code @lucidrains के बेहतरीन VIT Pytorch implementation से लिया गया है, इसलिए उनके काम को भी ज़रूर देखें।

यदि आपके कोई सवाल या फ़ीडबैक हों, तो बेझिझक संपर्क करें। पढ़ने के लिए धन्यवाद! (लेखक का GitHub, X(Twitter), Threads, LinkedIn)

If you have any questions or feedback, please feel free to reach out to me. Thank you for reading!

आभार / Acknowledgements

@lucidrains का VIT PyTorch implementation
सभी इमेज Wikipedia से ली गई हैं, और उनका उपयोग CC BY-SA 4.0 लाइसेंस के तहत अनुमत है।

VIT Pytorch implementation

All images have been taken from Wikipedia and are licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

आगे पढ़ें

⚠️विज्ञापन⚠️: :pytorch:PyTorch Korea User Group द्वारा संकलित यह लेख क्या आपके लिए उपयोगी रहा? सदस्य के रूप में जुड़ें, तो हम प्रमुख लेख आपको ईमेल से भेजेंगे! (डिफ़ॉल्ट रूप से Weekly, लेकिन Daily में भी बदला जा सकता है.)

1 टिप्पणियां

gcback 2024-04-22

उपयोगी सामग्री के लिए मेहनत करने पर धन्यवाद.^