Python की 30 लाइनों में Automatic Differentiation को समझना

(vmartin.fr)

3 पॉइंट द्वारा GN⁺ 2023-08-27 | 1 टिप्पणियां | WhatsApp पर शेयर करें

neural network training के मूल में मौजूद automatic differentiation को scalar के लिए Tensor class के रूप में खुद implement करके दिखाया गया है कि value calculation और differentiation calculation एक ही computation graph पर कैसे जुड़ते हैं
सामान्य Python variables में z = x + y का सिर्फ result value बचता है और relation खत्म हो जाता है, इसलिए Tensor को value और operation history दोनों store करनी होती है
Children(a, b, op) और forward() recursive call से binary tree computation graph बनाया जाता है, और addition・multiplication को redefine करके बाद में values डालने पर भी expression को फिर से calculate किया जा सकता है
grad(deriv_to) self के respect में derivative को 1 और दूसरे scalar के respect में derivative को 0 मानता है, फिर basic operations के differentiation rules recursively apply करके नया computation graph बनाता है
implementation सिर्फ scalars handle करता है और धीमा हो सकता है; array operations, zero multiplication pruning, constant node handling, और repeated calculations कम करने के लिए cache जैसी चीजें improvement tasks के रूप में बची हैं

सामान्य Python variables में relation गायब हो जाता है

x = 3, y = 5, z = x + y की तरह calculate करने पर z में केवल result value 8 बचती है
बाद में x या y की value बदल भी जाए, तो z यह track नहीं कर पाता कि वह किन variables से बना था
variables के बीच relation नहीं बचता, इसलिए किसी specific variable के respect में derivative को automatically calculate करना मुश्किल होता है

`Tensor` से operation history बचाकर रखना

नया type Tensor value (value) store करता है, और operators को redefine करके Tensor objects के बीच calculation होने पर नया Tensor return करता है
शुरुआती implementation सिर्फ __add__ को redefine करता है, जिससे Tensor(3) + Tensor(5) T:8 बना सकता है
इस stage पर अभी z यह operation history preserve नहीं कर पाता कि वह x + y का result है

computation graph और `forward()`

operation history preserve करने के लिए Children = namedtuple('Children', ['a', 'b', 'op']) introduce किया गया है
- a: left input tensor
- b: right input tensor
- op: np.add, np.multiply जैसी actual operation
हर Tensor में numeric value के साथ-साथ children भी हो सकते हैं, और इनके जरिए binary tree shape का computation graph बनता है
forward() child nodes को recursively visit करके actual values calculate करता है
- x = Tensor(3), y = Tensor(5) में z1 = x + y, z2 = z1 * y T:40 बनता है
- पहले x = Tensor(None), y = Tensor(None) से graph बनाकर, बाद में x.value = 3, y.value = 5 डालने और z2.forward() call करने पर भी T:40 calculate होता है

automatic differentiation को computation graph के रूप में बनाना

automatic differentiation को Tensor द्वारा support किए गए हर basic operation के लिए differentiation rule जोड़ने के तरीके से implement किया गया है
grad(self, deriv_to) computation graph को recursively traverse करता है और complex function को simple functions के combination में decompose करता है
basic rules इस प्रकार हैं
- tensor को खुद उसी के respect में differentiate करने पर Tensor(1)
- child-less scalar को किसी दूसरे tensor के respect में differentiate करने पर Tensor(0)
- addition: (a + b)' = a' + b'
- multiplication: (ab)' = a'b + ab'
z2 = (x + y) * y को y के respect में differentiate करने पर result g कोई simple value नहीं, बल्कि partial derivative को represent करने वाला नया computation graph बनता है
- expression के रूप में g = ∂z2/∂y = x + 2*y
- x = 3, y = 5 होने पर g की value 13 है

subtraction, division और exponential function तक विस्तार

ज्यादा complex expressions handle करने के लिए Tensor में subtraction, division, exponential function, और negation operations जोड़े गए हैं
grad() में हर operation से match करने वाले differentiation rules हैं
- subtraction: (a - b)' = a' - b'
- division: (a/b)' = (a'b - ab') / b²
- exponential function: exp(a)' = a' * exp(a)
forward() को भी ऐसे operations handle करने के लिए बदला गया है जिन्हें सिर्फ एक term चाहिए
- उदाहरण: exp(a) को दूसरे term b की जरूरत नहीं होती
- -x को 0 - x form में handle किया जाता है

example expression और Sympy verification

नीचे दिए expression को Tensor से लिखा गया और x, y के respect में partial derivatives calculate किए गए

z = (12 - (x * e^y)) / (45 + x * y * e^-x)

code में इसे इस तरह express किया गया है

x = Tensor(3)
y = Tensor(5)
z = (Tensor(12) - (x * y.exp())) / (Tensor(45) + x * y * (-x).exp())

calculated partial derivative values इस प्रकार हैं
- z.grad(x) → T:-3.34729777301069
- z.grad(y) → T:-9.70176956641438
उसी expression को Sympy के diff() और evalf() से calculate करने पर भी result identical है
- xs = 3, ys = 5 पर x के respect में derivative value -3.34729777301069 है
- y के respect में derivative value -9.70176956641438 है

simple implementation की limitations और optimization points

यह implementation सबसे simple automatic differentiation system के करीब है, और साथ ही काफी धीमा हो सकता है
current class सिर्फ scalars handle करती है
- ज्यादा useful library बनने के लिए arbitrary-size arrays पर operations add करने होंगे
computation graph देखने पर कुछ optimizations possible हैं
- multiplication node में अगर children में से एक 0 है, तो और गहराई में search करने की जरूरत नहीं
- अगर कोई node और उसके children differentiation target tensor x पर depend नहीं करते, तो उस node को constant मानकर traversal रोका जा सकता है
- अगर वही operation repeat होता है, तो cache रखकर same calculation को कई बार perform करने से बचा जा सकता है

1 टिप्पणियां

GN⁺ 2023-08-27

Hacker News की राय

ऐसे छोटे और एलिगेंट code demo पसंद हैं। क्योंकि ये हाथ गंदे करके concepts समझने में मदद करते हैं
Sasha Rush की GPU puzzles और tensor puzzles भी ऐसे ही उदाहरण हैं
https://github.com/srush/GPU-Puzzles
https://github.com/srush/Tensor-Puzzles
- तो फिर https://jaykmody.com/blog/gpt-from-scratch/ भी मज़ेदार हो सकता है
  original code यहाँ है: https://github.com/jaymody/picoGPT/blob/main/gpt2.py
- Andrej Karpathy का micrograd भी है: https://github.com/karpathy/micrograd
अगर आप मानते हैं कि सिर्फ इससे automatic differentiation पूरी तरह समझ आ गई, तो आप खुद को धोखा दे रहे हैं
जब graph एक tree हो, तो इस लेख की तरह सब कुछ बहुत सरल होता है। लेकिन अगर graph ज्यादा सामान्य directed acyclic graph हो, जैसे x = 5; y = 2x; z = xy, तो implementation अब भी बहुत सरल हो सकता है, पर यह समझना आसान नहीं कि वह implementation सही क्यों है। अगर आप सोचते हैं कि यह “बस सामान्य chain rule” है, तो आप फिर भी खुद को धोखा दे रहे हैं
शुरुआती explanations में से एक Paul Werbos ने दी थी, और उन्होंने जरूरी नियम को ordered derivatives का chain rule कहा था और उसे सामान्य chain rule से induction द्वारा सिद्ध किया था। फिर भी यह सामान्य chain rule से तुरंत स्पष्ट रूप से नहीं निकलता। अगर कोई उल्टा मानता है, तो उम्मीद है कि वह मुझे गलत साबित करेगा; ऐसा हुआ तो मुझे बहुत खुशी होगी
- तो फिर और कहाँ पढ़ना चाहिए? autograd, PyTorch, mxnet जैसे frameworks बनाने वालों ने कहीं न कहीं यह विस्तार से सीखा होगा, तो उसका source जानना चाहूँगा। जहाँ तक मुझे पता है mxnet academia से, शायद CMU से निकला था
- सच कहूँ तो ऐसी चर्चा में लोग क्या चाहते हैं, यह मुझे ठीक से समझ नहीं आता, और शायद वजह यह है कि implied abstraction, यानी ordered derivatives, ideal नहीं है
  computational graph, यानी directed acyclic graph की edges के साथ सामान्य chain rule लागू करें तो हर step पर सही value मिलती है। बस एक अतिरिक्त नियम चाहिए: “अगर किसी variable को calculation में कई बार इस्तेमाल किया जाए, यानी एक ही node से कई edges निकलें या reverse direction में कई edges आएँ, तो अलग-अलग computed gradients को जोड़ना होगा।” यह भी मुझे काफी basic और intuitive लगता है
  उदाहरण के लिए, अगर f(x, y) में x और y दोनों की जगह z डालें, तो d/dz f(z, z) = f_x(z, z) + f_y(z, z) है, और subscript partial derivative को दर्शाता है। मेरे लिए यह तरीका दोनों को मिलाकर “chain rule से आगे की चीज़” जैसा बनाने की तुलना में mathematically भी ज्यादा सरल है, और actual implementation, खासकर जिस PyTorch से मैं सबसे ज्यादा परिचित हूँ, वह जो करता है उसके भी ज्यादा करीब लगता है
- chain rule partial derivatives के लिए परिभाषित है, इसलिए technically इसे अब भी बस chain rule ही माना जा सकता है
Automatic differentiation जादू जैसा लगता है
कई computer scientists इससे मंत्रमुग्ध हुए हैं और व्यापक perspective से इस technique को introduce करने वाले लेख लिखे हैं। मेरा लेख भी उनमें से एक है, और इसमें operator overloading के बिना complex numbers इस्तेमाल करने वाला “गरीब आदमी का variant” भी शामिल है
https://pizzaseminar.speicherleck.de/automatic-differentiati...
- जब मैं 1994~1995 में machine learning कर रहा था, तब मुझे automatic differentiation के बारे में पता नहीं था, और objective function बनाने वाले professor ने भी analytical derivatives खुद निकाले थे। कुछ साल पहले ही इसके बारे में जाना, और यह सोचकर हैरानी हुई कि 90s के आखिर में मैंने Mathematica इतना सीखा था कि खुद analytical derivatives बना सकूँ
- लगता है यह 2003 के J. Martins, P. Sturdza, J. Alonso के complex-step derivative approximation तक जाता है। वह paper पढ़ने लायक है
  [0]: https://doi.org/10.1145/838250.838251
- सचमुच जादू जैसा लगता है। अगर इसी तरह से लिखा गया backpropagation का कोई introductory material हो तो जानना चाहूँगा
मेरा बनाया हुआ 26-line Python automatic differentiation implementation है: https://gist.github.com/sradc/d9d66e3898ffe3a02e0b6b266629b0...
- छोटा होना अच्छा है, लेकिन लगता है मेरा दिमाग उचित whitespace होने पर कहीं बेहतर काम करता है। ऐसे दूसरे styles पर भी थोड़ा practice करनी चाहिए
यह knowledge-based engineering systems में इस्तेमाल होने वाली technique से बहुत मिलता-जुलता है, जहाँ इसे dependency tracking कहा जाता है। node या tensor caching के साथ इस्तेमाल करने पर computation घटाया जा सकता है, खासकर बड़े parametric 3D models के लिए उपयोगी है
value fetch करते समय binary/dependency tree को recursively call करके देखा जाता है कि कौन सा variable बदला है, और सिर्फ जरूरी चीजें दोबारा compute की जाती हैं। __set__, __get__ methods वाले custom Python objects और attributes का इस्तेमाल करें तो इसे object-oriented model की built-in feature जैसा बनाया जा सकता है
x = Tensor(3)
y = Tensor(5)
z = x + y
print(x, y) # 3, 5
print(z) # 8
x.value = 4 # value set करते समय कुछ भी दोबारा compute नहीं होता
print(z) # 9, क्योंकि value fetch करने के क्षण बदली हुई dependencies दोबारा compute होती हैं
Andrej Karpathy का autograd engine बनाने वाला एक दिलचस्प video है, और काफी insight देता है
https://youtu.be/VMj-3S1tku0?si=wuKhELwOwoYbzpt7
repository:
https://github.com/karpathy/micrograd
मेरे जानने में automatic differentiation का जो variant है, वह operation graph नहीं बनाता। इसके बजाय वह संबंधित value को on the fly calculate करता है
- शायद आप forward mode automatic differentiation के बारे में सोच रहे हैं। यह तब ज़्यादा उपयोगी होता है जब function का output dimension अपेक्षाकृत बड़ा हो, और यह reverse mode automatic differentiation से अलग है, जो output dimension अपेक्षाकृत छोटा होने पर ज़्यादा उपयोगी होता है
  दोनों काम करते हैं, लेकिन स्थिति के हिसाब से एक तरीका ज़्यादा efficient होता है। “neural network training” जैसे मामलों में अक्सर कई targets के लिए single loss output को optimize किया जाता है, इसलिए आम तौर पर reverse mode इस्तेमाल होता है
automatic differentiation को बस numerical chain rule कहना, या कम-से-कम इसी तरह समझाना, बेहतर होगा। शाब्दिक रूप से बस यही है, और कुछ खास operations में Jacobian matrix को explicitly calculate न करने की कुछ tricks जुड़ी होती हैं, इसलिए यह कहीं ज़्यादा स्पष्ट है
- यहाँ समझाया गया और backpropagation implementations में सबसे ज़्यादा इस्तेमाल होने वाला “autodiff” reverse mode automatic differentiation है, लेकिन forward mode भी है और इन दोनों extremes के बीच की strategies भी हैं। आखिरकार सब chain rule पर ही आकर टिकता है, लेकिन algorithm level पर कौन-सा तरीका चुनना बिल्कुल भी मामूली बात नहीं है
  सच में, अगर किसी से कहा जाए कि computational graph के जरिए gradients propagate करने के लिए chain rule इस्तेमाल करो, तो ज़्यादातर लोग intuitively forward mode को default मानेंगे। मैं भी ऐसा ही करूंगा
  https://en.wikipedia.org/wiki/Automatic_differentiation#Beyo...
  इस नज़रिए से देखें तो chain rule से मिलने वाले expression को traverse करते हुए gradients accumulate करने के किसी खास तरीके के लिए इस term का इस्तेमाल उपयोगी लगता है
- तकनीकी रूप से गलत है। numerical chain rule finite difference method का इस्तेमाल करता है, और calculations के दौरान errors accumulate होते जाते हैं
  “दूसरे तरीकों से अंतर” वाला section देखें: https://en.m.wikipedia.org/wiki/Automatic_differentiation
  पास के comment की तरह मुख्य बात यह है कि implementation सच में महत्वपूर्ण है और पढ़ने लायक है। automatic differentiation को chain rule implement करने के तरीकों का समूह कहना ठीक है, लेकिन इसे “बस” numerical chain rule कहना गलत है
- हो सकता है यह ज़्यादा accurate हो, लेकिन मैं इसे ज़्यादा clear नहीं कहूंगा
automatic differentiation तो smooth functions की category में Jacobian matrix और total derivative का Cartesian lens ही है, फिर समस्या क्या है? https://www.youtube.com/watch?v=ne99laPUxN4
class का नाम Tensor रखने की वजह जानना चाहूंगा। क्या expression या उसके derivative को tensor की तरह सोचने का कोई तरीका है? या फिर इसलिए कि scalar भी tensor होता है, और इसे दूसरे tensor types को support करने तक extend किया जा सकता है?
- गलत हो सकता हूं, लेकिन mathematically शायद 2D object को matrix और 3D या उससे ऊपर के object को tensor कहा जाता है
  बताया गया automatic differentiation algorithm arbitrary high-dimensional objects पर काम करता है, इसलिए ऐसे objects को tensor कहना समझ में आता है

Python की 30 लाइनों में Automatic Differentiation को समझना

सामान्य Python variables में relation गायब हो जाता है

Tensor से operation history बचाकर रखना

computation graph और forward()

automatic differentiation को computation graph के रूप में बनाना

subtraction, division और exponential function तक विस्तार

example expression और Sympy verification

simple implementation की limitations और optimization points

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय

`Tensor` से operation history बचाकर रखना

computation graph और `forward()`