शुरुआत से इम्प्लीमेंट किए गए diffusion models: एक नया सैद्धांतिक दृष्टिकोण

(chenyang.co)

2 पॉइंट द्वारा GN⁺ 2024-03-12 | 1 टिप्पणियां | WhatsApp पर शेयर करें

diffusion models का उपयोग image generation से आगे बढ़कर audio, video, 3D, protein design, और robot path planning जैसी उन समस्याओं में होता है जहाँ multimodal distribution sampling की ज़रूरत होती है; यह ट्यूटोरियल optimization के नज़रिए से training और sampling को जोड़ता है
training प्रक्रिया में डेटा में noise मिलाकर (x_\sigma=x_0+\sigma\epsilon) बनाया जाता है, और neural network (\epsilon_\theta(x,\sigma)) को noise direction का अनुमान लगाने के लिए mean squared error को न्यूनतम करने पर प्रशिक्षित किया जाता है
प्रशिक्षित denoiser को डेटा सेट (\mathcal{K}) पर approximate projection के रूप में समझा जा सकता है, और ideal denoiser का संबंध (\sigma)-smoothed squared distance function के gradient से जुड़ता है
DDIM sampling को (f(x)=\frac{1}{2}\mathrm{dist}_{\mathcal{K}}(x)^2) पर approximate gradient descent के रूप में देखा जा सकता है, और (\sigma_t) schedule iteration की संख्या और denoiser evaluation cost तय करता है
gradient estimation update और noise addition को मिलाकर DDIM, DDPM, और लेखकों के improved sampler को gam·mu parameters के साथ एक ही रूपरेखा में समझाया जा सकता है, और आगे toy model तथा latent diffusion उदाहरणों तक ले जाया जाता है

optimization के नज़रिए से diffusion models

diffusion models multimodal distribution से sample बनाने में मजबूत हैं, और Stable Diffusion जैसे text-to-image generation tools के अलावा audio, video, 3D generation, protein design, और robot path planning में भी लागू होते हैं
ट्यूटोरियल की सैद्धांतिक नींव ICML 2024 paper और संबंधित paper की optimization interpretation पर आधारित है
implementation मुख्य रूप से smalldiffusion को reference करता है, और मुख्य लेख का code मूल library की तुलना में शिक्षण उद्देश्य के लिए सरल बनाया गया है

training: noise direction prediction

diffusion model training examples से डेटा सेट (\mathcal{K}) सीखता है, और लक्ष्य उस सेट से sample generate करना होता है
- image के मामले में (\mathcal{K} \subset \mathbb{R}^{c\times h \times w}) वास्तविक images से संबंधित pixel values का सेट है
- यही ढाँचा audio, video, robot trajectories, और text जैसे discrete domains पर भी लागू होता है
training procedure को तीन चरणों में देखा जा सकता है
- (x_0 \sim \mathcal{K}), (\sigma), (\epsilon \sim N(0,I)) sample किए जाते हैं
- (x_\sigma=x_0+\sigma\epsilon) से noise-mixed data बनाया जाता है
- (\epsilon_\theta(x_\sigma,\sigma)) को (\epsilon) predict करने के लिए squared loss minimize किया जाता है
code में training_loop हर batch x0 के लिए generate_train_sample से sigma और eps बनाता है, और model(x0 + sigma * eps, sigma) के output और eps के बीच MSE optimize करता है
(\sigma) को continuous interval से uniformly sample करने के बजाय (N) मानों में discretize किए गए (\sigma) schedule से चुना जाता है
- Schedule class संभावित sigmas की सूची को wrap करती है और training के दौरान batch के हिसाब से मान sample करती है
- मुख्य उदाहरण ScheduleLogLinear(N, sigma_min=0.02, sigma_max=10) का उपयोग करता है
- ScheduleDDPM pixel-space diffusion models के लिए, और ScheduleLDM Stable Diffusion जैसे latent diffusion models के लिए schedule है

Swissroll toy example

toy dataset शुरुआती diffusion papers में से एक Sohl-Dickstein et al. 2015 में इस्तेमाल किया गया spiral point set है, जहाँ (\mathcal{K}\subset\mathbb{R}^2) है
सरल dataset में denoiser को MLP के रूप में implement किया जाता है
- input में (x\in\mathbb{R}^2) और (\sigma) की 2-dimensional embedding को concatenate किया जाता है
- output noise (\epsilon\in\mathbb{R}^2) का prediction है
- कई diffusion models (\sigma) के लिए sinusoidal positional embedding इस्तेमाल करते हैं, लेकिन इस उदाहरण में साधारण 2-dimensional embedding भी अच्छा काम करती है
उदाहरण training setup में ScheduleLogLinear(N=200, sigma_min=0.005, sigma_max=10) और epochs=15000 का उपयोग होता है
प्रशिक्षित denoiser को (x-\sigma\epsilon_\theta(x,\sigma)) plot करके vector field के रूप में visualize किया जा सकता है
- जब (\sigma) बड़ा होता है, denoiser डेटा के mean की भविष्यवाणी करने की प्रवृत्ति रखता है
- जब (\sigma) छोटा हो और input (x) डेटा के करीब हो, तब यह वास्तविक data point की भविष्यवाणी करता है

denoising को projection के रूप में समझना

डेटा सेट (\mathcal{K}) के लिए distance function को (\mathrm{dist}_{\mathcal{K}}(x)=\min{|x-x_0|:x_0\in\mathcal{K}}) से परिभाषित किया जाता है
(x) का projection (\mathrm{proj}_{\mathcal{K}}(x)) वह बिंदु-समूह है जो (\mathcal{K}) के भीतर इस दूरी को प्राप्त करता है
यदि (\mathcal{K}) closed set हो, (x\notin\mathcal{K}) हो, और projection unique हो, तो squared distance function का gradient (x-\mathrm{proj}_{\mathcal{K}}(x)) बनता है
क्योंकि distance function (\mathrm{dist}_{\mathcal{K}}) हर जगह differentiable नहीं है, इसलिए min की जगह softmin का उपयोग कर (\sigma) से smoothed squared distance function पेश किया जाता है
smoothed distance function का gradient, (x) से तय होने वाले weights के आधार पर, (\mathcal{K}) के points के weighted average की ओर जाता है

ideal denoiser और relative error model

ideal denoiser (\epsilon^*) वह denoiser है जो किसी दिए गए (\sigma) पर training loss को ठीक-ठीक minimize करता है
यदि डेटा (\mathcal{K}) पर finite set वाली discrete uniform distribution हो, तो ideal denoiser को closed-form expression में लिखा जा सकता है
- हर data point का weight (x_\sigma) और उस point के बीच की दूरी पर निर्भर करता है
- छोटे dataset में इसे IdealDenoiser से सीधे compute किया जा सकता है
toy data में ideal denoiser, (\sigma) बड़ा होने पर data mean की ओर जाता है, और (\sigma) छोटा होने पर सबसे नज़दीकी data point की ओर
मुख्य theorem सभी (\sigma>0), (x\in\mathbb{R}^n) के लिए (\frac{1}{2}\nabla_x \mathrm{dist}^2_{\mathcal{K}}(x,\sigma)=\sigma\epsilon^*(x,\sigma)) संबंध स्थापित करती है
relative error model उस शर्त का उपयोग करता है जिसमें (x-\sigma\epsilon_\theta(x,\sigma)), (\mathrm{proj}_{\mathcal{K}}(x)) का अच्छा approximation हो
- यह तब लागू होता है जब (\sqrt{n}\sigma), (\mathrm{dist}_{\mathcal{K}}(x)) का एक constant factor के भीतर अच्छा अनुमान देता हो
- माना जाता है कि error, (\eta\mathrm{dist}_{\mathcal{K}}(x)) से कम या बराबर सीमित है
- low noise पर manifold hypothesis के तहत अधिकतर अतिरिक्त noise डेटा manifold के लंबवत होती है, इसलिए denoising projection का approximation बनता है
- high noise पर, यदि (\sigma) (\mathcal{K}) के diameter से बड़ा हो, तो data के weighted mean की prediction करने वाला denoiser भी छोटा relative error रखता है
CIFAR-10 इतना छोटा है कि ideal denoiser compute किया जा सके, और experiments में sampling trajectory पर exact projection तथा ideal denoiser output के बीच relative error छोटा दिखता है

sampling: iterative denoising और DDIM

जब trained denoiser मिल जाता है, तो noise-mixed (x_t) और noise level (\sigma_t) से (\hat{x}0^t=x_t-\sigma_t\epsilon\theta(x_t,\sigma_t)) द्वारा (x_0) का अनुमान लगाया जाता है
शुरुआत में (\sigma_T) को (\mathcal{K}) के diameter की तुलना में बड़ा रखा जाता है, और (x_T) को (N(0,\sigma_T)) से independently sample किया जाता है ताकि वह (\mathcal{K}) से दूर हो
high noise पर denoiser को एक बार call करने से relative error छोटा हो सकता है, लेकिन absolute error बड़ा हो सकता है; ideal denoiser की prediction data mean के करीब होती है
इसलिए sampling में (\sigma_t) schedule के साथ denoiser को बार-बार call कर (x_T,\ldots,x_0) sequence बनाई जाती है
update (x_{t-1}=x_t-(\sigma_t-\sigma_{t-1})\epsilon_\theta(x_t,\sigma_t)) coordinate transform के बाद deterministic DDIM sampling algorithm के बराबर है
- DDIM के साथ समानता का proof paper के Appendix A में है

distance minimization के रूप में DDIM

DDIM को (f(x)=\frac{1}{2}\mathrm{dist}_{\mathcal{K}}(x)^2) पर approximate gradient descent के रूप में समझा जाता है
- step size (1-\sigma_{t-1}/\sigma_t) है
- (\nabla f(x_t)) का अनुमान (\epsilon_\theta(x_t,\sigma_t)) से किया जाता है
(\sigma_t) schedule sampling के दौरान gradient steps की संख्या और उनका आकार तय करता है
- यदि steps बहुत कम हों, तो (\mathrm{dist}_{\mathcal{K}}(x_t)) घटे बिना convergence रुक सकती है
- यदि छोटे steps बहुत ज़्यादा हों, तो denoiser evaluation की संख्या बढ़ने से computational cost बढ़ जाती है
admissible schedule वह है जिसमें हर iteration पर (\sqrt{n}\sigma_t), (\mathrm{dist}_{\mathcal{K}}(x_t)) के constant factor के भीतर मेल खाए
- geometric रूप से घटती log-linear (\sigma_t) sequence एक admissible schedule है
theorem के अनुसार, यदि DDIM से बने (x_t) पर (\nabla\mathrm{dist}{\mathcal{K}}(x)) मौजूद हो और (\mathrm{dist}{\mathcal{K}}(x_T)=\sqrt{n}\sigma_T) हो, तो (x_t) squared distance function के gradient descent से generated होता है और (\mathrm{dist}_{\mathcal{K}}(x_t)/\sqrt{n}\approx\sigma_t) बना रहता है
toy example में मूल log-linear schedule से sub-sampling करके 20-step DDIM sampler बनाया जाता है, और अधिकांश samples मूल data के करीब होते हैं, हालांकि सुधार की गुंजाइश रहती है

gradient estimation आधारित improved sampler

(\nabla\mathrm{dist}{\mathcal{K}}(x)) के (x) और (\mathrm{proj}{\mathcal{K}}(x)) के बीच invariant रहने के गुण का उपयोग करते हुए, current estimate और previous estimate को मिलाने वाला update इस्तेमाल किया जाता है
update (\bar{\epsilon}t=\gamma\epsilon\theta(x_t,\sigma_t)+(1-\gamma)\epsilon_\theta(x_{t+1},\sigma_{t+1})) पिछले step की error को current estimate से correct करने का तरीका है
toy model samples में यह तरीका DDIM की तुलना में तेज़ी से converge करता है और samples मूल data के अधिक करीब आते हैं
DDIM की तुलना में इस sampler को momentum जोड़े गए रूप में समझा जा सकता है; trajectory overshoot कर सकती है, लेकिन तेज़ convergence भी दे सकती है
generation प्रक्रिया में noise जोड़ने से sampling quality अनुभवजन्य रूप से बेहतर होती है
- मूल (\sigma_t) schedule बनाए रखने के लिए पहले छोटे (\sigma_{t'}) तक denoise किया जाता है, फिर (w_t\sim N(0,I)) noise दोबारा जोड़ी जाती है
- जब (\mu=\frac{1}{2}) हो, तो DDPM sampler बिल्कुल पुनर्प्राप्त होता है
पूरा update (x_{t-1}=x_t-(\sigma_t-\sigma_{t'})\bar{\epsilon}_t+\eta w_t) तीन samplers को generalize करता है
- DDIM: gam=1, mu=0
- DDPM: gam=1, mu=0.5
- gradient estimation sampler: gam=2, mu=0

बड़े models और reference सामग्री

ऊपर दिया गया training code केवल toy data ही नहीं, बल्कि image diffusion models को scratch से train करने के लिए भी इस्तेमाल किया जा सकता है
FashionMNIST example FashionMNIST dataset पर training करके Papers with Code leaderboard के FID मानदंड में दूसरा स्थान पाने वाला example देता है
sampling code बिना बदलाव के pre-trained latent diffusion models पर भी इस्तेमाल किया जा सकता है
- उदाहरण में ScheduleLDM(1000) और ModelLatentDiffusion('stabilityai/stable-diffusion-2-1-base') का उपयोग होता है
- text condition An astronaut riding a horse रखी जाती है, 50 (\sigma) steps से sampling करने के बाद latent को decode किया जाता है
(\gamma) momentum term का प्रभाव high-resolution text-to-image generation में comparison visualization के साथ दिखाया गया है
अतिरिक्त रूप से देखने लायक सामग्री
- What are diffusion models: Markov process को उलटने वाले discrete-time नज़रिए से diffusion models का परिचय
- Generative modeling by estimating gradients of the data distribution: stochastic differential equation को उलटने वाले continuous-time नज़रिए से diffusion models का परिचय
- The annotated diffusion model: PyTorch diffusion model implementation का विस्तृत विवरण

1 टिप्पणियां

GN⁺ 2024-03-12

Hacker News की टिप्पणियाँ

मैं लेखक हूँ। diffusion model को समझने की कोशिश करते हुए मुझे एहसास हुआ कि code और math को काफी सरल बनाया जा सकता है, और उसी वजह से मैंने यह blog post और diffusion library बनाई।
अगर कोई सवाल हो तो जवाब दे सकता हूँ
- एक researcher के नज़रिए से diffusion models पर बहुत से blog posts पसंद नहीं आते, लेकिन यह सच में अच्छा था। यह सीधे मूल बात पर जाता है, फिर भी उन जटिल हिस्सों को दिखाता है जहाँ लोग अक्सर फँस जाते हैं, और भटकता या बिखरता नहीं है।
  खासकर trajectory पर चर्चा अच्छी लगी, क्योंकि यह समझने की प्रेरणा देती है कि scheduler जैसे topics में लोगों को दिक्कत क्यों होती है। Song या Lilian की posts जितना complete तो नहीं है, लेकिन कहीं ज़्यादा approachable है, इसलिए मैं इसे दूसरों को recommend करूँगा।
  संदर्भ के लिए, मेरे एक दोस्त ने पहले एक minimal diffusion implementation लिखा था, जो DDPM perspective से थोड़ा ज़्यादा “complete” है, इसलिए उपयोगी लगा: https://github.com/VSehwag/minimal-diffusion/
- आख़िरी example image में momentum term का house digital painting पर नकारात्मक असर दिखता है। gamma = 2.0 image में दरवाज़ा गायब है, इसलिए DDIM sampler जो gradient information इस्तेमाल करता है, उसकी intuition समझने के लिए उस example की details के बारे में और जानना चाहूँगा।
  Stable Diffusion में sampling procedure के साथ थोड़ा प्रयोग करने के अनुभव से, मैं DDIM के मुकाबले convergence time और number of steps की तुलना भी देखना चाहता था। momentum, convergence, और error के बीच कोई संबंध है या नहीं, यह जानने की जिज्ञासा है। उदाहरण के लिए, क्या momentum sampler के 16 steps, DDIM के 20 steps ± error term के लगभग बराबर हैं—ऐसी तुलना अच्छी होती
- get_sigma_embeds(batches, sigma) लगता है पहला input इस्तेमाल नहीं करता। क्या इरादा sigma को (batches, 1) shape में broadcast करने का था?
- जिज्ञासा है कि इन concepts में से कुछ physics principles से आए हैं या नहीं। क्या यह कुछ वैसा है जैसे कहा जाता है कि neural networks जैविक neural networks से प्रेरित हैं, या इस नज़रिए पर कोई insight है?
एक और अच्छी post का शीर्षक भी Diffusion Models From Scratch है: https://www.tonyduan.com/diffusion/index.html
यह mathematical details को कहीं ज़्यादा गहराई से कवर करती है, और साथ में 500 lines से कम का बहुत समझने योग्य minimal implementation भी देती है
code होना अच्छा है। diffusion papers बहुत सारी equations के लिए मशहूर हैं(https://twitter.com/cto_junior/status/1766518604395155830), लेकिन बाकी लोगों के लिए code पढ़ना कहीं आसान है और शायद ज़्यादा सटीक भी। मेरी राय में हर theoretical paper के साथ reference implementation code होना चाहिए।
अच्छा होगा अगर इसे Sora और दूसरे video generation models को चलाने वाले diffusion transformer version तक भी बढ़ाया जाए। इस post और https://jaykmody.com/blog/gpt-from-scratch/ को मिलाकर “scratch से diffusion transformer” पर एक introductory post बनाई जा सकती है
- diffusion papers equations के लिए मशहूर ज़रूर हैं, लेकिन सच कहूँ तो जिन diffusion researchers को मैं जानता हूँ, उनकी प्रतिक्रिया भी अक्सर यही होती है। बहुत लोग वही equations बार-बार लिखते हैं, और वे equations असल में लगभग recap के काम आती हैं।
  दूसरी ओर, अगर आप सच में गहराई में जाना चाहते हैं, तो मैं Kingma, Gao, Ricky Tian Qi Chen, और Max Welling के students (Tomczak postdoc हैं, Hoogeboom आदि), और छिपे हुए बड़े contributor Aapo Hyvärinen के काम पढ़ने की सलाह दूँगा। Kingma & Gao के अपेक्षाकृत हल्के काम का, जो SD3 paper से भी जुड़ा है, एक example यहाँ है: https://arxiv.org/abs/2303.00848
  अफ़सोस की बात यह है कि पहले के research को जानने और समझने पर निर्भरता ज़्यादा है, इसलिए accessibility कम हो जाती है, लेकिन इसे कोई बहुत सार्थक आलोचना कहना भी मुश्किल है। आखिर यह research है, आम लोगों के लिए educational material नहीं
- बस U-net को transformer encoder से बदलना है। embedding हटा दें और image patches को n_embd size के vectors में project कर दें, और diffusion process खुद वैसा ही रह सकता है
post अच्छी है, लेकिन ऐसा लगता है कि diffusion model की वह महत्वपूर्ण property छूट गई कि वह score function (log probability का derivative) को model करता है[1], और यह भी कि diffusion sampling Langevin dynamics[2] जैसी है। मेरी नज़र में ये perspectives अच्छे से समझाते हैं कि GAN की तुलना में training आसान क्यों होती है। वजह यह है कि modeling objective आसान होता है।
[1] https://yang-song.net/blog/2021/score/
[2] https://lilianweng.github.io/posts/2021-07-11-diffusion-mode...
- सही है। ये blog posts diffusion models की वह व्याख्या देती हैं जो मुख्य पाठ में बताए गए “data पर projection” perspective से अलग है। इन्हें एक ही training objective और sampling process की अलग-अलग व्याख्याओं की तरह देखा जा सकता है।
  हमारे perspective में diffusion models को train करना आसान इसलिए है क्योंकि training objective exact distance function के gradient की भविष्यवाणी करने के बजाय smoothed distance function के gradient की भविष्यवाणी करता है। diffusion model sampling कई बार approximate gradient steps लेने जैसी है।
  diffusion models को और गहराई से समझना हो तो मैं ऐसी सभी blog posts पढ़ने और अलग-अलग interpretations सीखने की सलाह दूँगा
बहुत दिलचस्प। Iterative alpha-(de)Blending[1] तुरंत याद आया। वह काम भी conceptually और simple diffusion model बनाने की कोशिश करता है, और इस निष्कर्ष पर पहुँचता है कि इसे approximate iterative projection process के रूप में formalize किया जा सकता है।
हालाँकि, इस post का approach शायद denoiser error analysis जैसे और भी दिलचस्प experiments संभव बनाता है।
[1] https://arxiv.org/pdf/2305.03486.pdf
theoretical explanation अच्छी है। यह dataset से independent explanation लगती है, लेकिन असली image generation के कुछ ठोस पहलुओं को लेकर जिज्ञासा है।
उदाहरण के लिए, image generator को piano keys बनाना मुश्किल क्यों लगता है? ऐसा लगता है कि काली keys के दो और तीन के बारी-बारी से आने वाले pattern जैसी चीज़ बनाने के लिए middle-range distance constraints को बेहतर represent करना पड़ेगा
- यह fingers problem जैसा ही है। हर बार count, size, angle, position वगैरह सब सही होने चाहिए, और इनमें से एक भी ग़लत हुआ तो लोग बहुत जल्दी पकड़ लेते हैं। यह उन चीज़ों से अलग है, जैसे tree branches, जहाँ branching position “ग़लत” होने पर भी लोग आसानी से नोटिस नहीं करते
क्या diffusion का एक हिस्सा यह idea है कि training data को बहुत बड़े पैमाने पर बढ़ा दिया जाए? यानी randomly diffused images की तुलना उनकी original non-diffused images से की जा सके—क्या बात कुछ ऐसी है?
सभी machine learning models convolution हैं। देख लेना
- लगता है आपने यह बात कुछ बार पोस्ट की है; क्या इसे थोड़ा विस्तार से समझा सकते हैं? उदाहरण के लिए, reinforcement learning को convolution के रूप में देखना मेरे लिए कठिन है

शुरुआत से इम्प्लीमेंट किए गए diffusion models: एक नया सैद्धांतिक दृष्टिकोण

optimization के नज़रिए से diffusion models

training: noise direction prediction

Swissroll toy example

denoising को projection के रूप में समझना

ideal denoiser और relative error model

sampling: iterative denoising और DDIM

distance minimization के रूप में DDIM

gradient estimation आधारित improved sampler

बड़े models और reference सामग्री

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की टिप्पणियाँ