Diffusion Forcing: नेक्स्ट-टोकन प्रेडिक्शन और फुल-सीक्वेंस डिफ्यूजन का मेल

(boyuan.space)

1 पॉइंट द्वारा GN⁺ 2024-07-06 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Diffusion Forcing एक sequence generation तरीका है, जो हर token के लिए अलग diffusion noise level सीखता है, ताकि sampling के समय इसे next-token model और full-sequence diffusion model, दोनों की तरह इस्तेमाल किया जा सके
diffusion के noise को masking के रूप में समझते हुए, past tokens को clean रखा जा सकता है और केवल future tokens को noisy state में छोड़ा जा सकता है, या पूरी sequence में अलग-अलग noise levels रखे जा सकते हैं
DMLab और Minecraft video prediction में teacher forcing आसानी से diverge हो गया और causal full-sequence diffusion में consistency डगमगाई, जबकि Diffusion Forcing ने ज्यादा stable predictions बनाए
decision-making और planning में tokens को [a_t, o_{t+1}] के रूप में define करके action और उसके बाद की observation को साथ में model किया जाता है, और near future व distant future को अलग noise levels दिए जा सकते हैं
training length से लंबे rollouts भी संभव हैं: DMLab में 36 frames पर training के बाद 2000+ frames, और Minecraft में 72 frames पर training के बाद 2000+ frames sliding window के बिना generate किए गए

Diffusion Forcing की मुख्य संरचना

Diffusion Forcing नाम teacher forcing और diffusion models से लिया गया है
लक्ष्य next-token autoregressive model और full-sequence diffusion model के फायदों को एक ही training paradigm में जोड़ना है
- next-token model का फायदा: variable-length generation
- full-sequence diffusion model का फायदा: मनचाही trajectory की ओर sampling को steer करने वाली sequence-level guidance
एक बार trained model को sampling time पर अलग-अलग तरीकों से चलाया जा सकता है
- next-token model की तरह flexible और compositional generation संभव
- full-sequence diffusion model की तरह पूरी sequence पर guidance लागू की जा सकती है

Token-wise noise और “noise as masking”

Diffusion Forcing sequence diffusion को train करता है, लेकिन हर token को अलग noise level देता है
diffusion के noise को अलग-अलग strength की masking के रूप में देखा जा सकता है
- full-sequence diffusion: सभी frames को एक ही noise level पर एक साथ denoise करना
- next-token prediction: past tokens को noise 0 पर रखना और next frame को एक-एक करके denoise करना
sampling के समय sequence में noise placement बदलने से कई तरह के behaviors बनाए जा सकते हैं
- autoregressive rollout stabilization
- लंबे horizon के लिए guidance
- causal uncertainty सहित planning

सैद्धांतिक गुण

साबित किया गया है कि Diffusion Forcing असली joint distribution से sampled tokens की सभी partial sequence likelihoods के variational lower bound को optimize करता है
यह गुण दिखाता है कि training objective केवल empirical performance से नहीं, बल्कि पूरे partial sequence space की likelihood से भी जुड़ा है

Video prediction के नतीजे

नतीजों में model द्वारा सीधे synthesized videos का इस्तेमाल किया गया है, और generation VAE या superresolution के बिना हुई है
बताया गया है कि results cherry-picking के बिना sample किए गए हैं
DMLab dataset comparison में तीनों methods के बीच फर्क साफ दिखता है
- teacher forcing आसानी से diverge हो जाता है
- causal full-sequence diffusion model में गंभीर consistency issues दिखते हैं
- Diffusion Forcing stable और consistent video prediction हासिल करता है
Minecraft dataset में भी वही pattern दिखाई देता है
- teacher forcing आसानी से diverge हो जाता है
- causal full-sequence diffusion model में गंभीर consistency issues हैं
- Diffusion Forcing stable और consistent predictions generate करता है

Training length से लंबे video rollouts

Diffusion Forcing trained maximum sequence length से कहीं ज्यादा लंबे videos rollout कर सकता है
यह rollout sliding window के बिना किया जाता है
- RNN rollout में latent z को initial latent z0 पर reset नहीं किया जाता
- stabilization effect Diffusion Forcing में दिखाई देता है
DMLab results:
- 36 frames पर training
- 2000+ frames तक rollout संभव
- sliding window के बिना किया गया
- original dataset resolution 64x64 है
- लंबे video के mp4 compression के कारण visual quality कम हो गई, और original generation quality दिखाने के लिए PNG visualization भी दिए गए
Minecraft results:
- 72 frames पर training
- बिना divergence 2000+ frames तक rollout संभव
- sliding window के बिना किया गया
- original dataset resolution 128x128 है
- कुछ scenarios में agent दो-block ऊंचे dirt या stone block के सामने दिशा बदलने तक रुकता है; इसे dataset collection की intrinsic problem के रूप में handle किया गया है

Diffusion Planning

Diffuser जैसे मौजूदा work की तरह, test-time guidance का इस्तेमाल करके diffusion sequence को planner के रूप में उपयोग किया जा सकता है
Diffusion Forcing हर token को [a_t, o_{t+1}] के रूप में define करके causal relationship को explicitly model करता है
- कौन-सा action लेना है, इसके बारे में belief रखता है
- उस action से आने वाली observation के बारे में भी belief रखता है
- action के बाद नई observation आने पर posterior estimation से belief update किया जा सकता है
Diffusion planning process video decision-making framework के रूप में Diffusion Forcing planning process को visualize करता है
future causal uncertainty को model करने के लिए near future को low noise level और distant future को high noise level पर रखा जा सकता है

Long-horizon imitation learning

कई real-world tasks Markovian नहीं होते और उन्हें पूरा करने के लिए long-horizon memory की जरूरत होती है
real robot task में robot arm से third slot का उपयोग करके दो fruits के slots आपस में बदलवाए जाते हैं
- शुरुआत में fruits random slots में रखे जाते हैं
- केवल single observation से initial fruit placement पता नहीं चलता, इसलिए next step तय नहीं किया जा सकता
planning experiment में guidance हटाकर action-observation sequence को साथ में diffuse करते हुए feedback control किया गया
दिखाए गए videos failure होने से पहले कई consecutive successes दिखाते हैं
- previous run से fruit positions randomize होने पर भी robot task कर सकता है
test time पर unseen distraction के प्रति robust बनाने के लिए incoming observation को noisy observation की तरह treat करने के लिए prompting किया जा सकता है
- उदाहरण के तौर पर field of view में shopping bag को randomly फेंकने वाली distraction method इस्तेमाल की गई

2025 update: Scaling Up Diffusion Forcing

2025 update में state-of-the-art Wan2.1-T2V-1.3B को केवल 20k steps और 49 frames पर finetune किया गया
इसके बाद 5x rollout से 217 frames तक stable generation किया गया
follow-up work History-Guided Video Diffusion में देखा जा सकता है
example videos में sunset waves, चट्टान पर monkey, सोने की तैयारी करता dog, tropical beach aerial view, surfing scene, और uphill road पर cycling scene आदि शामिल हैं

आगे के research directions

Conditioning
- long sequences तक scale करते समय replacement-based conditioning अक्सर इस्तेमाल होती है
- Johnathan Ho का “Video Diffusion Models” चर्चा करता है कि यह तरीका गलत क्यों है
- Diffusion Forcing context token को clean और future token को noisy मानने वाला ज्यादा natural conditioning तरीका देता है, लेकिन इस हिस्से को detail में explore नहीं किया गया
Noise as masking
- यह तरीका binary masking नहीं, बल्कि token की fractional masking achieve करता है
- इतना general है कि MAE जैसे self-supervised learning methods में भी जोड़ा जा सकता है
- noise addition की frequency domain में रोचक interpretation है
Compositionality
- paper दिखाता है कि history length control करके compositionality हासिल की जा सकती है
- noise as masking इस्तेमाल करने पर model खुद तय कर सकता है कि unnecessary history को कब ignore करना है और केवल shorter horizon पर condition करना है
Non-causal version
- इस paper में decision-making में causality important होने के कारण causal Diffusion Forcing इस्तेमाल किया गया है
- noise as masking idea non-causal models पर भी लागू हो सकता है
- prediction को जिन entries को नहीं देखना चाहिए, उन्हें pure Gaussian noise से mask करने पर non-causal version train करके sampling time पर causal बनाया जा सकता है
Alternative Guidance
- proposed decision-making framework में Diffuser के ज्यादा करीब setting बनाए रखने के लिए observation पर guidance लागू की गई है
- learned reward पर guidance लागू करने वाला version भी propose किया गया था, लेकिन paper में explore नहीं किया गया
Noise scheme
- token-wise independent noise levels generality के लिए design किए गए थे, लेकिन वे हर task के लिए optimal नहीं हैं
- अगर data time axis पर बहुत locally correlated है, तो यह बहुत ज्यादा redundancy बनाए रख सकता है
- इससे overall signal-to-noise ratio प्रभावित हो सकता है
Next few token prediction
- next few token prediction केवल planning experiments में इस्तेमाल हुआ, जबकि video experiments अभी भी next-token style में हैं
- RNN version में यह बहुत अच्छा काम नहीं करता था, लेकिन transformer version code में बहुत अच्छा काम करता है
- causal model में “few” बहुत बड़ा होने पर next few token prediction inconsistency पैदा कर सकता है
- non-causal model में यह phenomenon कम होता है
Latent & DiT version
- release के बाद Diffusion Forcing का 3D U-Net version publish हुआ
- Diffusion Forcing causal या non-causal DiT पर भी लागू हो सकता है
- stabilization scheme VAE वाले latent space में ज्यादा natural fit होती है
- pixel corruption जरूरी नहीं कि Gaussian हो, लेकिन VAE latent की corruption Gaussian के ज्यादा करीब हो सकती है

Citation information

@article{chen2025diffusion,
  title={Diffusion forcing: Next-token prediction meets full-sequence diffusion},
  author={Chen, Boyuan and Mart{\'\i} Mons{\'o}, Diego and Du, Yilun and Simchowitz, Max and Tedrake, Russ and Sitzmann, Vincent},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={24081--24125},
  year={2025}
}

1 टिप्पणियां

GN⁺ 2024-07-06

Hacker News की राय

यहाँ कुछ विचार ध्यान खींचते हैं। पहले, यह LLM के मुख्य learning idea, sequence masking, को diffusion model के साथ जोड़ता है, और हर pixel के लिए ‘uncertainty’ level को track करता है
इस ‘uncertainty’ level को diffusion model के ‘noise’ level की तरह माना जाता है, और model किसी embedding द्वारा नियंत्रित होकर noise हटाता है
इससे image के कुछ हिस्सों को बाकी हिस्सों से पहले तय किया जा सकता है, इसलिए इसे maze solving जैसे कामों में इस्तेमाल किया जा सकता है। paper में fruit ले जाने वाले robot arm control तक दिखाया गया है, जो काफी हैरान करने वाला है
title तो उल्टा इस idea को कम करके दिखाता लगता है। masking level real value है, इसलिए यह partial masking करने का तरीका है, और मुझे यह काफी गहरा और दिलचस्प idea लगता है
हालांकि paper में कई चीजें cover नहीं हैं, इसलिए codebase को लेकर बहुत curiosity है। maze tracking task और video extension task को ठीक-ठीक कैसे configure किया गया है, robot arm को इस model से कैसे जोड़ा गया और desired task कैसे निर्देशित किया गया—ये सब अस्पष्ट हैं। architecture खुद भी कई papers या detailed explanations मांगता लगता है
- यह planning और exploration में uncertainty modeling को बहुत elegant तरीके से handle करने जैसा लगता है
  tasks को variable length में बदलते हुए भी agent को current situation को taken for granted न मानकर reflect करने के लिए मजबूर करना powerful है। इसलिए unexpected difficulties होने पर भी यह path के साथ बेहतर react और generalize कर सकता है
  मेरा अनुमान है कि सभी tasks को variable horizon के रूप में treat किया गया है, और current state को previous actions के result के रूप में रखा गया है। code भी देखना अच्छा होगा
- linked codebase पर्याप्त नहीं है क्या? मैं समझना चाहता हूँ कि यहाँ क्या missing है
  https://github.com/buoyancy99/diffusion-forcing
मैं curious हूँ कि existing text-generation LLMs पर diffusion जैसी technique apply करने के लिए कोई research या tool है क्या, जो नए pretraining के बिना या बस थोड़ी fine-tuning से छोटे GPT / Phi 3 / Gwen जैसे models पर चल सके
Monte Carlo tree search के साथ Tree of Thoughts जैसी चीज़ों के बारे में जानता हूँ और वह कुछ हद तक similar है, लेकिन आमतौर पर reward से learned goal अलग होता है, इसलिए मेरी दिलचस्पी token-level generation के ज्यादा करीब तरीके में है
क्या यह संभव है?
मैं इसी field में काम कर रहा हूँ, और यह काम बहुत ज़्यादा obscure तरीके से present किया गया है
यह किस problem को solve करने की कोशिश कर रहा है? क्या यह नया generative model propose कर रहा है?
- theoretical background नहीं है, लेकिन video भी ठीक से समझ नहीं आया। “Teacher Forcing” खराब लगता है, ऐसा तो लगता है, पर बाकी अच्छा है या खराब, पता नहीं। आखिर baseline क्या है?
क्या Russ अब diffusion कर रहा है? robotics में यह काफी applicable होना चाहिए
- diffusion policy सच में हाल में robotics में इस्तेमाल होने लगी है। https://diffusion-policy.cs.columbia.edu/ और related research देख सकते हैं
training time के संबंध में क्या मैं कुछ miss कर रहा हूँ? token-wise noise जोड़ने से training speed बहुत धीमी हो जाती है क्या? फिर भी शानदार paper है
शानदार काम है। curious हूँ कि क्या इसे partial masking इस्तेमाल करने वाले discrete diffusion model के रूप में फिर से LLMs पर apply किया जा सकता है
बहुत बढ़िया, लेकिन इसका नाम diffusion forcing क्यों है?
- दूसरे paragraph में आता है:
  कहा गया है कि “Diffusion Forcing” नाम “teacher forcing” और “diffusion models” से आया है

Diffusion Forcing: नेक्स्ट-टोकन प्रेडिक्शन और फुल-सीक्वेंस डिफ्यूजन का मेल

Diffusion Forcing की मुख्य संरचना

Token-wise noise और “noise as masking”

सैद्धांतिक गुण

Video prediction के नतीजे

Training length से लंबे video rollouts

Diffusion Planning

Long-horizon imitation learning

2025 update: Scaling Up Diffusion Forcing

आगे के research directions

Conditioning

Noise as masking

Compositionality

Non-causal version

Alternative Guidance

Noise scheme

Next few token prediction

Latent & DiT version

Citation information

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय