वीडियो जनरेशन के लिए Next-Frame Prediction मॉडल में इनपुट फ्रेम context packing का उपयोग

(lllyasviel.github.io)

2 पॉइंट द्वारा GN⁺ 2025-04-21 | 1 टिप्पणियां | WhatsApp पर शेयर करें

FramePack 13B वीडियो diffusion मॉडल को 6GB laptop GPU memory पर भी लंबे वीडियो बनाने में इस्तेमाल करने के लिए Next-Frame Prediction आधारित approach है
यह input frames को एक जैसी लंबाई में नहीं संभालता; हर frame के लिए अलग patchifying kernel इस्तेमाल करके prediction target के करीब वाले महत्वपूर्ण frames को अधिक GPU resources देता है
HunyuanVideo के हिसाब से 480p frame (1, 2, 2) पर करीब 1536 tokens से घटकर (2, 4, 4) पर 192 tokens तक हो सकता है, और streaming compute complexity O(1) है
FramePack Scheduling frame importance और compression ratio को control करता है, और image-to-video में शुरुआती frames को समान रूप से महत्वपूर्ण मानने वाली scheduling भी संभव है
लंबे वीडियो generation में accumulated errors से होने वाली drifting को घटाने के लिए causality तोड़ने वाली bidirectional sampling का उपयोग किया जाता है, और inverted anti-drifting sampling image-to-video के लिए उपयुक्त है

FramePack का input frame context packing

FramePack Next-Frame या Next-Frame-Section Prediction मॉडल में कई input frames लेकर नए frames को diffusion से generate करने का तरीका है
लक्ष्य performance और उपयोग की शर्तें इस प्रकार हैं
- 13B मॉडल से 6GB laptop GPU memory पर fps 30 के हजारों frames generate करना
- single 8xA100/H100 node पर 13B video model को batch size 64 के साथ fine-tune करना
- personal RTX 4090 पर optimization से पहले 2.5 सेकंड/frame, teacache इस्तेमाल करने पर 1.5 सेकंड/frame generation
- timestep distillation नहीं
मुख्य बात input frame images को सिर्फ concatenate करने के बजाय logical GPU memory layout में हर frame के context length को अलग-अलग रखना है
हर frame का context length अलग-अलग patchifying kernel से control किया जाता है
- HunyuanVideo में 480p frame, (1, 2, 2) patchifying kernel इस्तेमाल करने पर करीब 1536 tokens
- (2, 4, 4) patchifying kernel पर बदलने से प्रति frame 192 tokens
अगले frame के prediction target के करीब वाले frames जैसे अधिक महत्वपूर्ण frames को लंबा context दिया जाता है
streaming compute complexity O(1) है; O(nlogn) या O(n) नहीं

Scheduling और drift prevention

FramePack Scheduling उन cases को support करता है जहां frame importance किसी simple pattern का पालन नहीं करती, compression ratio बदला जाता है, या user-specified frames को अधिक महत्वपूर्ण माना जाता है
image-to-video में पहला frame महत्वपूर्ण होता है, इसलिए शुरुआती frames को समान रूप से महत्वपूर्ण बनाने वाली scheduling इस्तेमाल की जा सकती है
सभी scheduling O(1) हैं, और अलग-अलग scheduling का evaluation Paper में शामिल है
Next-Frame Prediction models में वीडियो लंबा होने पर quality गिरने वाली drifting एक आम समस्या है
- आखिरी generated frame को बार-बार input देकर लंबा वीडियो बनाया जाए तो 5–6 बार के बाद यह तेजी से बिगड़ता है, और करीब 10 बार के बाद बहुत ज्यादा degrade हो सकता है
- इस समस्या को error accumulation या exposure bias भी कहा जाता है
history noise augmentation, special cfg guidance, rolling diffusion timesteps जैसे मौजूदा तरीकों पर experiments भी paper में शामिल हैं
drifting को मूल रूप से handle करने के लिए causality तोड़कर sampling को bidirectional बनाना जरूरी है
- केवल vanilla sampling causal तरीका है
- anti-drifting sampling और inverted anti-drifting sampling bidirectional तरीके हैं
- inverted anti-drifting sampling हर inference में पहले frame को approximation target की तरह मानता है, और image-to-video के लिए उपयुक्त है

Demo conditions और reference material

demo results RTX 3060 6GB laptop और 13B HY variant से compute किए गए
- image-to-5-seconds: 30fps, 150 frames
- image-to-60-seconds: 30fps, 1800 frames
- GitHub repository में fit करने के लिए videos को h264crf18 से compress किया गया
संबंधित resources के रूप में Paper, Code, FramePack-P1 Preview उपलब्ध हैं

1 टिप्पणियां

GN⁺ 2025-04-21

Hacker News की रायें

यह व्यक्ति जीनियस है। शायद कुछ लोगों को पता न हो, लेकिन ControlNet भी इसी ने बनाया था
consumer hardware पर चलने वाला पहला काम का video generation model होने के कारण यह काफी अहम है, और उम्मीद है कि जल्द ही ControlNet pose support भी आएगा
- IC-Light भी इसी ने बनाया था। हैरानी है कि यह अभी भी open source में योगदान क्यों कर रहा है
  बड़ी कंपनियों ने जरूर जबरदस्त offers दिए होंगे; सच में बेहद प्रतिभाशाली है
- video generation को मैंने अपनी अधीरता की वजह से ठीक से आज़माया नहीं है, लेकिन Wan भी सामान्य hardware पर काफी ठीक नहीं है क्या?
मजेदार है कि यह लोगों को हर हाल में नचाना चाहता है। interview के लिए बैठा व्यक्ति भी बैठे-बैठे नाचना शुरू कर देता है
- शायद prompt में dance शामिल है। prompt बदलें तो शायद दूसरी actions भी करवाई जा सकती हैं, लेकिन शायद वह उतना मजेदार नहीं होगा
- लगता है यह कई video researchers द्वारा इस्तेमाल किए जाने वाले बड़े public TikTok training dataset का असर है
- दिलचस्प observation है
  static image में हमेशा आंखें खोजी जाती हैं, और video में हमेशा dance खोजा जाता है
examples काफी प्रभावशाली हैं, लेकिन इन्हें बनाने में इस्तेमाल resources असल में लगभग मामूली स्तर के हैं। लगता है previous-generation consumer hardware पर भी inference चलाया जा सकता है
कभी 5090 पर inference throughput के आंकड़े भी देखना चाहूंगा
क्या इसे spatial direction में भी किया जा सकता है? उदाहरण के लिए, image को एक बार में generate करने के बजाय ऊपर से नीचे generate करना संभव होगा या नहीं, यह जानना चाहता हूं
क्या इसे extrapolation के बजाय video interpolation के लिए इस्तेमाल किया जा सकता है?
- paper में जिस “inverted anti-drifting” की बात है, वह मूल रूप से पहले काफी extrapolate करने और फिर उल्टा interpolate करने के तरीके जैसा है
कमाल है। RAM जैसे resources ज्यादा हों तो क्या यह और तेज हो सकता है? यह भी जानना चाहूंगा कि H100 या H200 पर speed और बढ़ाई जा सकती है या नहीं
लगता है कि यह जो actions कर सकता है, वे असल में लगभग सिर्फ dance तक सीमित हैं
- dance के अलावा movements भी काफी हैं। footwork dance न होने वाले examples सिर्फ एक-दो हैं, लेकिन movement केवल पैरों तक सीमित नहीं है
- image input के साथ text prompt भी लेता है, इसलिए examples में dance डाला गया होने की संभावना ज्यादा है

वीडियो जनरेशन के लिए Next-Frame Prediction मॉडल में इनपुट फ्रेम context packing का उपयोग

FramePack का input frame context packing

Scheduling और drift prevention

Demo conditions और reference material

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की रायें