जनरेटिव मॉडल क्या जानते हैं? क्या वे सच में जानते हैं?

(intrinsic-lora.github.io)

1 पॉइंट द्वारा GN⁺ 2024-02-25 | 1 टिप्पणियां | WhatsApp पर शेयर करें

वास्तविक दृश्यों को विश्वसनीय बनाने वाले GAN, autoregressive और Diffusion मॉडल के भीतर depth, normals, albedo, shading जैसे scene intrinsic गुण implicit रूप से मौजूद हो सकते हैं
प्रस्तावित तरीका मॉडल architecture से कम बंधे LoRA के जरिए मौजूदा image generation decoder को जस-का-तस इस्तेमाल करके intrinsic representation को restore करता है
VQGAN और Stable Diffusion में attention layer, और StyleGAN में affine layer पर हल्का LoRA जोड़कर, अलग task-specific decoding head के बिना intrinsic image प्राप्त की जाती है
Stable Diffusion में rank 2 के आधार पर पूरे मॉडल weights के केवल 0.04% को trainable parameters के रूप में जोड़ा जाता है, और सिर्फ 250 labelled images से भी intrinsic image generation संभव है
control experiments में दिखा कि generative model की quality जितनी अधिक होती है, restored scene intrinsic properties की accuracy भी उतनी अधिक होने की प्रवृत्ति रखती है, लेकिन extraction की संभावना मॉडल और domain के अनुसार बदलती है

Research question और LoRA approach

शुरुआत इस सवाल से होती है कि अगर generative model वास्तविक दृश्यों की अच्छी तरह नकल करता है, तो उसकी internal representations में भी scene intrinsic गुण हो सकते हैं
यह शोध चार बातों की जांच करना चाहता है
- GAN, Autoregressive और Diffusion model किस तरह का intrinsic knowledge encode करते हैं
- क्या architecture या model type से स्वतंत्र intrinsic representation restore करने वाला general framework बनाया जा सकता है
- आवश्यक training parameters और labelled data कितने कम हो सकते हैं
- generative model quality और restored intrinsic accuracy के बीच कोई सीधा संबंध है या नहीं
method का केंद्र Low-Rank Adaptation(LoRA) है
- VQGAN और Stable Diffusion में attention layer पर LoRA लागू किया गया
- StyleGAN में affine layer पर LoRA लागू किया गया
- अलग task-specific decoding head या layer जोड़े बिना, image generation में इस्तेमाल होने वाला वही decoder head इस्तेमाल किया गया
संबंधित सामग्री

Restoration results और model-wise differences

छोटे LoRA से ही कई generative models में depth, normals, albedo, shading restore किए जा सकते हैं
Stable Diffusion में rank 2 LoRA के आधार पर trainable parameters पूरे model weights के 0.04% तक घट जाते हैं
केवल 250 labelled images होने पर भी LoRA module के जरिए intrinsic image generate की जा सकती है
control experiments में model quality और restored intrinsic accuracy के बीच positive correlation की पुष्टि हुई
model और domain के आधार पर intrinsic extraction के नतीजे अलग-अलग दिखाई देते हैं
- VQGAN / Autoregressive / FFHQ: normal और depth मध्यम quality, albedo और shading उच्च quality
- StyleGAN-v2 / GAN / FFHQ: normal, albedo और shading उच्च quality, depth मध्यम quality
- StyleGAN-v2 / GAN / LSUN Bed: normal, depth, albedo और shading सभी उच्च quality
- StyleGAN-XL / GAN / FFHQ: normal, albedo और shading उच्च quality, depth मध्यम quality
- StyleGAN-XL / GAN / ImageNet: normal, depth, albedo और shading सभी extract नहीं किए जा सके
- Stable Diffusion-UNet / Diffusion / Open: normal, depth, albedo और shading सभी उच्च quality
- Stable Diffusion / Diffusion / Open: normal, depth, albedo और shading सभी उच्च quality
Stable Diffusion 2.1 को extend करने वाले तरीके का intrinsic map pseudo ground truth से compare किया गया है, और comparison items surface normals, depth, albedo, shading हैं

1 टिप्पणियां

GN⁺ 2024-02-25

Hacker News की राय

Sora को लेकर उम्मीदें बड़ी होने की एक वजह यह थी कि कुछ वीडियो देखकर लगता था कि अंदर भौतिक दुनिया का simulation चल रहा है और वीडियो मानो उस 3D scene को camera से शूट किया गया हो।
बस अलग-अलग वीडियो टुकड़ों को जोड़ने से कहीं ज्यादा कुछ पीछे हो रहा है—ऐसा intuitively लगता था, और यह paper उसी का सबूत जैसा दिखता है।
स्थिर image generators में भी यह दिखता है कि model असल में 3D scene render करना और photo लेना सीखता है। 3D engine बनाने की कोशिश नहीं की गई थी, बस image piles को linear algebra में डालकर optimize किया गया, और उससे world simulator निकल आया—यह हैरान करने वाला है।
- इंसान 3D दुनिया में रहते हैं, और training data भी उसी scene को कई angles से देखने वाली continuous binocular visual stream है। इसके उलट Sora ने मानो TV देखकर दुनिया सीखी है, इसलिए 3D scene की implicit representation और rendering सीखने के लिए शायद उसे और ज्यादा video games खेलने पड़ें।
- अब भी लोग सच में इसे सिर्फ वीडियो के टुकड़े जोड़ना मानते हैं, यह हैरान करता है।
- “3D engine बनाने की कोशिश नहीं की, बल्कि images को linear algebra में फेंककर optimize किया तो world simulator निकल आया” — यह बात ऐसी लगती है जैसे मानवीकृत evolution mind के बारे में कह सकती हो।
- निर्माता कंपनी द्वारा चुने गए वीडियो में भी एक scene था जहां बिल्ली को पांचवां पैर उग आया और फिर जल्दी गायब हो गया; ऐसे phenomena इस optimistic narrative में कैसे fit होते हैं, यह सवाल है।
- neural network linear algebra नहीं है। अगर मान लें कि आजकल ज्यादातर ReLU activation इस्तेमाल होता है, तो neural network का core आधा-linear structure है, और वही आधी linearity उसे ताकत देती है।
नाम Bojack Horseman में आने वाले काल्पनिक game show Hollywoo Stars and Celebrities: What Do They Know? Do They Know Things?? Let's Find Out! से लिया गया है।
https://bojackhorseman.fandom.com/wiki/Hollywoo_Stars_and_Ce...!
- मुझे वह show सच में बहुत पसंद है, इसलिए laptop पर उसका sticker भी लगाया हुआ है। अगर आपने Bojack Horseman नहीं देखा है, तो यह मजेदार होने के साथ sincere भी है, और इसमें existential feel काफी मजबूत है; अगर आपके taste से match करे तो जरूर देखने लायक है।
  एक complete animation package के रूप में मुझे यह Futurama से कहीं बेहतर लगता है। इसमें relate करने लायक बहुत depth है और यह जोर से असर करता है, लेकिन खुद को पर्याप्त हल्का बनाए रखता है ताकि देखने के बाद mood ठीक रहे।
  अब मैं filmtech side में काम करने लगा हूं, तो Hollywoo sticker और भी ज्यादा fit बैठता है।
- सिर्फ title देखकर ही मैंने इस article को upvote कर दिया।
- मैं इस खास game show title को काफी बार quote करता हूं, लेकिन इसे समझने वाले लोग ज्यादा नहीं होते, इसलिए अफसोस होता है कि मैं बस अजीब इंसान जैसा दिखता हूं।
- show में इसे लगातार HSaCWDTKDTKTLFO कहकर बुलाना भी मजेदार है। लंबे acronym को छोटे acronym की तरह एक-एक अक्षर पढ़ना शायद इस show का मेरा सबसे पसंदीदा recurring gag है।
- लगता है अपने लोग मिल गए। मैंने यह show करीब 6 बार देखा है।
मुझे वह समय याद आया जब मैंने Unity High Definition Rendering Pipeline test project में G-buffer निकालने की कोशिश की थी: https://www.youtube.com/watch?v=Fwtc694qNUM
हालांकि यह paper सच में कुछ prove करता है या नहीं, यह मुझे ठीक से नहीं पता। यहां वे एक विशाल UNet LoRA model train कर रहे हैं, लेकिन यह स्पष्ट नहीं है कि existing model से कुछ “extract” कर रहे हैं, या फिर deferred rendering pipeline से निकलने जैसे channels बनाने वाला नया model बना रहे हैं।
normals, albedo और depth को combine करने वाली deferred rendering 3D scenes बनाने की कई techniques में से बस एक है, और video games में भी इसका इस्तेमाल 2000s की शुरुआत वाले Xbox के Shrek game तक नहीं हुआ था (https://sites.google.com/site/richgel99/the-early-history-of...)
असली शानदार चीज image generation model से “camera” की rotation/translation matrix निकाल सकने वाला LoRA model होगी। वह कहीं ज्यादा मजबूत evidence होगा और साथ ही काफी useful भी लगेगा।
- supplementary material देखें तो random-initialized UNet से LoRA train करने वाला experiment है। उस case में pre-trained Stable Diffusion UNet इस्तेमाल करने के मुकाबले surface normals लगभग extract नहीं हो पाते, जिससे काफी साफ दिखता है कि model के अंदर मौजूद existing features performance के लिए महत्वपूर्ण हैं।
- मैं बहुत जानकार नहीं हूं, लेकिन “नए train किए गए parameters पूरे generative model parameters के 0.6% से कम हैं” वाला हिस्सा शायद उस सवाल का जवाब देता है।
  0.6% सुनने में छोटा number लगता है, लेकिन मुझे जिज्ञासा है कि सही चीज मापी गई है या नहीं। जरूरी नहीं कि model ने ठीक वही representation encode किया हो जिसे हम extract कर रहे हैं, लेकिन अगर model size के लिहाज से सस्ते और stable तरीके से normals, albedo और depth में map हो सकने वाली कोई चीज encode की है, तो सिर्फ वही भी बहुत meaningful लगता है।
  कौन सा basis vector इस्तेमाल होता है, इससे फर्क नहीं पड़ता; बस यह पता होना चाहिए कि उसे मेरी representation में कैसे map करना है।
मैंने paper सरसरी तौर पर पढ़ा, लेकिन कई हिस्से कठिन लगे। image generation AI से परिचित न होने के नाते, core sentence जैसा दिखने वाला “I-LoRA modulates key feature maps to extract intrinsic scene properties such as normals, depth, albedo, and shading, using the models' existing decoders without additional layers, revealing their deep understanding of scene intrinsics” का ठीक अर्थ क्या है, यह जानना चाहता हूं।
“key feature maps को modulate करके scene की intrinsic properties extract करना” का मतलब क्या है, और additional decoding layers के बिना इस तरह की scene-property images कैसे generate की गईं—यह समझना चाहता हूं।
- मान लें आपके पास 1 billion parameters वाला neural network है। उसमें यहां-वहां करीब 5 million parameters जोड़ दिए जाते हैं, फिर LoRA तरीके से सिर्फ नए parameters को train करते रहते हैं और base network को नहीं छूते। तब यह scene properties predict करने वाला modulated network बन जाता है।
  दिलचस्प बात यह है कि बहुत कम extra parameters लगते हैं, जिससे लगता है कि original network पहले से ही उस point के काफी करीब था।
पता नहीं Toyota या Adobe ऐसे नाम वाली रिसर्च को फंड क्यों दे रहे हैं, लेकिन मुझे यह सच में बहुत पसंद है। अच्छा होगा अगर विज्ञान में फिर से थोड़ी शरारत लौट आए
ज्यादा व्यावहारिक रूप से देखें तो, “कम संख्या वाली labeled images से optimized model-agnostic approach Diffusion models, GAN, Autoregressive models आदि कई generative architectures के हिसाब से adapt हो जाती है” वाली व्याख्या पढ़कर लगता है कि क्या यह पूरी तरह visual-spatial tool है
क्या examples सिर्फ संयोग से visual हैं, या इसे text models तक बढ़ाने का कोई तरीका नहीं है? Interpretability का ऐसा approach पहली बार देख रहा/रही हूं और यह बहुत प्रभावशाली है
- language models की factual information edit करने पर भी research है। https://rome.baulab.info/
- क्या सच में समझ नहीं आता कि Toyota या Adobe computer vision research को fund क्यों करते हैं?
- यह वही Bojack Horseman reference है जिसकी जरूरत हमें पता ही नहीं थी
काफी चौंकाने वाला है। ये models सिर्फ decode न किए जा सकने वाले अरबों-dimensional hyperplane में जादू नहीं कर रहे, बल्कि असल में इंसानों द्वारा interpret की जा सकने वाली representations सीख रहे हैं
- एक पुराने 3D graphics engineer के नज़रिए से, इसके अंदर albedo का होना अपेक्षित भी है और सच में प्रभावशाली भी
  physically based rendering के core components हैं position, surface normal, incoming light, और कम से कम surface material properties में से एक, जैसे albedo और reflectivity/roughness। position को image के XY और depth से derive किया जा सकता है
  AI का depth model करना काफी expected है, और surface normal को depth की local convolution जैसा माना जा सकता है। लेकिन incoming light से अलग albedo को model करना शानदार है। सोचता/सोचती हूं कि reflectivity भी कहीं छिपी होगी या नहीं
- Generative models के पास काफी complex internal world model है, इसके बहुत सबूत होने के बावजूद, अब भी कुछ लोग अड़े रहते हैं कि वे बस “stochastic parrots” हैं और “वास्तव में कुछ भी नहीं समझते”; यह हैरान करता है
यह VR, या spatial computing के लिए अच्छी खबर है। अगर model physical world को उतना समझता है जितना paper दिखाता है, तो एक scene से दो projections generate करना बहुत कठिन मांग नहीं लगता। आगे क्या होगा, इसे लेकर सच में उत्सुकता है
अगर यह real images से albedo और lighting predict कर सकता है, तो काश कोई relight की जा सकने वाली Gaussian splatting scenes बना दे। dynamic lighting photos से बने 3D scans की उपयोगिता को काफी बढ़ा देगी, लेकिन इस क्षेत्र में अभी तक ऐसा result नहीं देखा जिसे “अच्छा” कहा जा सके
- क्या real images को पक्के तौर पर इस्तेमाल किया जा सकता है? अगर हां, तो real images से depth maps निकालना सबसे उपयोगी application लगती है
skeptical बनने की कोशिश नहीं है, लेकिन सोचता/सोचती हूं कि हमें कैसे पता कि image generation companies ने datasets में normal maps जैसी चीजें डालकर उन्हें reinforce नहीं किया
समझता/समझती हूं कि यह paper verifiable open-source models को cover करता है, लेकिन क्या ज्यादा advanced models का secret sauce ऐसा कुछ हो सकता है?
- इसके लिए normal map images और original images को pair करके train करना होगा। मेरी जानकारी में ऐसी approach आम training technique नहीं है, और यह capability कई open models में दिखती लगती है
यह test करना दिलचस्प होगा कि generative models की perception क्षमता इंसानों से बेहतर है या नहीं, उन optical illusions के जरिए जिनसे इंसान धोखा खा जाते हैं। जैसे, Ponzo illusion जैसी स्थिति में क्या वे depth को सही तरह judge करते हैं, यह जानना चाहूंगा/चाहूंगी

जनरेटिव मॉडल क्या जानते हैं? क्या वे सच में जानते हैं?

Research question और LoRA approach

Restoration results और model-wise differences

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय