RenderFormer: त्रिकोण mesh और global illumination आधारित neural rendering

(microsoft.github.io)

4 पॉइंट द्वारा GN⁺ 2025-06-02 | 1 टिप्पणियां | WhatsApp पर शेयर करें

RenderFormer एक neural rendering pipeline है जो triangle mesh scenes से सीधे images generate करता है; इसकी खास बात यह है कि यह scene-specific training के बिना global illumination तक संभालता है
Rendering को physics simulation प्रक्रिया के बजाय, triangles और reflection properties के tokens को छोटे pixel patch tokens में बदलने वाले sequence-to-sequence transformation के रूप में define किया गया है
Pipeline view-independent stage और view-dependent stage में बंटी है, और दोनों Transformer architecture का उपयोग करके न्यूनतम prior constraints के साथ train होती हैं
View-independent stage triangles के बीच light transport को model करती है, और view-dependent stage ray bundle tokens को pixel values में convert करती है
Public examples में lighting, materials, geometric complexity, animation और physics simulation शामिल हैं, और ये rasterization और ray tracing के बिना render किए जाते हैं

RenderFormer का rendering structure

RenderFormer triangle-based scene representation से images को सीधे render करने वाली neural rendering pipeline है
यह पूरे global illumination effects को शामिल करते हुए भी scene-specific training या fine-tuning की मांग नहीं करता
Rendering process sequence-to-sequence transformation के रूप में बना है
- Input reflection properties सहित triangle token sequence है
- Output छोटे pixel patches को दर्शाने वाला token sequence है
दो-stage pipeline view-independent light transport computation और actual pixel generation को अलग करती है
- View-independent stage: triangles के बीच light transport को model करती है
- View-dependent stage: ray bundle tokens को pixel values में convert करती है, और view-independent stage की triangle sequence इसे guide करती है
दोनों stages Transformer architecture पर आधारित हैं और न्यूनतम prior constraints के साथ train होती हैं
Rendering process में rasterization या ray tracing का उपयोग नहीं किया जाता

Public results और reference material

Rendering gallery scene-specific training या fine-tuning के बिना अलग-अलग lighting conditions, materials और geometric complexity दिखाती है
- Cornell Box, Stanford Bunny in Cornell Box, Lucy Statue, Utah Teapot
- Composed Scene, Constant Width Bodies, Crystals, Fox in the Wild
- Horse and Heart, RenderFormer Logo, Interior Room, Shader Ball, Tree, Veach MIS
Detailed comparison के लिए reference images उपलब्ध हैं
अतिरिक्त video material के रूप में uncompressed videos और reference videos उपलब्ध हैं
Teaser scenes
- Object rotation, lighting changes और material adjustments देखे जा सकते हैं
- Cornell Box Roughness Adjustment
- Bunny Roughness Adjustment
- Tree Light Change
- Tree Object Rotation
- Fancy Scene Rotation
- Composed Scene View Change
Animation और simulation
- Animation rendering examples में Cascade Cube Animation, Animated Crab, Gyroscope Motion, Animated Character, Marching Cubes Animation, Robot Animation शामिल हैं
- Physics-based simulation examples में Bowling Ball Physics Simulation, Rotating Box Dynamics, Constant Width Body Simulation शामिल हैं
- Paper ACM SIGGRAPH 2025 Conference Papers में शामिल होगा, और BibTeX entry का title “RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination” है

1 टिप्पणियां

GN⁺ 2025-06-02

Hacker News की राय

यहाँ सबसे शानदार चीज़ शायद speed हो सकती है: उसी scene में RenderFormer को 0.0760 सेकंड, जबकि Blender Cycles को 3.97 सेकंड (ऊँची settings पर 12.05 सेकंड) लगे, और फिर भी structural similarity index 0.9526 (0~1, जहाँ 1 एक जैसी image है) बनाए रखा। पेपर की table 2 और 1 देखें
इससे on-device Transformer model के साथ web या native app में 3D designers को बेहतर quality का instant render preview दिया जा सकता है
ऊपर का measurement A100 पर unoptimized PyTorch version model से किया गया था। आम users का GPU इससे काफ़ी कमजोर होगा, लेकिन 3D designers के GPU के लिए यह फिर भी traditional rendering की तुलना में काफ़ी बड़ा speedup दिखाने के लिए पर्याप्त हो सकता है। अगर system web-based हो, तो backend के A100 से connect करके image को browser में stream भी किया जा सकता है
सीमा यह है कि scene complexity बढ़ने पर, जैसे complex shadow shapes में (शायद particles या hair जैसी चीज़ों में भी), यह पूरी तरह accurate नहीं रहता। इसलिए final render अब भी traditional तरीके से करने की संभावना ज़्यादा है, ताकि आजकल कई AI-generated image/video में दिखने वाले बदसूरत visual artifacts से बचा जा सके। फिर भी अगर यह काफ़ी “ठीक-ठाक” हो और speed gain बड़ा हो, तो लंबे feature film-length previews को render करने वाले बड़े animation studios के लिए, जिन्हें music या story review में इस्तेमाल किया जाना हो, इसे अपनाने का कारण बन सकता है
- मुझे नहीं लगता कि authors ने जानबूझकर गुमराह करने की कोशिश की, लेकिन उस स्तर के GPU पर Blender Cycles इस पेपर के सभी scenes को प्रति frame 4 सेकंड से बहुत तेज़ render कर सकता है
  scenes बहुत कम complexity वाले, बेहद साधारण technical demo जैसे लगते हैं, और ऐसा लगता है कि Blender को pixel per 4000 iterations पर set किया गया है, जो समझ में नहीं आता। Blender कुछ सौ cycles के बाद ही output के काफ़ी करीब पहुँच जाता है, और उसके बाद बाकी 3,800 cycles बिना सुधार के सिर्फ GPU cycles जला रहे होंगे
  लगता है कि total render time में Blender का initialization step गलती से शामिल कर लिया गया, जबकि Transformer initialization शामिल नहीं किया गया। मैं हर system में दूसरे frame का render time देखना चाहूँगा, और अंदाज़ा है कि Blender की performance काफ़ी बेहतर होगी। पेपर के results ख़ुद दिलचस्प हैं, लेकिन Blender settings और measurement method में nuance है
- दिखाए गए scenes के हिसाब से 76ms भी लगभग एक युग जितना है। हाँ, आगे चलकर यह बहुत तेज़ हो सकता है, लेकिन traditional rendering से बेहतर कहने के लिए अभी काफ़ी दूरी बाकी है
- reference render के साथ time comparison काफ़ी बेईमानी जैसा लगता है
  ray tracing में error sample count के square root के अनुपात में घटता है। quality comparison के लिए reference image में बहुत high sample count इस्तेमाल करना सामान्य है, लेकिन वास्तविक offline renderer का sample count इस पेपर से 1~2 orders of magnitude कम होता है
  graphics papers में quality comparison के लिए बहुत high sample count वाली reference image देना आम बात है, लेकिन उसी reference image के साथ time comparison करना नहीं। अगर result approximate है, तो इसकी तुलना दूसरे approximate rendering algorithms से करना ज़्यादा fair होगा। आधुनिक real-time path tracers और denoisers consumer GPU पर भी इससे कहीं ज़्यादा complex scenes को 16ms से कम में render कर सकते हैं
  असली बात है “कहीं ज़्यादा complex scenes”। Transformer इस्तेमाल करने पर triangle count और output pixel count, दोनों के लिए scaling quadratic हो जाती है। मैं latest ML research के साथ पूरी तरह updated नहीं हूँ, इसलिए हो सकता है अब इसमें सुधार हुआ हो, लेकिन यह typical path tracer की theoretical scaling O(log n_triangles) और O(n_pixels) को मात देगा, ऐसा नहीं लगता। असल pixel count scaling पड़ोसी pixels के बीच high coherence की वजह से sublinear के काफ़ी करीब होती है
- इसमें एक पंक्ति है: “attention layer की execution time complexity token count के लिए quadratic बढ़ती है, और यहाँ triangle count token count के बराबर है। इसी वजह से scene के कुल triangles को 4,096 तक सीमित किया गया है”
- उसी scene में RenderFormer 0.0760 सेकंड और Blender Cycles 3.97 सेकंड होना काफ़ी चौंकाने वाला लगता है
  मैंने जल्दी से देखा, लेकिन यह detail नहीं मिली कि इसे कैसे configure किया गया था। जिज्ञासा है कि Cycles ने A100 पर CPU इस्तेमाल किया या CUDA kernels। और अगर यह single-frame render था, तो 3.97 सेकंड का एक नज़रअंदाज़ न किया जा सकने वाला हिस्सा renderer startup में गया होगा। sequence render करने पर per-frame time कम हो जाएगा
  sibling comment में बताई गई per-triangle complexity scaling भी दर्दनाक है
deep learning का इस्तेमाल global illumination rendered images के denoising में भी बहुत सफलतापूर्वक हो रहा है [1]
इस approach में traditional ray tracing algorithm scene की rough global illumination को जल्दी calculate करता है, और neural network output से noise हटाता है
[1] https://www.openimagedenoise.org
- demo output images AI upscale की तरह अजीब तरह से ज़्यादा smooth दिखती हैं। ऐसा लगता है जैसे input data की मात्रा से ज़्यादा image को बड़ा करने की कोशिश में, edges तो बच रहे हों लेकिन texture खो रही हो
  सुधार: denoising 125% DPI zoom की तुलना में 100% zoom पर बेहतर दिखती है, और नीचे वाला fern पहचानना भी आसान हो जाता है
graphics papers में हमेशा जो नहीं दिखाया गया है उसके बारे में सोचना चाहिए
इसमें polygons लगभग नहीं हैं, resolution कम है, textures नहीं हैं, motion blur नहीं है, depth of field नहीं है, और animation में कुछ artifacts भी हैं
यह दिलचस्प research है, लेकिन perspective में रखें तो बात यह है कि modern GPU का इस्तेमाल करके ऐसी images बनाई जा रही हैं जैसी 30 साल पहले 1/1,000,000 computation से बनाई जाती थीं
यह अजीब लगा कि किसी भी example में camera के पीछे की तरफ़ कुछ नहीं दिखाया गया
पता नहीं यह approach की limitation है या examples बनाने में चूक, लेकिन reflection और lighting की बात हो तो camera के पीछे का हिस्सा काफ़ी अहम होता है
मैं इस बारे में ज़्यादा नहीं जानता, इसलिए पूछ रहा हूँ: क्या ये scenes उस तरीके के आधार पर render किए जा रहे हैं जैसा उनसे render होने की उम्मीद की जाती है? अगर हाँ, तो फिर ज़्यादा direct method की जगह इसका इस्तेमाल क्यों करना चाहिए, समझ नहीं आता। क्योंकि यह direct method से तेज़ नहीं लगता
- शायद इसलिए कि यह Cool Research™ है। triangle count के साथ cost quadratic बढ़ती है, इसलिए यह practical नहीं है। इसी वजह से हर scene में 4096 ही इस्तेमाल किए गए
- शायद इसमें कुछ ऐसे शानदार फायदे हों जिनका अभी अनुमान लगाना मुश्किल है
  उदाहरण के लिए, अगर scene input weights का एक bundle हो, तो उसमें noise जोड़ने पर क्या निकलेगा? क्या इससे ऐसे दिलचस्प outputs मिल सकते हैं जो सामान्य तरीके से संभव नहीं होते?
  क्या दो अलग scene representations के बीच interpolation दिलचस्प होगा? ऐसे ही सवाल पूछे जा सकते हैं
- दूसरे comments के मुताबिक यह तरीका तेज़ है। direct method में global illumination बहुत धीमा हो सकता है
वाह, तो इससे GPU पर चक्र पूरा हो गया। rendering से compute, और फिर वापस rendering
ठीक दिखता है, लेकिन blurry है। neural renderer और classical renderer के बीच render time comparison देखना अच्छा होता
एनीमेशन, खासकर Animated Crab और Robot Animation में, जब ऑब्जेक्ट और कैमरा चलते हैं तो मॉडल के आसपास अस्वाभाविक रूप से घूमते हुए AI art artifacts काफ़ी साफ़ दिखाई देते हैं
- पेपर में समय से जुड़ी कुछ चर्चा है। इसने Blender Cycles (path tracing) से तुलना की है, और कम-से-कम 4,000 triangles से कम वाले scenes में neural network approach काफ़ी तेज़ है। लेकिन scaling शायद बहुत अच्छी नहीं होगी। इसमें कहा गया है कि attention execution time triangles की संख्या के लिए quadratic है
  https://renderformer.github.io/pdfs/renderformer-paper.pdf
  यह सोचने लायक है कि क्या neural network approach को simplified geometry के साथ सिर्फ indirect lighting के लिए इस्तेमाल करना व्यावहारिक होगा। यानी एक सामान्य rasterizer का उपयोग किया जाए और उसके ऊपर global illumination जोड़ी जाए
मेरा एक दोस्त है जो फ़िल्म उद्योग में physically based renderers पर काम करता है और उससे संबंधित research भी कर चुका है। इस उद्योग में काम कैसे होता है, इस पर उसकी बातें और व्याख्याएँ सुनना मुझे हमेशा अच्छा लगता है
सोच रहा हूँ कि आजकल ऐसी प्रतिभा को कौन-सी कंपनियाँ hire कर रही हैं। क्या AI कंपनियाँ भी training environments बनाने के लिए rendering engineers hire कर रही हैं?
अगर कोई अनुभवी research/industry rendering engineer को hire करना चाहता हो, तो मैं संपर्क करा सकता हूँ। मेरा दोस्त social media पर नहीं है, लेकिन मौक़ों की तलाश में है
- मुझसे Gmail पर मेरे username से संपर्क करें
यह बहुत शानदार research है। Transformer को text के अलावा दूसरे क्षेत्रों में लागू करने के ऐसे उदाहरण मुझे सच में बहुत पसंद हैं
अगर input sequential हो और उसके input tokens आपस में संबंधित हों, तो यह अच्छी तरह काम करता हुआ लगता है। इस क्षेत्र में और research देखने की उम्मीद है
text के अलावा ऐसे कौन-से दिलचस्प क्षेत्र हो सकते हैं जहाँ Transformer विशेष रूप से अच्छा फिट बैठे?
scene description के रूप में triangles के एक set को 2D pixel array में बदलने के लिए Transformer को train करना, और परिणाम ऐसा बनाना कि वह उसी scene को global illumination renderer से निकले pixels जैसा दिखे, एक शानदार और दिलचस्प विचार है
पिछले 5 वर्षों की research को देखें तो इसका काम करना अपने-आप में चौंकाने वाला नहीं है, लेकिन फिर भी यह काफ़ी गहरा परिणाम लगता है। Transformer architecture सच में बहुत versatile है
कुल मिलाकर यह बेहद तेज़ है, Blender render output के काफ़ी करीब है, और लगभग 1 अरब parameter model जैसा लगता है। यह fp16 है या fp32, पता नहीं, लेकिन फ़ाइल 2GB है तो नापसंद करने जैसा कुछ नहीं। मैं और ज़्यादा “realistic” scene demos देखना चाहूँगा, लेकिन चाहें तो इसे डाउनलोड करके Mac पर खुद चला सकते हैं

RenderFormer: त्रिकोण mesh और global illumination आधारित neural rendering

RenderFormer का rendering structure

Public results और reference material

Teaser scenes

Animation और simulation

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय