Magma - मल्टीमॉडल AI एजेंट्स के लिए एक foundation model

(microsoft.github.io)

3 पॉइंट द्वारा GN⁺ 2025-02-21 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Magma पहला foundation model है जो मल्टीमॉडल इनपुट को समझकर उन्हें environment के भीतर जोड़ सकता है, और virtual तथा real world में जटिल interactions को संभाल सकता है
यह केवल image·video understanding तक सीमित नहीं है, बल्कि goal-oriented visual planning और execution जनरेट करके विभिन्न AI agent tasks कर सकता है
UI navigation, robot manipulation, image·video understanding (खासकर spatial understanding और reasoning) सहित कई मल्टीमॉडल tasks में state-of-the-art performance हासिल की
scalable pretraining approach: unlabeled video data को existing agent data के साथ सीखकर मजबूत generalization performance देता है, इसलिए वास्तविक applications के लिए उपयुक्त है
code, model, और UI navigation demo को MSR Forum (2025.02.25) में सार्वजनिक करने की योजना है.

Magma का लक्ष्य

language और spatio-temporal intelligence:
- image और video को सटीक रूप से समझना, और उसके आधार पर goals को action planning और execution में बदलने की क्षमता
digital और physical environments में काम करना:
- web navigation (UI manipulation) और robot manipulation दोनों कर सकता है
- इंसानों की तरह digital·physical environments के बीच स्वतंत्र रूप से आने-जाने वाला AI
इसके लिए, unlabeled video data और existing agent data का उपयोग करने वाला एक नया training dataset और text·image·action को एकीकृत रूप से सीखने वाला pretraining framework विकसित करके Magma को train किया गया

Magma का pretraining तरीका

Magma को दो मुख्य approaches के जरिए train किया गया है.
1️⃣ बड़े पैमाने के heterogeneous training data का उपयोग
- existing multimodal data, UI navigation data, robot manipulation data के साथ-साथ बड़ी मात्रा में unlabeled video data इकट्ठा करके train किया गया.
- camera movement को हटाकर, वास्तविक action data निकाला जाता है ताकि model long-term action prediction और planning सीख सके.
2️⃣ integrated pretraining objectives सेट करना
- text और action मूल रूप से अलग हैं, और उन्हें प्रभावी ढंग से जोड़ना एक चुनौती है
- Set-of-Mark, Trace-of-Mark जैसी नई training techniques को अपनाकर text·image·action के बीच मजबूत alignment structure बनाया गया
  - Set-of-Mark (SoM): image में प्रभावी action grounding को संभव बनाता है, और UI screenshots, robot manipulation तथा human videos में clickable buttons या robot arm के लिए numbered marks का prediction करता है.
  - Trace-of-Mark (ToM): robot manipulation और human actions के लिए supervision प्रदान करता है, ताकि model temporal video dynamics को समझे और action लेने से पहले future states का prediction करे.

model का उपयोग

सीधे उपयोग (Fine-tuning के बिना भी उपयोग संभव)

Magma को research use के लिए design किया गया है, और इसे निम्नलिखित तरीकों से उपयोग किया जा सकता है.

image/video-based text generation: input image·text के आधार पर description और answers जनरेट किए जा सकते हैं.
visual planning: object movement जैसे goal achievement के लिए future action path का prediction कर सकता है.
agent capabilities:
- UI navigation: उदाहरण के लिए, "search button क्लिक करें" जैसी UI manipulation का prediction
- robot manipulation: robot की 7 degrees of freedom (7 DoF) manipulation का prediction

downstream tasks (Fine-tuning का उपयोग)

Magma को specific tasks के अनुसार further training दिया जा सकता है.

image captioning और QA: existing multimodal large language model (LLM) approaches से train करके spatial understanding और reasoning क्षमता को मजबूत किया जा सकता है.
video captioning और QA: video data के लिए temporal understanding और reasoning क्षमता को मजबूत किया जा सकता है.
UI navigation: web और mobile UI navigation tasks के लिए optimize करके उच्च performance हासिल की जा सकती है.
robot manipulation: robot control के लिए अतिरिक्त training के माध्यम से, OpenVLA जैसे existing robot manipulation models से बेहतर performance दिखाता है.

bias, risks, limitations

यह model सभी downstream tasks के लिए design नहीं किया गया है.
किसी specific use case पर लागू करने से पहले, accuracy, safety, और fairness का evaluation और adjustment करना चाहिए.
खासकर high-risk scenarios में लागू होने वाले laws और regulations का पालन करना चाहिए.

1 टिप्पणियां

GN⁺ 2025-02-21

Hacker News राय

Magma प्रोजेक्ट में रुचि के लिए धन्यवाद। हम inference, training, evaluation और data preprocessing code को चरणबद्ध तरीके से सार्वजनिक करेंगे, और यह अगले मंगलवार तक पूरा हो जाएगा
multimodal agent की प्रगति की रफ्तार प्रभावशाली है। OpenVLA जून 2024 में जारी हुआ था और उस समय state-of-the-art था। 8 महीने बाद, "Pick Place Hotdog Sausage" जैसे कार्यों में success rate 2/10 से बढ़कर 6/10 हो गई है
industrial robots कुशल होते हैं क्योंकि वे मानव व्यवहार की नकल नहीं करते। इसलिए यह समझना कठिन है कि रोबोट को मानव व्यवहार सिखाने का प्रस्ताव किस अर्थ में उपयोगी है। home robots को कुशल tools की आवश्यकता होगी। इसके लिए वर्तमान में इस्तेमाल होने वाली washing machine, oven और dishwasher से अलग नई machines की जरूरत होगी
multimodal capabilities, खासकर next action prediction, प्रभावशाली हैं। मैं देख रहा हूँ कि क्या यह feature GitHub पर open source किया जाएगा। यह भी जानना चाहता हूँ कि Magma नाम क्यों रखा गया
यह वास्तव में एक दिलचस्प model है। मैं इसे आज़माने के लिए उत्सुक हूँ। लेकिन मैं जो चाहता हूँ, वह एक multimodal agent model है जो Meta motivo जैसे humanoid control model के लिए embeddings बना सके। Meta motivo एक toy model है जिसे SMPL skeleton पर train किया गया है, और इसमें उंगलियाँ नहीं हैं, इसलिए इसकी functionality सीमित है। SMPL-X जैसे अधिक उन्नत models का उपयोग किया जा सकता था, लेकिन सटीक finger movements सहित open-ended motion data की कमी के कारण एक मजबूत manipulation model को train करना कठिन है
अधिकांश मौजूदा motion datasets academic motion capture setups से आते हैं और manipulation tasks पर केंद्रित नहीं हैं। मेरा मानना है कि 2D video से 3D HPE में प्रगति इस अंतर को भर देगी। यदि हमारे पास हजारों घंटों के video तक पहुँच हो, तो हम विविध वास्तविक interactions को समेटने वाला एक large-scale motion dataset बना सकते हैं
इससे agent model को train करने के लिए आवश्यक दो घटक संभव होंगे, जो ऐसे embeddings उत्पन्न करे जिन्हें hand और finger joint movements को सटीक रूप से model करने वाले control models पढ़ सकें। 2D video से SoTA 3D HPE की तेज़ प्रगति और online video की विशाल मात्रा को देखते हुए, मुझे उम्मीद है कि निकट भविष्य में हम अच्छे manipulation कौशल वाले humanoid robots देखेंगे
mug को साफ करने वाले video में व्यक्ति cup को धोने का नाटक करता हुआ दिखता है, जैसे वह अपने हाथ गीले नहीं करना चाहता। सोचता हूँ कि model ऐसी सूक्ष्म बातों को कब समझ पाएगा
मैं सोचता हूँ कि multimodal models लचीले ढंग से images क्यों generate नहीं करते। लगता है कि वे image generation के लिए इसे दूसरे model को सौंप देते हैं। उन्हें शायद यह ठीक से नहीं पता होता कि उन्होंने जो image बनाई उसमें क्या है, हालांकि वे image को edit कर सकते हैं
multimodal agents की लंबी अवधि वाले tasks में असफल होने के लिए बदनामी है। जानना चाहता हूँ कि Magma इसमें कैसा प्रदर्शन करता है
सोचता हूँ कि क्या multimodal models में कोई ऐसा है जिसे reasoning के लिए train किया गया हो
जानना चाहता हूँ कि क्या incremental training पर कोई research है। यह robots के लिए RAG का एक विकल्प हो सकता है