ML शोधपत्रों का संग्रह

(discuss.pytorch.kr)

11 पॉइंट द्वारा ninebow 2025-09-10 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

[2025/09/01 ~ 07] इस सप्ताह देखने लायक AI/ML शोधपत्रों का संग्रह

PyTorchKR🔥🇰🇷 🤔💭

1️⃣ Large Language Models की सीमाएँ और नियंत्रण: कई शोधपत्रों में Large Language Models (LLM) की सीमाओं और उन्हें नियंत्रित करने के तरीकों पर चर्चा की गई है। खास तौर पर, "On the Fundamental Impossibility of Hallucination Control in Large Language Models" में यह सैद्धांतिक असंभवता प्रस्तुत की गई है कि LLM सत्यनिष्ठ knowledge representation और information preservation को एक साथ हासिल नहीं कर सकते, और इसने hallucination तथा creativity की गणितीय समानता पर जोर दिया है। यह AI systems में ऐसे व्यवहारों को प्रबंधित करने की बुनियाद प्रदान करता है.

2️⃣ कुशल training और optimization तकनीकें: "Fantastic Pretraining Optimizers and Where to Find Them" और "Communication Efficient LLM Pre-training with SparseLoCo" जैसे शोधपत्र LLM के training process में efficiency बढ़ाने के लिए नई optimization techniques का अध्ययन करते हैं। विशेष रूप से, SparseLoCo ने communication efficiency बढ़ाने के लिए sparsification और quantization का उपयोग करते हुए performance और communication cost दोनों में बेहतर परिणाम दिखाए।

3️⃣ Multi-agent systems में collaboration और memory सुधार: "Anemoi: A Semi-Centralized Multi-agent Systems Based on Agent-to-Agent Communication MCP server from Coral Protocol" और "Memp: Exploring Agent Procedural Memory" में multi-agent systems की collaboration methods और agents की procedural memory को बेहतर बनाने के लिए approaches प्रस्तावित किए गए हैं। Anemoi agents के बीच direct collaboration के माध्यम से performance सुधारता है, और Memp agents को learnable procedural memory देकर निरंतर updates और improvement संभव बनाता है.

Large Language Models में hallucination control की मौलिक असंभवता पर अध्ययन / On the Fundamental Impossibility of Hallucination Control in Large Language Models

शोधपत्र परिचय

Large Language Models (LLM) में hallucination की समस्या AI systems की reliability और accuracy पर गंभीर प्रभाव डालने वाला विषय है, और यह शोध इस समस्या की मौलिक असंभवता को गणितीय रूप से स्पष्ट करने का प्रयास करता है। लेखकों ने bit information के समूह को auction की अवधारणा से समझाते हुए यह विश्लेषण किया कि कई components अपने-अपने आंशिक knowledge का उपयोग करके response कैसे बनाते हैं। यह शोध mechanism design theory, proper scoring rules theory, और transformer architecture के प्रत्यक्ष विश्लेषण सहित तीन स्वतंत्र गणितीय क्षेत्रों के माध्यम से hallucination और creativity की गणितीय नींव प्रदान करता है। विशेष रूप से, लेखक overconfidence या intuitive responses के निर्माण को मात्रात्मक रूप से मापने का एक तरीका प्रस्तुत करते हैं, जो hallucination और creativity दोनों की विशेषता के रूप में सामने आता है.

इसके अलावा, शोध में semantic information measurement और emergent operator जैसी अवधारणाएँ प्रस्तुत की गई हैं ताकि bounded reasoning को model किया जा सके, और यह रेखांकित किया गया है कि bounded reasoning सुलभ information उत्पन्न करता है, जबकि आदर्श unlimited reasoning semantic content को सख्ती से संरक्षित रखता है। इस विश्लेषण के माध्यम से लेखक यह सिद्ध करते हैं कि hallucination और imagination गणितीय रूप से समान घटनाएँ हैं, जो truthfulness, semantic information preservation, relevant knowledge disclosure, और knowledge-constrained optimality से विचलन से उत्पन्न होती हैं। यह शोध AI systems के design और evaluation पर महत्वपूर्ण प्रभाव डाल सकने वाली सैद्धांतिक नींव प्रदान करता है और भविष्य के शोध की दिशा पर अंतर्दृष्टि देता है। इन योगदानों से information theory और AI के संगम पर नए प्रश्न उठने की उम्मीद है, और यह information के स्वभाव के प्रति समझ को गहरा करने में सहायक हो सकता है.

शोधपत्र सारांश(Abstract)

यह शोधपत्र एक बुनियादी असंभवता प्रमेय स्थापित करता है: non-trivial knowledge aggregation करने में सक्षम कोई भी LLM एक साथ truthful knowledge representation, semantic information conservation, relevant knowledge का complete revelation, और knowledge-constrained optimality हासिल नहीं कर सकता। यह असंभवता किसी engineering limitation से नहीं, बल्कि information aggregation की गणितीय संरचना से उत्पन्न होती है। लेखक इस परिणाम को inference process को ideas की auction के रूप में समझाकर स्थापित करते हैं, जहाँ distributed components अपने partial knowledge का उपयोग करके response को आकार देने के लिए प्रतिस्पर्धा करते हैं। इसका प्रमाण तीन स्वतंत्र गणितीय क्षेत्रों में फैला है: mechanism design theory (Green-Laffont), proper scoring rules की theory (Savage), और transformers की प्रत्यक्ष architectural analysis (Log-Sum-Exp convexity)। विशेष रूप से, लेखक दिखाते हैं कि overconfidence या intuitive responses के निर्माण को कैसे मापा जा सकता है, जो hallucination और creativity, या imagination, दोनों की पहचान है।

इस विश्लेषण को समर्थन देने के लिए, वे सामान्य परिवेश में bounded reasoning को मॉडल करने हेतु semantic information measure और emergence operator की पूरक अवधारणाएँ प्रस्तुत करते हैं। वे सिद्ध करते हैं कि bounded reasoning जहाँ accessible information उत्पन्न करता है और मूल्यवान insight व inspiration देता है, वहीं idealized unconstrained reasoning semantic content को सख्ती से संरक्षित रखता है। hallucination और imagination को truthfulness, semantic information conservation, relevant knowledge के revelation, और knowledge-constrained optimality से विचलन पर आधारित गणितीय रूप से समान घटना दिखाकर, लेखक उन्नत AI systems में इन व्यवहारों को प्रबंधित करने के लिए एक principled foundation प्रदान करते हैं। अंत में, वे प्रस्तावित theory के evaluation और refinement को प्रेरित करने के लिए कुछ speculative ideas प्रस्तुत करते हैं。
> This paper establishes a fundamental impossibility theorem: no LLM capable of performing non-trivial knowledge aggregation can simultaneously achieve truthful knowledge representation, semantic information conservation, complete revelation of relevant knowledge, and knowledge-constrained optimality. The impossibility is not an engineering limitation but arises from the mathematical structure of information aggregation itself. We establish this result by describing the inference process as an auction of ideas, where distributed components compete exploiting their partial knowledge to shape responses. The proof spans three independent mathematical domains: mechanism design theory (Green-Laffont), the theory of proper scoring rules (Savage), and direct architectural analysis of transformers (Log-Sum-Exp convexity). In particular, we show how to quantify the creation of overconfident or intuitive responses-the signature of both hallucination and creativity, or imagination. To support this analysis, we introduce the complementary concepts of the semantic information measure and the emergence operator to model bounded reasoning in a general setting. We prove that while bounded reasoning generates accessible information, providing valuable insights and inspirations, the idealized unconstrained reasoning strictly preserves semantic content. By demonstrating that hallucination and imagination are mathematically identical phenomena-grounded in departures from truthfulness, semantic information conservation, revelation of relevant knowledge, and knowledge-constrained optimality-we offer a principled foundation for managing these behaviors in advanced AI systems. Finally, we present some speculative ideas to inspire evaluation and refinements of the proposed theory.

पेपर लिंक

https://arxiv.org/abs/2506.06382

शानदार pretraining optimizers और उन्हें खोजने के तरीके / Fantastic Pretraining Optimizers and Where to Find Them

पेपर परिचय

Pretraining optimizers बड़े language model training में महत्वपूर्ण भूमिका निभाते हैं, और विशेष रूप से AdamW लंबे समय से standard के रूप में स्थापित रहा है। हालांकि, हाल की research में alternative optimizers के लिए 1.4x से 2x speedup के दावे किए गए हैं, लेकिन यह अध्ययन दिखाता है कि ऐसे दावे वास्तव में बढ़ा-चढ़ाकर पेश किए गए हो सकते हैं। इस शोध में लेखक इन दावों के पीछे दो प्रमुख समस्याओं की ओर इशारा करते हैं। पहली, hyperparameter tuning असंतुलित तरीके से की जा सकती है, और दूसरी, evaluation setup सीमित या भ्रम पैदा करने वाला हो सकता है। इन समस्याओं को हल करने के लिए, लेखकों ने 10 deep learning optimizers की विभिन्न model scales और data-model ratios पर व्यवस्थित तुलना की।

शोध की मुख्य methodology hyperparameter tuning framework को तीन चरणों में विभाजित करती है। पहले चरण में, प्रत्येक optimizer के hyperparameters को बारीकी से tune करके सर्वोत्तम performance निकाली जाती है। दूसरे चरण में, memory requirements कम करने के लिए केवल उन्हीं हिस्सों को चुना जाता है जहाँ hyperparameter tuning की आवश्यकता होती है। अंत में, तीसरे चरण में model size और data budget के अनुसार hyperparameters के optimal values का अनुमान लगाने के लिए scaling laws लागू किए जाते हैं। यह methodology optimizers के बीच निष्पक्ष और reproducible comparison सुनिश्चित करती है, और शोध के परिणाम यह रेखांकित करते हैं कि matrix-based optimizers, scalar-based optimizers की तुलना में लगातार बेहतर performance दिखाते हैं।

यह अध्ययन hyperparameter tuning के महत्व और विभिन्न model scales तथा data-model ratios पर evaluation की आवश्यकता को रेखांकित करता है, और यह दिखाता है कि एक optimizer के लिए optimal hyperparameters दूसरे optimizer के लिए suboptimal हो सकते हैं। ये निष्कर्ष भविष्य के optimizer design और evaluation के मानक तय करने में महत्वपूर्ण योगदान देंगे।

पेपर सारांश (Abstract)

AdamW लंबे समय से language model pretraining में प्रमुख optimizer रहा है, जबकि कई वैकल्पिक optimizers ने 1.4x से 2x तक speedup देने का दावा किया है। हम तर्क देते हैं कि दो पद्धतिगत कमियों ने निष्पक्ष तुलना को धुंधला किया और व्यावहारिक अपनाने में बाधा डाली है: (i) असंतुलित hyperparameter tuning और (ii) सीमित या भ्रामक evaluation setup। इन दो समस्याओं को संबोधित करने के लिए, हम चार model scales (0.1B-1.2B parameters) और data-to-model ratios (Chinchilla optimum के 1-8x) पर दस deep learning optimizers का एक व्यवस्थित अध्ययन करते हैं। हम पाते हैं कि निष्पक्ष और उपयोगी तुलना के लिए कठोर hyperparameter tuning और विभिन्न model scales तथा data-to-model ratios पर evaluation की आवश्यकता होती है, और यह training के अंत में किया जाना चाहिए। पहला, किसी एक optimizer के लिए optimal hyperparameters दूसरे optimizer के लिए suboptimal हो सकते हैं, इसलिए blind hyperparameter transfer निष्पक्ष नहीं है। दूसरा, कई प्रस्तावित optimizers का well-tuned baseline की तुलना में वास्तविक speedup दावों से कम है, और model size बढ़ने पर यह 1.2B parameter model में केवल 1.1x रह जाता है। तीसरा, target training budget तक पहुँचने से पहले intermediate checkpoints की तुलना भ्रामक हो सकती है, क्योंकि learning rate decay के कारण training के दौरान दो optimizers की ranking उलट सकती है। हमारी गहन जाँच में हमने पाया कि Muon और Soap जैसे सबसे तेज optimizers सभी matrices को preconditioner के रूप में उपयोग करते हैं -- यानी gradients को entry-wise scalars के बजाय matrices से गुणा करते हैं। हालांकि, matrix-based optimizers का speedup model scale के व्युत्क्रमानुपाती होता है, जो 0.1B parameter model में AdamW पर 1.4x से घटकर 1.2B parameter model में केवल 1.1x रह जाता है。
> AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.

शोधपत्र लिंक

https://arxiv.org/abs/2509.02046

आगे पढ़ें

https://wandb.ai/marin-community/optimizer-scaling

Anemoi: agent-to-agent communication आधारित semi-centralized multi-agent system MCP server / Anemoi: A Semi-Centralized Multi-agent Systems Based on Agent-to-Agent Communication MCP server from Coral Protocol

शोधपत्र परिचय

Anemoi, Coral Protocol के A2A(Agent-to-Agent) communication model पर आधारित एक semi-centralized multi-agent system (Multi-Agent System, MAS) है, जो agents के बीच प्रत्यक्ष सहयोग के जरिए कुशल task coordination संभव बनाता है। पारंपरिक centralized MAS में planning agent कई worker agents को one-way तरीके से coordinate करता है, जिससे planner की क्षमता पर निर्भरता बढ़ती है और agents के बीच सीमित communication के कारण information loss तथा redundancy जैसी समस्याएँ पैदा होती हैं। Anemoi को इन समस्याओं को हल करने के लिए डिज़ाइन किया गया है, और यह ऐसी संरचना प्रदान करता है जिसमें सभी agents real time में progress को monitor कर सकते हैं, bottlenecks की पहचान कर सकते हैं, और सुधार के सुझाव दे सकते हैं।

Anemoi का मुख्य तत्व Coral Protocol के A2A communication MCP(Multi-Agent Communication Protocol) server का उपयोग है, जो agents के बीच निर्बाध information flow को support करता है। यह system planner agent और कई domain-specialized worker agents को जोड़ता है, जिससे शुरुआती plan दिया जा सकता है और workers सीधे coordination कर सकते हैं। इससे centralized planner पर निर्भरता कम होती है, adaptive plan updates संभव होते हैं, और दोहराए गए context transfer को न्यूनतम करके cost-efficient execution हासिल होता है।

Anemoi का मूल्यांकन GAIA benchmark पर किया गया, और इसमें planner के रूप में छोटे LLM(GPT-4.1-mini) का उपयोग करके 52.73% accuracy हासिल की गई। यह समान setup में सबसे शक्तिशाली open source baseline OWL के 43.63% से 9.09% अधिक है। ये नतीजे दिखाते हैं कि Anemoi का semi-centralized A2A communication model multi-agent systems के प्रदर्शन को बेहतर बनाने में योगदान दे सकता है।

यह शोध agents के बीच प्रत्यक्ष सहयोग और information flow में सुधार के जरिए multi-agent systems की नई संभावनाएँ खोलता है, और उम्मीद है कि यह भविष्य के generalized AI systems के विकास में महत्वपूर्ण योगदान देगा। Anemoi का implementation GitHub पर public है, जिससे researchers इस system का उपयोग करके विभिन्न applications विकसित कर सकते हैं।

शोधपत्र सार (Abstract)

हाल के generalist multi-agent systems (MAS) में प्रगति मुख्य रूप से context engineering और centralized paradigm का अनुसरण करती रही है, जहाँ एक planner agent एकतरफ़ा prompt passing के ज़रिए कई worker agents को समन्वित करता है। मज़बूत planner model के तहत यह प्रभावी है, लेकिन इस डिज़ाइन की दो अहम सीमाएँ हैं: (1) planner की क्षमता पर बहुत अधिक निर्भरता, जिसके कारण जब planner को कोई छोटा LLM संचालित करता है तो performance गिर जाती है; (2) agents के बीच सीमित communication, जिससे collaboration महंगे prompt concatenation और context injection पर निर्भर हो जाता है, और redundancy तथा information loss पैदा होता है। इन चुनौतियों को हल करने के लिए हम Anemoi प्रस्तावित करते हैं, जो Coral Protocol के Agent-to-Agent (A2A) communication MCP server पर आधारित एक semi-centralized MAS है। पारंपरिक डिज़ाइनों से अलग, Anemoi संरचित और प्रत्यक्ष inter-agent collaboration को सक्षम बनाता है, जिससे सभी agents प्रगति की निगरानी कर सकते हैं, परिणामों का आकलन कर सकते हैं, bottlenecks की पहचान कर सकते हैं, और real time में सुधार के सुझाव दे सकते हैं। यह paradigm एकल planner पर निर्भरता कम करता है, adaptive plan updates को support करता है, और redundant context passing को न्यूनतम करता है, जिससे execution अधिक scalable और cost-efficient बनता है। GAIA benchmark पर मूल्यांकन में, Anemoi ने planner के रूप में एक छोटे LLM (GPT-4.1-mini) के साथ 52.73% accuracy हासिल की, और समान LLM settings में सबसे मज़बूत open-source baseline OWL (43.63%) को +9.09% से पीछे छोड़ा। हमारा implementation सार्वजनिक रूप से https://github.com/Coral-Protocol/Anemoi पर उपलब्ध है।
> Generalist multi-agent systems (MAS) में हाल की प्रगति मुख्य रूप से context-engineering और centralized paradigm पर आधारित रही है, जहाँ एक planner agent एकतरफ़ा prompt passing के माध्यम से कई worker agents का समन्वय करता है। हालाँकि मज़बूत planner models के तहत यह प्रभावी है, इस डिज़ाइन में दो गंभीर सीमाएँ हैं: (1) planner की क्षमता पर अत्यधिक निर्भरता, जिसके कारण जब planner को कोई छोटा LLM संचालित करता है तो performance घट जाती है; और (2) agents के बीच सीमित communication, जहाँ collaboration महंगे prompt concatenation और context injection पर निर्भर करती है, जिससे redundancy और information loss पैदा होता है। इन चुनौतियों से निपटने के लिए, हम Anemoi प्रस्तावित करते हैं, जो Coral Protocol के Agent-to-Agent (A2A) communication MCP server पर निर्मित एक semi-centralized MAS है। पारंपरिक डिज़ाइनों के विपरीत, Anemoi structured और direct inter-agent collaboration को सक्षम बनाता है, जिससे सभी agents progress को monitor कर सकें, results का आकलन कर सकें, bottlenecks पहचान सकें, और real time में refinements प्रस्तावित कर सकें। यह paradigm एक single planner पर निर्भरता घटाता है, adaptive plan updates को support करता है, और redundant context passing को न्यूनतम करता है, जिसके परिणामस्वरूप execution अधिक scalable और cost-efficient बनता है। GAIA benchmark पर मूल्यांकन में, Anemoi ने planner के रूप में एक छोटे LLM (GPT-4.1-mini) के साथ 52.73% accuracy हासिल की, और समान LLM settings के तहत सबसे मज़बूत open-source baseline OWL (43.63%) को +9.09% से पीछे छोड़ा। हमारा implementation सार्वजनिक रूप से https://github.com/Coral-Protocol/Anemoi पर उपलब्ध है।

शोधपत्र लिंक

https://arxiv.org/abs/2508.17068

संचार-कुशल LLM प्री-ट्रेनिंग के लिए SparseLoCo / Communication Efficient LLM Pre-training with SparseLoCo

शोधपत्र परिचय

बड़े language models (LLM) के pre-training process में communication efficiency बढ़ाना एक बेहद महत्वपूर्ण शोध विषय है। हाल के distributed learning algorithms को इस वजह से काफी ध्यान मिला है कि वे data centers के बीच या internet के माध्यम से bandwidth-सीमित environments में LLM को train करने में उपयोगी हैं। लेकिन मौजूदा methods में अब भी model के पूरे gradients को transmit करना पड़ता है, जिससे communication bottleneck पैदा होता है और performance degradation हो सकती है। इस समस्या को हल करने के लिए प्रस्तावित SparseLoCo एक communication-efficient learning algorithm है, जो Top-k sparsification और 2-bit quantization का उपयोग करके अत्यधिक compression ratio हासिल करते हुए performance में सुधार करने का तरीका प्रस्तुत करता है।

SparseLoCo का मुख्य innovation यह है कि यह external momentum को error feedback और aggressive sparsification के साथ जोड़कर approximate करता है। इससे model की performance बेहतर की जा सकती है और साथ ही communication cost भी घटाई जा सकती है। शोध परिणाम अनुभवजन्य रूप से दिखाते हैं कि SparseLoCo विभिन्न communication-constrained environments में performance और communication cost दोनों के लिहाज़ से महत्वपूर्ण लाभ देता है। खास तौर पर, 1-3% sparsity और 2-bit quantization के माध्यम से इसने पारंपरिक DDP (Distributed Data Parallel) तरीके की तुलना में communication cost को उल्लेखनीय रूप से कम किया, जबकि performance को बनाए रखा या बेहतर किया।

यह शोध LLM pre-training में communication efficiency बढ़ाने का एक नया तरीका प्रस्तुत करता है, और आगे के अधिक experiments तथा optimizations के माध्यम से SparseLoCo की प्रगति की संभावनाएँ दिखाता है। उम्मीद है कि SparseLoCo बड़े मॉडलों की training efficiency बढ़ाने में महत्वपूर्ण योगदान देगा, और यह LLM research तथा development के लिए एक नई दिशा सुझा सकता है।

शोध सार (Abstract)

कम्युनिकेशन-एफ़िशिएंट distributed training algorithms ने हाल के समय में काफी ध्यान आकर्षित किया है, क्योंकि data center के बीच और इंटरनेट पर bandwidth-सीमित परिवेश में Large Language Models (LLMs) को train करने में इनके लाभ हैं। ये तरीके communication की आवृत्ति को कम करते हैं, लेकिन फिर भी आमतौर पर मॉडल के gradients की पूरी कॉपी communicate करनी पड़ती है—जिससे cross-datacenter links पर भी communication bottleneck पैदा होता है। इसके अलावा, naive AdamW DDP baseline की तुलना में इनसे performance में हल्की गिरावट आ सकती है। Quantization और error feedback का उपयोग अक्सर pseudo-gradient के आकार को कम करने के लिए किया जाता है, लेकिन LLM pre-training के संदर्भ में मौजूदा approaches sparsification का अतिरिक्त लाभ नहीं उठा पाए हैं और केवल सीमित quantization ही हासिल कर सके हैं। इस काम में हम SparseLoCo पेश करते हैं, जो LLMs के लिए एक communication-efficient training algorithm है और Top-k sparsification तथा quantization का प्रभावी उपयोग करके 1-3% sparsity और 2-bit quantization तक के अत्यधिक compression ratio हासिल करता है, जबकि full-precision DiLoCo से बेहतर प्रदर्शन करता है। हमारी मुख्य टिप्पणियाँ यह हैं कि outer momentum को aggressive sparsity के साथ error feedback द्वारा locally approximate किया जा सकता है, और sparse aggregation वास्तव में मॉडल performance को बेहतर बना सकती है। हम अनुभवजन्य रूप से दिखाते हैं कि communication-constrained LLM training settings की एक विस्तृत श्रृंखला में SparseLoCo, performance और communication cost दोनों में महत्वपूर्ण लाभ देता है。
> Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across data centers and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model's gradients-resulting in a communication bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While quantization and error feedback are often applied to reduce the pseudo-gradient's size, in the context of LLM pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited quantization. In this work, we introduce SparseLoCo, a communication-efficient training algorithm for LLMs that effectively leverages Top-k sparsification and quantization to reach extreme compression ratios of up to 1-3% sparsity and 2-bit quantization while outperforming full-precision DiLoCo. Our key observations are that outer momentum can be locally approximated by an error feedback combined with aggressive sparsity and that sparse aggregation can actually improve model performance. We empirically demonstrate in a range of communication-constrained LLM training settings that SparseLoCo provides significant benefits in both performance and communication cost.

पेपर लिंक

https://arxiv.org/abs/2508.15706

बजट सीमाओं के तहत अनुकूली LLM routing / Adaptive LLM Routing under Budget Constraints

पेपर परिचय

Large Language Models (LLM) की प्रगति ने natural language processing के क्षेत्र में क्रांतिकारी बदलाव लाए हैं, लेकिन इन मॉडलों की उच्च लागत और अलग-अलग प्रकार की queries के लिए उपयुक्त प्रतिक्रिया देना अब भी एक चुनौती बना हुआ है। इस अध्ययन में LLM routing समस्या को contextual bandit समस्या के रूप में पुनर्परिभाषित करते हुए, बजट सीमाओं के तहत इष्टतम LLM चुनने के लिए Preference-prior Informed LinUCB for Adaptive Routing (PILOT) नामक एक नया algorithm प्रस्तावित किया गया है। मौजूदा supervised learning approaches की सीमा यह है कि उन्हें बड़े पैमाने पर labeled datasets की आवश्यकता होती है, और इस अध्ययन ने इस सीमा को पार करने के लिए user feedback के माध्यम से LLM चयन को गतिशील रूप से समायोजित करने की methodology विकसित की है।

PILOT दो प्रमुख चरणों से बना है। पहले चरण में offline human preference data का उपयोग करके एक shared embedding space बनाया जाता है, जो queries और LLMs के बीच affinity को दर्शाता है। इस प्रक्रिया में triplet loss को minimize करके query और LLM के संबंध को प्रभावी ढंग से सीखा जाता है। दूसरे चरण में online bandit feedback को एकीकृत किया जाता है, ताकि प्रत्येक query के लिए उपयुक्त LLM चुना जा सके और उससे मिलने वाले reward को देखकर performance में लगातार सुधार किया जा सके। यह approach बजट को ध्यान में रखते हुए लचीला resource allocation संभव बनाती है और विविध user needs के अनुसार अनुकूलित हो सकती है।

इस अध्ययन का मुख्य योगदान बजट सीमाओं को ध्यान में रखकर LLM routing समस्या का औपचारिककरण और उसे हल करने के लिए PILOT algorithm का प्रस्ताव है। प्रयोगों के परिणाम दिखाते हैं कि PILOT विभिन्न datasets पर मौजूदा bandit baselines से बेहतर प्रदर्शन करता है और cost efficiency को अधिकतम करने में सफल रहा है। ये निष्कर्ष LLMs की व्यावहारिक deployment और उपयोग में महत्वपूर्ण योगदान देते हैं, और भविष्य के शोध के लिए विभिन्न user needs के प्रति अनुकूलन क्षमता बढ़ाने तथा अधिक datasets पर लागू करने की संभावना सुझाते हैं।

पेपर सार (Abstract)

बड़े भाषा मॉडल (LLM) ने natural language processing में क्रांतिकारी बदलाव लाया है, लेकिन उनकी अलग-अलग क्षमताएँ और लागतें व्यावहारिक अनुप्रयोगों में चुनौतियाँ पैदा करती हैं। LLM routing हर query/task के लिए सबसे उपयुक्त LLM को dynamic रूप से चुनकर इस समस्या का समाधान करता है। पहले के approaches इसे supervised learning समस्या के रूप में देखते थे, जहाँ optimal query-LLM pairing की पूरी जानकारी होने का अनुमान लगाया जाता था। लेकिन वास्तविक परिदृश्यों में ऐसी व्यापक mapping उपलब्ध नहीं होती और user queries लगातार बदलती रहती हैं। इसलिए हम LLM routing को contextual bandit समस्या के रूप में अध्ययन करने का प्रस्ताव रखते हैं, जो supervised routing के विपरीत हर query के लिए सभी LLMs पर exhaustive inference की आवश्यकता के बिना bandit feedback का उपयोग करके adaptive decision-making संभव बनाती है। इस समस्या को हल करने के लिए हम queries और LLMs के लिए एक shared embedding space विकसित करते हैं, जहाँ query और LLM embeddings को उनकी affinity को दर्शाने के लिए align किया जाता है। यह space शुरू में offline human preference data से सीखा जाता है और online bandit feedback के माध्यम से परिष्कृत किया जाता है। हम इस विचार को adaptive routing के लिए LinUCB के एक नए extension, Preference-prior Informed Linucb fOr adaptive rouTing (PILOT), के माध्यम से साकार करते हैं। मॉडल routing के लिए विभिन्न user budgets को संभालने हेतु, हम multi-choice knapsack problem के रूप में model की गई एक online cost policy पेश करते हैं, जो resource-efficient routing सुनिश्चित करती है।
> Large Language Models (LLMs) have revolutionized natural language processing, but their varying capabilities and costs pose challenges in practical applications. LLM routing addresses this by dynamically selecting the most suitable LLM for each query/task. Previous approaches treat this as a supervised learning problem, assuming complete knowledge of optimal query-LLM pairings. However, real-world scenarios lack such comprehensive mappings and face evolving user queries. We thus propose to study LLM routing as a contextual bandit problem, enabling adaptive decision-making using bandit feedback without requiring exhaustive inference across all LLMs for all queries (in contrast to supervised routing). To address this problem, we develop a shared embedding space for queries and LLMs, where query and LLM embeddings are aligned to reflect their affinity. This space is initially learned from offline human preference data and refined through online bandit feedback. We instantiate this idea through Preference-prior Informed Linucb fOr adaptive rouTing (PILOT), a novel extension of LinUCB. To handle diverse user budgets for model routing, we introduce an online cost policy modeled as a multi-choice knapsack problem, ensuring resource-efficient routing.

शोधपत्र लिंक

https://arxiv.org/abs/2508.21141

टेक्स्ट-इमेज diffusion में computation reuse के ज़रिए image sets का efficient generation / Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

शोधपत्र परिचय

टेक्स्ट-इमेज diffusion मॉडल उच्च-गुणवत्ता वाली images उत्पन्न करने में बहुत प्रभावी हैं, लेकिन इस प्रक्रिया की ऊँची computational cost एक बड़ी चुनौती बनती जा रही है। मौजूदा शोध मुख्य रूप से individual image generation की efficiency बढ़ाने पर केंद्रित रहे हैं, जबकि यह शोध correlated prompts के बीच redundancy को कम करने का एक नया approach प्रस्तावित करता है। प्रस्तावित method diffusion मॉडल की coarse-to-fine प्रकृति का उपयोग करती है, ताकि शुरुआती denoising steps में समान prompts के बीच shared structure को पकड़ा जा सके।

यह शोध training-free approach के माध्यम से semantic similarity के आधार पर prompts को cluster करता है और शुरुआती diffusion stages में computation share करने की रणनीति अपनाता है। प्रयोगों से पता चला कि image embeddings पर conditioned models में यह method image quality को बनाए रखते हुए या बेहतर बनाते हुए computational cost को कम-से-कम 50% तक घटा सकती है। साथ ही, UnClip की text-to-image prior जानकारी का उपयोग करके diffusion step allocation को optimize कर efficiency को और बढ़ाया गया।

प्रस्तावित method को मौजूदा text-to-image generation pipelines के साथ सहज रूप से integrate किया जा सकता है, और यह बड़े prompt sets पर scale हो सकती है, जिससे environmental और financial burden कम करने में मदद मिल सकती है। यह शोध diffusion मॉडलों की generative dynamics पर महत्वपूर्ण insight प्रदान करता है और भविष्य में sustainable optimization strategies की खोज के लिए एक महत्वपूर्ण आधार सामग्री बनने की उम्मीद है।

शोधपत्र सारांश(Abstract)

टेक्स्ट-इमेज diffusion मॉडल उच्च-गुणवत्ता वाली image generation को संभव बनाते हैं, लेकिन उनकी computational cost बहुत अधिक होती है। जहाँ पहले का शोध per-inference efficiency को optimize करने पर केंद्रित था, वहीं हम एक orthogonal approach की पड़ताल करते हैं: correlated prompts के बीच redundancy को कम करना। हमारी method diffusion मॉडलों की coarse-to-fine प्रकृति का लाभ उठाती है, जहाँ शुरुआती denoising steps समान prompts के बीच shared structures को capture करते हैं। हम semantic similarity के आधार पर prompts को cluster करने और शुरुआती diffusion steps में computation share करने वाली एक training-free approach प्रस्तावित करते हैं। प्रयोग दिखाते हैं कि image embeddings पर conditioned होकर train किए गए models के लिए, हमारा approach computational cost को काफी कम करते हुए image quality में सुधार करता है। UnClip की text-to-image prior का उपयोग करके हम अधिक efficiency के लिए diffusion step allocation को बेहतर बनाते हैं। हमारी method मौजूदा pipelines के साथ सहज रूप से integrate होती है, prompt sets के साथ scale करती है, और बड़े पैमाने की text-to-image generation के environmental और financial burden को कम करती है। Project page: https://ddecatur.github.io/hierarchical-diffusion/
> Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/

शोधपत्र लिंक

https://arxiv.org/abs/2508.21032

आगे पढ़ें

https://ddecatur.github.io/hierarchical-diffusion/

Attention एक smoothed cubic spline है / Attention is a smoothed cubic spline

शोधपत्र परिचय

Transformer architecture में attention module अपनी अहमियत के बावजूद अब भी काफी हद तक एक अनजाना क्षेत्र बना हुआ है। यह शोध attention module को smooth cubic spline के रूप में व्याख्यायित करके classical approximation theory के नज़रिये से नई अंतर्दृष्टि देता है। लेखकों ने दिखाया है कि ReLU activation function का उपयोग करने पर attention, masked attention, और encoder-decoder attention सभी को cubic spline के रूप में व्यक्त किया जा सकता है। यह दृष्टिकोण इसलिए महत्वपूर्ण है क्योंकि transformer के सभी components विभिन्न attention modules और feed-forward neural networks के संयोजन से बने होते हैं।

शोध इस बात पर ज़ोर देता है कि Pierce-Birkhoff conjecture के आधार पर हर spline को ReLU-activated encoder के रूप में व्यक्त किया जा सकता है। इसके ज़रिए attention module की गणितीय प्रकृति अधिक स्पष्ट होती है और cubic spline के माध्यम से transformer की संरचनात्मक समझ गहरी होती है। साथ ही, smooth $C^\infty$ version पाने के लिए यदि ReLU को SoftMax जैसी smooth activation function से बदला जाए, तो मौजूदा transformer model को पुनः प्राप्त किया जा सकता है।

यह शोध attention mechanism की गणितीय व्याख्या के माध्यम से मौजूदा machine learning models की समझ को और गहरा करता है, और transformer architecture के सार को spline जैसे सुविख्यात गणितीय ऑब्जेक्ट के रूप में समझाता है। प्रयोगों के नतीजे दिखाते हैं कि प्रस्तावित cubic spline model मौजूदा models से बेहतर प्रदर्शन करता है, और attention module की गणितीय व्याख्या का वास्तविक performance पर सकारात्मक प्रभाव पड़ता है। उम्मीद है कि ये निष्कर्ष भविष्य में attention mechanism के विकास में योगदान देंगे। यह शोध transformer के attention module को एक नए नज़रिये से देखने में मदद करता है और संबंधित क्षेत्र के शोधकर्ताओं के लिए एक महत्वपूर्ण बुनियादी सामग्री बन सकता है।

शोधपत्र सारांश(Abstract)

हम एक ऐसी अंतर्दृष्टि को रेखांकित करते हैं जो शायद महत्वपूर्ण है, लेकिन अब तक देखी नहीं गई थी: transformer में attention module एक smooth की गई cubic spline है। इस तरह देखने पर transformer का यह रहस्यमय लेकिन अत्यंत महत्वपूर्ण component classical approximation theory में गहराई से निहित एक पुराने विचार का स्वाभाविक विकास बन जाता है। अधिक सटीक रूप से, हम दिखाते हैं कि ReLU activation के साथ attention, masked attention, और encoder-decoder attention सभी cubic splines हैं। चूँकि transformer का हर component विभिन्न attention modules (= cubic splines) और feed-forward neural networks (= linear splines) की compositions से निर्मित होता है, इसलिए इसके सभी components -- encoder, decoder, और encoder-decoder blocks; multilayered encoders और decoders; स्वयं transformer -- cubic या उससे उच्च-क्रम के splines हैं। यदि हम Pierce-Birkhoff conjecture मान लें, तो इसका उल्टा भी सही है, यानी हर spline एक ReLU-activated encoder है। चूँकि spline सामान्यतः केवल $C^2$ होती है, इसलिए smooth $C^\infty$ version पाने का एक तरीका है ReLU को किसी smooth activation से बदलना; और यदि इस activation के रूप में SoftMax चुना जाए, तो हम Vaswani et al. द्वारा प्रस्तावित मूल transformer को पुनः प्राप्त करते हैं। यह अंतर्दृष्टि transformer की प्रकृति पर प्रकाश डालती है, क्योंकि यह उसे पूरी तरह splines के रूप में व्यक्त करती है, जो applied mathematics के सबसे प्रसिद्ध और गहराई से समझे गए ऑब्जेक्ट्स में से एक हैं।
> We highlight a perhaps important but hitherto unobserved insight: The attention module in a transformer is a smoothed cubic spline. Viewed in this manner, this mysterious but critical component of a transformer becomes a natural development of an old notion deeply entrenched in classical approximation theory. More precisely, we show that with ReLU-activation, attention, masked attention, encoder-decoder attention are all cubic splines. As every component in a transformer is constructed out of compositions of various attention modules (= cubic splines) and feed forward neural networks (= linear splines), all its components -- encoder, decoder, and encoder-decoder blocks; multilayered encoders and decoders; the transformer itself -- are cubic or higher-order splines. If we assume the Pierce-Birkhoff conjecture, then the converse also holds, i.e., every spline is a ReLU-activated encoder. Since a spline is generally just $C^2$, one way to obtain a smoothed $C^\infty$-version is by replacing ReLU with a smooth activation; and if this activation is chosen to be SoftMax, we recover the original transformer as proposed by Vaswani et al. This insight sheds light on the nature of the transformer by casting it entirely in terms of splines, one of the best known and thoroughly understood objects in applied mathematics.

शोधपत्र लिंक

https://arxiv.org/abs/2408.09624

$Mem^p$: एजेंट प्रक्रियात्मक स्मृति की खोज / $Mem^p$: Exploring Agent Procedural Memory

शोधपत्र परिचय

LLM-आधारित agents कई तरह के tasks में बेहतरीन प्रदर्शन करते हैं, लेकिन मौजूदा procedural memory अक्सर manually designed होती है या static parameters पर निर्भर करती है, जिससे उसमें नाज़ुकता बनी रहती है। इस शोध में agents को सीखने योग्य और अपडेट होने वाली lifelong procedural memory देने के लिए एक नई methodology $Mem^p$ प्रस्तावित की गई है। $Mem^p$ पिछले agent trajectories को बारीक step-by-step instructions और high-level scripts के रूप में distill करके procedural memory के निर्माण(Build), retrieval, और update रणनीतियों का अध्ययन करता है।

$Mem^p$ का मूल dynamic regime है, जो procedural memory को लगातार update, revise, और discard करता रहता है। इससे agent नए अनुभवों के आधार पर अपने memory repository को विकसित कर सकता है, और empirical evaluation में TravelPlanner और ALFWorld पर agent की success rate और efficiency में क्रमिक सुधार देखा गया। खास तौर पर, अधिक शक्तिशाली model से निर्मित procedural memory अपनी उपयोगिता बनाए रखती है, और जब इसे कमज़ोर model में transfer किया जाता है तब भी performance में उल्लेखनीय सुधार दिखता है।

procedural memory का retrieval process agent के लिए नए task के संदर्भ में सबसे मिलते-जुलते अनुभव को प्रभावी ढंग से खोजने के लिए अनिवार्य है। यह प्रक्रिया vector embedding model का उपयोग करके similarity मापती है और सबसे उपयुक्त memory को retrieve करती है। इसके अलावा, procedural memory का update mechanism इस तरह डिज़ाइन किया गया है कि agent द्वारा पूरे किए गए tasks की संख्या बढ़ने पर उसमें dynamic रूप से जोड़ना, हटाना, और संशोधन करना संभव हो। यह व्यापक दृष्टिकोण agent की learning क्षमता को अधिकतम करता है और विभिन्न environments में उसकी task-execution क्षमता को बेहतर बनाता है।

$Mem^p$ agent की procedural memory को लगातार बेहतर बनाकर भविष्य के agent systems के विकास के लिए महत्वपूर्ण संकेत देता है और learning-based procedural memory के महत्व को रेखांकित करता है। उम्मीद है कि ये शोध परिणाम agent performance को अधिकतम करने में एक नवोन्मेषी योगदान देंगे।

शोधपत्र सारांश(Abstract)

Large Language Models (LLM) पर आधारित एजेंट विविध कार्यों में उत्कृष्ट प्रदर्शन करते हैं, लेकिन वे हाथ से डिज़ाइन की गई या स्थिर parameters में उलझी नाज़ुक procedural memory की समस्या से जूझते हैं। इस शोध में हम एजेंटों को सीखने योग्य, अपडेट की जा सकने वाली, और आजीवन उपयोगी procedural memory प्रदान करने की रणनीतियों की जाँच करते हैं। हम $Mem^p$ प्रस्तावित करते हैं, जो पिछले agent trajectories को सूक्ष्म step-by-step निर्देशों और उच्च-स्तरीय script-जैसी abstractions, दोनों में distill करता है, और procedural memory के Build, Retrieval, और Update के लिए विभिन्न रणनीतियों के प्रभाव का अन्वेषण करता है। लगातार सामग्री को अपडेट, सुधार, और अप्रचलित करने वाली एक dynamic regimen के साथ मिलकर, यह repository नए अनुभवों के साथ-साथ विकसित होती है। TravelPlanner और ALFWorld पर किए गए empirical evaluation से पता चलता है कि memory repository के परिष्कृत होने के साथ, एजेंट समान कार्यों पर लगातार अधिक सफलता दर और बेहतर efficiency हासिल करते हैं। इसके अलावा, अधिक शक्तिशाली model से निर्मित procedural memory अपना मूल्य बनाए रखती है; procedural memory को कमज़ोर model में migrate करने से प्रदर्शन में उल्लेखनीय सुधार मिलता है।
> Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose $Mem^p$ that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model yields substantial performance gains.

पेपर लिंक

https://arxiv.org/abs/2508.06433

मॉडल आर्किटेक्चर खोज के लिए AlphaGo मोमेंट / AlphaGo Moment for Model Architecture Discovery

पेपर परिचय

ASI-Arch न्यूरल नेटवर्क architecture search के क्षेत्र में पूरी तरह स्वायत्त रूप से नवोन्मेषी architectures खोजने वाली एक artificial superintelligence (ASI4AI) प्रणाली है। यह मानव-परिभाषित search space तक सीमित पारंपरिक neural architecture search (NAS) से आगे बढ़कर, auto-optimization से auto-innovation की ओर paradigm shift लाती है और नए structural concepts के लिए hypothesis बनाना, implementation, training, और validation तक स्वतंत्र रूप से करती है। 20,000 GPU घंटों के दौरान 1,773 experiments के माध्यम से इसने 106 state-of-the-art linear attention architectures खोजे, जो मानव-डिज़ाइन किए गए baselines से आगे बढ़ते हुए नए design principles प्रस्तुत करते हैं। इसके अलावा, यह वैज्ञानिक खोज की प्रक्रिया पर स्वयं empirical scaling laws प्रस्तुत करती है, जिससे यह सिद्ध होता है कि शोध प्रगति को मानव संज्ञानात्मक सीमाओं से आगे बढ़ाकर computational resources के आधार पर scale किया जा सकता है।

पेपर सार (Abstract)

जहाँ AI systems घातीय रूप से बेहतर क्षमताएँ दिखा रहे हैं, वहीं AI research की गति स्वयं मानव की संज्ञानात्मक क्षमता द्वारा रैखिक रूप से सीमित बनी हुई है, जिससे विकास में एक लगातार गंभीर bottleneck पैदा हो रहा है। यह शोधपत्र neural architecture discovery के महत्वपूर्ण क्षेत्र में AI research के लिए Artificial Superintelligence (ASI4AI) के पहले प्रदर्शन, ASI-Arch, को प्रस्तुत करता है। ASI-Arch एक पूर्णतः autonomous system है, जो AI को स्वयं architecture innovation करने में सक्षम बनाकर इस मूलभूत सीमा को तोड़ता है। मानव-परिभाषित search space तक सीमित पारंपरिक Neural Architecture Search (NAS) से आगे बढ़ते हुए, यह automated optimization से automated innovation की ओर एक paradigm shift पेश करता है। ASI-Arch architecture discovery के क्षेत्र में end-to-end scientific research कर सकता है; यह स्वायत्त रूप से नए architecture concepts की hypothesis बनाता है, उन्हें executable code के रूप में implement करता है, और कठोर experiments तथा पिछले अनुभव के आधार पर training और performance को empirically validate करता है। ASI-Arch ने 20,000 GPU hours में 1,773 autonomous experiments किए, जिनके परिणामस्वरूप 106 innovative और state-of-the-art (SOTA) linear attention architectures की खोज हुई। जैसे AlphaGo की Move 37 ने मानव खिलाड़ियों को न दिखने वाली अप्रत्याशित strategic insights उजागर की थीं, उसी तरह इस AI द्वारा खोजे गए architectures emergent design principles दिखाते हैं, जो व्यवस्थित रूप से human-designed baselines से आगे निकलते हैं और architecture innovation के पहले से अज्ञात रास्तों को उजागर करते हैं। विशेष रूप से, हम scientific discovery के लिए पहला empirical scaling law स्थापित करते हैं, यह दिखाते हुए कि architectural breakthroughs को computationally scale किया जा सकता है; इससे research progress मानव सीमाओं से बंधी प्रक्रिया से computation-scalable process में बदल जाती है। यह शोधपत्र उन emergent design patterns और autonomous research capabilities का व्यापक विश्लेषण प्रस्तुत करता है, जिन्होंने इन breakthroughs को संभव बनाया, और self-accelerating AI systems के लिए एक blueprint स्थापित करता है।
> While AI systems demonstrate exponentially improving capabilities, the pace of AI research itself remains linearly bounded by human cognitive capacity, creating an increasingly severe development bottleneck. We present ASI-Arch, the first demonstration of Artificial Superintelligence for AI research (ASI4AI) in the critical domain of neural architecture discovery--a fully autonomous system that shatters this fundamental constraint by enabling AI to conduct its own architectural innovation. Moving beyond traditional Neural Architecture Search (NAS), which is fundamentally limited to exploring human-defined spaces, we introduce a paradigm shift from automated optimization to automated innovation. ASI-Arch can conduct end-to-end scientific research in the domain of architecture discovery, autonomously hypothesizing novel architectural concepts, implementing them as executable code, training and empirically validating their performance through rigorous experimentation and past experience. ASI-Arch conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures. Like AlphaGo's Move 37 that revealed unexpected strategic insights invisible to human players, our AI-discovered architectures demonstrate emergent design principles that systematically surpass human-designed baselines and illuminate previously unknown pathways for architectural innovation. Crucially, we establish the first empirical scaling law for scientific discovery itself--demonstrating that architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process. We provide comprehensive analysis of the emergent design patterns and autonomous research capabilities that enabled these breakthroughs, establishing a blueprint for self-accelerating AI systems.

शोधपत्र लिंक

https://arxiv.org/abs/2507.18074

Unsupervised Elicitation of Language Models / भाषा मॉडलों की बिना पर्यवेक्षण क्षमता-उद्घाटन

शोधपत्र परिचय

जब pretrained language model को किसी विशेष task के लिए अनुकूलित किया जाता है, तो मौजूदा methods को human supervision की आवश्यकता होती है, लेकिन superhuman capabilities वाले models के लिए high-quality human supervision कठिन या असंभव हो सकती है। इस समस्या के समाधान के लिए यह शोध Internal Coherence Maximization (ICM) नामक एक unsupervised learning algorithm प्रस्तावित करता है, जो external supervision के बिना model द्वारा स्वयं बनाए गए labels का उपयोग करके fine-tuning करता है। ICM कई benchmarks पर human-supervised learning के बराबर या उससे बेहतर performance दिखाता है, और विशेष रूप से superhuman capability वाले tasks में human-labeled training से बेहतर परिणाम देता है। इसके अलावा, इस method का उपयोग करके state-of-the-art language models के reward models और auxiliary systems को train किया गया, जिससे human-supervised models की तुलना में performance improvement प्रदर्शित हुई।

शोधपत्र सार (Abstract)

प्रीट्रेन किए गए language models को downstream tasks के लिए अनुकूलित करने हेतु, आज का post-training paradigm वांछित व्यवहार निर्दिष्ट करने के लिए मनुष्यों पर निर्भर करता है। लेकिन superhuman क्षमताओं वाले मॉडलों के लिए, उच्च-गुणवत्ता वाली मानव supervision प्राप्त करना कठिन या असंभव हो सकता है। इस चुनौती से निपटने के लिए, हम एक नया unsupervised learning algorithm, Internal Coherence Maximization (ICM), प्रस्तावित करते हैं, जो pretrained language models को उनके स्वयं-उत्पन्न labels पर \emph{without external supervision} fine-tune करता है। GSM8k-verification, TruthfulQA, और Alpaca reward modeling tasks में, यह विधि golden supervision पर प्रशिक्षित प्रदर्शन के बराबर है और crowdsourced human supervision पर प्रशिक्षण से बेहतर प्रदर्शन करती है। जिन कार्यों में LLM की क्षमताएँ स्पष्ट रूप से superhuman हैं, वहाँ यह विधि मानव labels पर प्रशिक्षण की तुलना में उन क्षमताओं को कहीं अधिक प्रभावी ढंग से सामने ला सकती है। अंत में, हम दिखाते हैं कि यह विधि frontier LLM training को बेहतर बना सकती है। हमने इस विधि का उपयोग करके एक unsupervised reward model को train किया और reinforcement learning के माध्यम से Claude 3.5 Haiku-आधारित assistant को train किया। यह reward model और assistant दोनों ही अपने human-supervised समकक्षों से बेहतर प्रदर्शन करते हैं।
> pretrained language models को downstream tasks के लिए steer करने हेतु, आज का post-training paradigm वांछित व्यवहार निर्दिष्ट करने के लिए मनुष्यों पर निर्भर करता है। हालांकि, superhuman क्षमताओं वाले मॉडलों के लिए, उच्च-गुणवत्ता वाली मानव supervision प्राप्त करना कठिन या असंभव है। इस चुनौती से निपटने के लिए, हम एक नया unsupervised algorithm, Internal Coherence Maximization (ICM), प्रस्तुत करते हैं, जो pretrained language models को उनके स्वयं-उत्पन्न labels पर fine-tune करता है, \emph{without external supervision}. GSM8k-verification, TruthfulQA, और Alpaca reward modeling tasks में, हमारी विधि golden supervision पर प्रशिक्षण के प्रदर्शन की बराबरी करती है और crowdsourced human supervision पर प्रशिक्षण से बेहतर प्रदर्शन करती है। जिन कार्यों में LMs की क्षमताएँ बहुत हद तक superhuman हैं, वहाँ हमारी विधि मानव labels पर प्रशिक्षण की तुलना में उन क्षमताओं को कहीं अधिक बेहतर ढंग से उभार सकती है। अंत में, हम दिखाते हैं कि हमारी विधि frontier LMs के training को बेहतर बना सकती है: हम अपनी विधि का उपयोग एक unsupervised reward model को train करने के लिए करते हैं और reinforcement learning का उपयोग करके Claude 3.5 Haiku-based assistant को train करते हैं। reward model और assistant दोनों ही अपने human-supervised counterparts से बेहतर प्रदर्शन करते हैं。

शोध-पत्र लिंक

https://arxiv.org/abs/2506.10139

यह लेख GPT मॉडल से तैयार किए गए सारांश पर आधारित है, इसलिए संभव है कि इसमें मूल पाठ की सामग्री या आशय से अलग तरह से संक्षेपण हुआ हो। यदि यह विषय आपकी रुचि का है, तो कृपया मूल पाठ भी साथ में देखें! पढ़ते समय यदि आपको कोई अटपटी या गलत सामग्री मिले, तो कृपया टिप्पणी में बताएं। 🤗
⚠️विज्ञापन⚠️ क्या 🔥PyTorch Korea User Group🇰🇷 द्वारा तैयार किया गया यह लेख उपयोगी लगा? सदस्य के रूप में जुड़ें, तो हम प्रमुख लेख आपको ईमेल💌 से भेजेंगे! (डिफ़ॉल्ट Weekly है, लेकिन Daily में भी बदला जा सकता है।)

[2025/09/01 ~ 07] इस सप्ताह देखने लायक AI/ML शोधपत्रों का संग्रह