[2026/06/01 ~ 07] इस हफ्ते देखने लायक AI/ML शोध-पत्रों का संग्रह

PyTorchKR🔥🇰🇷 🤔💭

इस हफ्ते चुने गए 10 शोध-पत्रों को देखें तो वे तेज़ी से तीन मुख्य दिशाओं पर केंद्रित होते दिखते हैं: बड़े भाषा मॉडल (LLM) आधारित agents में state management, inference efficiency, और वास्तविक वातावरण में safety व verifiability। खास तौर पर, agents की efficiency को अधिकतम करने वाले structural बदलावों से लेकर transformer architecture के बुनियादी redesign तक, और real world के dynamic environments के अनुकूल robust behavior सुनिश्चित करने तक, कई दिलचस्प research trends देखने को मिले।

:one: एजेंट workflow में नवाचार: state का externalization और reasoning logic का internalization इस हफ्ते के शोध-पत्रों में दो ऐसे परस्पर विपरीत लेकिन एक-दूसरे के पूरक approaches विशेष रूप से उभरकर आए, जिनका लक्ष्य agents द्वारा जटिल और long-horizon tasks करते समय आने वाली cost और context bottleneck की समस्या हल करना है। Harness-1 और AdaCoM state या context management का बोझ agent से हटाकर external environment या अलग management model पर डालते हैं, जिससे लंबे कार्यों की स्थिरता बढ़ती है। दूसरी ओर, Latent Agents और agentic workflow internalization (Subterranean Agents) research external orchestrator या multi-agent communication की जटिल प्रक्रिया को सीधे एक single model के weights के भीतर compile करने वाला post-training approach प्रस्तावित करते हैं। इसके जरिए model prompt या external coordination पर निर्भर हुए बिना स्वयं discussion या procedural reasoning कर सकता है, और frontier-model स्तर की performance बनाए रखते हुए inference cost और token usage को नाटकीय रूप से घटाने की दिशा दिखाता है।

:two: मूलभूत architecture का redesign: attention mechanism का fusion और parameter optimization transformer की बुनियादी computational inefficiency को दूर करने और memory usage घटाने की दिशा में foundational research भी एक मज़बूत trend है। SISA(Forget Attention) शोध-पत्र state space model (SSM) के sequential importance signals को सीधे attention score computation में inject करने वाले 'score-level fusion' के माध्यम से global retrieval capability और sequential priority judgment दोनों को साथ हासिल करता है। वहीं QKV variant research (Do Transformers Need Three Projections?) इस पारंपरिक मानक पर सवाल उठाती है कि query, key, और value को हमेशा अलग-अलग होना चाहिए; और empirical रूप से दिखाती है कि key और value को साझा करने वाली projection scheme (Q-K=V) performance loss को न्यूनतम रखते हुए KV cache को काफी हद तक घटा सकती है। architecture स्तर के ये structural improvements सिर्फ performance gains तक सीमित नहीं हैं, बल्कि limited memory वाले edge devices और on-device AI environments में practical deployment की संभावनाओं को भी काफी बढ़ाते हैं।

:three: dynamic environments में real-time adaptation और system-level robustness सुनिश्चित करना केवल सही उत्तर उत्पन्न करने से आगे बढ़कर, बदलती परिस्थितियों और खतरों से सक्रिय रूप से निपटने और system को स्वयं विकसित करने वाले शोध भी ध्यान खींचते हैं। MOSS prompt modification तक सीमित पुराने self-evolution approaches को source-code-level rewriting तक बढ़ाता है, जिससे agent system अपनी structural defects को खुद ठीक कर सके। FuzzingBrain V2 multi-agent का उपयोग कर 100% reproducible तरीके से वास्तविक software vulnerabilities को detect और fix करता है। इसके अलावा, AdvGame language model safety alignment को attacker और defender के बीच real-time non-cooperative game के रूप में परिभाषित करके dynamic defense capability बढ़ाता है, जबकि Plan, Watch, Recover research ऐसा proactive assistant model प्रस्तुत करती है जो user के तय प्रक्रिया से हटने पर real time में intervene और coaching कर सके। यह दिखाता है कि AI अब नियंत्रित लैब वातावरण से बाहर निकलकर unpredictable real-world errors और security threats के बीच भी भरोसेमंद, सक्रिय systems के रूप में स्थापित हो रहा है।

शोध-पत्रवार मुख्य सारांश

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses: यह reinforcement learning आधारित search agent है जिसमें memory burden policy पर नहीं बल्कि harness पर डाला गया है। 8 benchmarks में इसने औसतन curated recall 0.730 दर्ज किया, और खास तौर पर transfer performance मज़बूत रही।
Forget Attention: Importance-Aware Attention Is All You Need: यह SISA प्रस्तावित करता है, जो state space model (SSM) के importance signals को सीधे attention scores में inject करता है। single SDPA call से implementation संभव है, और retrieval performance तथा long-range dependency recovery दोनों में बड़ा सुधार मिलता है।
Do Transformers Need Three Projections? Systematic Study of QKV Variants: यह QKV projections को किस हद तक share किया जा सकता है, इसका systematic analysis करने वाला शोध है। Q-K=V ने performance को लगभग बरकरार रखते हुए KV cache को काफी घटाया, और GQA/MQA के साथ जोड़ने पर memory savings और बढ़ीं।
Compiling Agentic Workflows into LLM Weights: यह external orchestration की जगह task procedure को सीधे model weights में compile करने वाले approach पर केंद्रित है। repeated calls और लंबे context consumption को घटाते हुए भी यह near-frontier स्तर की quality हासिल करता है।
Learning Agent-Compatible Context Management for Long-Horizon Tasks: यह fixed agent के लिए ऐसा AdaCoM प्रस्तावित करता है जिसमें external LLM context को dynamically edit करता है। long-term web search और research tasks में यह अनावश्यक past information को घटाते हुए task constraints को सुरक्षित रखता है।
Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate: यह multi-agent debate को एक single LLM के भीतर distill करने वाली post-training method है। अधिकतम 93% कम tokens के साथ भी इसने explicit debate के बराबर या उससे बेहतर performance दिखाई।
MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems: यह ऐसा agent system है जो prompt नहीं बल्कि source code स्तर पर self-evolution करता है। वास्तविक failure evidence के आधार पर code structure को फिर से लिखता है, और verification के बाद rollback-सक्षम तरीके से deploy करता है।
Safety Alignment of LMs via Non-cooperative Games: यह safety alignment को attacker LM और defender LM के परस्पर अनुकूलित होने वाले non-cooperative game के रूप में फिर से परिभाषित करता है। preference-based reinforcement learning के जरिए safety और usefulness की Pareto frontier को एक साथ आगे बढ़ाता है।
Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance: यह एक proactive multimodal assistance system है जो सीखता है कि user के प्रक्रिया से भटकने पर कब intervene करना है और उसे कैसे वापस लाना है। EgoProactive और Pro²Bench के जरिए वास्तविक recovery coaching performance का मूल्यांकन किया जाता है।
FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction: यह multi-agent LLM आधारित security system है जो vulnerability discovery और reproduction को automate करता है। OSS-Fuzz आधारित verification, precise vulnerability localization, और hierarchical fuzzing को मिलाकर इसने उच्च detection rate और वास्तविक vulnerability discovery परिणाम हासिल किए।

Harness-1: state externalization harness के साथ search agents के लिए reinforcement learning / Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

शोध-पत्र परिचय

search agents को अक्सर बढ़ते हुए transcript पर policy के रूप में train किया जाता है, जहाँ model को search कैसे करना है यह तय करने के साथ-साथ यह भी याद रखना पड़ता है कि उसने क्या देखा है, कौन-सा evidence उपयोगी है, कौन-सी constraints अभी भी खुली हैं, और कौन-से claims वास्तव में verify किए जा चुके हैं। लेखकों का मानना है कि यह setup policy के भीतर बहुत अधिक state-management burden डालता है, और reinforcement learning को अर्थपूर्ण search decisions तथा recoverable record management—जिसे environment अधिक स्थिर रूप से संभाल सकता है—दोनों को एक साथ optimize करने के लिए मजबूर करता है।

इसे हल करने के लिए वे state-externalizing harness के भीतर reinforcement learning से प्रशिक्षित 20B search agent Harness-1 प्रस्तावित करते हैं। यह harness candidate pool, importance tags वाला curated set, compressed evidence links, verification records, compressed और deduplicated observations, तथा budget-aware context rendering जैसी environment-side working memory को manage करता है।

वहीं, policy यह तय करने जैसे semantic निर्णय संभालती है कि क्या search करना है, किन documents को रखना या हटाना है, क्या verify करना है, और कब रुकना है। web, finance, patent और multi-hop question answering सहित 8 retrieval benchmarks में Harness-1 ने औसतन 0.730 curated recall हासिल किया, जो अगले सबसे मजबूत open source retrieval subagent से 11.4 points अधिक था। खास तौर पर training domain के बाहर के transfer benchmarks में इसका प्रदर्शन सुधार अधिक स्पष्ट था, जो संकेत देता है कि explicit search state पर reinforcement learning ऐसा retrieval behavior बना सकती है जो बेहतर generalize करता है.

सार(Abstract)

search agents को अक्सर बढ़ती हुई transcripts पर policies के रूप में train किया जाता है। यानी model को यह तय करना होता है कि क्या search करना है, और साथ ही उसे यह भी याद रखना पड़ता है कि उसने क्या देखा है, कौन-सा evidence उपयोगी है, कौन-सी constraints अभी भी खुली हैं, और किन claims की वास्तव में जांच हो चुकी है।

हम तर्क देते हैं कि यह formulation policy के भीतर बहुत अधिक routine state management ठूंस देती है। यानी reinforcement learning (RL) को semantic search decisions के साथ-साथ उस recoverable bookkeeping को भी optimize करने के लिए मजबूर किया जाता है, जिसे environment अधिक विश्वसनीय ढंग से संभाल सकता है।

हम Harness-1 नामक 20B search agent (retrieval subagent) पेश करते हैं, जिसे stateful search harness के भीतर reinforcement learning से train किया गया है। यह harness environment-side working memory बनाए रखता है, जिसमें candidate pool, importance tags के साथ curated set, compact evidence links, verification records, compressed और deduplicated observations, तथा budget-aware context rendering शामिल हैं। policy semantic निर्णयों को संभालती रहती है: क्या search करना है, किन documents को रखना या हटाना है, क्या verify करना है, और कब रुकना है।

web, finance, patents और multi-hop QA को समेटने वाले 8 retrieval benchmarks में Harness-1 ने औसतन 0.730 curated recall हासिल किया, जो दूसरे स्थान वाले open search subagent से +11.4 points आगे था, और कहीं बड़े frontier-model searchers के साथ भी प्रतिस्पर्धी प्रदर्शन दिखाया। खासकर unseen transfer benchmarks पर इसका सुधार उल्लेखनीय था, जो यह संकेत देता है कि explicit search state पर reinforcement learning ऐसा retrieval behavior पैदा कर सकती है जो training domains से आगे भी generalize करे। code: https://github.com/pat-jj/harness-1

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.

पेपर लिंक

https://arxiv.org/abs/2606.02373

आगे पढ़ें

https://github.com/pat-jj/harness-1

https://huggingface.co/pat-jj/harness-1

attention को भूल जाइए: सिर्फ Importance-Aware Attention ही काफी है / Forget Attention: Importance-Aware Attention Is All You Need

पेपर परिचय

Transformer और state space model (State Space Model, SSM) को मिलाने वाले hybrid language modeling में मुख्य चुनौती यह है कि globally information को explore करने की क्षमता और sequence में क्या महत्वपूर्ण है इसे पहचानने की क्षमता—इन दोनों को साथ कैसे जिंदा रखा जाए। मौजूदा Transformer कहीं भी देख सकते हैं, लेकिन priority तय करने में उनकी सीमाएँ हैं; वहीं SSM महत्वपूर्ण signals को accumulate कर सकता है, लेकिन बीत चुकी जानकारी को फिर से बारीकी से refer करना उसके लिए कठिन है। इस मायने में दोनों एक-दूसरे के पूरक हैं। लेकिन मौजूदा hybrid तरीके प्रायः block level या head level पर इन दोनों mechanisms को parallel रूप में रखने तक सीमित रहे, इसलिए attention scores की गणना के क्षण में SSM का importance signal सीधे परिलक्षित नहीं हो पाता था। इसी समस्या-बोध के आधार पर लेखक SSM-Informed Softmax Attention (SISA) प्रस्तावित करते हैं और ऐसा नया integration design करते हैं जिसमें SSM द्वारा दिया गया sequential importance signal attention output में नहीं, बल्कि score में ही inject किया जाता है। मुख्य विचार यह है कि content similarity को दर्शाने वाले standard inner-product term के साथ SSM से निकाले गए importance vectors का inner-product term भी जोड़ा जाए, ताकि tokens के बीच संबंध केवल content match नहीं, बल्कि “इस समय क्या महत्वपूर्ण है” को भी दर्शा सके।

खास तौर पर, इस तरीके की एक महत्वपूर्ण बात यह है कि इसे बिना किसी अतिरिक्त recurrent state या custom kernel के, extended query और key बनाकर सिर्फ एक Scaled Dot-Product Attention (SDPA) call से implement किया जा सकता है। दूसरे शब्दों में, SISA गणितीय रूप से SSM की sequential information का उपयोग करता है, लेकिन implementation के दृष्टिकोण से इसे standard Transformer operation flow के साथ अच्छी तरह fit होने के लिए design किया गया है, जिससे FlashAttention परिवार की optimizations के साथ compatibility भी बनी रहती है। इसके अलावा, SSM channel input से decay और rotation components की गणना करके importance signal बनाता है, और इस signal को attention के score level पर काम करने के लिए तैयार किया जाता है, जिससे retrieval performance सीधे बेहतर होती है। प्रयोगों के परिणामों में भी इस design का प्रभाव साफ दिखा: 152M scale और 5B token की condition में SISA ने LAMBADA-greedy पर 17.3% दर्ज किया, जो standard Transformer और Mamba-3 से आगे था; और NIAH(Needle-in-a-Haystack) में training के 1K step पर ही 100% हासिल कर लिया, जिससे बहुत तेज retrieval convergence देखने को मिली।

आगे बढ़ते हुए, SISA 369M स्केल पर भी हर मेट्रिक में पूरी तरह हावी नहीं दिखता, लेकिन कम-से-कम महत्वपूर्ण retrieval कार्यों में लगातार मजबूत performance बनाए रखते हुए stock SDPA executionability नहीं खोता — इस वजह से इसका व्यावहारिक महत्व बड़ा है। लेखक इसे block-level और head-level से आगे जाने वाले तीसरे design axis, यानी score-level fusion, को hybrid language model के लिए एक वैध विकल्प के रूप में पेश करते हैं। अंततः इस पेपर का योगदान केवल दो model family को मिलाने में नहीं है, बल्कि SSM द्वारा दिए गए importance signal को attention score निर्माण के केंद्र में लाकर global retrieval और sequential priority judgment को एक ही operation में एकीकृत करने में है। इस तरह का approach दिखाता है कि लंबी दूरी की dependency recovery और core information tracking महत्वपूर्ण होने वाले language modeling tasks में hybrid संरचना किस तरह और अधिक परिष्कृत रूप से विकसित हो सकती है।

सार(Abstract)

attention की global retrieval क्षमता और state space model (SSM) के sequential importance signal को जोड़ना hybrid language modeling की एक अनसुलझी चुनौती है। Transformer सब कुछ देख सकते हैं, लेकिन priority तय नहीं कर सकते; SSM जानते हैं कि क्या महत्वपूर्ण है, लेकिन उसे फिर से देख नहीं सकते। मौजूदा hybrid — Jamba (block level) और Hymba (head level) — इन दोनों mechanism को अलग-अलग compartment में रखते हैं, इसलिए attention computation के दौरान कोई भी दूसरे को जानकारी नहीं दे पाता। हम SISA (SSM-Informed Softmax Attention) प्रस्तावित करते हैं। यह SSM से निकले importance term को सीधे attention score के भीतर जोड़ता है, और expanded query/key vector पर एक single SDPA call के रूप में पूरे operation को लागू करता है। इसमें न recurrent state चाहिए, न custom kernel। 152M / 5 अरब tokens पर SISA ने LAMBADA-greedy 17.3% हासिल किया (Transformer 13.9, Mamba-3 15.5 की तुलना में), और 1K step से ही NIAH 100% दर्ज किया, जो Transformer की retrieval convergence से 7 गुना तेज है। 369M पर Mamba-3, LAMBADA में आगे रहता है, लेकिन SISA perfect NIAH और stock SDPA execution बनाए रखता है। इसलिए SISA, block-level और head-level paradigm से आगे बढ़ते हुए SSM-attention hybrid के लिए तीसरा design axis — score-level fusion — प्रस्तुत करता है।

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

पेपर लिंक

https://arxiv.org/abs/2606.02332

क्या Transformers को तीन projections की ज़रूरत होती है? QKV variants का व्यवस्थित अध्ययन / Do Transformers Need Three Projections? Systematic Study of QKV Variants

पेपर परिचय

Transformer की performance को सहारा देने वाला मुख्य घटक query, key, value से बना QKV (query-key-value) attention है, लेकिन हर projection वास्तव में कितनी स्वतंत्र रूप से आवश्यक है, इस पर अब तक पर्याप्त व्यवस्थित समीक्षा नहीं हुई थी। यह शोध इसी खालीपन को लक्ष्य बनाता है, और Q-K=V, Q=K-V, Q=K=V — इन तीन projection-sharing constraints के आधार पर attention के भीतर weight tying का representational power और inference efficiency पर क्या असर पड़ता है, इसका बारीक विश्लेषण करता है। खास तौर पर, यह इस बात पर ध्यान देता है कि आख़िरी दो variants attention map को symmetric बनाने की ओर झुकते हैं, और इस कारण directional behavior को पूरक करने के लिए two-dimensional positional encoding को शामिल करने वाली design का भी परीक्षण करता है। इस तरह चर्चा सिर्फ parameter reduction तक सीमित नहीं रहती, बल्कि representational space की संरचना को बदलने वाले प्रश्न तक फैलती है। इस approach का महत्व इस बात में है कि यह सिर्फ यह नहीं पूछता कि projection sharing से performance गिरती है या नहीं, बल्कि यह अलग-अलग बताता है कि किन परिस्थितियों में quality बनी रहती है और किन परिस्थितियों में attention की directionality और selectivity कमजोर पड़ती है।

प्रयोगों को synthetic tasks, vision, और language modeling जैसे अलग-अलग domains को समेटने के लिए तैयार किया गया था, ताकि यह सत्यापित किया जा सके कि projection sharing का प्रभाव किसी एक data domain तक सीमित घटना नहीं है। synthetic tasks में order reversal, sorting, substitution, exchange, और copying जैसी manipulation समस्याओं का उपयोग करके यह देखा गया कि model संरचनात्मक संबंधों को कितनी अच्छी तरह सीखता है। vision experiments में MNIST, CIFAR, TinyImageNet, और anomaly detection के माध्यम से उन वातावरणों में generalization performance को परखा गया जहाँ spatial position information महत्वपूर्ण होती है। language modeling में 300M और 1.2B parameter वाले models को 10B tokens पर train किया गया, ताकि यह देखा जा सके कि बड़े पैमाने की settings में भी यही प्रवृत्ति बनी रहती है या नहीं। परिणामस्वरूप Q-K=V तरीका, मूल QKV Transformer के लगभग बराबर या कभी-कभी उससे बेहतर performance दिखाता है, और language modeling में key-value (KV) cache को 50% घटाते हुए भी perplexity में गिरावट केवल 3.1% रही।

इससे भी अधिक महत्वपूर्ण बात यह है that यह reduction effect grouped query attention (GQA) या multi-query attention (MQA) के साथ complementary तरीके से जुड़ता है। Q-K=V को GQA-4 के साथ इस्तेमाल करने पर KV cache को 87.5% तक घटाया जा सकता है, और MQA के साथ जोड़ने पर यह 96.9% तक कम हो जाता है, जिससे on-device inference में वास्तविक लाभ मिलता है। लेखक इन परिणामों के आधार पर कहते हैं कि key और value वास्तव में समान representational space साझा कर सकते हैं, और क्योंकि attention low-rank संरचना में काम करता है, इसलिए पूरी तरह अलग QKV विभाजन हमेशा आवश्यक नहीं होता। इसके उलट, Q=K-V query और key को अत्यधिक कसकर बाँध देता है, जिससे attention की directionality कमजोर पड़ती है, और इसी कारण performance तथा stability दोनों के लिहाज़ से यह कम अनुकूल साबित होता है।

समग्र रूप से, यह अध्ययन Transformer की QKV संरचना को एक स्वाभाविक standard नहीं, बल्कि पुनर्विचार योग्य design space के रूप में देखने के लिए प्रेरित करता है, और यह ठोस अनुभवजन्य मानदंड देता है कि किन projections को साझा किया जा सकता है और किन भूमिकाओं को अलग रखना चाहिए। खासकर performance को लगभग बनाए रखते हुए memory usage को काफी कम किया जा सकता है — इस दृष्टि से यह नतीजा edge device जैसे constrained environment में efficient deployment के लिए एक महत्वपूर्ण design guideline के रूप में पढ़ा जा सकता है।

सार(Abstract)

Transformer अब विभिन्न AI कार्यों के लिए मानक समाधान बन चुके हैं, और query, key, value (QKV) attention formulation इसमें केंद्रीय भूमिका निभाता है। लेकिन इन तीन projections का अलग-अलग योगदान और इनमें से कुछ को हटाने का प्रभाव अभी तक पूरी तरह समझा नहीं गया है। हमने तीन projection sharing constraints का व्यवस्थित मूल्यांकन किया: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), और c) Q=K=V (single projection)। अंतिम दो variants symmetric attention maps बनाते हैं; इसे हल करने के लिए हमने 2D positional encodings के जरिए asymmetric attention का भी अध्ययन किया। synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly detection), और language modeling (10B tokens पर 300M तथा 1.2B parameter models) में किए गए प्रयोगों से हमने पाया कि हमारे transformers, QKV transformer के बराबर या कभी-कभी उससे बेहतर प्रदर्शन करते हैं। language modeling में Q-K=V projection sharing ने सिर्फ 3.1% perplexity degradation के साथ KV cache को 50% तक कम किया। महत्वपूर्ण बात यह है कि projection sharing, head sharing (GQA/MQA) के साथ complementary है। Q-K=V को GQA-4 के साथ जोड़ने पर cache 87.5% तक घटाया जा सकता है, और Q-K=V को MQA के साथ जोड़ने पर यह 96.9% तक घटता है, जिससे व्यावहारिक on-device inference संभव होता है। हमने दिखाया कि Q-K=V गुणवत्ता बनाए रखता है क्योंकि key और value समान representational spaces में रह सकते हैं और attention low-rank regime में काम करता है, जबकि Q=K-V attention की directionality को तोड़ देता है। हमारे परिणाम attention में अभी तक कम खोजे गए weight tying के एक रूप के तौर पर projection sharing को व्यवस्थित रूप से स्पष्ट करते हैं, और खासकर edge deployment में उपयोगी, सीधे और मापे जा सकने वाले inference memory लाभ दिखाते हैं। कोड https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections पर सार्वजनिक रूप से उपलब्ध है।

Transformers विभिन्न AI कार्यों के लिए मानक समाधान बन चुके हैं, और query, key, value (QKV) attention formulation इसमें केंद्रीय भूमिका निभाता है। हालांकि, इन तीन projections का अलग-अलग योगदान और इनमें से कुछ को हटाने का असर अभी भी ठीक से समझा नहीं गया है। हम तीन projection sharing constraints का व्यवस्थित मूल्यांकन करते हैं: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), और c) Q=K=V (single projection)। अंतिम दो variants symmetric attention maps बनाते हैं; इसे संबोधित करने के लिए हम 2D positional encodings के जरिए asymmetric attention का भी अध्ययन करते हैं। synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), और language modeling (10B tokens पर 300M और 1.2B parameter models) तक फैले प्रयोगों के माध्यम से हमने पाया कि हमारे transformers, QKV transformer के बराबर या कभी-कभी बेहतर प्रदर्शन करते हैं। language modeling में Q-K=V projection sharing, केवल 3.1% perplexity degradation के साथ 50% KV cache reduction हासिल करता है। महत्वपूर्ण रूप से, projection sharing, head sharing (GQA/MQA) के पूरक है: Q-K=V को GQA-4 के साथ जोड़ने पर 87.5% cache reduction मिलता है, जबकि Q-K=V + MQA, 96.9% तक पहुंचता है, जिससे व्यावहारिक on-device inference संभव होता है। हम दिखाते हैं कि Q-K=V गुणवत्ता बनाए रखता है क्योंकि keys और values समान representational spaces में रह सकते हैं और attention low-rank regime में काम करता है, जबकि Q=K-V, attention directionality को तोड़ देता है। हमारे परिणाम attention में weight tying के एक कम-खोजे गए रूप के तौर पर projection sharing का व्यवस्थित वर्णन करते हैं, जो खासकर edge deployment के लिए सीधे, मापने योग्य inference memory लाभ देता है। कोड सार्वजनिक रूप से https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections पर उपलब्ध है.

पेपर लिंक

https://arxiv.org/abs/2606.04032

Agentic workflows को LLM weights में compile करना: 100 गुना कम लागत पर frontier-स्तर के करीब गुणवत्ता / Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

पेपर परिचय

हाल के समय में agent orchestration frameworks का तेज़ी से फैलाव यह दिखाता है कि complex tasks को Large Language Model (LLM) के ऊपर external orchestrator द्वारा नियंत्रित करने का तरीका लगभग एक मानक दृष्टिकोण की तरह स्वीकार किया जा रहा है, लेकिन यह शोध-पत्र सवाल उठाता है कि procedural tasks में यह संरचना हमेशा सर्वश्रेष्ठ नहीं होती। external orchestrator हर turn पर निर्देश और routing decisions inject करता है, जिससे control और debugging आसान हो जाते हैं, लेकिन इसकी सीमाएं भी हैं: यह लगातार context window खर्च करता है, हर बातचीत में frontier model को call करना पड़ता है, और स्वयं प्रक्रिया third-party providers के सामने उजागर हो सकती है। इसके जवाब में लेखक प्रस्ताव रखते हैं कि task procedure को prompt में रखने के बजाय सीधे एक छोटे fine-tuned model के weights में compile किया जाए, ताकि runtime पर अलग orchestration के बिना भी प्रक्रिया को internalize किया हुआ agent बनाया जा सके। इस तरह का तरीका बाहरी रूप से प्रक्रिया को बार-बार inject करने की जरूरत खत्म करता है, जिससे लागत बहुत घटती है, लंबा context घेरना नहीं पड़ता, और sensitive workflows बाहरी सेवाओं के सामने उजागर नहीं होते। लेखक ऐसे agents को, जिनमें प्रक्रिया मॉडल के भीतर छिपकर काम करती है, subterranean agent कहते हैं, और इन्हें मौजूदा orchestration-केंद्रित डिज़ाइन से स्पष्ट रूप से अलग मानते हैं।

मुख्य methodology सिर्फ एक सरल अवधारणा पेश करने तक सीमित नहीं है, बल्कि उन तीन मानी जाने वाली बाधाओं को वास्तविक कार्य-परिवेश में परखने पर केंद्रित है जिनकी वजह से developers इस approach को अपनाने में हिचकते हैं। पहली, क्या छोटे models frontier-स्तर की गुणवत्ता दे सकते हैं—इस performance concern को संबोधित किया गया है। दूसरी, क्या product-specific knowledge जैसी अक्सर बदलने वाली जानकारी को weights में समाहित किया जा सकता है—इस knowledge internalization समस्या की जांच की गई है। तीसरी, क्या यह तरीका branching और hubs से भरे बड़े, जटिल workflows तक scale हो सकता है—इसे सत्यापित किया गया है। इसके लिए शोधकर्ताओं ने travel booking, Zoom support, और insurance claims जैसे तीन अलग प्रकृति वाले domains चुने, ताकि procedural depth और domain knowledge की अलग-अलग मांगों वाली स्थितियों में compile approach की उपयोगिता की तुलना की जा सके। travel booking, 14 nodes वाले एक मानक procedural flow के माध्यम से state transitions और step-by-step decision-making की स्थिरता को परखता है। Zoom support यह रेखांकित करता है कि समान आकार के workflow में भी product-specific policies और feature knowledge की आवश्यकता होती है। insurance claims, 55 nodes और 6 decision hubs वाली अधिक जटिल संरचना के साथ, ऐसा वास्तविक stress test बनता है जिसमें conditional branching और policy calculation दोनों एक साथ आवश्यक होते हैं।

प्रयोगों के परिणामों के निहितार्थ स्पष्ट हैं। यह पुष्टि होती है कि प्रक्रिया को weights में समाहित करने वाला छोटा मॉडल near-frontier quality, यानी frontier models के करीब की गुणवत्ता, बनाए रखते हुए भी लागत को दो अंकों के गुणक स्तर तक घटा सकता है, और यह performance तथा efficiency के बीच पारंपरिक संतुलन पर फिर से विचार करने को मजबूर करता है। खासकर बीमा दावा उदाहरण में, यह दिखता है कि मॉडल सिर्फ जवाब उत्पन्न करने तक सीमित नहीं रहता, बल्कि verification, branching, reimbursement calculation, और payout guidance तक शामिल procedural reasoning को लगातार ढंग से अंजाम दे सकता है। ये परिणाम संकेत देते हैं कि यदि काम दोहराने योग्य हो और उसकी संरचना अपेक्षाकृत स्थिर हो, तो हर बार बाहरी orchestration से गुजरने की तुलना में प्रक्रिया को ही सीख लेने वाला compiled approach अधिक उपयुक्त हो सकता है। साथ ही, यह सीमा भी बनी रहती है कि प्रक्रिया बदलने पर retraining की ज़रूरत पड़ सकती है, और prompt-based approach की तुलना में त्वरित संशोधन तथा interpretability के मामले में यह कमज़ोर हो सकता है। फिर भी, इस शोध का योगदान यह है कि इसने agent design के विकल्पों का दायरा बढ़ाया है। अंततः यह पेपर उस प्रचलित धारणा को चुनौती देता है कि agent workflow को हमेशा बाहरी रूप से assemble करना चाहिए, और अनुभवजन्य रूप से दिखाता है कि प्रक्रिया को मॉडल के भीतर ले जाने का तरीका व्यावहारिक स्तर पर भी पूरी तरह प्रभावी विकल्प बन सकता है।

सार(Abstract)

Agent orchestration frameworks तेज़ी से फैल चुके हैं, और LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, और LlamaIndex को मिलाकर GitHub stars की संख्या 2.9 लाख से अधिक हो चुकी है। ये सभी एक ही पैटर्न का पालन करते हैं। यानी, LLM के ऊपर एक बाहरी orchestrator रखा जाता है, जो हर turn पर instructions और routing decisions inject करता है। हालिया शोध ने दिखाया है कि procedural tasks में इस architecture पर frontier model के system prompt में प्रक्रिया को सीधे दे देना ही भारी पड़ता है [Dennis et al., 2026a]। लेकिन इसकी कीमत context window की खपत, हर बातचीत के लिए frontier model की आवश्यकता, और proprietary procedures का third-party providers के सामने उजागर होना है। प्रक्रिया को छोटे fine-tuned model के weights में compile करके एक subterranean agent बनाना इन सभी समस्याओं को हल कर सकना चाहिए, और पूर्ववर्ती शोध (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) ने भी दिखाया है कि यह तकनीक काम करती है। फिर भी developer adoption भारी तौर पर orchestration की ओर झुका हुआ है। हम तीन perceived barriers की पहचान करते हैं, और उन्हें travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), तथा insurance claims (55 nodes, 6 decision hubs) के तीन क्षेत्रों में अनुभवजन्य रूप से संबोधित करते हैं।

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

पेपर लिंक

https://arxiv.org/abs/2605.22502

आगे पढ़ें

https://discuss.pytorch.kr/t/llm-subterranean-agent/10501

दीर्घ-अवधि कार्यों के लिए agent-compatible context management सीखना / Learning Agent-Compatible Context Management for Long-Horizon Tasks

पेपर परिचय

जब बड़े भाषा मॉडल (LLM) आधारित agents वेब सर्च या deep research जैसे long-horizon tasks को पूरा करते हैं, जिनमें चरण लंबे होते हैं और बीच-बीच के निर्णय जमा होते जाते हैं, तब सबसे बड़ी बाधाओं में से एक यह होती है कि बातचीत लंबी होने के साथ उपयोगी संकेत और अनावश्यक पुरानी जानकारी आपस में मिल जाती है, जिससे reasoning डगमगाने लगती है। मौजूदा context management methods अक्सर agent की internal policy को साथ में सीखते हैं या summary जैसी fixed strategies पर निर्भर रहते हैं, लेकिन ऐसे तरीके closed-source agents पर लागू करना कठिन है, और वे इस वास्तविकता को पर्याप्त रूप से नहीं दर्शाते कि अलग-अलग agents को अलग तरह के management की ज़रूरत हो सकती है। इस समस्या के समाधान के लिए प्रस्तावित Adaptive Context Management (AdaCoM) एक ऐसा approach अपनाता है जिसमें fixed agent (frozen agent) को जस का तस रखा जाता है, जबकि बाहरी एक दूसरा LLM context को dynamic तरीके से edit करना सीखता है। यहाँ मुख्य बात सिर्फ लंबी बातचीत को compress करना नहीं है, बल्कि message स्तर पर deletion, rewriting, और merging करते हुए वर्तमान कार्य के लिए आवश्यक constraints और progress को सुरक्षित रखना तथा पुराने noise को हटाना है; यानी लचीले edit actions सीखना। यह डिज़ाइन इसलिए महत्वपूर्ण है क्योंकि यह context management को static preprocessing नहीं, बल्कि agent की success rate को सीधे सुधारने वाली policy learning problem के रूप में पुनर्परिभाषित करता है।

AdaCoM सबसे पहले supervised fine-tuning (SFT) से शुरू होता है, ताकि context manager structured output format का अभ्यस्त हो सके, और फिर वास्तविक task performance को reward बनाकर Group Relative Policy Optimization (GRPO) के जरिए policy को और परिष्कृत करता है। इस प्रक्रिया में manager मौजूदा context को prompt में बदलकर input के रूप में लेता है, और Markov decision process (MDP) के दृष्टिकोण से हर चरण पर यह चुनता है कि किन messages को बनाए रखना है या संशोधित करना है। साथ ही, केवल अंतिम उत्तर को देखने के बजाय, context length overflow, दोहराए जाने वाले tool calls, format errors, और मध्यवर्ती task signals को दर्शाने वाले process reward भी तैयार किए जाते हैं, ताकि long-horizon tasks में महत्वपूर्ण स्थानीय edit quality भी सीखी जा सके। इसके माध्यम से AdaCoM एक साधारण summarizer नहीं, बल्कि ऐसा adaptive editing policy बनकर काम करता है जो agent को स्थिर रूप से सोच की निरंतरता बनाए रखने में मदद देता है।

प्रयोगों में, वेब सर्च और deep research benchmarks पर विभिन्न agents में इसे लागू करने पर performance improvement देखा गया। खासकर, जिन agents का मूल ReAct (Reasoning and Acting) आधारित baseline performance अधिक था, उनके लिए higher-fidelity context preservation अधिक लाभकारी साबित हुआ; जबकि अपेक्षाकृत कमज़ोर agents के लिए अधिक आक्रामक compression के माध्यम से स्थिर reasoning region में बने रहना अधिक प्रभावी निकला। लेखक इसे fidelity-reliability trade-off के रूप में व्याख्यायित करते हैं, और दिखाते हैं कि context management agent की क्षमता-स्तर के अनुसार बदलना चाहिए। आगे, transfer experiments में यह प्रवृत्ति देखी गई कि समान capability characteristics वाले agents के बीच AdaCoM की strategies अधिक अच्छी तरह transfer होती हैं, जिससे यह संकेत मिलता है कि एक सार्वभौमिक summary rule की तुलना में reusable external context manager अधिक व्यावहारिक दिशा हो सकता है। अंततः यह शोध एक महत्वपूर्ण methodological advance प्रस्तुत करता है, क्योंकि यह long-horizon tasks में failure के कारण को केवल agent की reasoning ability तक सीमित नहीं रखता, बल्कि उस reasoning को सहारा देने वाले context management को ही एक learnable core component के रूप में देखता है।

सार(Abstract)

बड़े language model (LLM) agents अब web search और deep research जैसे long-horizon tasks का सामना लगातार अधिक कर रहे हैं, और वास्तविक applications में संचित context की वजह से long-context degradation और reasoning failures हो सकते हैं। पहले के शोध ने agent-side context control या summarization जैसी fixed strategies के जरिए context management से इस समस्या को कम करने की कोशिश की है, लेकिन ऐसे तरीकों में adaptation के लिए agent खुद को train करना पड़ता है, इसलिए closed-source agents के लिए ये व्यावहारिक नहीं हैं, और यह बात भी नज़रअंदाज़ हो जाती है कि अलग-अलग agents को अलग-अलग strategies की ज़रूरत हो सकती है.

हम Adaptive Context Management (AdaCoM) का प्रस्ताव करते हैं। यह flexible modification actions और end-to-end reinforcement learning के जरिए एक external LLM को train करता है ताकि वह एक frozen agent के context को manage कर सके। web search और deep research benchmarks पर विविध agents में AdaCoM, stale content को हटाते हुए task constraints और progress को सुरक्षित रखकर performance को काफ़ी बेहतर बनाता है। सीखी गई strategies एक Fidelity-Reliability Trade-off दिखाती हैं। यानी, जिन agents का सामान्य ReAct performance अधिक होता है, वे higher-fidelity context preservation से लाभ उठाते हैं, जबकि कम performance वाले agents को reliable reasoning regime के भीतर बने रहने के लिए अधिक aggressive compression की ज़रूरत होती है। transfer experiments दिखाते हैं कि AdaCoM उन agents के बीच सबसे प्रभावी ढंग से generalize करता है जिनकी capability समान होती है, जिसे सामान्य ReAct performance से मापा गया है। यह agent systems के लिए reusable context managers की दिशा में एक व्यावहारिक रास्ता सुझाता है।

LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through context management with agent-side context control or fixed strategies such as summarization, which require training the agent itself for adaptation - making it impractical for closed-source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.

पेपर लिंक

https://arxiv.org/abs/2605.30785

Latent Agents: आंतरिकीकृत multi-agent debate के लिए एक post-training procedure / Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

पेपर परिचय

बड़े language models (Large Language Models, LLMs) की reasoning performance बढ़ाने के लिए multi-agent debate एक शक्तिशाली तरीका है, लेकिन इसकी एक बड़ी सीमा यह है कि कई agents को लंबे debate history का आदान-प्रदान करना पड़ता है, जिससे computation cost बहुत बढ़ जाती है। Latent Agents इस अकार्यकुशलता को हल करने के लिए एक post-training procedure प्रस्तावित करता है, जो बाहरी रूप से किए जाने वाले multi-agent debate को एक single language model के अंदर distill कर देता है। मुख्य विचार यह है कि debate के केवल परिणाम को compress करने से आगे बढ़कर, model पहले debate की structure खुद सीखे, और फिर reinforcement learning (RL) के जरिए उसे internalize करे। इसके लिए लेखकों ने पहले 3 agents और 2 rounds से बना debate data तैयार किया, और arithmetic problem debate records में, जहाँ अंतिम consensus बन चुका था, structure tags जोड़कर एक consistent format बनाया। इसके बाद supervised fine-tuning (SFT) चरण में पूरे debate trace को यथावत train किया गया ताकि model debate के unfold होने के तरीके और consensus formation patterns की नकल कर सके.

इसके बाद का reinforcement learning चरण सिर्फ format imitation से आगे बढ़कर वास्तव में debate को internalize करने की प्रक्रिया है। यहाँ Group Relative Policy Optimization (GRPO) का उपयोग करके कई candidate outputs की तुलना की जाती है, और एक length clipping reward जोड़ा जाता है ताकि सही answer अपेक्षाकृत पहले सामने आए। साथ ही, <|Agent 1|>, <|Round 1|>, <|endofdebate|> जैसे structure tags को बनाए रखने में मदद करने वाले format reward को धीरे-धीरे कम किया जाता है, ताकि model लंबे external debate पर निर्भर रहे बिना केवल internal representations के आधार पर निष्कर्ष तक पहुँच सके। यह dynamic reward scheduling और length reduction, debate के computational बाहरी रूप को कम करते हुए भी agents के बीच interaction से मिलने वाले reasoning benefits को सुरक्षित रखने में महत्वपूर्ण भूमिका निभाते हैं। प्रयोगों के परिणाम दिखाते हैं कि प्रस्तावित model ने GSM8K, MMLU-Pro, Big-Bench Hard(BBH) पर explicit multi-agent debate के बराबर या उससे बेहतर performance दिखाई, और इस्तेमाल किए गए tokens में अधिकतम 93% तक की कमी आई, जिससे reasoning efficiency काफ़ी बढ़ गई। खास तौर पर कुछ settings में केवल SFT से ही मौजूदा debate method से बेहतर परिणाम मिले, और RL जोड़ने पर accuracy तथा token reduction दोनों प्रभाव और मज़बूत हुए, जिससे internalization procedure की प्रभावशीलता स्पष्ट रूप से सामने आई.

इस शोध का एक और महत्वपूर्ण योगदान यह है कि यह mechanistically analyze करता है कि internalized debate model के representation space को कैसे बदलता है। activation steering experiments के जरिए लेखकों ने दिखाया कि internalized model के भीतर agent-specific subspaces बनते हैं, और अलग-अलग agent perspectives से मेल खाने वाली interpretable directions मौजूद हैं। इससे संकेत मिलता है कि multi-agent debate के फायदे केवल output text के averaging से नहीं आते, बल्कि अलग-अलग reasoning perspectives के latent space में structurally अलग होकर फिर combine होने की प्रक्रिया से जुड़े होते हैं। इससे आगे, malicious agent को internalize करने के after negative steering के जरिए उसे suppress करने वाले experiments यह दिखाते हैं कि distilled model में harmful behavior अधिक localized और control करने में आसान हो सकता है। नतीजतन, Latent Agents न सिर्फ multi-agent reasoning को cost-efficient तरीके से compress करने की विधि प्रस्तुत करता है, बल्कि internalized reasoning की structure और controllability को भी स्पष्ट करता है।

सार(Abstract)

मल्टी-एजेंट डिबेट बड़े language models (LLM) की reasoning performance को बेहतर बनाता है, यह दिखाया गया है। लेकिन यह compute-intensive है और सवालों के जवाब देने से पहले लंबी conversation history बनानी पड़ती है। इस अक्षमता को दूर करने के लिए, हम एक ऐसा framework विकसित करते हैं जो दो-चरणीय fine-tuning pipeline के ज़रिए multi-agent debate को एक single LLM में distill करता है; यह pipeline debate structure learning को dynamic reward scheduling और length clipping के माध्यम से internalization के साथ जोड़ती है। कई models और benchmarks में, हमारे internalized models, explicit multi-agent debate के प्रदर्शन के बराबर या उससे बेहतर नतीजे देते हैं, जबकि 93% तक कम tokens का उपयोग करते हैं। इसके बाद हमने activation steering के माध्यम से इस क्षमता के mechanistic आधार की जाँच की और पाया कि internalization agent-specific subspaces बनाता है: activation space में ऐसी interpretable directions जो अलग-अलग agent perspectives से मेल खाती हैं। हमने इसका एक practical application भी दिखाया। Internalized debate के ज़रिए LLM में malicious agents डालने के बाद, negative steering लागू करके उन्हें दबाया गया; इससे हमने दिखाया कि distillation, harmful behaviors को localize और control करना आसान बनाती है, और base models पर steering लागू करने की तुलना में overall performance में कम गिरावट लाती है। हमारे निष्कर्ष distilled models में multi-agent capabilities को समझने के लिए एक नया दृष्टिकोण देते हैं और internalized reasoning behaviors को नियंत्रित करने के लिए practical guidelines प्रदान करते हैं। कोड इस URL पर उपलब्ध है: https://github.com/johnsk95/latent_agents

Multi-agent debate has been shown to improve reasoning in large language models (LLMs). However, it is compute-intensive, requiring generation of long transcripts before answering questions. To address this inefficiency, we develop a framework that distills multi-agent debate into a single LLM through a two-stage fine-tuning pipeline combining debate structure learning with internalization via dynamic reward scheduling and length clipping. Across multiple models and benchmarks, our internalized models match or exceed explicit multi-agent debate performance using up to 93% fewer tokens. We then investigate the mechanistic basis of this capability through activation steering, finding that internalization creates agent-specific subspaces: interpretable directions in activation space corresponding to different agent perspectives. We further demonstrate a practical application: by instilling malicious agents into the LLM through internalized debate, then applying negative steering to suppress them, we show that distillation makes harmful behaviors easier to localize and control with smaller reductions in general performance compared to steering base models. Our findings offer a new perspective for understanding multi-agent capabilities in distilled models and provide practical guidelines for controlling internalized reasoning behaviors. Code available at https://github.com/johnsk95/latent_agents

पेपर लिंक

https://arxiv.org/abs/2604.24881

MOSS: autonomous agent systems में source-level rewriting के ज़रिए self-evolution / MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

पेपर परिचय

ऐसे autonomous agent systems जो deployment के बाद भी स्वयं सीख सकें और बार-बार होने वाली असफलताओं को कम कर सकें, लंबे समय से एक महत्वपूर्ण लक्ष्य रहे हैं। लेकिन व्यवहार में, अधिकांश systems केवल text से बदले जा सकने वाले settings और prompt स्तर तक सीमित रहते हैं और structural defects को मूल रूप से संबोधित नहीं कर पाते। इन सीमाओं से आगे बढ़ने के लिए प्रस्तावित MOSS, source-level adaptation को self-evolution के माध्यम के रूप में अपनाता है और इस तरह डिज़ाइन किया गया है कि वह agent की core execution structure को ही फिर से लिख सके। लेखक बताते हैं कि routing, hook order, state invariants, और dispatch जैसे वास्तविक व्यवहार तय करने वाले तत्व code के भीतर मौजूद होते हैं, इसलिए केवल skill files या prompt configuration बदलने से ऐसी failures अनिवार्य रूप से बची रहती हैं जिन तक पहुँचना संभव नहीं होता। इसके विपरीत, source code Turing-complete होता है, text-based artifacts का superset है, और model के instruction-following पर निर्भर हुए बिना deterministically काम करता है; इसलिए इसे कहीं अधिक general और stable adaptation mechanism के रूप में प्रस्तुत किया गया है.

MOSS की methodology का मुख्य बिंदु यह है कि यह automatically collected production-failure evidence को शुरुआती आधार बनाती है और उसी के आधार पर एक multi-stage evolution pipeline को निश्चित रूप से चलाती है। Code modification का काम बाहरी coding agent CLI (command-line interface) को सौंपा जाता है, लेकिन MOSS चरणों के क्रम और अंतिम निर्णय को सीधे नियंत्रित करके generation और verification की ज़िम्मेदारियों को अलग रखता है। इस तरह बने candidate versions को ephemeral trial workers पर failure batches को फिर से replay करके verify किया जाता है, और यह सिर्फ static analysis नहीं बल्कि वास्तविक failure situations के reproduction-based evaluation को अंजाम देता है, इसलिए इसका महत्व अधिक है। Verification पास करने वाले candidates ही user consent को आधार मानकर in-place container swap के ज़रिए promote किए जाते हैं, और उसके बाद यदि वे health probe conditions को पूरा नहीं करते, तो उन्हें अपने-आप rollback कर दिया जाता है; इस तरह operational safety भी सुनिश्चित की जाती है.

यह approach, मौजूदा self-evolving agents से अलग है, जो मुख्यतः prompt, memory schema, और workflow graph जैसे text में व्यक्त किए जा सकने वाले क्षेत्रों में ही सुधार की कोशिश करते थे; इसके विपरीत MOSS पूरे system को, जिसमें वास्तविक execution harness भी शामिल है, evolution का target बनाता है। इसलिए MOSS को केवल बेहतर responses उत्पन्न करने वाले model के रूप में नहीं, बल्कि चल रहे agent system की structural defects को सीधे सुधारने वाले adaptation platform के रूप में समझा जा सकता है। खास तौर पर, deterministic pipeline को verification-promotion-rollback प्रक्रिया के साथ जोड़कर यह text-centric तरीकों की तुलना में, जो long-context drift के प्रति अधिक संवेदनशील होते हैं, कहीं अधिक robust self-improvement path प्रस्तुत करता है। यह डिज़ाइन स्पष्ट रूप से दिखाता है कि यदि autonomous agents को वास्तविक service environments में सुरक्षित रूप से evolve करना है, तो केवल learning capability ही नहीं, बल्कि deployment, verification, और rollback को शामिल करने वाले systems-engineering mechanisms भी साथ में आवश्यक हैं.

प्रयोगों में, MOSS ने OpenClaw में चार tasks पर average grader score को केवल एक evolution cycle में 0.25 से 0.61 तक बढ़ा दिया, और यह सुधार बिना किसी human intervention के हासिल किया गया। यह परिणाम दिखाता है कि source-level rewriting का दृष्टिकोण केवल सैद्धांतिक रूप से अधिक general नहीं है, बल्कि वास्तविक production agent systems में भी meaningful performance improvements ला सकता है। अंततः, यह paper self-evolving agents के दायरे को text adjustment से code-level reconfiguration तक विस्तारित करते हुए, autonomous systems के लिए बार-बार होने वाली failures को स्वयं सुधारने की एक नई संभावना प्रस्तुत करता है.

सारांश (Abstract)

तैनाती के बाद autonomous agentic systems आम तौर पर स्थिर रहते हैं: ये user interactions से नहीं सीखते, और बार-बार होने वाली विफलताएँ तब तक बनी रहती हैं जब तक अगला human-driven update कोई fix जारी नहीं कर देता। इसके जवाब में self-evolving agents सामने आए हैं, लेकिन ये भी evolution को केवल text-mutable artifacts — skill files, prompt configurations, memory schemas, workflow graphs — तक सीमित रखते हैं और agent harness को जस का तस छोड़ देते हैं। चूँकि routing, hook ordering, state invariants, और dispatch किसी text artifact में नहीं बल्कि code में मौजूद होते हैं, इसलिए संरचनात्मक विफलताओं की एक पूरी श्रेणी text layer से भौतिक रूप से पहुँच से बाहर रहती है। हमारा तर्क है कि source-level adaptation मूल रूप से कहीं अधिक सामान्य माध्यम है। यह Turing-complete है, text से बदले जा सकने वाले हर scope का एक strict superset है, base model की compliance पर निर्भर होने के बजाय deterministic रूप से प्रभावी होता है, और long-context drift के कारण कमजोर नहीं पड़ता। हम MOSS प्रस्तुत करते हैं, जो production agentic substrates पर source-level self-rewriting करने वाला एक system है। हर evolution production failures के अपने-आप चुने गए evidence batch पर आधारित होता है और एक deterministic multi-stage pipeline से होकर आगे बढ़ता है। code modification को एक pluggable external coding-agent CLI को सौंपा जाता है, जबकि MOSS stage ordering और verdicts को अपने नियंत्रण में रखता है। candidates को ephemeral trial workers में candidate image के खिलाफ batch replay करके verify किया जाता है, फिर user-consent-gated in-place container swap और health-probe-gated rollback के जरिए promote किया जाता है। OpenClaw पर MOSS ने बिना human intervention के सिर्फ एक cycle में 4 tasks के mean grader score को 0.25 से बढ़ाकर 0.61 कर दिया।

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

पेपर लिंक

https://arxiv.org/abs/2605.22794

गैर-सहकारी खेलों के जरिए language models की safety alignment / Safety Alignment of LMs via Non-cooperative Games

पेपर परिचय

language models (LM) की safety alignment हाल के AI alignment research की एक केंद्रीय चुनौती बन गई है, क्योंकि इसमें उपयोगिता बनाए रखते हुए malicious inputs के प्रति resilience भी सुनिश्चित करनी होती है। जहाँ मौजूदा approaches मुख्य रूप से adversarial prompts बनाकर फिर उनसे बचाव करने के लिए sequential fine-tuning तक सीमित रहे हैं, यह paper safety alignment को एक non-zero-sum game के रूप में फिर से परिभाषित करता है, जिसमें Attacker LM और Defender LM एक-दूसरे की strategies के अनुसार real time में अनुकूलित होते हैं। दोनों models online reinforcement learning (RL) के माध्यम से jointly सीखते हैं; attacker अधिक sophisticated red-teaming strategies खोजता है और defender उन हमलों के खिलाफ अधिक robust ढंग से evolve होता है। यह mutual adaptation structure static dataset पर one-time training नहीं है; बल्कि models के बीच प्रतिस्पर्धा बार-बार दोहराई जाती है और performance boundary को लगातार आगे बढ़ाती है, यही बात इसे मौजूदा तरीकों से स्पष्ट रूप से अलग बनाती है। खास तौर पर, authors ने reward signal को point-wise score के रूप में न रखकर pairwise comparison से मिलने वाले preference-based signal के रूप में डिज़ाइन किया है, ताकि अधिक stable supervision मिले और reward hacking के प्रति vulnerability कम हो।

इस methodology के केंद्र में AdvGame नाम की training procedure है, जिसका लक्ष्य safety और utility के बीच Pareto frontier को और बाहर की ओर ले जाना है। विशेष रूप से, attacker और defender एक-दूसरे की latest policy को प्रतिबिंबित करते हुए बारी-बारी से update होते हैं, इसलिए defender वास्तव में अधिक मजबूत हमलों के खिलाफ प्रशिक्षित होता है, और attacker किसी एक खास model की कमजोरियों तक सीमित न रहकर सामान्य vulnerability detection capability सीखता है। appendix में दिया गया mathematical derivation दिखाता है कि इस game-theoretic optimization problem को वास्तव में trainable रूप में कैसे बदला जाता है। इसमें attacker policy के optimal distribution को reference policy के मुकाबले exponential reweighting के रूप में व्यक्त किया जाता है, फिर normalization constant हटाने के लिए दो candidates की तुलना करने वाले रूप में व्यवस्थित किया जाता है। इस प्रक्रिया में attacker training absolute score regression के बजाय relative preference order को match करने की समस्या बन जाती है, जो स्वाभाविक रूप से Direct Preference Optimization (DPO) family के objective पर पहुँचती है। दूसरे शब्दों में, attacker द्वारा बनाए गए prompt और defender के response से मिलकर बनने वाली पूरी trajectory को comparison target बनाया जाता है, जिससे वास्तविक interaction पर आधारित अधिक समृद्ध training signal मिलता है।

इसके अलावा, यह paper preference probability को Bradley-Terry model से जोड़ते हुए marginalized preference की अवधारणा प्रस्तुत करता है, जो attacker और defender के बीच interactions को logit space में समेटती है। इससे individual responses के noise को average किया जा सकता है, जबकि prompt अकेले नहीं बल्कि prompt और response के संयुक्त प्रभाव को दर्शाने वाली preference structure सीखी जा सकती है। नतीजतन, attacker update उस dynamic distribution पर किया जाता है जिसे मौजूदा defender policy लगातार बदलती रहती है; इसलिए यह किसी fixed target के लिए specialized attack नहीं, बल्कि कई models पर generalize होने वाली red-team capability की ओर converge करता है। जैसा कि abstract ज़ोर देता है, यह joint optimization न केवल अधिक useful और attack-resistant defender LM देता है, बल्कि वास्तविक deployment environments में इस्तेमाल किए जा सकने वाले एक मजबूत general-purpose attacker LM भी साथ में उत्पन्न करता है। अंततः, यह research safety alignment को सिर्फ एक defense technique के रूप में नहीं, बल्कि models के बीच competition और adaptation का व्यवस्थित उपयोग करने वाली learning problem के रूप में विस्तारित करती है, और language models की safety तथा utility दोनों को साथ बढ़ाने के लिए एक नई methodological direction प्रस्तुत करती है।

सारांश (Abstract)

भाषा मॉडल (LM) की उपयोगिता बनाए रखते हुए उसकी सुरक्षा सुनिश्चित करना AI alignment में अब भी एक महत्वपूर्ण चुनौती है। मौजूदा approaches sequential adversarial training पर निर्भर करती हैं: adversarial prompts बनाना और उनसे बचाव करने के लिए LMs को fine-tune करना। हम एक अलग paradigm पेश करते हैं: safety alignment को एक Attacker LM और Defender LM के बीच non-zero-sum game के रूप में formulate करना, जिन्हें online reinforcement learning के जरिए jointly train किया जाता है। हर LM दूसरे की बदलती strategies के अनुसार लगातार adapt करता है, जिससे iterative improvement आगे बढ़ती है। हमारी method point-wise scores के बजाय pairwise comparisons से निकले preference-based reward signal का उपयोग करती है, जो अधिक robust supervision देती है और reward hacking को संभावित रूप से कम करती है। हमारी RL recipe, AdvGame, safety और utility की Pareto frontier को आगे बढ़ाती है, जिससे ऐसा Defender LM मिलता है जो एक साथ अधिक helpful भी है और adversarial attacks के प्रति अधिक resilient भी। इसके अलावा, अंततः प्राप्त Attacker LM एक मजबूत, general-purpose red-teaming agent में converge होता है, जिसे arbitrary target models की सीधे probing के लिए deploy किया जा सकता है। कोड github.com/facebookresearch/advgame पर उपलब्ध है。

भाषा मॉडल (LMs) की सुरक्षा सुनिश्चित करते हुए उनकी उपयोगिता बनाए रखना AI alignment में एक महत्वपूर्ण चुनौती बना हुआ है। मौजूदा approaches sequential adversarial training पर निर्भर करती हैं: adversarial prompts बनाना और LMs को उनसे बचाव के लिए fine-tune करना। हम एक अलग paradigm पेश करते हैं: safety alignment को एक Attacker LM और Defender LM के बीच non-zero-sum game के रूप में देखना, जिन्हें online reinforcement learning के जरिए jointly train किया जाता है। हर LM दूसरे की विकसित होती strategies के अनुसार लगातार adapt करता है, जिससे iterative improvement होती है। हमारी method point-wise scores के बजाय pairwise comparisons से निकले preference-based reward signal का उपयोग करती है, जो अधिक robust supervision देती है और reward hacking को संभावित रूप से कम करती है। हमारी RL recipe, AdvGame, safety और utility की Pareto frontier को आगे बढ़ाती है, जिससे ऐसा Defender LM मिलता है जो एक साथ अधिक helpful और adversarial attacks के प्रति अधिक resilient है। इसके अलावा, परिणामस्वरूप Attacker LM एक मजबूत, general-purpose red-teaming agent में converge होता है, जिसे arbitrary target models की probing के लिए सीधे deploy किया जा सकता है। कोड github.com/facebookresearch/advgame पर उपलब्ध है.

शोधपत्र लिंक

https://arxiv.org/abs/2512.20806

आगे पढ़ें

https://github.com/facebookresearch/advgame

योजना, अवलोकन, पुनर्प्राप्ति: सक्रिय प्रक्रियात्मक सहायता के लिए बेंचमार्क और आर्किटेक्चर / Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

शोधपत्र परिचय

वास्तविक procedural tasks में उपयोगकर्ता हमेशा तय क्रम का ठीक-ठीक पालन नहीं करते, इसलिए assistance systems को केवल अगला step predict करने से आगे बढ़कर यह भी तय करने में सक्षम होना चाहिए कि कब intervene करना है और कैसे guide करना है। इसी समस्या-बोध के आधार पर प्रस्तावित approach proactive procedural assistance पर केंद्रित है, जो उपयोगकर्ता के first-person visual information, dialogue history और query context के आधार पर वर्तमान स्थिति की व्याख्या करती है और real time में यह भी detect करती है कि क्या उपयोगकर्ता out-of-plan (OOP) स्थिति में चला गया है। खास तौर पर, इस शोध का मुख्य बिंदु intervention के होने या न होने और intervention की सामग्री को अलग-अलग संभालना है, क्योंकि timing decision और coaching generation के optimization goals एक-दूसरे से अलग होते हैं। जब उपयोगकर्ता सामान्य procedure से बाहर चला जाता है, तब चुपचाप प्रतीक्षा करने के बजाय उचित समय पर संक्षिप्त और सटीक recovery instructions देनी चाहिए, और इसके लिए system को procedural state और visual cues दोनों को साथ-साथ track करना पड़ता है।

इन लक्ष्यों को समर्थन देने के लिए, लेखकों ने पहले EgoProactive नाम का एक large-scale wearable first-person dataset बनाया, जिसमें explicit out-of-plan annotations और recovery steps दोनों दिए गए हैं। यह dataset इसलिए विशेष रूप से महत्वपूर्ण है क्योंकि यह वास्तविक वातावरण में होने वाले detours और errors को learnable बनाता है और उन मौजूदा resources की सीमाओं को पूरा करता है जो केवल linear step progression मानकर चलते थे। साथ ही, Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist और HowTo100M जैसे पाँच मौजूदा benchmarks को एक unified proactive guidance framework में पुनर्गठित कर Pro²Bench बनाया गया, जिससे अलग-अलग domains में intervention timing और recovery coaching abilities की लगातार तुलना की जा सके। यह इसलिए महत्वपूर्ण है क्योंकि यह procedural understanding को केवल अगले step prediction की समस्या से आगे बढ़ाकर वास्तविक interaction quality को मापने की समस्या तक ले जाता है।

मॉडल के स्तर पर, इस शोध में plan संभालने वाले भाग और interaction संभालने वाले भाग को अलग करने वाली decoupled planner-interaction architecture प्रस्तावित की गई है, ताकि procedural state tracking और response generation को ढीले ढंग से जोड़ने के बजाय उनकी-उनकी भूमिकाओं के अनुसार optimize किया जा सके। इसके साथ plan-anchored clip selection लागू किया गया, ताकि पूरे video को बिना भेदभाव process करने के बजाय उन visual segments को प्राथमिकता दी जा सके जो current step और recovery decision से सीधे जुड़े हों। यह तरीका लंबे first-person videos में अनावश्यक noise को कम करते हुए out-of-plan के संकेतों और recovery के लिए ज़रूरी clues को अधिक स्पष्ट रूप से पकड़ने में मदद करता है। दूसरे शब्दों में, यह architecture “क्या कहना है” और “क्या देखना है” — दोनों को plan-केंद्रित तरीके से align करती है।

इसके अलावा, post-training recipe के जरिए यह भी दिखाया गया कि यह method किसी एक खास model के लिए बना विशेष treatment नहीं है, बल्कि एक सामान्य प्रक्रिया है जिसे अलग-अलग backbones पर transfer किया जा सकता है। वास्तव में, Llama 4 और Qwen-3.6-VL पर cross-backbone reproduction करके method की portability को verify किया गया, जो यह संकेत देता है कि भविष्य में इसे अधिक शक्तिशाली multimodal models तक भी आसानी से बढ़ाया जा सकता है। experimental results में trained Llama-4 system ने Claude Opus 4.6, Gemini 3.1 Pro, GPT 5.2 और Qwen3 VL 235B जैसे मजबूत baselines की तुलना में छह datasets पर अधिक objective intervention quality दिखाई। खासकर oracle plan condition में, जब plan quality को नियंत्रित किया गया, तो recovery guidance performance में बड़ी बढ़ोतरी देखी गई, जिसने plan tracking और intervention generation को अलग रखने वाली संरचना की वैधता को स्पष्ट रूप से समर्थन दिया। कुल मिलाकर, यह शोध procedural tasks करने वाले उपयोगकर्ताओं के लिए multimodal assistants को केवल step-prediction systems के रूप में नहीं, बल्कि real-time intervention coaches के रूप में पुनर्परिभाषित करता है, और अधिक यथार्थपरक data, architecture और learning strategies को साथ में प्रस्तुत करने के कारण विशेष महत्व रखता है।

सार (Abstract)

अनुवादित किए जाने वाले abstract की संरचना और शब्दावली के अनुसार, पहले वाक्य को सीधे कोरियाई में रूपांतरित करने के बाद पूरे abstract को स्वाभाविक और अकादमिक शैली में परिष्कृत कर अनुवाद करेंगे।
हम एक proactive multi-modal assistant system की कल्पना करते हैं, जो procedural tasks में उपयोगकर्ताओं को real-time step-by-step guidance देता है और यह स्वायत्त रूप से तय करता है कि कब बीच में हस्तक्षेप करना है और कैसे coaching देनी है। लेकिन बड़े पैमाने के cross-domain benchmarks की कमी, जो वास्तविक परिस्थितियों को दर्शाते हों—खासकर वह सामान्य स्थिति जिसमें उपयोगकर्ता अपेक्षित step sequence से भटक जाते हैं—प्रगति को सीमित कर रही है। हम इस कमी को चार योगदानों के माध्यम से भरते हैं: (1) हम EgoProactive जारी करते हैं, जो proactive procedural assistance के लिए एक बड़े पैमाने का wearable-egocentric dataset है, जिसमें स्पष्ट Out-of-Plan (OOP) annotations और recovery steps शामिल हैं; (2) हम पाँच स्थापित benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) को एकीकृत proactive-guidance schema के तहत Pro^2Bench में विस्तारित करते हैं; (3) हम procedural state, visual cues और recovery injection के लिए विशेषीकृत एक decoupled planner--interaction architecture प्रस्तावित करते हैं; (4) हम एक post-training recipe प्रस्तुत करते हैं जो model families के बीच transfer होती है, और जिसे Llama 4 तथा Qwen-3.6-VL पर cross-backbone replication के माध्यम से सत्यापित किया गया है। व्यापक experiments में, हमारा प्रशिक्षित Llama-4 system सभी छह datasets पर मजबूत proprietary baselines (Claude Opus 4.6, Gemini 3.1 Pro, GPT 5.2) और open-weight baselines (Qwen3 VL 235B) की तुलना में objective intervention quality में उल्लेखनीय सुधार करता है। Oracle-plan experiments आगे दिखाते हैं कि जब plan quality को नियंत्रित किया जाता है, तो प्रशिक्षित duplex model उच्च-गुणवत्ता वाली guidance उत्पन्न करता है और Out-of-Plan (OOP) recovery में बड़े सुधार दिखाता है।

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

पेपर लिंक

https://arxiv.org/abs/2606.04970

आगे पढ़ें

https://huggingface.co/datasets/facebook/wearable-ai

FuzzingBrain V2: स्वचालित vulnerability discovery और reproduction के लिए multi-agent LLM system / FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction

पेपर परिचय

सॉफ़्टवेयर vulnerabilities से उत्पन्न सुरक्षा खतरों की गंभीरता लगातार बढ़ रही है, और केवल 2025 में ही लगभग 50,000 CVE (Common Vulnerabilities and Exposures) रिपोर्ट किए गए हैं। बड़े language models (LLM) ने automated vulnerability detection के लिए नई संभावनाएँ प्रस्तुत की हैं, लेकिन मौजूदा LLM-आधारित approaches अब भी कुछ मूलभूत समस्याओं से जूझ रही हैं। विशेष रूप से, LLM द्वारा तैयार की गई vulnerability reports में false positives की दर अधिक होती है, साथ ही reproducible verification mechanism की कमी रहती है; vulnerability localization के लिए function-level या line-level जैसे non-optimal granularity levels का उपयोग किया जाता है; और complex cross-function dependencies तथा multi-layered trigger conditions वाली vulnerabilities को प्रभावी ढंग से संभालना कठिन होता है। इस अध्ययन में प्रस्तुत FuzzingBrain V2 इन चुनौतियों का व्यवस्थित समाधान करने के लिए डिज़ाइन किया गया एक multi-agent LLM system है, जो Google के OSS-Fuzz framework को verification backend के रूप में उपयोग करके रिपोर्ट की गई सभी vulnerabilities के लिए 100% reproducibility सुनिश्चित करता है। इसके अतिरिक्त, यह system control flow information को शामिल करने वाले एक नए abstraction, Suspicious Point, को प्रस्तुत करता है, जिससे function-level और line-level के बीच के optimal point पर सटीक vulnerability localization संभव होती है; और logic-based hierarchical function analysis के साथ दो-स्तरीय fuzzing strategy के माध्यम से resource constraints के तहत function coverage को बेहतर बनाया जाता है। आगे, Model Context Protocol आधारित static तथा dynamic analysis tools और परिष्कृत context engineering का उपयोग करके complex vulnerability reasoning को मजबूत किया जाता है। AIxCC 2025 final competition के C/C++ dataset पर FuzzingBrain V2 ने 90% detection rate (40 में से 36 vulnerabilities) हासिल किया, और वास्तविक production environment में 12 open-source projects में कुल 41 पहले से अज्ञात vulnerabilities खोजीं, जिनमें से 26 की पुष्टि हुई, 23 को ठीक किया गया, और 2 CVE identifiers आवंटित किए गए। ये परिणाम स्पष्ट रूप से दिखाते हैं कि semantic analysis capabilities और execution-based detection को जोड़ने वाला multi-agent approach केवल अकादमिक उपलब्धि भर नहीं है, बल्कि वास्तविक production software की सुरक्षा को सीधे बेहतर बना सकता है।

सार (Abstract)

सॉफ़्टवेयर vulnerabilities गंभीर security threat पैदा करती हैं, और 2025 में लगभग 50,000 CVE रिपोर्ट किए गए। बड़े language models (LLM) automated vulnerability detection के लिए आशाजनक दिखते हैं, लेकिन तीन प्रमुख चुनौतियाँ अब भी बाकी हैं। पहला, LLM द्वारा जनरेट की गई vulnerability reports में false positive दर ऊँची होती है और reproducible verification की कमी रहती है। दूसरा, मौजूदा LLM-आधारित approaches vulnerability localization के लिए ऐसी granularity का उपयोग करती हैं जो optimal नहीं है। function-level analysis में context बहुत व्यापक होने पर bugs छूट जाते हैं, जबकि line-level analysis पर्याप्त context नहीं दे पाता। तीसरा, मौजूदा approaches को complex cross-function dependencies और trigger conditions वाली vulnerabilities पर reasoning करने में कठिनाई होती है। हम FuzzingBrain V2 प्रस्तुत करते हैं, एक multi-agent system जो चार प्रमुख योगदानों के माध्यम से इन gaps को दूर करता है: (1) Google के OSS-Fuzz पर आधारित पूरी तरह automated vulnerability analysis, जो यह सुनिश्चित करता है कि रिपोर्ट की गई सभी vulnerabilities fuzzer-reproducible हों; (2) Suspicious Point, सटीक vulnerability localization के लिए एक नया control-flow-आधारित abstraction; (3) logic-driven hierarchical function analysis जिसमें dual-layer fuzzing है, जो resource constraints के तहत function coverage को बेहतर बनाता है; (4) MCP-आधारित static और dynamic analysis tools, जिनमें context engineering के ज़रिए complex vulnerability reasoning को मजबूत किया गया है। AIxCC 2025 Final Competition C/C++ dataset पर FuzzingBrain V2 ने 90% detection rate हासिल किया (40 में से 36 vulnerabilities)। वास्तविक deployment में FuzzingBrain V2 ने 12 open-source projects में 29 zero-day vulnerabilities खोजीं, जिन्हें maintainers ने पुष्टि करके ठीक किया, और उनमें से 2 को CVE ID आवंटित किए गए।

Software vulnerabilities गंभीर security threats पैदा करती हैं, और 2025 में लगभग 50,000 CVE रिपोर्ट किए गए। हालांकि Large Language Models (LLMs) automated vulnerability detection के लिए आशाजनक हैं, फिर भी तीन मुख्य चुनौतियाँ बनी हुई हैं। पहला, LLM-generated vulnerability reports में false positive rates अधिक होते हैं और reproducible verification का अभाव होता है। दूसरा, मौजूदा LLM-based approaches vulnerability localization के लिए suboptimal granularities का उपयोग करती हैं: function-level analysis तब bugs को नज़रअंदाज़ कर देता है जब context बहुत विस्तृत हो जाता है, जबकि line-level analysis में पर्याप्त context नहीं होता। तीसरा, मौजूदा approaches को complex cross-function dependencies और triggering conditions वाली vulnerabilities पर reasoning करने में कठिनाई होती है। हम FuzzingBrain V2 प्रस्तुत करते हैं, एक multi-agent system जो चार प्रमुख योगदानों के माध्यम से इन gaps को भरता है: (1) Google के OSS-Fuzz पर निर्मित fully automated vulnerability analysis, जो सुनिश्चित करता है कि रिपोर्ट की गई सभी vulnerabilities fuzzer-reproducible हों; (2) Suspicious Point, optimal granularity पर precise vulnerability localization के लिए एक नया control-flow-based abstraction; (3) logic-driven hierarchical function analysis, जिसमें dual-layer fuzzing शामिल है जो resource constraints के तहत function coverage को बेहतर बनाता है; (4) MCP-based static और dynamic analysis tools, जिनमें context engineering के माध्यम से complex vulnerability reasoning को सशक्त किया गया है। AIxCC 2025 Final Competition C/C++ dataset पर FuzzingBrain V2 ने 90% detection rate हासिल किया (40 में से 36 vulnerabilities)। वास्तविक deployment में FuzzingBrain V2 ने 12 open-source projects में 29 zero-day vulnerabilities खोजीं, जिन्हें maintainers ने पुष्टि करके ठीक किया, और इनमें से 2 को CVE ID दिए गए.

शोधपत्र लिंक

https://arxiv.org/abs/2605.21779

⚠️विज्ञापन⚠️: 🔥PyTorch Korea User Group🇰🇷 द्वारा संकलित यह लेख क्या आपको उपयोगी लगा? सदस्य बनें, तो हम प्रमुख लेख आपको ईमेल💌 से भेजेंगे! आप Telegram या Slack/Discord/Teams/Dooray/GoogleChat आदि के माध्यम से भी नए लेखों की सूचना पा सकते हैं। :D

[2026/06/01 ~ 07] इस हफ्ते देखने लायक AI/ML शोध-पत्रों का संग्रह

PyTorchKR🔥🇰🇷 🤔💭

शोध-पत्रवार मुख्य सारांश

Harness-1: state externalization harness के साथ search agents के लिए reinforcement learning / Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

शोध-पत्र परिचय

सार(Abstract)

पेपर लिंक

आगे पढ़ें

attention को भूल जाइए: सिर्फ Importance-Aware Attention ही काफी है / Forget Attention: Importance-Aware Attention Is All You Need

पेपर परिचय

सार(Abstract)

पेपर लिंक

क्या Transformers को तीन projections की ज़रूरत होती है? QKV variants का व्यवस्थित अध्ययन / Do Transformers Need Three Projections? Systematic Study of QKV Variants

पेपर परिचय

सार(Abstract)

पेपर लिंक

और पढ़ें

Agentic workflows को LLM weights में compile करना: 100 गुना कम लागत पर frontier-स्तर के करीब गुणवत्ता / Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

पेपर परिचय

सार(Abstract)

पेपर लिंक

आगे पढ़ें

दीर्घ-अवधि कार्यों के लिए agent-compatible context management सीखना / Learning Agent-Compatible Context Management for Long-Horizon Tasks

पेपर परिचय

सार(Abstract)

पेपर लिंक

Latent Agents: आंतरिकीकृत multi-agent debate के लिए एक post-training procedure / Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate

पेपर परिचय

सार(Abstract)

पेपर लिंक

और पढ़ें

MOSS: autonomous agent systems में source-level rewriting के ज़रिए self-evolution / MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

पेपर परिचय

सारांश (Abstract)

पेपर लिंक

और पढ़ें

गैर-सहकारी खेलों के जरिए language models की safety alignment / Safety Alignment of LMs via Non-cooperative Games

पेपर परिचय

सारांश (Abstract)

शोधपत्र लिंक

आगे पढ़ें

शोधपत्र परिचय

सार (Abstract)

पेपर लिंक

आगे पढ़ें

FuzzingBrain V2: स्वचालित vulnerability discovery और reproduction के लिए multi-agent LLM system / FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction

पेपर परिचय

सार (Abstract)

शोधपत्र लिंक

संबंधित पढ़ाई

2 टिप्पणियां