लॉन्ग-रनिंग एजेंट्स - जब एजेंट कई दिनों तक चलते हैं तो क्या बदलता है

(addyo.substack.com)

5 पॉइंट द्वारा GN⁺ 2 시간 전 | 1 टिप्पणियां | WhatsApp पर शेयर करें

AI एजेंट अब एकल चैट सेशन तक सीमित नहीं रहे, बल्कि कई दिनों से कई हफ्तों तक स्वायत्त रूप से चल सकते हैं, कई context windows और sandboxes के बीच आ-जा सकते हैं, failures से recover कर सकते हैं, और रुकावट के बिंदु से फिर शुरू कर सकते हैं — यह एक नया paradigm है
मौजूदा एजेंट एकल सेशन की संरचनात्मक सीमाओं से टकराते हैं, जैसे context window का खत्म होना, self-evaluation में overconfidence, और पहले किए गए fixes को दोबारा वापस ले आना
Anthropic, Google, Cursor जैसी प्रमुख कंपनियां model loop·execution sandbox·session log separation आर्किटेक्चर की ओर converge कर रही हैं
लॉन्ग-रनिंग एजेंट्स की मुख्य चुनौतियां हैं persistent state management, self-verification, context compression, और इन्हें हल करने के लिए पांच design patterns पेश किए गए हैं
असली productivity difference मॉडल से कम, और मॉडल को घेरने वाली state·session·structured handoff layer में अधिक बनता है — यही निवेश का मुख्य क्षेत्र है

"लॉन्ग-रनिंग" के तीन अर्थ

Long-horizon reasoning: कई dependent steps में plan और execute करने की क्षमता, जो मुख्यतः model quality का सवाल है। METR के time horizon metric के अनुसार frontier models 2019 के बाद से 50% reliability पर पूरा कर सकने वाले tasks की अवधि लगभग हर 7 महीने में दोगुनी कर रहे हैं। अगर यह trend जारी रहा, तो 2028 तक दिन-स्तरीय और 2034 तक साल-स्तरीय tasks पूरे करना संभव हो सकता है
Long-running execution: ऐसा ढांचा जिसमें agent process कई घंटे से कई दिन तक चलता है और मॉडल को हजारों बार call किया जा सकता है। यह मुख्यतः harness design का सवाल है
Persistent agency: एकल task से आगे बढ़कर एजेंट अपनी पहचान बनाए रखे, memory accumulate करे, और user preferences सीखे। Google का Memory Bank इसका प्रतिनिधि उदाहरण है
वास्तविक production agents में ये तीनों अक्सर साथ जुड़े होते हैं, लेकिन हर एक की engineering समस्याएं और समाधान अलग हैं

लॉन्ग-रनिंग एजेंट्स क्यों महत्वपूर्ण हैं

10 मिनट चलने वाला एजेंट सवाल-जवाब या छोटे bug fix तक सीमित रहता है, लेकिन 10 घंटे चलने वाला एजेंट पूरे feature development, 6 quarters से लंबित migration, या junior analyst स्तर की research कर सकता है
Anthropic के Claude Sonnet घोषणा में internal testing के आधार पर 30 घंटे से अधिक autonomous coding के उदाहरण साझा किए गए, जिनमें एक run में 11,000 lines वाला Slack-style app बनाया गया
Project Vend में Claude instance ने एक महीने तक असली office vending business चलाया — inventory management, pricing, और vendors से communication किया। पहले चरण में meaningful failures सामने आए, और दूसरे चरण में काफी सुधार हुआ
- यहां मुख्य बात profitability नहीं, बल्कि यह देखना था कि जब एजेंट turns के बजाय हफ्तों तक अपनी identity बनाए रखता है तो consistency की कौन-सी समस्याएं उभरती हैं

वे तीन दीवारें जिनसे हर लॉन्ग-रनिंग एजेंट टकराता है

सीमित context: 1M token window भी अंततः खत्म होती है, और window भरने से पहले ही context rot (मॉडल performance का धीरे-धीरे गिरना) शुरू हो जाता है। 24 घंटे की execution अभी किसी भी context window roadmap में ठीक से फिट नहीं बैठती
Persistent state का अभाव: नया session blank slate से शुरू होता है। Anthropic इसे ऐसे बताता है जैसे "shift बदलने पर engineer आए और उसे पिछले shift में क्या हुआ इसका बिल्कुल पता न हो"
Self-verification का अभाव: जब मॉडल अपने ही काम का मूल्यांकन करता है, तो लगातार positive bias दिखता है। "पूरा हुआ?" जैसे प्रश्न पर यह वास्तविकता से अधिक बार "हाँ" कहता है, और अलग verification signal न होने पर 30% completion की स्थिति में भी पूर्ण confidence के साथ result जमा कर सकता है

Ralph loop: प्रैक्टिशनर्स के लिए लॉन्ग-रनिंग एजेंट का सरल implementation

Ralph loop (Ralph Wiggum technique), Geoffrey Huntley और Ryan Carson द्वारा लोकप्रिय किया गया, प्रैक्टिशनर्स के लिए एक लॉन्ग-रनिंग एजेंट pattern है, जिसकी reference implementation सिर्फ एक bash script है
काम करने का क्रम: अधूरा task चुनना (prd.json) → task·context·memo के साथ prompt बनाना → agent call करना → tests चलाना → result को progress.txt में जोड़ना → task list update करना → repeat
मुख्य सिद्धांत: एजेंट खुद amnesiac है, लेकिन filesystem याद रखती है। prd.json plan की भूमिका निभाता है, progress.txt lab notes का, और AGENTS.md rolling rulebook का
Ryan Carson का Compound Product analysis loop (daily reports पढ़ना) → planning loop (PRD बनाना) → execution loop (code लिखना) के रूप में कई loops को chain करता है। यह Anthropic की independently विकसित planner-generator-evaluator triple structure का open source version है
सिर्फ bash scripts और JSON files से overnight चलने वाला लॉन्ग-रनिंग एजेंट बनाया जा सकता है। Google और Anthropic ने इस pattern को productize करते हुए इसे recoverable, safe, और observable बनाया है

Anthropic: harness से Brain/Hands/Session separation तक

पहला तरीका (harness structure): autonomous full-stack development के लिए 2-agent harness। Initializer agent शुरुआती project environment सेट करता है, prompt को feature-list.json तक expand करता है, और boot script (init.sh) लिखता है। Coding agent बार-बार wake होकर feature units पर काम करता है, tests चलाता है, claude-progress.txt लिखता है, और commit करता है
- Test ratchet नियम: "tests को delete या modify करना allowed नहीं है" — यह उस सामान्य failure को रोकता है जिसमें agent failing tests हटा कर pass दिखा देता है
- InfoQ के विस्तृत version में यह planner, generator, evaluator triple structure तक विकसित होता है। generation और evaluation को अलग रखने का कारण यह है कि मॉडल अपने ही काम को बहुत उदारता से आंकता है
दूसरा तरीका (Brain/Hands/Session separation): Claude Managed Agents (अप्रैल 2026 की शुरुआत में जारी) की architecture
- Brain: model और harness loop
- Hands: sandboxed ephemeral execution environment जहां tools वास्तव में चलते हैं
- Session: हर thought, tool call, और observation का append-only event log
Anthropic की मुख्य framing: "harness का हर component उन बातों के बारे में assumptions encode करता है जो मॉडल खुद नहीं कर सकता"। इन्हें tightly couple करने पर assumptions पुराने पड़ते ही पूरे system को बदलना पड़ता है, जबकि अलग करने पर harness stateless हो सकता है और sandbox को cattle की तरह disposable माना जा सकता है
नया container wake(sessionId) call करके logs से state reconstruct कर सकता है। इससे time-to-first-token p50 पर लगभग 60% और p95 पर 90% से अधिक घटा — क्योंकि sandbox तैयार होने से पहले ही reasoning शुरू हो सकती है
Session-event-log concept सबसे कम आंका गया हिस्सा है। यही लॉन्ग-रनिंग एजेंट्स को recoverable बनाता है। इसके बिना container failure सीधे session failure बन जाता है
scientific computing stack: CLAUDE.md (जीवित plan जिसे agent सीखते हुए edit करता है), CHANGELOG.md (portable lab notes), tmux + SLURM + git (execution·coordination layer), Ralph loop (completion claim होने पर re-check)
- प्रतिनिधि उदाहरण: Claude Opus ने कई दिनों में बनाया गया Boltzmann solver तैयार किया, जिसने reference CLASS implementation के मुकाबले 1% से कम error हासिल की। यह researchers के महीनों से वर्षों के काम को compress करता है

Cursor: Planner, Worker, Judge संरचना

Cursor ने long-term autonomous coding को scale करते हुए design के तीन iterations किए
- पहला (flat coordination): बराबरी की स्थिति वाले agents locks के साथ shared files में लिखते थे → bottleneck बना, agents risk-averse हो गए, और churning (बार-बार घूमना लेकिन commit न करना) होने लगा
- दूसरा (optimistic concurrency control): bottleneck तो कम हुआ, पर coordination problem हल नहीं हुई
- तीसरा (मौजूदा production): Planner (codebase explore करना·tasks बनाना, sub-planners को recursively spawn करना), Worker (focused execution, बिना आपसी coordination के independent work), Judge (iteration complete हुई या restart चाहिए इसका फैसला)
मुख्य खोज: "system behavior का आश्चर्यजनक रूप से बड़ा हिस्सा harness या model से ज्यादा prompts पर निर्भर होता है"
model-role matching भी design surface का हिस्सा है: GPT models लंबी autonomous work में Opus से बेहतर निकले। Opus जल्दी रुकने और shortcuts लेने की प्रवृत्ति दिखाता है। एक ही task, अलग role, अलग model
Composer 2 (proprietary frontier coding model) और background cloud agents: long tasks अब local नहीं, बल्कि Anysphere cloud infrastructure पर चलते हैं। 8 घंटे की refactoring और codebase-wide migration laptop बंद होने पर भी जारी रह सकती है
- local पर शुरू होने के बाद अगर task 30 मिनट से अधिक का लगे, तो cloud में shift हो जाती है, और बाद में mobile से reconnect किया जा सकता है
- हर agent isolated git worktree में चलता है और PR के जरिए merge होता है
अंतिम संरचना Anthropic जैसी है: role separation, session persistence, worker के साथ judge, और cloud sandbox में git-based coordination

Google: Agent Platform के लॉन्ग-रनिंग एजेंट्स

Cloud Next '26 में Vertex AI को Gemini Enterprise Agent Platform में समेकित किया गया, जिससे लॉन्ग-रनिंग एजेंट्स SLA-आधारित औपचारिक product बन गए
Agent Runtime: "कई दिनों तक autonomous execution" का समर्थन, sub-second cold start, और on-demand sandbox provisioning। उदाहरण use case: एक हफ्ते लंबी sales prospecting sequence
Agent Sessions: conversation और event history को persist करते हैं। custom session ID को CRM या DB records से map करके agent state को business state के साथ store किया जा सकता है
Agent Memory Bank: Next '26 के अनुसार GA (general availability) में उपलब्ध long-term memory layer। यह sessions से memories curate करता है, user IDs के दायरे में रखता है, और search API देता है। Payhawk केस में Memory Bank आधारित agent ने expense submission time 50% से अधिक घटाया
Agent Sandbox (मजबूत code execution), Agent-to-Agent Orchestration, Agent Registry, Agent Identity, Agent Gateway, Agent Observability, Agent Simulation आदि production operation की लगभग हर जरूरत को cover करते हैं। enterprise की जरूरत वाले encrypted IDs और audit logs भी शामिल हैं
architecture स्तर पर यह Anthropic के brain/hands/session separation को platform scale पर productize करता है, साथ में ADK (code-first development kit) और Agent Studio (visual tool) bundle करता है। तीन साल पहले जिसे खुद बनाना पड़ता था, अब वह "brain/hands/session separation का कौन-सा version किराए पर लेना है" चुनने का मामला है

production लॉन्ग-रनिंग एजेंट्स के लिए पांच patterns

Checkpoint-and-resume: सबसे आम multi-day failure है context loss। 200 documents process करने के बाद 201वें पर error आने पर checkpoint न हो तो फिर से शुरू करना पड़ता है। एजेंट को long-running server process की तरह treat करें: intermediate state disk पर save करें, हर N tasks पर checkpoint लें, और failure recovery सक्षम करें। सही checkpoint granularity तय करना मुख्य बात है
Delegated approval (human-in-the-loop): पुराने implementations state को JSON serialize → webhook → response wait के तरीके से संभालते थे, लेकिन इससे state stale हो जाती है और notifications छूट जाती हैं। long-running runtime में agent reasoning chain, working memory, tool history, और pending actions को जस का तस रखते हुए pause हो सकता है। मानव review के दौरान compute usage शून्य, और sub-second latency पर resume संभव। Google का Mission Control इसके लिए inbox की भूमिका निभाता है
Memory-layered context: 7 दिन चलने वाले एजेंट को सिर्फ session state से ज्यादा चाहिए। Memory Bank (दीर्घकालिक curated memory) + Memory Profiles (low-latency lookup) की जरूरत होती है। production failure mode है memory drift — agent unstructured interactions से procedural shortcuts सीखकर उन्हें बहुत व्यापक रूप से लागू करने लगता है। इसलिए memory को microservice की तरह govern करना जरूरी है। Agent Identity (read/write permissions), Agent Registry (agent version tracking), Agent Gateway (policy enforcement)
Ambient processing: ऐसे agents जो इंसानों से बात किए बिना Pub/Sub streams या BigQuery tables की events पर react करते हैं, जैसे content moderation, anomaly detection, inbox classification। अगर policy को agent में hardcode करने के बजाय Gateway में define किया जाए, तो redeploy किए बिना सैकड़ों agents पर policy changes लागू किए जा सकते हैं
Fleet orchestration: असली systems में एक agent नहीं, बल्कि coordinator कई specialists (Lead Researcher Agent, Scoring Agent, Outreach Agent) को subtasks delegate करता है। हर specialist की अपनी Identity होती है (जैसे Outreach Agent, Scoring के लिए इस्तेमाल होने वाले financial data को access नहीं कर सकता), अपनी policy, और अपना Registry entry। ADK इसे graph-based workflows के रूप में declaratively संभालता है
ये patterns साथ मिलकर इस्तेमाल किए जा सकते हैं। उदाहरण के लिए compliance system में: document processing के लिए checkpointing + review gates के लिए delegated approval + cross-session knowledge के लिए memory layering + specialist coordination के लिए fleet orchestration

इसे वास्तव में कैसे बनाएं

जो developers अपने repo में long-running coding tasks चाहते हैं: Claude Code, Antigravity, Cursor, Codex आदि का उपयोग करें। AGENTS.md को pilot checklist की तरह maintain करें (छोटा रखें, और सिर्फ वास्तविक failures से निकली items रखें)। typecheck और lint hooks जोड़ें, शुरुआत से पहले planning file लिखें, और जब agent completion claim करे तो Ralph loop से दोबारा जांचें। multi-hour या overnight work को worktree में चलाएं ताकि laptop बंद होने पर भी जारी रहे, और meaningful work units पर commit करें। अधिकांश लोगों के लिए यह सबसे high-leverage path है
hosted agent product बनाना हो: runtime खुद न बनाएं, managed विकल्प चुनें। वर्तमान में तीन व्यावहारिक options हैं: Google Agent Platform (Agent Engine + Memory Bank + Sessions), Claude Managed Agents, या ADK·Claude Agent SDK·Codex SDK के ऊपर self-hosting। managed विकल्प brain/hands/session separation, observability, identity, और audit trails default में देते हैं। self-hosting control और specialized models के उपयोग की सुविधा देता है
autonomous·operational work (monitoring, research, operations): इसमें Memory Bank जैसी persistence चाहिए। ADK + Memory Bank + Cloud Run + Cloud Scheduler का stack "हर N घंटे में agent चलाओ, state जमा करो, threshold पर alert करो" के लिए सबसे साफ setup है

रास्ता कोई भी हो, ये मुख्य practices जरूरी हैं

agent शुरू होने से पहले completion criteria लिखें: लॉन्ग-रनिंग execution में यही सबसे high-leverage चीज है। बाहरी file में explicit और testable completion conditions लिखें, ताकि agent चलते-चलते "complete" की परिभाषा खुद बदल न दे
evaluator और generator को अलग करें: self-grading एक मुख्य failure mode है। planner/worker/judge pipeline या generator/evaluator pair केवल style नहीं, बल्कि वास्तविक architectural pattern हैं। एक ही model हो तो भी roles और prompts अलग रखें
prompts नहीं, session logs में निवेश करें: append-only event log एजेंट को recoverable, debuggable, और auditable बनाता है। अगर आप पिछले 24 घंटों की agent activity को persistent storage से reconstruct नहीं कर सकते, तो आपके पास बस LLM calls वाला एक long-running shell script है
compression और context reset को first-class citizen मानें: Anthropic ने पाया कि बहुत लंबे tasks में summary-based compression पर्याप्त नहीं है; harness को session पूरी तरह तोड़कर structured handoff files से फिर बनाना पड़ता है। यह मूलतः उसी तरह है जैसे किसी नए engineer को onboard करना

वर्तमान की व्यावहारिक सीमाएं

लागत: frontier models पर 24 घंटे की execution महंगी है। budget, circuit breakers, और tool spending hard caps के बिना आधे दिन में हफ्ते भर का API budget खत्म हो सकता है
सुरक्षा: API keys, cloud access, और shell command execution permissions वाले लॉन्ग-रनिंग एजेंट्स का attack surface चैट session की तुलना में कहीं बड़ा होता है। इसलिए brain/hands separation pattern महत्वपूर्ण है — जिस sandbox में model-generated code चलता है, वहां credentials तक पहुंच नहीं होनी चाहिए
Alignment drift: कई context windows पार करते हुए agent drift कर सकता है। मूल goal summarize होता है, फिर resummarize होता है, और fidelity घटती जाती है। hooks और judge इसी से बचाने के लिए हैं, और यही सबसे आम कारण है कि "agent वह काम करने लगता है जो उससे मांगा ही नहीं गया"
Verification: 24 घंटे की autonomous activity का audit करना वास्तविक मानवीय समय की समस्या है। observability और structured outputs (PRs, commits, briefings, test runs) इसे tractable बनाते हैं
इंसानों की भूमिका: किसी task को इस precision से define करना कि agent उस पर पूरा दिन काम कर सके, अक्सर खुद काम करने से कठिन होता है। इसलिए बढ़ती हुई value code लिखने में नहीं, बल्कि ऐसे specs लिखने में है जो autonomous executors के संपर्क में भी टिक सकें

आगे की दिशा

Google, Anthropic, और Cursor एक ही संरचना की ओर converge कर रहे हैं: model loop·execution sandbox·session log separation, planning·generation·evaluation separation, built-in compression·hooks·context reset, और managed service के रूप में memory
अंतर मुख्यतः surface-level हैं: Google Agent Platform enterprise stack है (identity·audit trails built-in), Claude Managed Agents "Anthropic harness का hosted version" हैं, और Cursor background agents "IDE से cloud में निकाली गई long-term coding" हैं
अगले एक साल की कठिन समस्याएं individual layers से अधिक उनके ऊपर की coordination में होंगी: shared codebase पर कई लॉन्ग-रनिंग agents चलाना, ऐसे agents जो अपने traces पढ़कर अपना harness patch करें, और ऐसे harness जो task के हिसाब से tools और context को JIT (just-in-time) assemble करें
models अभी भी core हैं, लेकिन chat window और overnight चल सकने वाले agent के बीच का बड़ा अंतर अधिकतर state, session, और structured handoff में है — और अभी सीखने के लिए सबसे महत्वपूर्ण क्षेत्र यही है

1 टिप्पणियां

jjpark78 1 시간 전

मैंने इस समस्या को हल करने के लिए hermes इस्तेमाल करना शुरू किया, और मुझे लगता है कि यह बुरा नहीं है, हाहा