लंबे समय तक चलने वाले एप्लिकेशन डेवलपमेंट के लिए harness design

(anthropic.com)

62 पॉइंट द्वारा GN⁺ 2026-03-26 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

Anthropic ने frontend design quality में सुधार और long-running autonomous coding की दो समस्याओं को एक साथ हल करने के लिए GAN से प्रेरित multi-agent संरचना विकसित की
generator और evaluator को अलग करने वाली संरचना ने subjective design quality को ठोस मानदंडों पर score करना संभव बनाया, जिससे agent के self-evaluation bias की समस्या हल हुई
planner-generator-evaluator की 3-agent architecture के साथ multi-hour autonomous coding sessions में full-stack application पूरा किया गया, जिसमें sprint contract negotiation और Playwright-आधारित QA शामिल थे
Opus 4.5 से Opus 4.6 पर जाने के बाद sprint breakdown के बिना भी 2 घंटे से अधिक consistent coding संभव हुई, जिससे harness की complexity घटाते हुए भी performance बरकरार रही
मॉडल performance बेहतर होने पर भी दिलचस्प harness combinations का क्षेत्र कम नहीं होता, बल्कि खिसकता है, और AI engineer का मुख्य काम नए combinations खोजते रहना है

साधारण implementation अपनी सीमा तक क्यों पहुँच जाता है

पिछले प्रयोगों में initialization agent product spec को task list में तोड़ता था, और coding agent एक-एक feature implement करने के बाद artifacts के ज़रिए sessions के बीच context पास करता था
- developer community में भी "Ralph Wiggum" approach की तरह hooks या scripts के जरिए agent को लगातार repeat loop में बनाए रखने वाले समान तरीके सामने आए
complex tasks में समय के साथ agent के track से भटकने की समस्या बनी रही, और दो common failure modes देखे गए
पहला failure mode: context window भरते ही model consistency खोने लगता था, और कुछ models में "context anxiety" दिखी, जहाँ वे अपनी context limit तक पहुँचने का अनुमान लगते ही काम जल्दी समेटने की कोशिश करते थे
- context reset (पूरी context window खाली करके, पिछले agent state और अगले steps वाले structured handoff के साथ नया agent शुरू करना) इन दोनों समस्याओं का समाधान था
- यह बातचीत के पुराने हिस्सों का सार लेकर उसी agent को आगे बढ़ाने वाले compaction से अलग तरीका है; compaction continuity बनाए रखता है, लेकिन clean slate नहीं देता, इसलिए context anxiety बनी रह सकती है
- Claude Sonnet 4.5 में context anxiety इतनी प्रबल थी कि केवल compaction से long-running task performance सुनिश्चित नहीं हो सकती थी, इसलिए context reset harness design का मुख्य तत्व बन गया
दूसरा failure mode: self-evaluation की समस्या, जहाँ agent अपने ही बनाए output का मूल्यांकन करते समय quality स्पष्ट रूप से साधारण होने पर भी आत्मविश्वास से उसकी तारीफ करता था
- यह खासकर design जैसे subjective tasks में गंभीर था, क्योंकि यहाँ verifiable software tests जैसी binary checks नहीं होतीं
- work agent और evaluation agent को अलग करना बहुत शक्तिशाली leverage निकला, और स्वतंत्र evaluator को skeptical तरीके से tune करना, generator को self-critical बनाने की तुलना में कहीं आसान था

Frontend design: subjective quality को score करने लायक बनाना

बिना intervention के Claude आमतौर पर तकनीकी रूप से काम करने वाले लेकिन दृश्य रूप से साधारण safe और predictable layouts बनाता था
harness design को दो मुख्य insights ने दिशा दी
- aesthetics को पूरी तरह score नहीं किया जा सकता, लेकिन design principles और preferences को encode करने वाले scoring rubric से सुधार किया जा सकता है — "क्या यह design सुंदर है?" की तुलना में "क्या यह अच्छे design principles का पालन करता है?" अधिक consistent scoring देता है
- frontend generation और scoring को अलग करके feedback loop बनाया जा सकता है, जो generator को अधिक मजबूत output की ओर धकेलता है
generator और evaluator दोनों को दिए गए 4 scoring criteria:
- Design quality: क्या colors, typography, layout, images आदि मिलकर कोई स्पष्ट mood और identity वाला coherent whole बनाते हैं
- Originality: क्या custom decisions के प्रमाण हैं, या यह template layout, library defaults, या AI-generated pattern है — जैसे purple gradient पर white card जैसा AI-generated संकेत हो तो fail
- Craft: typography hierarchy, spacing consistency, color harmony, contrast ratio जैसी technical execution — creativity नहीं, capability की जाँच
- Functionality: aesthetics से अलग usability — क्या user समझ सकता है कि interface क्या करता है और मुख्य actions कहाँ हैं
Design quality और Originality को Craft और Functionality से अधिक weight दिया गया — क्योंकि Claude को Craft और Functionality में तो अच्छे scores मिल जाते थे, लेकिन design और originality में output साधारण रहता था
- rubric में बहुत सामान्य "AI slop" patterns को स्पष्ट रूप से penalize किया गया ताकि model को aesthetic risk लेने के लिए प्रेरित किया जा सके
orchestration को Claude Agent SDK से बनाया गया, जहाँ generator HTML/CSS/JS frontend बनाता था, और evaluator Playwright MCP के जरिए live page के साथ सीधे interact करके screenshots लेता, implementation को बारीकी से देखता, फिर score और detailed critique लिखता
- प्रति generation 5 से 15 iterations होती थीं, और हर iteration में evaluator की critique पर प्रतिक्रिया देकर generator अधिक अलग दिशा में बढ़ता था
- पूरा run अधिकतम 4 घंटे तक चल सकता था
- generator को हर evaluation के बाद strategic decision लेने के लिए कहा गया: अगर score trend अच्छा हो तो मौजूदा दिशा को refine करो, और अगर approach काम न करे तो पूरी तरह अलग aesthetics पर switch करो
rubric की wording का generator पर अप्रत्याशित असर पड़ा — "सबसे अच्छे designs museum-quality होते हैं" जैसी पंक्तियों ने खास visual convergence को प्रेरित किया, यानी rubric से जुड़ी prompting सीधे output के character को shape करती थी
scores आमतौर पर iterations के साथ सुधरे, लेकिन हमेशा linear नहीं रहे, और कई बार final iteration से अधिक मध्य की iteration पसंद आई
- implementation complexity iterations के साथ बढ़ने की प्रवृत्ति दिखी, क्योंकि evaluator feedback के जवाब में अधिक ambitious solutions अपनाए गए
- पहली iteration से ही unprompted baseline की तुलना में साफ़ तौर पर बेहतर परिणाम मिले; rubric और उससे जुड़ी भाषा ने evaluator feedback से पहले ही model को सामान्य defaults से बाहर धकेला
Dutch museum website case: 9वीं iteration तक एक साफ़ dark-theme landing page बनाया गया, लेकिन 10वीं iteration में approach को पूरी तरह छोड़कर CSS perspective से render किया गया 3D room, checkerboard floor, दीवारों पर स्वतंत्र रूप से टंगी artworks, और दरवाज़ों के जरिए galleries के बीच navigation वाले spatial experience में फिर से कल्पित किया गया — यह single-pass generation में न दिखने वाली creative jump थी

Full-stack coding तक विस्तार

Architecture

पिछले long-running harness में initialization agent, per-feature coding agents, और session-to-session context reset के जरिए consistent multi-session coding संभव हुई
- Sonnet 4.5 की context anxiety के कारण context reset महत्वपूर्ण था, लेकिन Opus 4.5 में यह behavior काफी हद तक हट गया, इसलिए context reset के बिना एक continuous session में पूरी build की गई
- Claude Agent SDK का automatic compaction बढ़ते context को संभालता था
3-agent system की संरचना:
- Planner: 1 से 4 वाक्यों वाले छोटे prompt को पूरे product spec में expand करता था — prompting इस तरह की गई कि वह detailed technical implementation की बजाय product context और high-level technical design पर ध्यान दे, क्योंकि पहले से तकनीकी details तय करने पर गलतियाँ downstream तक फैल सकती थीं
  - उसे product spec में AI features को weave करने के मौके खोजने के लिए भी कहा गया
- Generator: sprint के हिसाब से spec से एक-एक feature उठाकर React/Vite/FastAPI/SQLite (बाद में PostgreSQL) stack में implement करता, हर sprint के अंत में self-evaluation करके QA को handoff देता, और git से version control करता
- Evaluator: Playwright MCP से चल रहे application को वास्तविक user की तरह click-through करके UI features, API endpoints, और database state test करता — product depth, functionality, visual design, और code quality के आधार पर score करता, और किसी criterion में hard threshold से नीचे होने पर sprint fail हो जाता
हर sprint से पहले generator और evaluator sprint contract negotiate करते थे — यानी code लिखने से पहले उस chunk के लिए "done" की परिभाषा पर सहमति
- क्योंकि product spec जानबूझकर high-level रखा गया था, यह step user stories और testable implementation के बीच की खाई भरता था
- communication file-based था — एक agent file लिखता और दूसरा उसे पढ़कर जवाब देता

Harness run results: retro game maker

"level editor, sprite editor, entity behavior, और playable test mode सहित 2D retro game maker बनाओ" वाले prompt पर test किया गया
solo agent: 20 मिनट / $9 बनाम full harness: 6 घंटे / $200 — harness 20 गुना से अधिक महंगा था, लेकिन output quality का अंतर तुरंत साफ़ दिखा
solo run result: शुरुआती screen उम्मीद के मुताबिक थी, लेकिन click करते ही समस्याएँ खुलीं — layout में space की बर्बादी, workflow rigid, sprite और entity पहले बनाने चाहिए लेकिन UI इसका मार्गदर्शन नहीं करता, और सबसे अहम असल game काम नहीं करता था (entity screen पर दिखती थी लेकिन input पर प्रतिक्रिया नहीं देती थी, entity definition और game runtime के बीच wiring टूटी हुई थी)
harness run result: planner ने एक-वाक्य prompt को 10 sprints में फैले 16 feature specs में expand किया — जिसमें sprite animation system, behavior templates, sound effects और music, AI-assisted sprite generator और level designer, तथा shareable links के जरिए game export शामिल थे
- frontend design skills को पढ़कर app की visual design language भी spec का हिस्सा बनाई गई
- canvas पूरे viewport का उपयोग करता था, panel sizes उचित थे, और spec की design direction का पालन करने वाली consistent visual identity थी
- sprite editor अधिक समृद्ध और feature-complete था, tool palette अधिक साफ़, color picker बेहतर, और zoom controls अधिक usable थे
- AI integration के जरिए prompting से game के अलग-अलग हिस्से generate करके workflow तेज किया गया
play mode में मुख्य अंतर: solo run में game काम नहीं करता था, जबकि harness run में वास्तव में entity को move करके game खेला जा सकता था — physics engine में कुछ roughness थी (platform और character overlap), और AI level composition की सीमाएँ थीं (ऐसी बड़ी दीवारें जिन्हें jump नहीं किया जा सकता), लेकिन core functionality काम कर रही थी
evaluator ने implementation को spec के अनुरूप बनाए रखा — केवल Sprint 3 में ही level editor के लिए 27 criteria वाला granular contract था
- मिली समस्याओं के उदाहरण: rectangle fill tool drag के start/end points पर ही tiles रख रहा था, entity delete key handler में conditional error था, और FastAPI route order issue के कारण reorder को integer की तरह parse करके 422 error लौट रही थी
evaluator tuning में काफी मेहनत लगी — default state में Claude एक कमजोर QA agent था, जो सही issues ढूँढ लेने के बाद भी खुद को समझाकर उन्हें "कोई बड़ी बात नहीं" कहकर approve कर देता था, और superficial testing के कारण subtle bugs छूट जाते थे
- evaluator logs पढ़कर जहाँ judgement diverge होती थी ऐसे cases खोजे गए, और QA prompt को बार-बार update करने वाले development loop के कई चक्रों के बाद ही उचित scoring मिल सकी

Harness की iterative improvement

शुरुआती परिणाम उत्साहजनक थे, लेकिन क्योंकि harness बड़ा, धीमा और महंगा था, इसलिए performance घटाए बिना इसे सरल बनाना अगला कदम था
- harness का हर component उन धारणाओं को encode करता है कि model अपने दम पर क्या नहीं कर सकता, और ऐसी धारणाओं को stress test करना ज़रूरी है — क्योंकि model में सुधार के साथ वे जल्दी पुरानी हो सकती हैं
- सिद्धांत था: "सबसे सरल संभव समाधान खोजो, और केवल ज़रूरत पड़ने पर ही complexity बढ़ाओ"
radical simplification की कोशिशें original performance को reproduce नहीं कर सकीं, और यह समझना मुश्किल था कि असल में कौन-से हिस्से load संभाल रहे हैं, इसलिए एक बार में एक component हटाने वाला systematic approach अपनाया गया
Opus 4.6 release ने harness complexity घटाने की अतिरिक्त प्रेरणा दी — यह अधिक सोच-समझकर plan करता था, agentic tasks पर ज़्यादा देर तक टिका रहता था, बड़े codebase में अधिक stable था, अपने mistakes पकड़ने वाली code review/debugging skills बेहतर थीं, और long-context retrieval भी काफ़ी बेहतर थी

Sprint structure हटाना

sprint structure को पूरी तरह हटा दिया गया — Opus 4.6 की बेहतर क्षमता के कारण model बिना decomposition के भी काम को लगातार संभाल सकता था
planner और evaluator को रखा गया — planner के बिना generator में scope की कमी थी; वह raw prompt से बिना spec के build शुरू कर देता था और कम features वाला application बनाता था
evaluator को per-sprint scoring से हटाकर run के अंत में single-pass पर ले जाया गया
- अगर task model की अकेली क्षमता के दायरे में हो तो evaluator अनावश्यक overhead बन जाता है, लेकिन model capability की सीमा पर मौजूद tasks में यह अब भी ठोस सुधार देता है
- इसलिए evaluator कोई स्थायी yes/no चीज़ नहीं है; जब task उस सीमा से आगे जाए जहाँ current model अकेले reliably काम कर सकता है, तब उसकी cost वाजिब होती है
AI feature builds को बेहतर करने के लिए prompting भी जोड़ी गई — generator को app के अंदर की functionality के लिए tools के जरिए चलने वाले सही agents बनाना सिखाने में काफी iteration लगी, क्योंकि यह ज्ञान नया था और Claude के training data में इसकी coverage सीमित थी

Updated harness results: browser DAW

"Web Audio API का उपयोग करके browser में एक fully functional DAW बनाओ" वाले prompt पर test किया गया
कुल लगभग 4 घंटे, $124.70 खर्च हुए
- planner 4.7 मिनट/$0.46, build round 1 2 घंटे 7 मिनट/$71.08, QA round 1 8.8 मिनट/$3.24, build round 2 1 घंटा 2 मिनट/$36.89, QA round 2 6.8 मिनट/$3.09, build round 3 10.9 मिनट/$5.88, QA round 3 9.6 मिनट/$4.06
builder sprint decomposition के बिना 2 घंटे से अधिक लगातार चला
QA agent की पहली feedback: design fidelity, AI agent, और backend अच्छे थे, लेकिन functional completeness मुख्य failure point थी — timeline पर clips drag/move नहीं हो रहे थे, instrument UI panels (synth knobs, drum pads) नहीं थे, और visual effects editor (EQ curves, compressor meters) नहीं था
QA की दूसरी feedback: audio recording अभी stub थी, clip resize और split implement नहीं थे, और effects visualization graphics की बजाय numeric sliders थे
final app पेशेवर music production program के स्तर तक नहीं पहुँचा, और agent की song composition क्षमता में सुधार की ज़रूरत थी, क्योंकि Claude वास्तव में आवाज़ सुन नहीं सकता, इसलिए QA feedback loop musical taste के मामले में कम प्रभावी रही
- फिर भी इसमें working arrangement view, mixer, transport सहित functional music production software के core elements थे
- केवल prompting से short song snippets बनाए जा सकते थे — agent tempo और key set करता, melody रखता, drum track बनाता, mixer levels adjust करता, और reverb जोड़ता

आगे की दिशा

जैसे-जैसे models बेहतर होंगे, उम्मीद है कि वे अधिक लंबे और अधिक complex tasks कर पाएँगे; कुछ मामलों में model के आसपास की scaffold की अहमियत घट सकती है, इसलिए अगला model आने तक इंतज़ार करने पर कुछ समस्याएँ अपने आप हल हो सकती हैं
दूसरी ओर, model जितना बेहतर होता है, baseline से आगे के complex tasks हासिल करने के लिए harness विकसित करने की संभावनाएँ भी उतनी ही बढ़ती हैं
मुख्य सीख:
- जिस model पर build करना है उसके साथ प्रयोग करना, वास्तविक समस्याओं में traces पढ़ना, और performance को इच्छित outcomes के अनुसार tune करना हमेशा अच्छी practice है
- अधिक complex tasks में task को तोड़ना और हर पहलू पर specialized agents लागू करना extra headroom दे सकता है
- नया model आने पर harness की दोबारा समीक्षा करना अच्छी practice है, ताकि ऐसे हिस्से हटाए जा सकें जो अब performance का भार नहीं उठा रहे, और ऐसे नए हिस्से जोड़े जा सकें जो पहले असंभव रही बड़ी capabilities को संभव बनाते हों
models बेहतर होने पर भी दिलचस्प harness combinations का क्षेत्र कम नहीं होता, बल्कि स्थान बदलता है, और AI engineers का काम अगले नए combination को खोजते रहना है

लंबे समय तक चलने वाले एप्लिकेशन डेवलपमेंट के लिए harness design

साधारण implementation अपनी सीमा तक क्यों पहुँच जाता है

Frontend design: subjective quality को score करने लायक बनाना

Full-stack coding तक विस्तार

Architecture

Harness run results: retro game maker

Harness की iterative improvement

Sprint structure हटाना

Updated harness results: browser DAW

आगे की दिशा

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.