SRE में AI: Google भरोसेमंद ऑपरेशंस के भविष्य को कैसे डिज़ाइन कर रहा है

epdlemflaj · 2026-06-02T11:08:38+09:00

AI coding assistant कोड जनरेशन और deployment की गति बढ़ा रहे हैं (उत्पादकता को 4 गुना तक ले जाने का लक्ष्य), इसलिए इंसानों द्वारा एक-एक चीज़ की समीक्षा पर आधारित पारंपरिक SRE practices अब scalable नहीं रहीं — यह लेख बताता है कि Google ने AI युग के लिए SRE को कैसे redesign किया है सिर्फ मौजूदा कामों को AI से automate करने के बजाय, autonomous mitigation agent (AI Operator), execution guardrails (Actus), और human operations memory पर आधारित continuous evaluation pipeline (IRM Analyzer) के साथ reliability की नई नींव बनाई जा रही है production में AI की गलती की कीमत बहुत बड़ी होती है, इसलिए transparency, real-time risk assessment, और progressive authorization वाली "Safety Trifecta" से इसे नियंत्रित किया जाता है autonomy को L0 (manual) से L4 (fully autonomous) तक स्तरों में बाँटा गया है, और ऊपर के स्तर पर जाने के लिए golden data पर statistically significant success rate साबित करना जरूरी है SRE की भूमिका "operator से architect" की ओर शिफ्ट हो रही है — इंसान line-by-line code review के बजाय design, intent, policy, और autonomous agents की safety boundaries तय करने के लिए abstraction ladder पर ऊपर जाते हैं SRE को अभी क्यों बदलना चाहिए SLO, error budget, और toil reduction जैसी core philosophy अब भी standard हैं, लेकिन "planetary scale" services और multi-tenant workloads की complexity को सिर्फ deterministic automation से संभालना संभव नहीं है AI-assisted development से change की रफ्तार तेज हो रही है, और observability gaps अब petabyte-scale unstructured data से भर रहे हैं AI को सिर्फ एक tool नहीं, बल्कि service lifecycle के पूरे प्रवाह में फैली एक transformative layer के रूप में integrate किया जा रहा है Production में AI को नियंत्रित करना (AI-Ops governance) production में AI का गलत behavior तुरंत और व्यापक outage में बदल सकता है, और उसका blast radius इंसानों से बड़ा होता है तथा अधिक तेजी से फैलता है मुख्य चुनौतियाँ: human expertise का evolution (operator → architect), explainability और trust, data integrity और bias mitigation, model drift से निपटना, security vectors (adversarial attack, data poisoning, prompt injection) से रक्षा, और unintended cascading failures को रोकना Safety Trifecta transparency: agent इस्तेमाल किए गए signals, hypotheses, चयन के कारण, confidence आदि जैसे "Chain of Thought" को logs में दर्ज करता है real-time risk assessment: ongoing deployment, error budget, active incident, time zone जैसे context के आधार पर हर action के risk level का मूल्यांकन किया जाता है progressive authorization: शुरुआत से full authority देने के बजाय autonomy level के अनुसार अधिकार धीरे-धीरे बढ़ाए जाते हैं architectural guardrails: हमेशा access denial, least privilege, agent-only rate limit और circuit breaker, mandatory dry-run support, zero-trust और safe-by-default actuation SRE AI autonomy levels (L0~L4) monitoring, investigation, approval, actuation, और self-direct capabilities के हिसाब से automation maturity को परिभाषित किया गया है L0 manual: सिर्फ monitoring automated, बाकी सब इंसान करते हैं L1 assisted: investigation तक automation (AI incident hypothesis देता है), approval और execution इंसान के पास L2 partially autonomous: execution तक automation संभव, लेकिन इंसान की explicit approval जरूरी L3 highly autonomous: अच्छी तरह परिभाषित scenarios में approval और actuation तक autonomy, इंसान को सिर्फ notify किया जाता है L4 fully autonomous: diagnosis, mitigation, और resolution की पूरी श्रृंखला खुद plan और execute करता है, परिणाम के अनुसार strategy को real time में adjust करता है, और incident के पूरे lifecycle को closure तक संभालता है level बढ़ाना कोई simple switch नहीं, बल्कि trust और safety controls सुनिश्चित करने पर आधारित एक structured journey है Evaluation data और human operations memory Human Trajectory: chat, incident notes, CLI जैसी बिखरी हुई logs को NLP से parse करके समयक्रम अनुसार event sequence में reconstruct किया जाता है (IRM-Analyzer) data quality tiers: Bronze (automatic labeler heuristics) / Silver (program-generated, gold मानक से calibrated) / Gold (human expert verified) stratified sampling से विविध incidents की manual review करके gold data बनाया जाता है, और इसके जरिए true precision तथा observed precision को अलग-अलग मापा जाता है Nightly Evals + LLM-as-a-Judge: हाल के वास्तविक incidents पर हर दिन automatic evaluation, qualitative reasoning का आकलन LLM evaluator करता है, जबकि अंतिम mitigation output को सख्त deterministic scoring से आँका जाता है (उदाहरण: सही binary और version का exact match होने पर ही "correct") golden data को incident mitigation workflow में स्वाभाविक रूप से integrate किया जाता है ताकि SRE केवल accept/modify/reject करके लगातार high-quality labels दे सकें पूरे SRE lifecycle में AI का उपयोग Detectr (detection): Gemini आधारित multi-stage pipeline social, customer support, forum आदि से मिले user feedback को filter → cluster → denoise → report करती है, और metric-based monitoring से छूट जाने वाले नए तरह के failures को पकड़ने के लिए backstop की भूमिका निभाती है (Cloud, Ads, YouTube, Search में लागू; कुल मिलाकर सैकड़ों घंटों के impact में कमी) AI Alert (alert enrichment): alert इंसान तक पहुँचने से पहले लगभग 2 मिनट के भीतर बड़े पैमाने पर parallel monitoring, logs, change logs, और dependency graph को query करके context जोड़ता है, और अनुमान नहीं बल्कि source links के साथ verifiable facts ही देता है (read-only) L1: human-led mitigation Incident Hypothesis: LLM+RAG monitoring anomalies, playbooks, logs, और पिछली similar cases को मिलाकर सबसे संभावित एक कारण और उसके verification steps सुझाता है → A/B test में MTTM (mean time to mitigate) 10% कम होने की पुष्टि हुई Investigation Dashboard (InvD): हर incident के लिए तुरंत "single pane of glass" बनाता है, जिसमें anomaly detection → signal correlation → investigation value judgment → root cause identification की 4-step capability होती है, और 100 से अधिक domain-specific "troubleshooters" parallel में चलते हैं → सिर्फ ML-based anomaly detection से discovery rate 195% बढ़ी, MTTM लगभग 44% कम हुआ Gemini-based CLI (Antigravity CLI): Production Agent (MCP) के जरिए bug filing, owner assignment, postmortem export, real-time monitoring queries, log analysis, safe traffic drain जैसी L1 investigation करता है (skill library के जरिए विस्तार योग्य) L3: autonomous mitigation cost को स्थिर रखते हुए 4 गुना development speed को support करने के लिए recommendations से आगे बढ़कर direct actuation की जरूरत है, लेकिन यह progressive authorization के तहत L2 (suggestion + approval wait) से शुरू होकर validation के बाद L3/L4 तक जाता है AI Operator: production alerts के लिए first-response agent, जो parallel investigation से RCA करता है और फिर enricher, skill, few-shot का dynamic उपयोग करके mitigation चुनता है; अपना CoT central UI में दिखाता है, अटकने पर तुरंत इंसान को escalate करता है और investigation history सौंप देता है; सभी execution traces Spanner में store होते हैं, जिन पर LLM-as-a-Judge automatic critique और bug filing करके self-improvement loop बनाता है Actus (mitigation safety verification/actuation agent): AI के reasoning engine और execution engine को अलग करने वाला unified control plane — standardized tool registration और planning, dry-run और justification verification जैसी pre-execution safety checks, risk दिखने पर L3 → L2 automatic downgrade, और सभी ongoing actions को तुरंत रोकने व L3 permissions को एक साथ वापस लेने के लिए emergency "red button" AI-Ops को सहारा देने वाली तकनीक high-quality production data और metadata (telemetry, topology, past incidents, playbooks, SLO आदि) RAG platform, domain-specific fine-tuning, AI-friendly tool interfaces (MCP, Production Agent server) agents और humans में अंतर करने के लिए मजबूत agent identity management (audit और non-repudiation) agent-to-agent communication protocol (A2A), ताकि specialized agents microservices की तरह मिलकर काम कर सकें SRE का भविष्य: agentic SDLC में supervision का विस्तार AI द्वारा code plan, write, review, और submit करने से change list (CL) को 4~10 गुना तक बढ़ाने की दिशा — line-by-line review की सीमाएँ हैं, और इसका नतीजा reviewer fatigue तथा formal approval में निकलता है human oversight "shift left" होती है और abstraction ladder पर ऊपर जाकर design, intent, और policy review पर केंद्रित होती है Independent Harness को अनिवार्य बनाया जाता है: code जनरेट करने वाले AI और test/review करने वाले AI को सख्ती से अलग रखकर cross-bias रोका जाता है adaptive progressive rollout और machine-speed continuous production verification से पारंपरिक soak time और canary bottlenecks दूर किए जाते हैं "Intervening Pull Request Problem": simple rollback से बीच में आए bug fixes और security patches भी वापस जा सकते हैं → dynamic configuration, feature flags, और AI-assisted fix-forward (targeted patch का automatic generation और deployment) से इसका समाधान निष्कर्ष: SRE की भूमिका systems को operate करने से बदलकर ऐसी boundaries design करने की ओर जा रही है जिनमें autonomous agents सुरक्षित रूप से innovation कर सकें

(sre.google)

9 पॉइंट द्वारा epdlemflaj 2026-06-02 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

AI coding assistant कोड जनरेशन और deployment की गति बढ़ा रहे हैं (उत्पादकता को 4 गुना तक ले जाने का लक्ष्य), इसलिए इंसानों द्वारा एक-एक चीज़ की समीक्षा पर आधारित पारंपरिक SRE practices अब scalable नहीं रहीं — यह लेख बताता है कि Google ने AI युग के लिए SRE को कैसे redesign किया है
सिर्फ मौजूदा कामों को AI से automate करने के बजाय, autonomous mitigation agent (AI Operator), execution guardrails (Actus), और human operations memory पर आधारित continuous evaluation pipeline (IRM Analyzer) के साथ reliability की नई नींव बनाई जा रही है
production में AI की गलती की कीमत बहुत बड़ी होती है, इसलिए transparency, real-time risk assessment, और progressive authorization वाली "Safety Trifecta" से इसे नियंत्रित किया जाता है
autonomy को L0 (manual) से L4 (fully autonomous) तक स्तरों में बाँटा गया है, और ऊपर के स्तर पर जाने के लिए golden data पर statistically significant success rate साबित करना जरूरी है
SRE की भूमिका "operator से architect" की ओर शिफ्ट हो रही है — इंसान line-by-line code review के बजाय design, intent, policy, और autonomous agents की safety boundaries तय करने के लिए abstraction ladder पर ऊपर जाते हैं

SRE को अभी क्यों बदलना चाहिए

SLO, error budget, और toil reduction जैसी core philosophy अब भी standard हैं, लेकिन "planetary scale" services और multi-tenant workloads की complexity को सिर्फ deterministic automation से संभालना संभव नहीं है
AI-assisted development से change की रफ्तार तेज हो रही है, और observability gaps अब petabyte-scale unstructured data से भर रहे हैं
AI को सिर्फ एक tool नहीं, बल्कि service lifecycle के पूरे प्रवाह में फैली एक transformative layer के रूप में integrate किया जा रहा है

Production में AI को नियंत्रित करना (AI-Ops governance)

production में AI का गलत behavior तुरंत और व्यापक outage में बदल सकता है, और उसका blast radius इंसानों से बड़ा होता है तथा अधिक तेजी से फैलता है
मुख्य चुनौतियाँ: human expertise का evolution (operator → architect), explainability और trust, data integrity और bias mitigation, model drift से निपटना, security vectors (adversarial attack, data poisoning, prompt injection) से रक्षा, और unintended cascading failures को रोकना
Safety Trifecta
- transparency: agent इस्तेमाल किए गए signals, hypotheses, चयन के कारण, confidence आदि जैसे "Chain of Thought" को logs में दर्ज करता है
- real-time risk assessment: ongoing deployment, error budget, active incident, time zone जैसे context के आधार पर हर action के risk level का मूल्यांकन किया जाता है
- progressive authorization: शुरुआत से full authority देने के बजाय autonomy level के अनुसार अधिकार धीरे-धीरे बढ़ाए जाते हैं
architectural guardrails: हमेशा access denial, least privilege, agent-only rate limit और circuit breaker, mandatory dry-run support, zero-trust और safe-by-default actuation

SRE AI autonomy levels (L0~L4)

monitoring, investigation, approval, actuation, और self-direct capabilities के हिसाब से automation maturity को परिभाषित किया गया है
- L0 manual: सिर्फ monitoring automated, बाकी सब इंसान करते हैं
- L1 assisted: investigation तक automation (AI incident hypothesis देता है), approval और execution इंसान के पास
- L2 partially autonomous: execution तक automation संभव, लेकिन इंसान की explicit approval जरूरी
- L3 highly autonomous: अच्छी तरह परिभाषित scenarios में approval और actuation तक autonomy, इंसान को सिर्फ notify किया जाता है
- L4 fully autonomous: diagnosis, mitigation, और resolution की पूरी श्रृंखला खुद plan और execute करता है, परिणाम के अनुसार strategy को real time में adjust करता है, और incident के पूरे lifecycle को closure तक संभालता है
level बढ़ाना कोई simple switch नहीं, बल्कि trust और safety controls सुनिश्चित करने पर आधारित एक structured journey है

Evaluation data और human operations memory

Human Trajectory: chat, incident notes, CLI जैसी बिखरी हुई logs को NLP से parse करके समयक्रम अनुसार event sequence में reconstruct किया जाता है (IRM-Analyzer)
data quality tiers: Bronze (automatic labeler heuristics) / Silver (program-generated, gold मानक से calibrated) / Gold (human expert verified)
stratified sampling से विविध incidents की manual review करके gold data बनाया जाता है, और इसके जरिए true precision तथा observed precision को अलग-अलग मापा जाता है
Nightly Evals + LLM-as-a-Judge: हाल के वास्तविक incidents पर हर दिन automatic evaluation, qualitative reasoning का आकलन LLM evaluator करता है, जबकि अंतिम mitigation output को सख्त deterministic scoring से आँका जाता है (उदाहरण: सही binary और version का exact match होने पर ही "correct")
golden data को incident mitigation workflow में स्वाभाविक रूप से integrate किया जाता है ताकि SRE केवल accept/modify/reject करके लगातार high-quality labels दे सकें

पूरे SRE lifecycle में AI का उपयोग

Detectr (detection): Gemini आधारित multi-stage pipeline social, customer support, forum आदि से मिले user feedback को filter → cluster → denoise → report करती है, और metric-based monitoring से छूट जाने वाले नए तरह के failures को पकड़ने के लिए backstop की भूमिका निभाती है (Cloud, Ads, YouTube, Search में लागू; कुल मिलाकर सैकड़ों घंटों के impact में कमी)
AI Alert (alert enrichment): alert इंसान तक पहुँचने से पहले लगभग 2 मिनट के भीतर बड़े पैमाने पर parallel monitoring, logs, change logs, और dependency graph को query करके context जोड़ता है, और अनुमान नहीं बल्कि source links के साथ verifiable facts ही देता है (read-only)

L1: human-led mitigation

Incident Hypothesis: LLM+RAG monitoring anomalies, playbooks, logs, और पिछली similar cases को मिलाकर सबसे संभावित एक कारण और उसके verification steps सुझाता है → A/B test में MTTM (mean time to mitigate) 10% कम होने की पुष्टि हुई
Investigation Dashboard (InvD): हर incident के लिए तुरंत "single pane of glass" बनाता है, जिसमें anomaly detection → signal correlation → investigation value judgment → root cause identification की 4-step capability होती है, और 100 से अधिक domain-specific "troubleshooters" parallel में चलते हैं → सिर्फ ML-based anomaly detection से discovery rate 195% बढ़ी, MTTM लगभग 44% कम हुआ
Gemini-based CLI (Antigravity CLI): Production Agent (MCP) के जरिए bug filing, owner assignment, postmortem export, real-time monitoring queries, log analysis, safe traffic drain जैसी L1 investigation करता है (skill library के जरिए विस्तार योग्य)

L3: autonomous mitigation

cost को स्थिर रखते हुए 4 गुना development speed को support करने के लिए recommendations से आगे बढ़कर direct actuation की जरूरत है, लेकिन यह progressive authorization के तहत L2 (suggestion + approval wait) से शुरू होकर validation के बाद L3/L4 तक जाता है
AI Operator: production alerts के लिए first-response agent, जो parallel investigation से RCA करता है और फिर enricher, skill, few-shot का dynamic उपयोग करके mitigation चुनता है; अपना CoT central UI में दिखाता है, अटकने पर तुरंत इंसान को escalate करता है और investigation history सौंप देता है; सभी execution traces Spanner में store होते हैं, जिन पर LLM-as-a-Judge automatic critique और bug filing करके self-improvement loop बनाता है
Actus (mitigation safety verification/actuation agent): AI के reasoning engine और execution engine को अलग करने वाला unified control plane — standardized tool registration और planning, dry-run और justification verification जैसी pre-execution safety checks, risk दिखने पर L3 → L2 automatic downgrade, और सभी ongoing actions को तुरंत रोकने व L3 permissions को एक साथ वापस लेने के लिए emergency "red button"

AI-Ops को सहारा देने वाली तकनीक

high-quality production data और metadata (telemetry, topology, past incidents, playbooks, SLO आदि)
RAG platform, domain-specific fine-tuning, AI-friendly tool interfaces (MCP, Production Agent server)
agents और humans में अंतर करने के लिए मजबूत agent identity management (audit और non-repudiation)
agent-to-agent communication protocol (A2A), ताकि specialized agents microservices की तरह मिलकर काम कर सकें

SRE का भविष्य: agentic SDLC में supervision का विस्तार

AI द्वारा code plan, write, review, और submit करने से change list (CL) को 4~10 गुना तक बढ़ाने की दिशा — line-by-line review की सीमाएँ हैं, और इसका नतीजा reviewer fatigue तथा formal approval में निकलता है
human oversight "shift left" होती है और abstraction ladder पर ऊपर जाकर design, intent, और policy review पर केंद्रित होती है
Independent Harness को अनिवार्य बनाया जाता है: code जनरेट करने वाले AI और test/review करने वाले AI को सख्ती से अलग रखकर cross-bias रोका जाता है
adaptive progressive rollout और machine-speed continuous production verification से पारंपरिक soak time और canary bottlenecks दूर किए जाते हैं
"Intervening Pull Request Problem": simple rollback से बीच में आए bug fixes और security patches भी वापस जा सकते हैं → dynamic configuration, feature flags, और AI-assisted fix-forward (targeted patch का automatic generation और deployment) से इसका समाधान
निष्कर्ष: SRE की भूमिका systems को operate करने से बदलकर ऐसी boundaries design करने की ओर जा रही है जिनमें autonomous agents सुरक्षित रूप से innovation कर सकें