Kimi K2.6 जारी - ओपन सोर्स कोडिंग की प्रगति

(kimi.com)

5 पॉइंट द्वारा GN⁺ 9 일 전 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

लॉन्ग-होराइज़न कोडिंग और एजेंटिक कार्यों में प्रदर्शन बढ़ाने वाला मॉडल, कई भाषाओं और फ्रंटएंड·devops·परफॉर्मेंस ऑप्टिमाइज़ेशन समेत पूरे क्षेत्र में generalization performance को मजबूत करता है
जटिल इंजीनियरिंग कार्यों को persistent coding के रूप में संभालते हुए, हज़ारों tool calls और 12 घंटे से अधिक लगातार रन के बाद Zig inference optimization और exchange-core के पूर्ण पुनर्गठन में throughput में बड़ा सुधार दर्ज किया
साधारण prompts को पूर्ण फ्रंटएंड इंटरफेस में बदलता है और image·video generation tools का भी उपयोग करता है, साथ ही authentication और database कार्यों सहित सरल full-stack workflows को सपोर्ट करता है
Agent Swarm आर्किटेक्चर को 300 sub-agents और 4,000 coordination steps तक स्केल कर search·research·document writing·file generation कार्यों को parallel में चलाता है, और PDF·slides·spreadsheets·Word documents के format और style को reusable skills में बदलता है
proactive agents और Claw Groups तक दायरा बढ़ाकर लंबे समय तक autonomous operation, multi-agent collaboration, और task reallocation करता है, तथा benchmarks और enterprise beta tests में coding·tool calling·long-run reliability में सुधार की पुष्टि हुई

लॉन्ग-होराइज़न कोडिंग

लॉन्ग-होराइज़न कोडिंग कार्यों में प्रदर्शन सुधार की पुष्टि, Rust·Go·Python जैसी कई भाषाओं और फ्रंटएंड·devops·परफॉर्मेंस ऑप्टिमाइज़ेशन जैसे कई कार्यों में generalization performance को मजबूत किया गया
- आंतरिक coding benchmark Kimi Code Bench में जटिल end-to-end कार्यों के पूरे दायरे में Kimi K2.5 की तुलना में बड़ा सुधार दर्ज किया गया
जटिल इंजीनियरिंग कार्यों में persistent coding का निष्पादन
- Mac लोकल वातावरण में Qwen3.5-0.8B मॉडल का डाउनलोड और deployment सफल
- अपेक्षाकृत niche भाषा Zig में model inference को implement और optimize कर, out-of-distribution generalization performance साबित की
- 4,000 से अधिक tool calls, 12 घंटे से अधिक लगातार रन, और 14 iterations के बाद throughput को लगभग 15 tokens/sec से बढ़ाकर लगभग 193 tokens/sec तक पहुंचाया
- अंतिम गति LM Studio की तुलना में लगभग 20% तेज़
8 साल पुराने ओपन सोर्स वित्तीय matching engine exchange-core का पूर्ण पुनर्गठन किया गया
- 13 घंटे के रन के दौरान 12 optimization strategies दोहराईं, और 1,000 से अधिक tool calls के साथ 4,000 से अधिक lines of code को सटीक रूप से संशोधित किया
- CPU और memory allocation के flame graph विश्लेषण से छिपे bottlenecks की पहचान की
- core thread topology को 4ME+2RE से 2ME+1RE में पुनर्गठित किया
- पहले से परफॉर्मेंस सीमा के करीब पहुंच चुके engine में median throughput 185% बढ़ा (0.43→1.24 MT/s), और performance throughput 133% बढ़ा (1.23→2.86 MT/s)
beta test की enterprise evaluations में भी लॉन्ग-होराइज़न coding reliability और tool calling quality पर कई सकारात्मक प्रतिक्रियाएँ दर्ज हुईं
- Baseten ने leading closed models के समान coding task performance, third-party frameworks की समझ पर आधारित मजबूत tool calling quality, और जटिल व लंबे इंजीनियरिंग कार्यों के लिए उपयुक्तता का उल्लेख किया
- Blackbox ने long-horizon·agentic coding workflows में ओपन सोर्स मॉडल के लिए नया मानक, जटिल multi-step task handling, उच्च code quality, लंबी sessions की stability, और non-obvious bugs पकड़ने की क्षमता का उल्लेख किया
- CodeBuddy ने K2.5 की तुलना में code generation accuracy में 12% वृद्धि, long-context stability में 18% सुधार, और tool calling success rate 96.60% दर्ज की
- Factory ने अपने benchmark के साथ side-by-side evaluation में 15% improvement की रिपोर्ट दी
- Fireworks ने long-horizon reliability और instruction following ability को सबसे बड़ा सुधार बिंदु बताया
- Hermes Agent ने tool calling और agent loops की घनिष्ठता, coding improvements, और creative scope के विस्तार का उल्लेख किया
- Kilo ने कम लागत पर SOTA-स्तरीय प्रदर्शन और पूरे codebase में long-context कार्यों की ताकत का उल्लेख किया
- Ollama ने coding और agent tools के लिए उपयुक्तता, लंबी multi-step sessions की stability, और मौजूदा integrations के साथ immediate compatibility का उल्लेख किया
- OpenCode ने task decomposition और tool calling की stability, iteration overhead में कमी, और end-to-end experience की reliability का उल्लेख किया
- Qoder ने tool calling और model call frequency में वृद्धि, task execution के दौरान अधिक proactiveness, और user interruptions व latency में कमी का उल्लेख किया
- Vercel ने Next.js benchmark में 50% से अधिक सुधार, platform पर top-tier performance, और cost efficiency के आधार पर agentic coding व frontend generation के लिए उपयुक्तता का उल्लेख किया

कोडिंग-केंद्रित डिज़ाइन

मजबूत coding capability के आधार पर साधारण prompts को पूर्ण फ्रंटएंड इंटरफेस में बदला जा सकता है
- aesthetic hero section, interactive elements, scroll-triggered effects और समृद्ध animations सहित structured layouts का निर्माण
image·video generation tools के उपयोग की क्षमता के आधार पर visually consistent assets के निर्माण का समर्थन
- इससे उच्च गुणवत्ता और अधिक आकर्षक hero section बनाने में योगदान मिलता है
static frontend से आगे बढ़कर सरल full-stack workflows तक विस्तार
- authentication, user interaction, और database कार्य शामिल
- transaction records या session management जैसे lightweight use cases का समर्थन
आंतरिक Kimi Design Bench का निर्माण
- Visual Input Tasks, Landing Page Construction, Full-Stack Application Development, General Creative Programming इन चार श्रेणियों से बना
- Google AI Studio की तुलना में कई श्रेणियों में promising results और अच्छा performance दर्ज किया गया
K2.6 Agent के उदाहरण outputs प्रदान किए गए
- एक prompt और पहले से configured harness·tools का उपयोग कर परिणाम तैयार किए गए
- aesthetics के लिहाज़ से समृद्ध interaction वाले सुंदर frontend designs शामिल
- functionality के लिहाज़ से built-in database और authentication शामिल
- tool usage के लिहाज़ से image·video generation tools का उपयोग कर polished websites शामिल

उन्नत Agent Swarm

केवल vertical scaling नहीं, बल्कि horizontal scaling पर केंद्रित आर्किटेक्चर अपनाया गया
- Agent Swarm कार्यों को dynamic तरीके से heterogeneous subtasks में तोड़ता है, और स्वयं बनाए गए domain-specific agents उन्हें parallel में execute करते हैं
K2.5 Agent Swarm research preview के आधार पर, Kimi K2.6 Agent Swarm में अनुभव में गुणात्मक छलांग प्रस्तुत की गई
- broad search और deep research का संयोजन
- large-scale document analysis और long-form writing का संयोजन
- कई formats में content generation को parallel में चलाना
- एक ही autonomous run के भीतर documents·websites·slides·spreadsheets को कवर करने वाले end-to-end outputs प्रदान करना
आर्किटेक्चर की horizontal scaling capacity बढ़ाई गई
- 300 sub-agents एक साथ 4,000 coordination steps चलाते हैं
- K2.5 के 100 sub-agents और 1,500 steps की तुलना में बड़ा विस्तार
- बड़े पैमाने के parallelization से end-to-end latency कम हुई, output quality बेहतर हुई, और Agent Swarm की operational सीमा विस्तृत हुई
PDF·spreadsheets·slides·Word documents जैसी उच्च-गुणवत्ता वाली files को Skills में बदला जा सकता है
- document की structure और style characteristics को capture और preserve किया जाता है
- बाद के कार्यों में वही quality और format फिर से तैयार किया जा सकता है
कई example tasks प्रस्तुत किए गए
- 100 global semiconductor assets पर 5 quant strategies को design और execute किया गया, McKinsey-style PPT को reusable skill में बदला गया, और detailed modeling spreadsheet तथा complete executive presentation materials प्रदान किए गए
- समृद्ध visual data वाले उच्च-गुणवत्ता के astrophysics paper को reusable academic skill में बदला गया, reasoning flow और visualization methods निकाले गए, और 40-पेज·7,000-शब्द का research paper, 20,000 से अधिक items वाला structured dataset, तथा 14 astronomy-grade charts तैयार किए गए
- अपलोड किए गए resume के आधार पर 100 sub-agents बनाकर California में संबंधित 100 jobs match की गईं, और structured opportunity dataset तथा 100 customized resumes प्रदान किए गए
- Google Maps पर Los Angeles में official website न रखने वाली 30 retail stores की पहचान की गई, और हर store के लिए conversion-focused landing page तैयार किया गया

proactive agents

OpenClaw और Hermes जैसे autonomous·proactive agents में मजबूत प्रदर्शन दर्ज किया गया
- कई applications में 24x7 continuous operation वाले उपयोग प्रकारों का समर्थन
साधारण chat-based interaction से अलग workflows को सपोर्ट करता है
- scheduling, code execution, और cross-platform task orchestration को persistent background agents के रूप में चलाने की आवश्यकता होती है
RL infrastructure team ने K2.6-based agent का उपयोग कर 5 दिनों तक autonomous operation चलाया
- monitoring, incident response, और system operations संभाले
- persistent context बनाए रखना, multi-threaded tasks संभालना, और alert generation से resolution तक पूरे lifecycle का निष्पादन साबित किया
- sensitive information हटाने के बाद के task logs के अस्तित्व का उल्लेख किया गया
वास्तविक वातावरण में reliability improvements को मापा गया
- अधिक सटीक API interpretation
- अधिक स्थिर long-running execution performance
- लंबी research tasks के दौरान बेहतर safety awareness
आंतरिक evaluation suite Claw Bench से प्रदर्शन सुधार को quantify किया गया
- Coding Tasks, IM Ecosystem Integration, Information Research & Analysis, Scheduled Task Management, Memory Utilization इन पाँच क्षेत्रों को शामिल किया गया
- सभी metrics पर Kimi K2.5 की तुलना में task completion rate और tool calling accuracy में बड़ा सुधार हुआ
- खासकर उन workflows में मजबूत सुधार दर्ज हुआ जिनमें मानव निगरानी के बिना लगातार autonomous operation की आवश्यकता होती है

Bring Your Own Agents

मजबूत orchestration capability के आधार पर proactive agents को Claw Groups तक विस्तारित किया गया
- Agent Swarm आर्किटेक्चर के एक नए implementation form के रूप में research preview प्रदान किया गया
खुले और heterogeneous ecosystem को अपनाया गया
- कई agents और मनुष्य वास्तविक collaborators के रूप में साथ काम करते हैं
- उपयोगकर्ता किसी भी device से, किसी भी model पर चल रहे agent को onboard कर सकते हैं
- हर agent के पास अपना toolset, skill, और persistent memory context होता है
- local laptops, mobile devices, cloud instances जैसे विभिन्न environments के agents एक shared operating space में स्वाभाविक रूप से integrate होते हैं
केंद्र में Kimi K2.6 adaptive coordinator की भूमिका निभाता है
- हर agent की skill profile और available tools के आधार पर tasks को dynamic तरीके से assign करता है
- उपयुक्त capabilities के अनुसार tasks को optimize करता है
- agent failure या stagnation होने पर उसे detect कर task reallocation या subtask regeneration करता है
- शुरुआत से validation और completion तक outputs के पूरे lifecycle को सक्रिय रूप से manage करता है
Claw Groups के अपने usage cases भी शामिल हैं
- human-agent workflows को वास्तविक रूप से refine करने के लिए internally agent marketing team का उपयोग किया गया
- Demo Makers, Benchmark Makers, Social Media Agents, Video Makers जैसे specialized agents साथ मिलकर काम करते हैं
- end-to-end content production और launch campaign संचालन करते हैं
- K2.6 intermediate results को share करने और ideas को consistent finished outputs में बदलने का coordination करता है
मानव और AI के रिश्ते को question answering या simple task assignment से आगे बढ़ाकर वास्तविक collaborative partnership तक विस्तारित किया गया
- सहयोगी सिस्टम के भीतर "my agent", "your agent", "our team" की सीमाएँ स्वाभाविक रूप से धुंधली हो जाने वाले भविष्य की दिशा प्रस्तुत की गई

benchmark तालिका

Agentic क्षेत्र के प्रमुख आँकड़े
- HLE-Full w/ tools 54.0, GPT-5.4 52.1, Claude Opus 4.6 53.0, Gemini 3.1 Pro 51.4, Kimi K2.5 50.2
- BrowseComp 83.2, BrowseComp(agent swarm) 86.3, Kimi K2.5 क्रमशः 74.9, 78.4
- DeepSearchQA f1-score 92.5, accuracy 83.0
- WideSearch item-f1 80.8
- Toolathlon 50.0, Kimi K2.5 27.8
- MCPMark 55.9
- Claw Eval pass^3 62.3, pass@3 80.9
- APEX-Agents 27.9
- OSWorld-Verified 73.1
Coding क्षेत्र के प्रमुख आँकड़े
- Terminal-Bench 2.0 (Terminus-2) 66.7
- SWE-Bench Pro 58.6
- SWE-Bench Multilingual 76.7
- SWE-Bench Verified 80.2
- SciCode 52.2
- OJBench (python) 60.6
- LiveCodeBench (v6) 89.6
Reasoning & Knowledge क्षेत्र के प्रमुख आँकड़े
- HLE-Full 34.7
- AIME 2026 96.4
- HMMT 2026 (Feb) 92.7
- IMO-AnswerBench 86.0
- GPQA-Diamond 90.5
Vision क्षेत्र के प्रमुख आँकड़े
- MMMU-Pro 79.4, MMMU-Pro w/ python 80.1
- CharXiv (RQ) 80.4, CharXiv (RQ) w/ python 86.7
- MathVision 87.4, MathVision w/ python 93.2
- BabyVision 39.8, BabyVision w/ python 68.5
- V* w/ python 96.9
आधिकारिक Kimi-K2.6 benchmark results को reproduce करने के लिए official API के उपयोग की सिफारिश की गई
- third-party providers चुनने के लिए Kimi Vendor Verifier (KVV) का संदर्भ भी दिया गया

फुटनोट्स

सामान्य test details
- Kimi K2.6 और Kimi K2.5 के परिणाम thinking mode enabled, Claude Opus 4.6 के max effort, GPT-5.4 के xhigh reasoning effort, और Gemini 3.1 Pro के high thinking level शर्तों पर रिपोर्ट किए गए
- अलग से उल्लेख न होने पर Kimi K2.6 के experiments temperature 1.0, top-p 1.0, और 262,144 tokens context length पर किए गए
- जिन benchmarks के public scores उपलब्ध नहीं थे, उन्हें Kimi K2.6 जैसी ही शर्तों पर दोबारा evaluate किया गया और asterisk(*) से चिह्नित किया गया
- जिन परिणामों पर asterisk नहीं है, वे official reports से उद्धृत हैं
reasoning benchmarks
- GPT-5.4 और Claude 4.6 के IMO-AnswerBench scores z.ai blog से लिए गए
- Humanity's Last Exam (HLE) और अन्य reasoning tasks का evaluation अधिकतम 98,304 tokens generation length पर किया गया
- default reported value HLE full set है
- text-only subset में Kimi K2.6 ने tools के बिना 36.4% accuracy, और tools के साथ 55.5% accuracy दर्ज की
tool-augmented और agentic tasks
- HLE with tools, BrowseComp, DeepSearchQA, WideSearch में search, code-interpreter, web-browsing tools लगाए गए
- HLE-Full with tools के लिए अधिकतम generation length 262,144 tokens, और per-step limit 49,152 tokens थी
- context window threshold पार होने पर केवल सबसे हाल की tool-related message rounds को बनाए रखने वाली simple context management strategy का उपयोग किया गया
- BrowseComp scores Kimi K2.5 और DeepSeek-V3.2 के समान discard-all strategy context management से प्राप्त किए गए
- DeepSearchQA में Kimi K2.6 test पर context management लागू नहीं किया गया, और supported context length से अधिक tasks को सीधे failure के रूप में गिना गया
- Claude Opus 4.6, GPT-5.4, और Gemini 3.1 Pro के DeepSearchQA scores Claude Opus 4.7 System Card से उद्धृत हैं
- WideSearch के परिणाम hide tool result context management setting के साथ रिपोर्ट किए गए
- test system prompt Kimi K2.5 technical report के समान था
- Claw Eval version 1.1, max-tokens-per-step 16384 पर चलाया गया
- APEX-Agents में public 480 tasks में से 452 tasks का evaluation किया गया
  - Artificial Analysis के समान Investment Banking Worlds 244, 246 को बाहर रखा गया
  - बाहर रखने का कारण external runtime dependency था
coding tasks
- Terminal-Bench 2.0 score base agent framework Terminus-2 और दिए गए JSON parser का उपयोग कर preserve thinking mode में प्राप्त किया गया
- SWE-Bench series evaluations (Verified, Multilingual, Pro सहित) के लिए SWE-agent पर आधारित संशोधित in-house evaluation framework का उपयोग किया गया
- इस framework का tool configuration bash tool, createfile tool, insert tool, view tool, strreplace tool, submit tool के न्यूनतम सेट पर आधारित था
- coding tasks के सभी reported scores 10 स्वतंत्र runs के average हैं
vision benchmarks
- max-tokens 98,304, 3 runs का average(avg@3) लागू किया गया
- Python tool enabled setting में max-tokens-per-step 65,536, max-steps 50 के साथ multi-step reasoning चलाई गई
- MMMU-Pro official protocol का पालन करता है, input order बनाए रखता है, और images को आगे रखता है

Kimi K2.6 जारी - ओपन सोर्स कोडिंग की प्रगति

लॉन्ग-होराइज़न कोडिंग

कोडिंग-केंद्रित डिज़ाइन

उन्नत Agent Swarm

proactive agents

Bring Your Own Agents

benchmark तालिका

फुटनोट्स

सामान्य test details

reasoning benchmarks

tool-augmented और agentic tasks

coding tasks

vision benchmarks

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.