Mercury 2: diffusion-आधारित अल्ट्रा-फास्ट inference LLM

(inceptionlabs.ai)

7 पॉइंट द्वारा GN⁺ 2026-02-26 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

diffusion model-आधारित parallel generation तरीके का उपयोग कर पारंपरिक sequential decoding LLM की speed limitations को पार करने वाला language model
एक साथ कई tokens को generate और revise करने वाली parallel refinement संरचना के साथ, 5x से अधिक तेज़ response speed हासिल
1,009 tokens/second processing speed, 128K context, JSON output, tool use capabilities आदि के साथ real-time applications के लिए optimized
coding assistance, agent loops, voice interfaces, search·RAG pipelines जैसे latency-sensitive environments में efficiency साबित
OpenAI API के साथ पूरी तरह compatible, मौजूदा infrastructure में बदलाव किए बिना तुरंत integrate किया जा सकता है

Mercury 2 का अवलोकन

Mercury 2 दुनिया का सबसे तेज़ inference language model है
- इसका लक्ष्य production AI environments में instant responsiveness प्रदान करना है
मौजूदा LLMs की bottleneck autoregressive sequential decoding (one token at a time) संरचना है
- इसके कारण iterative loop-आधारित AI workflows में latency जमा होती जाती है

Mercury 2 sequential decoding की जगह parallel refinement तरीका अपनाता है
- यह कई tokens को एक साथ generate करता है और कुछ ही steps में converge करता है
- यह “typewriter” नहीं बल्कि “editor” की तरह पूरे draft को बार-बार revise करता है
नतीजतन 5x से अधिक तेज़ generation speed और एक नया speed curve हासिल होता है
diffusion-आधारित inference latency और cost को कम रखते हुए high-quality reasoning संभव बनाता है

speed: NVIDIA Blackwell GPU पर 1,009 tokens/second
pricing: input के प्रति 1 million tokens पर $0.25, output के प्रति 1 million tokens पर $0.75
quality: प्रमुख speed-optimized models के साथ प्रतिस्पर्धी स्तर
features: tunable reasoning, 128K context, tool use, JSON schema-aligned output
latency optimization: p95 latency, high-concurrency environments में consistent responsiveness, stable throughput बरकरार
NVIDIA के एक प्रतिनिधि ने कहा कि Mercury 2 ने NVIDIA AI infrastructure के साथ मिलकर 1,000 tokens/second से अधिक हासिल किया

autocomplete, refactoring, code agents जैसे developer loops में instant responses प्रदान करता है
Zed के cofounder Max Brunsfeld ने “suggestions की speed जो सोच का हिस्सा लगे” पर ज़ोर दिया

multi-step reasoning calls की ज़रूरत वाले agent workflows में call latency कम करता है
Viant ने Mercury 2 का उपयोग कर real-time campaign optimization और autonomous advertising systems को मजबूत किया
Wispr Flow real-time conversation और transcript refinement में Mercury 2 की speed का मूल्यांकन कर रहा है
Skyvern ने कहा, “GPT-5.2 से कम-से-कम दो गुना तेज़”

voice interfaces की latency limits सबसे कठोर होती हैं
Happyverse AI ने Mercury 2 के साथ natural real-time conversational avatars बनाए
OpenCall ने low latency और high quality के साथ अधिक responsive voice agents बनाने की संभावना बताई

multi-search, re-ranking, summary process की cumulative latency घटाकर real-time inference संभव बनाता है
SearchBlox ने Mercury 2 के साथ सहयोग में real-time search AI लागू किया,
और customer support, risk, e-commerce जैसे विभिन्न क्षेत्रों में seconds-level intelligence प्रदान की

Mercury 2 तुरंत उपलब्ध है और OpenAI API के साथ पूरी तरह compatible है
मौजूदा systems में code changes के बिना integrate किया जा सकता है
enterprise evaluation के दौरान workload fit, performance validation, evaluation design support प्रदान किया जाता है
आधिकारिक वाक्य: “Mercury 2 is live. Welcome to diffusion.”