Mistral Voxtral Realtime 4B स्पीच रिकग्निशन मॉडल के लिए शुद्ध C-आधारित CPU-ओनली inference implementation

(github.com/antirez)

13 पॉइंट द्वारा GN⁺ 2026-02-12 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

Mistral Voxtral Realtime 4B मॉडल के लिए सिर्फ C भाषा में बना inference pipeline, जो बिना किसी बाहरी dependency के पूरी तरह standalone संरचना देता है
Metal GPU acceleration (MPS) और BLAS(OpenBLAS/Accelerate) backend का समर्थन, और streaming API के जरिए real-time voice input तथा token output को प्रोसेस करता है
Memory-mapped BF16 weights, sliding window-आधारित encoder, और rolling KV cache की मदद से लंबे audio input में भी memory usage स्थिर रहता है
Mic input, stdin pipe, और ffmpeg conversion के जरिए कई तरह के audio input modes का समर्थन, साथ ही alternative token display और latency control option (-I) भी उपलब्ध
MIT license के तहत जारी, और Apple M3 Max पर real-time से लगभग 2.5 गुना तेज़ प्रदर्शन के साथ हल्का local speech recognition implementation संभव बनाता है

Voxtral.c अवलोकन

Mistral AI के Voxtral Realtime 4B मॉडल के लिए शुद्ध C-आधारित inference engine, जिसमें C standard library के अलावा कोई dependency नहीं है
- MPS backend तेज़ inference speed देता है, जबकि BLAS(OpenBLAS/Accelerate) CPU-आधारित environments में काम करता है
- Python runtime, CUDA, vLLM के बिना भी पूरा local inference संभव
python_simple_implementation.py फ़ाइल के जरिए एक सरल Python reference implementation भी दी गई है
- इसके लिए केवल PyTorch, safetensors, soundfile, soxr चाहिए

मुख्य फीचर्स

Zero dependencies: किसी बाहरी library के बिना सिर्फ C से चलाया जा सकता है
Metal GPU acceleration: Apple Silicon environment में अपने-आप सक्रिय, GPU operation fusion और batched attention processing के साथ
Streaming output: generate किए गए token तुरंत stdout पर आउटपुट होते हैं
Streaming C API: audio को क्रमवार input किया जा सकता है और token string real-time में प्राप्त की जा सकती है
Memory-mapped weights: safetensors फ़ाइल को mmap से सीधे load कर तुरंत उपयोग किया जा सकता है
Mic input support (macOS): automatic silence detection शामिल
Chunked Encoder: overlapping chunks में audio प्रोसेस कर memory usage स्थिर रखता है
Rolling KV Cache: 8192-position sliding window के साथ cache को अपने-आप compress करता है, जिससे unlimited-length audio प्रोसेस किया जा सकता है

उपयोग का तरीका

बेसिक कमांड
- ./voxtral -d voxtral-model -i audio.wav : फ़ाइल-आधारित speech recognition
- ./voxtral -d voxtral-model --from-mic : mic input के साथ real-time recognition (macOS)
- ffmpeg pipe के जरिए विभिन्न audio formats input किए जा सकते हैं
Alternative token display
- --alt <cutoff> option के साथ मिलते-जुलते उच्चारण वाले candidate भी दिखाए जा सकते हैं
- cutoff मान जितना अधिक होगा, उतने अधिक candidate दिखेंगे
Latency control (-I option)
- encoder call interval को सेकंड में सेट किया जा सकता है
- कम मान (जैसे 0.5 सेकंड) = कम latency, ज़्यादा GPU load / अधिक मान (जैसे 5 सेकंड) = अधिक efficient processing
- default 2.0 सेकंड है, real-time streaming के लिए 1.0~2.0 सेकंड recommended है

C API संरचना

vox_stream_t-आधारित streaming API उपलब्ध
- feed() : audio input
- get() : token प्राप्त करना
- finish() : बचा हुआ audio प्रोसेस करना
- flush() : buffer को force करके प्रोसेस करना
vox_stream_set_alt() से alternative token count सेट किया जा सकता है
vox_transcribe() फ़ंक्शन से single-file batch processing संभव

मॉडल डाउनलोड और कॉन्फ़िगरेशन

HuggingFace से लगभग 8.9GB मॉडल weights डाउनलोड करने होते हैं
- consolidated.safetensors (BF16 weights)
- tekken.json (tokenizer vocabulary)
- params.json (model configuration)
Apache-2.0 license मॉडल, MIT license कोड

परफ़ॉर्मेंस बेंचमार्क

Apple M3 Max (40-core GPU, 128GB RAM) के आधार पर
- MPS backend: encoder 284ms, decoder 23.5ms/step
- BLAS backend: encoder लगभग 8 सेकंड, decoder 335ms/step
60 सेकंड audio पर औसतन 31.6ms/step, यानी real-time से लगभग 2.5 गुना तेज़
decoder, एकल Metal command buffer call में प्रति token पूरा computation करता है

मॉडल आर्किटेक्चर

कुल 4 अरब parameters (4B) वाला streaming speech-to-text मॉडल
- Audio encoder: 32-layer causal transformer, 1280-dimension, 32-head, window 750
- Adapter: Linear(5120→3072) → GELU → Linear(3072→3072)
- LLM decoder: 26-layer transformer (Ministral-3 आधारित), 3072-dimension, GQA(32-head/8KV)
Tekken tokenizer, vocabulary size 131,072
समर्थित भाषाएँ: English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, Arabic, Russian, Chinese, Japanese, Korean

मेमोरी आवश्यकताएँ

मॉडल weights: 8.9GB (on-demand mmap)
GPU cache: लगभग 8.4GB (BF16→F16 conversion के बाद)
KV cache: अधिकतम 1.8GB (sliding window limit)
वर्क buffer: लगभग 200MB

बिल्ड और प्लेटफ़ॉर्म

macOS Apple Silicon: make mps (सबसे तेज़)
macOS Intel / Linux(OpenBLAS) : make blas
Ubuntu/Debian: sudo apt install libopenblas-dev
Fedora: sudo dnf install openblas-devel

लाइसेंस

कोड: MIT
मॉडल: Apache-2.0
open source रूप में कोई भी इसे संशोधित और पुनर्वितरित कर सकता है

Mistral Voxtral Realtime 4B स्पीच रिकग्निशन मॉडल के लिए शुद्ध C-आधारित CPU-ओनली inference implementation

Voxtral.c अवलोकन

मुख्य फीचर्स

उपयोग का तरीका

C API संरचना

मॉडल डाउनलोड और कॉन्फ़िगरेशन

परफ़ॉर्मेंस बेंचमार्क

मॉडल आर्किटेक्चर

मेमोरी आवश्यकताएँ

बिल्ड और प्लेटफ़ॉर्म

लाइसेंस

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.