voyage-multimodal-3: टेक्स्ट, इमेज और स्क्रीनशॉट के लिए ऑल-इन-वन embedding model

(blog.voyageai.com)

4 पॉइंट द्वारा GN⁺ 2024-11-18 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Voyage AI द्वारा जारी voyage-multimodal-3 टेक्स्ट और इमेज मिले-जुले knowledge base को एक ही embedding model से search करने के लिए RAG और semantic search model है
इसकी मुख्य खासियत यह है कि PDF, slides, tables, figures और document screenshots जैसे layout information महत्वपूर्ण होने वाले data को document parsing के बिना vectorize किया जा सकता है
CLIP family models की mixed-modality search limitations घटाने के लिए यह text और visual information को एक ही Transformer encoder में process करता है, और mixed-modality input के contextual relationships को preserve करता है
20 multimodal search datasets में इसने अगले best-performing multimodal embedding model की तुलना में औसतन 19.63% ज्यादा search accuracy दिखाई, और 34 text search datasets में भी OpenAI v3 large से आगे रहा
Screenshot ratio बढ़ने पर CLIP-based models की quality गिरती गई, लेकिन voyage-multimodal-3 में सभी ranges में performance degradation कम रहा, इसलिए यह screen capture-based search pipelines के लिए practical है

`voyage-multimodal-3` किन use cases को target करता है

voyage-multimodal-3 Voyage AI का पहला multimodal embedding model है, जिसका लक्ष्य visual material और text से भरपूर knowledge bases के लिए RAG और semantic search है
Input target text और content-heavy images हैं; प्रमुख examples ये हैं
- Text screenshots
- Figures और tables
- PDF screenshots
- Slide decks
- अन्य document images
Generated vectors text semantics के साथ-साथ font size, text position और margins जैसी visual features को भी reflect करते हैं
Complex layout या figures/photos मिले documents में heuristic-based parsing से accuracy issues हो सकते हैं, इसलिए model original screen को सीधे search vector में बदलने का approach अपनाता है
Functionality examples sample notebook में देखे जा सकते हैं

CLIP family से अलग embedding approach

Amazon Titan Multimodal G1, Google Vertex AI multimodal, Cohere multimodal v3 जैसे existing multimodal embedding models OpenAI CLIP-based architecture का उपयोग करते हैं
CLIP family architecture अलग-अलग modalities को independent networks से process करता है
- Images को vision tower के जरिए vectorize किया जाता है
- Text को text tower के जरिए vectorize किया जाता है
- इस architecture में text और image मिले input को एक साथ process करना मुश्किल होता है
voyage-multimodal-3 दोनों modalities को एक ही Transformer encoder के अंदर directly vectorize करता है
- Text और visual features को अलग components के बजाय unified representation के हिस्से के रूप में handle किया जाता है
- यह latest vision-language model architecture को generation के बजाय vectorization पर लागू करने जैसा है
इसके कारण mixed text और images, document screenshots, complex PDFs और annotated images में visual information और text information के contextual relationships को साथ में vectors में capture किया जा सकता है

Screenshots मिली search में दिखा फर्क

CLIP-like models में modality gap की वजह से mixed-modality search में performance कम हो सकती है
Example में “I address you, members of the Seventy-Seventh Congress…” text snippet के सबसे closest vector के रूप में उस screenshot के बजाय कोई दूसरा text मिला
यह phenomenon search bias पैदा करता है, जिसमें text vector related image की तुलना में unrelated same-modality item के ज्यादा close हो जाता है
Voyage AI ने PyTorch documentation के साथ quantitative experiment setup किया
- Same content वाले document sets को plain text strings और screenshots के रूप में अलग-अलग बनाया
- Text-based documents के कुछ हिस्से और बाकी documents के screenshots मिलाकर mixed-modality dataset बनाया
- Screenshot ratio को 0% से 100% तक अलग-अलग set किया गया
- हर model ने cosine similarity से top 10 results search किए और NDCG@10 से evaluation किया गया
CLIP-based models में screenshot ratio 90% तक बढ़ने पर search quality घट गई, और सभी text को images में बदलने पर भी performance कम रही
voyage-multimodal-3 ने सभी ratios में highest performance दिखाई, और overall performance degradation भी लगभग नहीं था
यह result screenshots के अंदर semantic information को vector में capture करने की क्षमता और सभी input modalities को same backbone से process करने वाले approach की robustness दिखाता है

Evaluation datasets और comparison targets

Multimodal evaluation 3 tasks, कुल 20 datasets पर की गई
- Table/figure search: charxiv, mmtab-test, ChartQA, Chartve, FintabnetQA, PlotQA
- Document screenshot search: ViDoRe benchmark के Energy, Healthcare Industry, Artificial Intelligence, Government Report, InfoVQA, DocVQA, ArxivQA, TabFQuad, TAT-DQA, Shift Project
- Text-photo search: meme-cap, mm-imdb, winoground, docci
Standard text search evaluation law, finance, conversation, code, web और technology जैसे 6 domains के 34 datasets पर की गई
सभी datasets में queries text हैं, और documents figures, photos, text, document screenshots या इनके combinations हो सकते हैं
Multimodal tasks के comparison models ये हैं
- OpenAI CLIP large (clip-vit-large-patch14-336)
- Amazon Titan Multimodal Embeddings G1 (amazon.titan-embed-image-v1)
- Cohere multimodal v3 (embed-multimodal-v3.0)
- SigLIP So400M (siglip-so400m-patch14-384)
- ColQwen2 v0.1 (colqwen-v0.1)
Standard text search में OpenAI v3 large (text-embeddings-3-large), Cohere multimodal/English 1 v3 और voyage-3 से comparison किया गया
Cohere multimodal v3 pure text में Cohere English v3 (embed-english-v3.0) को text tower के रूप में use करता है, इसलिए chart में confusion घटाने के लिए केवल “Cohere multimodal v3” label इस्तेमाल किया गया

Search accuracy results

voyage-multimodal-3 ने 20 multimodal search datasets में overall अगले best-performing multimodal embedding model की तुलना में औसतन 19.63% ज्यादा search accuracy दर्ज की
Table/figure search में यह OpenAI CLIP large, Amazon Titan Multimodal G1, Cohere multimodal v3, SigLIP So400M और ColQwen2 v0.1 से क्रमशः 41.44%, 45.00%, 43.37%, 20.66%, 6.14% आगे रहा
Document screenshot search में समान comparison models की तुलना में क्रमशः 26.54%, 37.68%, 25.84%, 35.62%, 0.98% ज्यादा performance दिखाई
Text-photo search में समान comparison models की तुलना में क्रमशः 6.55%, 5.16%, 5.86%, 3.42%, 10.34% आगे रहा
Standard text search में इसने OpenAI v3 large से 5.13%, और Cohere multimodal/English 1 v3 से 13.70% बेहतर performance दी
Pure text document search accuracy voyage-3 से 0.05% ज्यादा रही, यानी दोनों models लगभग एक समान level पर हैं
Full evaluation results spreadsheet में public हैं

Getting started और उपलब्ध resources

voyage-multimodal-3 release day से ही available है
पहले 200 million tokens free हैं
Getting started resources sample notebook और docs में दिए गए हैं
Fine-tuning embedding models में रुचि रखने वाले users contact@voyageai.com पर contact कर सकते हैं

1 टिप्पणियां

GN⁺ 2024-11-18

Hacker News की रायें

मुख्य observation सरल और सहज है: सभी CLIP-family models modality gap की वजह से mixed-modality search में अच्छा प्रदर्शन नहीं करते
उदाहरण के लिए “I address you, members of the Seventy-Seventh Congress…” वाक्य के सबसे करीब वाला vector संबंधित screenshot नहीं, बल्कि कोई दूसरा text हो जाता है। इसलिए embedding space में text vector संबंधित image की तुलना में असंबंधित text के ज्यादा करीब हो जाता है, और search results उसी modality की तरफ झुक जाते हैं
- यह quote महत्वपूर्ण है, लेकिन अकेले देखने पर यह स्पष्ट नहीं होता कि वे इस समस्या को हल करने का दावा कर रहे हैं या नहीं। लगता है कि नया model voyage-multimodal-3 modalities के पार जुड़े हुए concepts की पहचान करता है
  अगर कोई latent space हो जो visually व्यक्त किए गए या text में लिखे गए एक ही idea को cluster कर सके, तो यह काफी शानदार होगा। हालांकि मुझे लगता है कि यह benchmark multimodal embeddings को काफी संकीर्ण तरीके से देखता है। संबंधित text image और text embedding का पास होना सुविधाजनक है, लेकिन यह कहना मुश्किल है कि यह “rabbit” और खरगोश की photo जैसे अलग visual representations की relevance तक भी फैलता है। अगर लक्ष्य document images को index करने जैसा संकीर्ण है, तो दूसरी techniques भी काफी अच्छा काम कर सकती हैं। text medium से आगे बढ़कर multimodal concept representation के लिए नया benchmark dataset आने का यह अच्छा मौका लगता है
- यह समस्या शायद multimodal mixup से हल की जा रही हो, जो दो modalities के बीच बड़ा latent-space gap बनने नहीं देता: https://arxiv.org/abs/2203.03897
अगर इस क्षेत्र में रुचि है, तो हमारे project को भी उम्मीदवारों में रखा जा सकता है, जो अंदरूनी तौर पर ColPali को transparent तरीके से इस्तेमाल करता है
https://github.com/tjmlabs/ColiVara
इस ओर का मुख्य benchmark Vidore leaderboard है, और मैं देखना चाहूंगा कि VoyageAI ज्यादा खुले open-source implementations की तुलना में कहां खड़ा है
लगता है मैं कुछ miss कर रहा हूं। कोई LLM जो “native multimodal” है, उसमें किसी न किसी रूप में multimodal embeddings शामिल होनी चाहिए, है ना
उदाहरण के लिए Google का Gemini blog post बताता है कि पुराने multimodal models में अलग-अलग modalities के components को अलग से train करके बाद में जोड़ा जाता था, लेकिन Gemini को शुरू से कई modalities पर pretrain किया गया और फिर अतिरिक्त multimodal data से fine-tune किया गया। इसलिए उनका दावा है कि यह शुरुआत से ही हर तरह के input को स्वाभाविक रूप से समझता और reason करता है
- Gemini जैसे LLM, और व्यापक रूप से causal language models, next-token prediction से train होते हैं, इसलिए output token embeddings को pool करके मिला vector RAG या semantic search के लिए उतना उपयोगी नहीं होता जितना किसी असली embedding model से मिला vector
  यहां फर्क समझना जरूरी है कि token embeddings और embedding model द्वारा output किए गए vectors/embeddings संबंधित हैं, लेकिन अलग concepts हैं। हर token के लिए अलग-अलग कई token embeddings होती हैं, जो transformer से गुजरते हुए contextualized हो जाती हैं, जबकि embedding model लंबे text, photo, document screenshot जैसे किसी एक input data item के लिए एक vector output करता है
- LLM embeddings में कई concepts की superposed representation होती है, जिससे next token predict किया जा सकता है, लेकिन वे contrastive learning से pretrained embedding models से बेहतर प्रदर्शन नहीं करतीं
- अगर दूसरे जवाब स्पष्ट नहीं थे, तो यहां “embedding” को “मेरे AI model की किसी layer द्वारा बनाई गई list” जैसा समझ सकते हैं
  सटीक तौर पर यह थोड़ा और विशिष्ट concept है, लेकिन इस संदर्भ में यह ठीक है। multimodal LLM समेत LLMs में भी embeddings होती हैं, लेकिन वे समान documents खोजने के लिए trained embeddings नहीं, बल्कि text generation के जरिए trained embeddings हैं
काफी प्रभावशाली लगता है। पेश किए गए evaluation पर critical perspectives जानने में रुचि है
यह भी जानना चाहूंगा कि non-English text के लिए कैसा है। क्या यह समझना सही है कि दूसरे commercial models की तरह यह भी API-only model है?
- सही, Voyage models API-only हैं
  मैंने multilingual के बारे में कुछ लिखा था, लेकिन वह गलत था, इसलिए हटा दिया। संदर्भ के लिए, Voyage में अलग law, code, finance models भी हैं। [1] देखें
  वैसे results सच में दिलचस्प हैं
  [1]: https://docs.voyageai.com/docs/embeddings
यह बात निराशाजनक है कि model commercial proprietary है और API-only है
- क्या यह दुखद है कि कर्मचारियों को salary देनी पड़ती है?
अगर यह API-only model है, तो मैं छोड़ूंगा। फिर भी बधाई
- दोनों बातों से सहमत हूं। जाहिर है लोगों से पैसे लेने के अलावा भी सिर्फ API पर focus करने के स्पष्ट कारण हो सकते हैं, लेकिन यह तथ्य ही कि कोई दूसरा option नहीं दिया जा रहा, मेरे लिए personally इसे consider न करने के लिए काफी होगा
काफी दिलचस्प लगता है। मैं AnyModal पर काम कर रहा था, जो images और audio जैसे कई data types को LLM में integrate करने का framework है: https://github.com/ritabratamaiti/AnyModal
voyage-multimodal-3 multimodal LLM development के लिए काफी promising लगता है, लेकिन मुझे नहीं पता कि यही intended use case है या नहीं
traditional Python API में Voyage engine text blocks को tokenize करता है और string output करता है। यह model images को space में vectorize करके वही काम करता लगता है
you या apple जैसे शब्द एक token बनते हैं, और pikachu जैसे ज्यादा complex terms pik-a-chu की तरह टूट सकते हैं
[1]: https://docs.voyageai.com/docs/tokenization
multimodal embeddings को देखने का तरीका दिलचस्प है। जब input एक modality से दूसरी modality की ओर धीरे-धीरे shift करता है, तो उसके अनुपात के अनुसार performance change मापते हैं
https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/...
Colab में dot product values 0.428 और 0.498 मापी गईं और इन्हें “similarity values काफी high हैं” बताया गया। मुझे संदेह है कि क्या ये सच में high values हैं
क्या 0.4 threshold से data को confidently label करने वाला system design किया जा सकता है?
- raw similarity score भी मायने रखता है, लेकिन आम तौर पर ज्यादा महत्वपूर्ण यह होता है कि दूसरे documents की तुलना में relative score क्या है
  notebook example में वे values relative रूप से सबसे high थीं। समझता हूं कि यह unclear या confusing क्यों हो सकता है, और इसे ठीक करूंगा
- raw output value अपने आप में आम तौर पर महत्वपूर्ण नहीं होती। महत्वपूर्ण यह है कि output distribution में उसकी position क्या है
- cosine similarity का 0.4 sigmoid threshold के 0.4 जैसा नहीं है
  लगभग identical duplicate data नहीं, बल्कि real data में cosine similarity 0.4 काफी ठीक-ठाक value है

voyage-multimodal-3: टेक्स्ट, इमेज और स्क्रीनशॉट के लिए ऑल-इन-वन embedding model

voyage-multimodal-3 किन use cases को target करता है

CLIP family से अलग embedding approach

Screenshots मिली search में दिखा फर्क

Evaluation datasets और comparison targets

Search accuracy results

Getting started और उपलब्ध resources

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की रायें

`voyage-multimodal-3` किन use cases को target करता है