Embeddings क्या हैं और वे क्यों महत्वपूर्ण हैं

(simonwillison.net)

5 पॉइंट द्वारा GN⁺ 2023-10-25 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Embeddings टेक्स्ट, इमेज और कोड जैसे कंटेंट को निश्चित लंबाई वाले floating-point array में बदलते हैं, जिससे अर्थ के आधार पर करीब आइटम दूरी की गणना से खोजे जा सकते हैं
एक ही मॉडल द्वारा बनाए गए vector space में, हर अलग संख्या का अर्थ जाने बिना भी cosine similarity से संबंधित दस्तावेज़, मिलती-जुलती इमेज और कोड snippets की तुलना की जा सकती है
OpenAI text-embedding-ada-002 से 472 TIL पोस्ट को 1,536-आयामी vectors के रूप में स्टोर करने वाले एक उदाहरण में, संबंधित पोस्ट खोजने की query में लगभग 400ms लगे, और पूरे 402,500 tokens की embedding लागत लगभग $0.04 रही
सिर्फ local models और छोटे tools के संयोजन से README search, code search, image search, clustering और RAG लागू किए जा सकते हैं; उदाहरण के तौर पर LLM, llm-sentence-transformers, Symbex, CLIP, E5-large-v2 का उपयोग हुआ
embedding-आधारित semantic search सटीक शब्द मिलान पर निर्भर नहीं करता, इसलिए यह आंतरिक दस्तावेज़ Q&A जैसे RAG में LLM prompt के भीतर संबंधित अंश डालने का एक मुख्य तरीका बन जाता है

Embeddings की बुनियादी अवधारणा

Embedding कंटेंट के एक हिस्से को floating-point numbers के array में बदलने का तरीका है
- कंटेंट की लंबाई चाहे जो हो, array की लंबाई हमेशा समान रहती है
- array की लंबाई इस्तेमाल किए जा रहे embedding model से तय होती है; उदाहरण के तौर पर 300, 1,000 या 1,536 संख्याएँ हो सकती हैं
इस array को बहु-आयामी space में coordinates की तरह देखा जा सकता है
- space में इसकी स्थिति उस अर्थ को दर्शाती है जिसे embedding model ने कंटेंट से समझा है
- इसमें रंग, आकार, अवधारणा जैसी कंटेंट विशेषताएँ परिलक्षित हो सकती हैं
भले ही हर अलग संख्या का अर्थ पूरी तरह समझ में न आए, position relationships का उपयोग करके करीब आइटम ढूँढने जैसे उपयोगी काम किए जा सकते हैं

Word2Vec से vector space को समझना

Google Research का Efficient Estimation of Word Representations in Vector Space 16 जनवरी 2013 को प्रकाशित Word2Vec शोधपत्र है
Word2Vec एक शुरुआती embedding model है जो एक शब्द को 300 संख्याओं के array में बदलता है
turbomaze.github.io/word2vecjson एक demo है जिसमें 10,000 शब्दों और हर शब्द के 300-number array को explore किया जा सकता है
- “france” के करीब french, belgium, paris, germany, italy, spain जैसे शब्द आते हैं
vector operations से भी संबंध सामने आते हैं
- “germany” vector में “paris” जोड़कर और “france” घटाकर जो vector मिलता है, वह “berlin” के सबसे करीब होता है
- इससे दिखता है कि model ने राष्ट्रीयता और भौगोलिक संबंधों को vector space में पकड़ा है
Word2Vec को 1.6 अरब शब्दों वाले कंटेंट पर train किया गया था, और आज के embedding models कहीं बड़े datasets पर train होकर अधिक समृद्ध संबंधों को पकड़ते हैं

LLM tools से embeddings निकालना

LLM बड़े language models के साथ काम करने के लिए एक command-line tool और Python library है
- इसे pip install llm या brew install llm से install किया जा सकता है
- default रूप से यह OpenAI API के साथ इस्तेमाल किया जा सकता है
plugins install करने पर नए language model या embedding model जोड़े जा सकते हैं
llm-sentence-transformers plugin, SentenceTransformers library को wrap करता है
- all-MiniLM-L6-v2 model को Hugging Face से डाउनलोड कर local रूप से इस्तेमाल किया जा सकता है
- llm embed command एक वाक्य को JSON number array के रूप में आउटपुट करती है
embedding अपने-आप में सिर्फ number array है; उसे स्टोर करके तुलना करने पर ही वह उपयोगी बनती है
llm embed-multi कई कंटेंट आइटम्स की embeddings एक साथ बनाकर SQLite table में collection के रूप में स्टोर कर सकता है
- उदाहरण command home directory के नीचे मौजूद सभी README.md files खोजकर उन्हें readmes collection में स्टोर करती है
- --store option मूल टेक्स्ट को भी SQLite table में साथ स्टोर करता है
- execution के बाद 16,796 README.md files स्टोर हुईं, और local computer पर इसमें लगभग 30 मिनट लगे

Semantic search और “vibes-based search”

llm similar command स्टोर की गई embedding collection में input sentence से मिलते-जुलते आइटम खोजती है
sqlite backup tools वाक्य से readmes collection खोजने पर sqlite-diffable, sqlite-dump, sqlite-generate, sqlite-history, sqlite-utils जैसे SQLite backup या संबंधित projects के README ऊपर आते हैं
यह ज़रूरी नहीं कि परिणाम दस्तावेज़ में “backups” शब्द सीधे मौजूद हो
- अगर कंटेंट query से अर्थ के स्तर पर समान है, तो वह परिणाम में आ सकता है
यही semantic search है, और मूल लेख में इसे vibes-based search कहा गया है
सिर्फ exact text match से उपयोगकर्ता की खोज हमेशा नहीं मिलती, इसलिए यह कई तरह के content search engines के लिए उपयोगी है

Code embeddings: Symbex और Datasette

Symbex Python codebase के symbols को explore करने का tool है
- इसे Python functions और classes जल्दी खोजकर LLM को देने के लिए बनाया गया था
- बाद में यह codebase के सभी functions की embeddings निकालकर code search engine बनाने में काम आया
Symbex मिले हुए symbols को JSON या CSV में output कर सकता है, और यह format llm embed-multi के input के रूप में इस्तेमाल हो सकता है
Datasette project के सभी functions और class methods की embeddings बनाने वाला उदाहरण gte-tiny model का उपयोग करता है
- gte-tiny एक 60MB file है
- symbex '*' '*:*' --nl वर्तमान directory के functions और class methods को newline-delimited JSON में output करता है
- llm embed-multi ... --format nl इस output को सीधे input के रूप में लेकर embeddings बनाता है
इसके बाद Datasette और datasette-llm-embed plugin से SQL के जरिए code semantic search चलाया जा सकता है
SQLite कई tools को जोड़ने वाले integration point की तरह काम करता है
- code से functions निकाले जाते हैं
- उन्हें embedding model से pass किया जाता है
- results SQLite में लिखे जाते हैं
- SQL से खोजा जाता है

CLIP से text और image को एक ही space में embed करना

CLIP OpenAI का जनवरी 2021 में जारी किया गया model है, जो text और image दोनों को embed कर सकता है
इसकी खास बात यह है कि text और image को एक ही vector space में रखा जाता है
- “dog” string की embedding position और कुत्ते की photo की embedding position उसी space में एक-दूसरे के करीब होती हैं
- text से संबंधित images या image से संबंधित text खोजे जा सकते हैं
browser में चलने वाला CLIP demo एक Observable notebook के रूप में बनाया गया है और CLIP model को browser के भीतर चलाता है
- page 158MB resources load करता है
- CLIP text model 64.6MB और image model 87.6MB का है
समुद्र तट की एक photo के लिए text के अनुसार similarity scores का उदाहरण दिया गया है
- beach: 26.946%
- city: 19.839%
- sunshine: 24.146%
- california beach: 27.427%
किसी भी photo और एक शब्द के बीच similarity पूछना अपने-आप में मुख्य बात नहीं है; असली महत्व उसके ऊपर search interface बनाना है

Faucet Finder: CLIP-आधारित image search

Faucet Finder bathroom faucets की photos खोजने के लिए बनाया गया एक custom search tool है
Drew Breunig ने faucet suppliers से 20,000 faucet photos इकट्ठी कीं और उनकी CLIP embeddings निकालीं
- implementation में LLM और llm-clip plugin का उपयोग हुआ
- इसे Datasette के साथ deploy किया गया
यह tool किसी खास faucet से visually similar दूसरे faucets खोज सकता है
- यदि कोई महँगा faucet पसंद आए, तो उससे मिलता-जुलता सस्ता विकल्प खोजा जा सकता है
Drew के demo में precomputed embeddings का उपयोग हुआ, इसलिए समान परिणाम दिखाने के लिए server पर CLIP model चलाने की ज़रूरत नहीं पड़ी
बाद में server-side CLIP model को Fly.io पर deploy किया गया, और text string embedding API तथा faucet embedding table API को जोड़ने वाला Observable notebook demo बनाया गया
- “gold purple” जैसी query से faucets की images को अर्थ के आधार पर खोजा जा सकता है

Clustering और 2D visualization

embeddings का उपयोग सिर्फ संबंधित कंटेंट सिफारिश और semantic search के लिए नहीं, बल्कि clustering के लिए भी किया जा सकता है
llm-cluster एक plugin है जो scikit-learn के sklearn.cluster का उपयोग करके clustering लागू करता है
GitHub issues API और paginate-json की मदद से simonw/llm repository के issue titles से llm-issues collection बनाई जा सकती है और 10 clusters बनाए जा सकते हैं
llm cluster llm-issues 10 --summary option cluster text को LLM में भेजकर उनके लिए वर्णनात्मक नाम बनाता है
- उदाहरण नामों में “Log Management and Interactive Prompt Tracking” और “Continuing Conversation Mechanism and Management” शामिल हैं
high-dimensional space को visualize करना कठिन होता है, इसलिए principal component analysis (PCA) से dimensions घटाए जा सकते हैं
- Matt Webb ने BBC In Our Time podcast episode descriptions की OpenAI embeddings बनाईं और PCA से 2D visualization तैयार की
- 1,536 dimensions को 2 dimensions में घटाने पर भी history wars वाले episodes या modern science discoveries वाले episodes एक-दूसरे के करीब दिखाई दिए

औसत position से sentence classification

embeddings का उपयोग classification में भी किया जा सकता है
- पहले किसी खास तरह से वर्गीकृत embedding groups की औसत position निकाली जाती है
- फिर नए कंटेंट की embedding यह देखकर category पाती है कि वह किन positions के सबसे करीब है
Amelia Wattenberger का Getting creative with embeddings वाक्यों को concrete या abstract होने के आधार पर score देने का उदाहरण है
concrete sentences और abstract sentences के samples बनाकर हर group की औसत position निकाली जाती है
नया वाक्य इन दोनों औसत positions के बीच किसके करीब है, उसके आधार पर score पाता है
इस score को ऐसे रंग में भी बदला जा सकता है जो ढीले तौर पर दिखाए कि वाक्य कितना abstract या concrete है

RAG: personal documents और internal documents पर Q&A

ChatGPT इस्तेमाल करने वाले लोग अक्सर यह जानना चाहते हैं कि personal notes या company internal documents के आधार पर सवालों के जवाब कैसे दिलवाए जाएँ
इसका उत्तर महँगा custom model training नहीं, बल्कि तैयार LLMs और retrieval-augmented generation (RAG) का संयोजन हो सकता है
RAG की बुनियादी प्रक्रिया सरल है
- user सवाल पूछता है
- personal documents से सवाल से संबंधित लगने वाला कंटेंट खोजा जाता है
- LLM की size limit का ध्यान रखते हुए संबंधित अंश और मूल सवाल prompt में डाले जाते हैं
- LLM दिए गए अतिरिक्त कंटेंट के आधार पर जवाब देता है
सामान्य size limit लगभग 3,000 से 6,000 शब्द होती है
RAG में कठिन हिस्सा prompt में डालने के लिए सबसे अच्छे अंश चुनना है
- embedding-आधारित semantic search संभावित रूप से सबसे प्रासंगिक कंटेंट जुटाने के लिए उपयुक्त है

E5-large-v2 से blog-आधारित offline Q&A बनाना

blog content पर आधारित RAG के उदाहरण में E5-large-v2 का उपयोग हुआ
प्रश्न और उत्तर वाले वाक्य व्याकरणिक रूप से अलग होते हैं, इसलिए यह ज़रूरी नहीं कि प्रश्न उस दस्तावेज़ के अर्थ के सबसे करीब दिखे जिसमें उत्तर मौजूद हो
E5-large-v2 दो तरह के कंटेंट को support करता है
- factual sentence को phrase के रूप में embed किया जाता है
- question को query के रूप में embed किया जाता है
- यह उसी तरह है जैसे CLIP image और text को एक ही space में रखता है
blog के 19,000 paragraphs को phrase के रूप में embed करके, question को query के रूप में embed किया जाता है ताकि उत्तर के करीब paragraphs मिल सकें
उदाहरण में Bash one-liner script से RAG लागू किया गया
- llm similar से संबंधित paragraphs खोजे गए
- jq से कंटेंट निकाला गया
- local laptop पर चल रहे Llama 2 Chat 7B model को प्रश्न और paragraphs दिए गए
What is shot-scraper? प्रश्न के लिए उत्तर बना कि shot-scraper Playwright को wrap करने वाली Python utility है, और command-line interface तथा YAML-आधारित configuration flow के साथ web page screenshots और JavaScript-आधारित scraping को automate करती है
बना हुआ उत्तर ब्लॉग के मौजूदा कंटेंट से exact sentence match नहीं था

व्यवहारिक काम में समायोजित किए जा सकने वाले विकल्प

LangChain, LLM के ऊपर features बनाने का framework है, और RAG इसकी मुख्य capabilities में से एक है
- यही feature LangChain पर भी बनाया जा सकता है, लेकिन LangChain को समझने में काफ़ी निवेश चाहिए
- यहाँ सब कुछ हल करने वाले एक framework की जगह छोटे-छोटे tools के संयोजन को प्राथमिकता दी गई है
distance function के रूप में cosine similarity default है
- दूसरे distance functions अभी आज़माए नहीं गए हैं
- RAG में distance function, embedding model, prompt strategy और LLM जैसे कई adjustable factors होते हैं
उदाहरणों में अधिकतम लगभग 20,000 embeddings का पैमाना था, और इस स्तर पर पूरे सेट पर brute force cosine similarity चलाकर भी उचित समय में परिणाम मिल जाते हैं
1 अरब objects जैसे बड़े data में vector databases या existing database extensions विकल्प बनते हैं
- SQLite के लिए sqlite-vss है
- PostgreSQL के लिए pgvector है
- Facebook का FAISS भी प्रयोग में लिया गया, और इसे इस्तेमाल करने वाला Datasette plugin datasette-faiss उपलब्ध है
आगे जिन रुझानों से उम्मीद है, वे हैं multimodal models और छोटे models
- Facebook ImageBind image, text, audio, depth, thermal और IMU data जैसी 6 modalities की joint embeddings सीखता है
- gte-tiny जैसे 60MB के छोटे models सीमित devices या browser execution की संभावना बढ़ाते हैं

आगे पढ़ें

What are embeddings? by Vicki Boykis
Text Embeddings Visually Explained by Meor Amer for Cohere
The Tensorflow Embedding Projector: embedding space explore करने का interactive tool
Learn to Love Working with Vector Embeddings: Pinecone के vector embedding tutorials का संग्रह

1 टिप्पणियां

GN⁺ 2023-10-25

Hacker News की राय

यह लेख पोस्ट करने के बाद embeddings को lower level पर समझने में मददगार कुछ और resources मिले
मेरा लेख जानबूझकर बहुत high-level रखा गया था, मुख्यतः applications पर फोकस के साथ
Cohere का Text Embeddings Visually Explained: https://txt.cohere.com/text-embeddings/
Tensorflow Embedding Projector tool: https://projector.tensorflow.org/
Vicki Boykis का What are embeddings? भी देखने लायक है: https://vickiboykis.com/what_are_embeddings/
इन्हें पेज के नीचे “further reading” में जोड़ने वाला हूँ
- पहले लगभग यही idea आज़माया था: https://blog.scottlogic.com/2022/02/23/word-embedding-recomm...
  embeddings का इस्तेमाल करके related posts की engagement बढ़ाई थी, और निजी तौर पर मुझे लगता है कि embeddings एक undervalued लेकिन powerful tool हैं
  इन्हें documents या excerpts के बीच similarity के आधार पर navigate करने, या उल्टा unique content ढूँढने में इस्तेमाल किया जा सकता है, और hallucination की चिंता नहीं करनी पड़ती, इसलिए ये काफी “safe” हैं
- AI, machine learning, LLM का कम अनुभव रखने वालों के लिए भी इसे approachable तरीके से लिखा गया है, यह अच्छा है
  embeddings कैसे बनाए जाते हैं, यह भी रोचक हो सकता है। जैसे training के बाद classification layer को काटकर हटाने का तरीका, या EfficientNet जैसा approach
- embeddings के इतिहास और computer science व LLM में उनके use पर कोई resource है या नहीं, यह जानना चाहूँगा
  ये machine learning की core foundation बनते जा रहे हैं
computer vision और visual SLAM algorithms में embeddings, place recognition का de facto standard method बन गए हैं, और यह इस लेख की बातों से बहुत मिलता-जुलता है
इसे “bag-of-word place recognition” कहा जाता है, और आजकल लगभग हर open-source library में इसका इस्तेमाल होता है
core idea यह है कि हर image को feature extraction/descriptor pipeline से गुज़ारकर top N features वाले vector में “embed” किया जाए
camera के move करने के दौरान keyframe नाम का images का database बनाया जाता है, और images को कहीं कम dimensions वाले vectors के रूप में store किया जाता है
बाद में सभी images से database को query किया जाता है और cosine similarity जैसे तरीकों से vector database में best match खोजा जाता है
match मिलने पर query image और matching image के बीच stereo constraints calculate करके map को update किया जा सकता है
original paper [1] है और सबसे प्रसिद्ध implementation https://github.com/dorian3d/DBoW2 है
[1]: https://www.google.com/search?client=firefox-b-d&q=Bags+of+B...
बेहतरीन introductory reference है
पहले मैंने खुद एक iOS notes app बनाया था, और existing full-text search में embeddings जोड़ना 1) हैरान करने लायक आसान था और 2) शुरुआती उम्मीद से कहीं ज़्यादा powerful निकला
मुझे पता था कि “dog” search करने पर “canine” वाली notes भी आएँगी, लेकिन “ऐसा pet जो मुझे पसंद आएगा” जैसी search से positive भावनाओं वाली कई animal-related notes मिलती हैं, यह खुद करके देखने पर ही समझ आया
वही पहला बड़ा “aha” moment था
उस समय Supabase का DocsGPT PR example code के तौर पर उपयोगी था: https://github.com/supabase/supabase/pull/12056
- “existing full-text search में जोड़ा” यह phrasing subtly important है। embeddings traditional search algorithms को complement करने वाली semantic search देते हैं
  कई applications names या proper nouns पर बहुत निर्भर होती हैं, और अक्सर context भी कम होता है
  अगर pet dog को description के बिना सिर्फ नाम से बुलाया जाए, तो कोई particular embedding model उसे पकड़ नहीं पाएगा
  लोगों, जगहों, street names जैसे proper nouns personalization/domain-specific search को anchor करने में बहुत important हो सकते हैं, लेकिन general-purpose language models इन्हें नहीं जानते
  इस problem से निपटने के specific तरीकों के बारे में जानना चाहूँगा
- Logseq notes के लिए भी ऐसा ही कुछ बना रहा हूँ
  अभी सबसे बड़ा सवाल यह है कि कितना text एक embedding में बनाया जाए
  हर sentence के लिए करूँ, या notes app के एक page में आने वाले sentence blocks के पूरे set को एक साथ करूँ—यही सोच रहा हूँ
- जानना चाहूँगा कि embedding generation के लिए device के बाहर API इस्तेमाल करते हैं या नहीं, और search device के अंदर होती है या नहीं
word embedding का प्रतिनिधि उदाहरण मशहूर King - Man + Woman = Queen है
vector space में यह अच्छी तरह काम करता है, लेकिन 2D में project करने पर visually उतना समझ में नहीं आता
मेरे अनुभव में PCA, MDS, t-SNE सभी के साथ यही था: https://bhugueney.gitlab.io/test-notebooks-org-publish/jupyt...
यह browser में word embedding करने वाला JupyterLite Notebook है, और smartphone पर इसे run न करना बेहतर है
मुझे जानना है कि word embedding के इस representative example को अच्छी तरह visualize करने का कोई तरीका किसी को पता है या नहीं
- अगर मैंने सही समझा है, तो 2D space में “king” को origin पर रखकर, X-axis को “king”-“man” और Y-axis को “king”-“woman” मानकर visualize किया जा सकता है
  अगर सच में orthogonality चाहिए तो Gram-Schmidt इस्तेमाल कर सकते हैं
  3D में Z-axis को “king”-“queen” के रूप में एक और axis रखा जा सकता है, और orthogonalized version model की distance की धारणा के ज्यादा करीब होगा
  2D में “king”-“man”+“woman” calculate करने पर वह “queen” से कितना दूर है, यह नहीं दिखाया जा सकता, लेकिन बाकी distances सही-सही मिल सकती हैं
  3D में शायद exact distance दी जा सकेगी
  “queen” को आमतौर पर इसलिए चुना जाता है क्योंकि वह X="king"-"man"+"woman" के सबसे करीब embedding वाला word होता है
  2D chart में अगले कुछ closest words भी दिखाए जा सकते हैं, और हर word के साथ 2D plane से उसकी orthogonal distance जोड़ी जा सकती है
  तब “queen” ऐसा word होना चाहिए जिसका X से squared distance और plane से squared orthogonal distance का योग सबसे छोटा हो, इसलिए आंखों से भी कुछ हद तक verify किया जा सकता है
- UMAP try करना अच्छा रहेगा
- high dimensions को visualize करने पर mathematician joke ढूंढ रहा था और ChatGPT से पूछा, तो उसने Richard Feynman-style का ऐसा joke बना दिया जो Google पर नहीं मिलता
  कुछ इस तरह था: “4D को visualize नहीं किया जा सकता… कम-से-कम मैं तो नहीं कर सकता। क्योंकि मेरे पास सिर्फ तीन branes हैं”, और यह branes और brains पर wordplay था
  बाद में ChatGPT ने माना कि यह उसने गढ़ा था और माफी मांगी
  फिर उसने John von Neumann, H. G. Wells, Ian Stewart के quotes भी दिए, और आखिर में ऐसा जवाब दिया: “4D visualize करना हो तो 3D visualize करें और फिर ‘n+1’ कह दें” — जो मेरी याद वाले joke से सबसे मिलता-जुलता था, लेकिन कम मजेदार था
  इसलिए मैंने उससे Deepak Chopra style में high-dimensional space visualize करने के लिए hallucinated quotes बनाने को कहा, तो उसने septillion-dimensional embeddings, Hilbert space, Poincaré conjecture, Heisenberg uncertainty principle, Shannon entropy जैसे expressions मिलाकर काफी plausible fake quotes की बौछार कर दी
practical trigonometry में आम गलती है जरूरत न होने पर square root calculation करना
example code में magnitude_a = sum(x * x for x in a) * 0.5 और magnitude_b = sum(x * x for x in b) * 0.5 में *0.5 की जरूरत नहीं है
अगर cosine compare करना है, तो squared values compare की जा सकती हैं, इसलिए महंगी root calculation से बचा जा सकता है
इसी तरह elliptic curve cryptography में भी inverse calculation जैसे महंगे operations को जहां तक हो सके बाद तक टाला जाता है, या जब सिर्फ दो points compare करने हों तो standard value calculate करने से ही बचा जाता है
- यह code समझने में आसान रखने के लिए लिखा गया है
  वरना इसे low-level SIMD code से replace कर दिया जाता
dot_product = sum(x * y for x, y in zip(a, b)) — यह देखकर हैरानी होती है कि ऐसा क्यों किया गया और vectorized numpy operation क्यों नहीं इस्तेमाल किया गया
“ChatGPT से cosine similarity code के कई versions लिखवाए” वाला हिस्सा देखकर बात समझ में आई
- इसके दो कारण हैं
  पहला, लोगों को समझाते समय numpy syntax मुझे उल्टा बाधा जैसा लगता है
  दूसरा, numpy सबसे हल्की dependency नहीं है
  performance चाहिए हो तो इस्तेमाल करता हूं, लेकिन इसे default choice नहीं बनाना चाहता
अगर आप Show HN posts, ProductHunt startups, YC companies, Github repositories में LLM embeddings से जुड़ी चीजें देखना चाहते हैं, तो अभी launch किए गए LLM-Embeddings-Based Search Engine MVP में उन्हें जल्दी ढूंढ सकते हैं
https://payperrun.com/%3E/search?displayParams={%22q%22:%22L...
- ठीक है
  मुझे उम्मीद थी कि कई filter buttons दबाते ही search results तुरंत update हो जाएंगे, और यह नहीं लगा था कि search फिर से करनी पड़ेगी
  समझ में आता है कि आपने ऐसा क्यों किया
- मेरी Show HN post यहां है: https://news.ycombinator.com/item?id=38011802
पिछले कुछ महीनों में “AI” से जुड़ा जो कुछ पढ़ा, उनमें यह सबसे दिलचस्प है
list में embedding models देखते समय हमेशा सोचता था कि ये क्या हैं, और यह भी कि हर कोई vector DB की बात क्यों कर रहा है
काफी समय से चल रहे side project में इसे तुरंत apply करने का तरीका दिमाग में आ रहा है
अगर सभी documents में embeddings हों, तो user data की useful clustering व्यावहारिक रूप से संभव हो सकती है
सच में जानना चाहता/चाहती हूं कि क्या किसी ने embeddings को approximate nearest neighbor और clustering के अलावा कहीं और इस्तेमाल किया है
जो संभावनाएं दिमाग में आती हैं वे हैं किसी arbitrary axis पर projection, indexing, और sorting. उदाहरण के लिए “गर्म-ठंडा”, “खुश-दुखी”, “SF-realism”, “literary-commercial” जैसे axes
embedding space में SVM-स्टाइल classification करना, या word2vec-स्टाइल inference यानी woman-man+king=queen करना, या LLM की किसी layer को अलग निकालना—इनके अलावा embeddings को सीधे train करने के तरीके भी होंगे
contrastive learning इस्तेमाल होती है, यह पता है, लेकिन function neural network के साथ embeddings train करना और function equations बनाकर mean squared error loss calculate करना जैसे दूसरे तरीके भी explore करने लायक लगते हैं
semantic search पर बहुत ज्यादा focus दिखता है, यह चौंकाने वाला है, और यकीनन इसके और भी दिलचस्प applications होंगे
- दिए गए examples सभी relatively common tasks जैसे लगते हैं, इसलिए थोड़ा confusion है
  पहला और तीसरा असल में वही हैं
  computer vision में आप किसी photo में चश्मा जोड़ने जैसे तरीके से image को semantically बदलना चाह सकते हैं, और Google ads में दिखने वाले tasks ऐसे ही examples हैं
  ऐसे tasks latent space में होते हैं
  normalizing flows में यह खास तौर पर साफ होता है, क्योंकि वे space को Gaussian में बदल देते हैं
  diffusion models भी approximate तरीके से कुछ ऐसा ही करते हैं, लेकिन वे reversible नहीं होते, हालांकि वापस किया जा सकता है
  जिस image, sentence, या data को manipulate करना है उसे project करते हैं, Gaussian space में manipulate करते हैं, फिर target space में वापस लाते हैं
  हालांकि embedding शब्द इतने सारे अर्थों वाला overloaded term है कि शायद इसी से लोग आपस में confuse हो रहे हों
  हो सकता है आप सिर्फ उस पहले block के बारे में सोच रहे हों जो discrete integer tokens को continuous floating-point में बदलता है
  लेकिन वह embedding भी trained होती है, इसलिए lookup table जैसी बन जाए तब भी यह neural network process ही है
  इस space में SVM इस्तेमाल करना भी होता है
  इसे latent space जैसा, लेकिन थोड़ा ज्यादा abstract मानता/मानती हूं
  कम से कम embedding injective होनी चाहिए. गणितीय रूप से तो ऐसा ही है, लेकिन…
- embedding space में SVM-स्टाइल classification industry NLP और machine learning में बहुत basic technique है
  embeddings को सीधे train करना तो शाब्दिक रूप से original embedding model, Word2Vec ही है
- PubMed abstracts के आधार पर word2vec embedding space भी बनाया था
  chemistry और biochemistry names में hyphen वाली spelling, बिना hyphen वाली spelling, spaces वाली spelling जैसी variants और abbreviations बहुत मिलीं
  शायद technical terminology dictionary भी बनाई जा सकती थी
  definitions तक कितना पहुंचा जा सकता था, पता नहीं, लेकिन vectors alone की limits होने पर भी यह starting point तो है
  काफी संभावना है कि दूसरों ने भी इस तरह dictionary building की हो
- दो भाषाओं में अलग-अलग embedding spaces बनाकर seed dictionary से spaces को align करने वाली cross-lingual embeddings का multilingual search और machine translation में actual या potential application है
- इसे data deduplication में भी इस्तेमाल किया जा सकता है
embeddings के साथ experiment किया है और production के कुछ use cases भी बनाए हैं; यह एक शानदार tool है जो कई cool applications संभव बनाता है
लेकिन किसी specific domain में बनाते समय off-the-shelf embedding models की limits सामने आती हैं
off-the-shelf models में कई dimensions होते हैं, लेकिन उनमें से कुछ dimensions मेरे application की classification, content similarity, clustering वगैरह के लिए important हो सकते हैं और कुछ नहीं
दूसरे शब्दों में, जिन dimensions की मुझे परवाह नहीं है उनमें पास होने की वजह से दो vectors पास दिख सकते हैं
embedding model fine-tuning के लिए बेहतर tools और literature आने की उम्मीद है
- इस problem को solve करने के लिए पूरे language model को fine-tune करना, कील के लिए बड़ा हथौड़ा इस्तेमाल करने जैसा है
  ऐसे tools काफी पहले से मौजूद हैं; उदाहरण के लिए थोड़ा data label करने के बाद embedding space के ऊपर classification के लिए SVM train कर सकते हैं
- sentence-transformers में इससे जुड़े tools काफी अच्छी तरह उपलब्ध हैं

Embeddings क्या हैं और वे क्यों महत्वपूर्ण हैं

Embeddings की बुनियादी अवधारणा

संबंधित कंटेंट सिफारिश: TIL ब्लॉग का उदाहरण

Word2Vec से vector space को समझना

LLM tools से embeddings निकालना

Semantic search और “vibes-based search”

Code embeddings: Symbex और Datasette

CLIP से text और image को एक ही space में embed करना

Faucet Finder: CLIP-आधारित image search

Clustering और 2D visualization

औसत position से sentence classification

RAG: personal documents और internal documents पर Q&A

E5-large-v2 से blog-आधारित offline Q&A बनाना

व्यवहारिक काम में समायोजित किए जा सकने वाले विकल्प

आगे पढ़ें

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय