Show HN: Wordllama – LLM के token embeddings से क्या-क्या किया जा सकता है

(github.com/dleemiller)

1 पॉइंट द्वारा GN⁺ 2024-09-16 | 1 टिप्पणियां | WhatsApp पर शेयर करें

WordLlama एक तेज़ और हल्का NLP toolkit है, जो LLM के token embeddings को reuse करके fuzzy deduplication, similarity calculation, ranking, clustering और semantic text splitting करता है
Inference मुख्य रूप से token lookup और average pooling पर चलता है, और यह NumPy-only चल सकने वाली lightweight pipeline और CPU optimizations को प्रमुखता देता है
default model 256-dimensional 16MB का है, Matryoshka representations से dimensions घटाए जा सकते हैं, और binary embeddings Hamming similarity के साथ और तेज़ computation support करते हैं
MTEB table में WL64~WL1024 कई metrics पर GloVe 300d और Komninos से अधिक score दिखाते हैं, जबकि all-MiniLM-L6-v2 से आम तौर पर कम score दर्ज करते हैं
pip install wordllama के बाद इसे WordLlama.load() से इस्तेमाल किया जा सकता है, और .key(query) एक callable function लौटाता है जिसे sorted, min, max जैसे standard library functions में दिया जा सकता है

WordLlama क्या करता है

WordLlama fuzzy deduplication, similarity calculation, ranking, clustering और semantic text splitting जैसे NLP utility tasks के लिए एक lightweight toolkit है
यह LLaMA 2, LLaMA 3 70B जैसे latest LLMs से token embedding codebook extract करके GloVe·Word2Vec·FastText जैसे compact word representations बनाता है
Inference के समय dependencies कम हैं और यह CPU hardware के लिए optimized है, इसलिए resource-constrained environments में deployment के लिए उपयुक्त है
तेज़ और छोटे size की वजह से इसे exploratory analysis, LLM output evaluators, multi-hop या agentic workflows की preparation जैसे utility use cases में इस्तेमाल किया जा सकता है

Installation और basic usage

Installation pip से किया जाता है

pip install wordllama

default 256-dimensional model को WordLlama.load() से load किया जाता है

from wordllama import WordLlama

wl = WordLlama.load()

.key(query) Callable[[str], float] लौटाता है, जिससे candidate strings को query के साथ similarity के आधार पर sort किया जा सकता है या maximum value चुनी जा सकती है

query = "Machine learning methods"
candidates = [
    "Foundations of neural science",
    "Introduction to neural networks",
    "Cooking delicious pasta at home",
    "Introduction to philosophy: logic",
]

sim_key = wl.key(query)

sorted_candidates = sorted(candidates, key=sim_key, reverse=True)
best_candidate = max(candidates, key=sim_key)

Example result में "Introduction to neural networks" score 0.3414 के साथ सबसे ऊँचा candidate बनता है

प्रमुख features

Embedding generation: simple token lookup और average pooling से text embeddings तेज़ी से generate करता है
Similarity calculation: दो texts के बीच cosine similarity calculate करता है
Document ranking: query और candidate documents की similarity के आधार पर rank करता है
Fuzzy deduplication: similarity threshold के आधार पर duplicate text हटाता है
Clustering: KMeans से documents को group करता है
Filtering: केवल वे documents रखता है जिनकी query से similarity threshold से ऊपर हो
Top-K search: query से सबसे मिलते-जुलते K documents लौटाता है
Semantic text splitting: text को semantically coherent chunks में बाँटता है
Binary embeddings: Hamming similarity के साथ और तेज़ computation support करता है
Matryoshka representations: जरूरत के अनुसार embedding dimensions काटकर model size और performance को adjust करता है

Model structure और performance

WordLlama general-purpose embedding framework के अंदर context-less small model train करता है
default model 256-dimensional 16MB size का है
README की MTEB table WL64, WL128, WL256, WL512, WL1024 की तुलना GloVe 300d, Komninos और all-MiniLM-L6-v2 से करती है
- WL256 ने Clustering 33.25, Reranking 52.03, Classification 58.21, Pair Classification 78.22, STS 67.91, CQA DupStack 24.12, SummEval 30.99 दर्ज किया
- GloVe 300d ने उन्हीं items में क्रमशः 27.73, 43.29, 57.29, 70.92, 61.85, 15.47, 28.87 दर्ज किया
- all-MiniLM-L6-v2 ने Clustering 42.35, Reranking 58.04, Classification 63.05, Pair Classification 82.37, STS 78.90, CQA DupStack 41.32, SummEval 30.81 दर्ज किया
l2_supercat LLaMA 2 vocabulary model है
- इसे LLaMA 2 70B और phi 3 medium जैसे कई models के codebooks से अतिरिक्त special tokens हटाकर उन्हें concatenate करने के बाद train किया गया
- LLaMA 2 tokenizer इस्तेमाल करने वाले कई models के codebooks को एक साथ concatenate करके train किया जा सकता है
- LLaMA 3 70B codebook training जैसी performance दिखाता है, जबकि vocabulary 32k बनाम 128k होने से 4 गुना छोटी है
LLaMA 3-based model के रूप में l3_supercat उपलब्ध है
अतिरिक्त results Results में हैं

Semantic text splitting

.split() लंबे text को semantic chunks में बाँटता है

long_text = "Your very long text goes here... " * 100
chunks = wl.split(long_text, target_size=1536)

print(list(map(len, chunks)))

# Output: [1055, 1055, 1187]

target_size target size होने के साथ-साथ maximum size भी है
Splitting process text order, sentence structure और संभव हो तो paragraph structure को बनाए रखने की कोशिश करता है
WordLlama embeddings का इस्तेमाल करके अधिक natural split indices खोजे जाते हैं
Output chunk size target_size से कम या बराबर सीमा में बदल सकता है
Recommended target size 512~2048 characters है और default value 1536 है
यदि बड़े chunks चाहिए, तो splitting के बाद कई semantic chunks को batch में group करने का तरीका recommended है
Details technical overview में हैं

Model2Vec और direct inference

2025-01-04 update में Model2Vec static embeddings support जोड़ा गया
WordLlama.load_m2v() से Model2Vec model load किया जा सकता है

wl = WordLlama.list_configs()

wl = WordLlama.load_m2v("potion_base_8m")  # 256-dim model
wl = WordLlama.load_m2v("m2v_multilingual")  # multilingual model

Model2Vec PCA का इस्तेमाल करके static embeddings बनाने का अलग तरीका है
Model2Vec side ने multilingual model और glove-based model बनाए हैं, और कहा गया है कि word similarity task में इनके scores अच्छे हैं
इसे Hugging Face के minishlab पर देखा जा सकता है
WordLlamaInference को loader के बजाय (n_vocab, dim) shape वाली static embedding array और tokenizer सीधे देकर इस्तेमाल किया जा सकता है

from wordllama import WordLlamaInference
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained(...)
wl = WordLlamaInference(np_embeddings_ar, tokenizer)

Training और embedding extraction

Binary embedding models में higher dimensions पर improvement margin अधिक स्पष्ट था, और binary embeddings के लिए 512 या 1024 dimensions recommended हैं
L2 Supercat model को single A100 GPU पर batch size 512 के साथ 12 घंटे train किया गया
LLaMA models से token embeddings extract करने के लिए user agreement स्वीकार करना और Hugging Face CLI में login करना जरूरी है

from wordllama.extract.extract_safetensors import extract_safetensors

extract_safetensors("llama3_70B", "path/to/saved/model-0001-of-00XX.safetensors")

Embeddings आम तौर पर पहली safetensors file में होते हैं, लेकिन हमेशा ऐसा नहीं होता
- manifest हो सकता है
- शायद खुद inspect करके खोजना पड़े
Training के लिए repository के scripts इस्तेमाल किए जाते हैं, और existing settings को copy या modify करके configuration file add करनी होती है

pip install wordllama[train]
python train.py train --config your_new_config
python train.py save --config your_new_config --checkpoint ... --outdir /path/to/weights/

Save step Matryoshka dimensions के अनुसार models को एक-एक करके save करता है

Updates, roadmap और license

2025-02-01 update में sorted, min, max जैसे standard library functions में इस्तेमाल हो सकने वाला callable support जोड़ा गया
2024-10-04 update में semantic splitting inference algorithm जोड़ा गया
Roadmap में DSPy evaluator और Retrieval-Augmented Generation, यानी RAG pipeline example notebook जोड़ना शामिल है
Community projects के रूप में Gradio Demo HF Space और CPU-ish RAG हैं
Project license MIT License है

1 टिप्पणियां

GN⁺ 2024-09-16

Hacker News रायें

इसका छोटा आकार वाकई पसंद आया। यह पहले से ही SBERT के सबसे छोटे मॉडल से भी कुछ फायदे रखता है।
हालांकि तकनीकी रूप से यह काफी पुराना approach लगता है, और मैं समझता हूं कि यह performance के साथ trade-off है। फिर भी उत्सुकता है कि क्या यह semantic similarity, natural language inference (NLI), noun abstraction जैसे similarity type switching दे सकता है।
उदाहरण के लिए, जब अखबारों के लेखों को “extreme environmental event” जैसी category में group करना हो, तो हम चाहेंगे कि “Freezing” और “Burning” बहुत similar आएं। यह MTEB/Sentence-Similarity या classical Word2Vec/GloVe जैसा behavior है। लेकिन अगर chemistry article हो, तो दोनों लगभग opposite आने चाहिए, और कभी-कभी natural language inference embeddings से दो चीजों के बीच causal relationship भी देखना होता है।
बाद के दो embedding types 2019 के बाद के relatively recent तरीके हैं, इसलिए मुझे लगता है कि उनमें technical opportunity ज्यादा है। पुरानी MTEB/semantic similarity family 2014 से कई use cases के लिए पर्याप्त रही है, और 2019 में mini-lm-v2 आदि से इसमें काफी सुधार हुआ।
ऊपर के तीनों embedding types SBERT से भी संभव हैं, लेकिन dimensions बड़ी हैं और model भी बड़ा है, इसलिए type के हिसाब से कई models load करने पर resources पर बोझ बढ़ता है। Generative embedding models या E5, natural language inference models बड़े हैं, इसलिए अक्सर करीब 6GB की जरूरत पड़ती है।
- अच्छा idea है। कुछ experiments करके feasibility देखूंगा।
  एक single similarity type पर train करने पर performance कैसी आती है, यह देखना चाहता हूं। Context calculation के बिना इसे handle करने का कोई और तरीका होगा या नहीं, इस पर पक्का नहीं हूं। Model switch करना पड़ सकता है, लेकिन अपने आप में यह बड़ी समस्या नहीं है।
- यह 17MB model है, और benchmark में MiniLM v2, यानी SBERT से जाहिर तौर पर कम आता है। मैं 23MB model से ONNX में V3 लगभग हर platform पर चला रहा हूं।
  इसे नीचा दिखाने का मतलब नहीं है; ऐसी चीजों को context में समझना जरूरी है। यहां context यह है कि LLM को गहराई से समझते हुए पता चलता है कि LLM में भी embeddings होती हैं, और उस नजरिए से पूरे embedding field की current state को फिर से खंगालने के बजाय उस embedding से छेड़छाड़ करके एक कदम आगे बढ़ना ज्यादा natural लगता है।
- अगर “ChatGPT embeddings” से मतलब OpenAI embedding models है, तो “burning” और “freezing” बिल्कुल opposite नहीं हैं। text-embedding-large-3 के 1024 dimensions पर चलाने पर cosine similarity करीब 0.46 आती है। पूरी तरह opposite embedding हो तो similarity -1 होनी चाहिए।
  यह सोचना कि opposite meaning वाले words की embeddings opposite होती हैं, एक common misconception है। असल में opposite meaning वाले words में भी कई commonalities होती हैं। “burning” और “freezing” दोनों temperature और physics से जुड़े हैं, English words हैं, verb·noun·adjective तीनों रूपों में इस्तेमाल हो सकते हैं, और spelling भी सही है। ये सभी features embedding में शामिल होते हैं।
Embeddings training data और objective function के हिसाब से बहुत सारी semantic information रखती हैं, और कई उपयोगी tasks में independently इस्तेमाल की जा सकती हैं।
पहले मैंने CLIP model के text encoder embeddings का इस्तेमाल करके prompts को corresponding images से बेहतर match कराने के लिए augment किया था। उदाहरण के लिए, prompt में “building” हो तो embedding matrix में “concrete”, “underground” जैसे nearest neighbors खोजकर उस word के बाद replace या append करता था। सीमित experiments में ज्यादातर queries पर recall बढ़ा।
- सही है। ऐसे in-domain contextual relationships embedding model को train कराए जा सकते हैं।
  https://www.marqo.ai/blog/generalized-contrastive-learning-f...
- वाकई शानदार idea है। इस implementation में भी संभव लगता है, इसलिए इस पर और सोचूंगा।
  wordllama में token embeddings का आकार देखने से augment करने के लिए important tokens identify करने में भी मदद मिल सकती है। हालांकि, इस task के लिए curated data से train करने पर यह काफी बेहतर काम कर सकता है।
उत्सुकता है कि English के अलावा दूसरी languages के लिए भी कोई plan है या नहीं। French के लिए यह एक perfect tool हो सकता है।
- बिल्कुल संभव है। Training corpus बनाना होगा, लेकिन French में कौन-सा material उपलब्ध है, यह मुझे अच्छी तरह नहीं पता।
  Mistral family models से थोड़ा training किया है, इसलिए French corpus में शायद पहले उसी तरफ try करूंगा।
  Issue खोल दें तो समय मिलने पर इस पर काम करूंगा।
बड़े corpus, जैसे 10,000 से ज्यादा sentences में हर sentence को document मानने के use case के लिए, TF-IDF sparse matrix vectors को k-means से cluster करके भी similar results मिल सकते हैं।
हालांकि इस tool में binary quantization जैसे तरीकों से k-means वाला हिस्सा तेज करने के लिए काफी utilities लगती हैं। आने वाले कुछ हफ्तों में benchmark करने का सोच रहा हूं।
कुछ साल पहले मैंने इसी तरह के functions इस्तेमाल करने वाले language games का एक collection बनाया था: https://github.com/Hellisotherpeople/Language-games
- दिलचस्प। लगता है यह pymagnitude इस्तेमाल करता है।
  https://github.com/plasticityai/magnitude
क्या किसी ने embeddings से Little Alchemy solve करने का idea सोचा है? #sample-use
- लगता है किसी ने https://neal.fun/infinite-craft/ को फिर से बना दिया है।
अच्छा दिख रहा है। उत्सुकता है कि mini-lm model की तुलना में इसके क्या फायदे हैं। ज्यादातर MTEB tasks में mini-lm बेहतर दिखता है, तो inference speed वगैरह में क्या यह बेहतर है?
- Mini-lm बेहतर embedding model है। यह model attention calculation नहीं करता, और training के बाद deep learning framework भी इस्तेमाल नहीं करता। इसलिए transformer model के contextual advantages नहीं मिल सकते।
  यह latest state-of-the-art performance को target करके भी नहीं बनाया गया। Dependencies, size, hardware requirements कम करने और speed बढ़ाने के लिए यह काफी constrained model है।
  Word embedding model के रूप में देखें तो भी यह काफी lightweight है। आमतौर पर ऐसे models की vocabulary बहुत बड़ी होती है और वे कई GB के होते हैं।
- लगता है फर्क model के size का है। यह ज्यादा हल्का और तेज है। mini-lm 80MB का है, और यहां सबसे छोटा model 16MB का है।
Game development में बहुत useful लगता है।
यह अच्छी तरह दिखाता है कि tokens में खुद कितना semantic content होता है।
क्या इसे PostgreSQL extension के रूप में बनाया जा सकता है?

Show HN: Wordllama – LLM के token embeddings से क्या-क्या किया जा सकता है

WordLlama क्या करता है

Installation और basic usage

प्रमुख features

Model structure और performance

Semantic text splitting

Model2Vec और direct inference

Training और embedding extraction

Updates, roadmap और license

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News रायें