शुरुआत से सिर्फ 2 महीनों में 30 करोड़ neural embeddings के साथ वेब सर्च इंजन बनाना

(blog.wilsonl.in)

1 पॉइंट द्वारा GN⁺ 2025-08-13 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

सर्च इंजन की गुणवत्ता में गिरावट और Transformer-आधारित embedding models की प्रगति की पृष्ठभूमि में, 2 महीनों के दौरान 30 करोड़ embeddings पर आधारित वेब सर्च इंजन विकसित करने के अनुभव पर चर्चा
कुल 200 GPU clusters, बड़े पैमाने के distributed crawler, RocksDB, HNSW जैसी high-performance infrastructure और algorithms के माध्यम से real-time natural language understanding search को लागू किया गया
keyword matching नहीं, बल्कि intent-केंद्रित query answering को लक्ष्य बनाकर, document parsing और context preservation के लिए normalization, chunking, statement chaining जैसी विभिन्न NLP/ML तकनीकों का उपयोग किया गया
pipeline, storage, service mesh, vector index आदि हर layer पर large-scale distributed systems design और bottleneck/cost optimization के उपायों का परिचय
अंततः ultra-low latency, बड़े पैमाने पर distributed, high-accuracy वाला personalized search engine तैयार होने का वर्णन

अवलोकन और प्रेरणा

लेखक ने हाल में सर्च इंजन की गुणवत्ता में गिरावट, SEO spam, और गैर-प्रासंगिक content की बढ़ती समस्या, तथा Transformer-आधारित embedding models की बेहतर natural language understanding क्षमता के संदर्भ में सर्च इंजन को शुरुआत से बनाने का निर्णय लिया
मौजूदा सर्च इंजनों की सीमाएँ मानव-स्तरीय प्रश्न समझ की कमी और keyword-आधारित साधारण matching से उत्पन्न होती हैं
लक्ष्य ऐसा intent-केंद्रित ranking बनाना है जो अच्छे quality content को हमेशा ऊपर दिखाए और लंबे tail तक संतुलित ढंग से खोज सके
वेब सर्च इंजन बनाना computer science, linguistics, ontology, NLP, ML, distributed systems, performance engineering जैसे विविध क्षेत्रों को समेटता है
यह प्रोजेक्ट 2 महीनों में बिना किसी मौजूदा infrastructure या पूर्व अनुभव के, पूरी तरह अकेले शुरू करके एक बिल्कुल नया सर्च इंजन बनाने की चुनौती था

संपूर्ण सिस्टम संरचना

200 GPU clusters पर SBERT-आधारित text embeddings के 30 करोड़ vectors बनाए गए
एक साथ सैकड़ों crawlers ने प्रति सेकंड 50,000 pages इकट्ठा किए, और कुल 28 करोड़ indexes बनाए गए
RocksDB और HNSW को 200 cores, 4TB RAM, 82TB SSD पर sharding करके store और index किया गया
query response की कुल latency लगभग 500ms के स्तर पर रखी गई
पूरी architecture और flow को crawler, pipeline, storage, embedding vector index, service mesh, front/backend क्षेत्रों में विभाजित किया गया

embedding-आधारित search के प्रयोग और सुधार

Neural Embedding Playground

SBERT जैसे embedding models का उपयोग करने वाली search, पारंपरिक keyword-केंद्रित search की तुलना में अधिक स्वाभाविक query understanding और accuracy देती है—यह प्रयोगों से पुष्टि हुई
input query के intent को context और sentence स्तर पर समझकर वास्तव में अधिक relevant उत्तर निकालना संभव हुआ

पारंपरिक search बनाम neural search के उदाहरण

पारंपरिक search: अधिक randomness वाले परिणाम, keyword match पर केंद्रित
embedding search: प्रश्न के context और intent को समझकर, सटीक मुख्य वाक्य या concept-केंद्रित परिणाम प्रदान करना
जटिल concept combinations, implicit/compound questions, और quality signals वाले queries के लिए meaning-based उत्तर खोज संभव

वेब पेज parsing और normalization

HTML से केवल semantic text elements निकालना, और layout/control elements जैसे noise को हटाने वाली normalization मुख्य लक्ष्य थी
WHATWG, MDN जैसे standards के अनुसार p, table, pre, blockquote, ul, ol, dl आदि की table संरचना को बनाए रखा गया
menu, navigation, comments, interface जैसे chrome elements को पूरी तरह हटाया गया
site-specific (जैसे: en.wikipedia.org) विशेष नियम लागू करके extraction की अधिकता/कमी की समस्याओं को हल किया गया
semantic structured data (meta, OpenGraph, schema.org आदि) का उपयोग करके knowledge graph निर्माण और ranking सुधार भी संभव हुआ

chunking और context preservation

sentence-स्तरीय chunking

embedding model की सीमाओं को पार करने के लिए पूरे page की जगह sentence-based chunking लागू की गई
chunking के दौरान स्वाभाविक sentence boundaries, grammar, abbreviations, URL, informal expressions आदि अनेक मामलों को spaCy sentencizer से सही तरह अलग किया गया

context preservation और linking

वाक्यों के बीच dependency, heading, paragraph, table आदि को पहचानकर context information को भी साथ में bundle करके embedding किया गया
उदाहरण के लिए, table संरचना में भी हर row का अर्थ न खो जाए, इसके लिए ऊपर के heading/clauses को श्रृंखलाबद्ध तरीके से जोड़कर शामिल किया गया

statement chaining

DistilBERT classifier ने एक वाक्य और उसके पिछले वाक्य का साथ में विश्लेषण करके context dependency की पुष्टि और chain extraction को स्वचालित किया
embedding के समय सभी ऊपरी dependent sentences को साथ शामिल करके context retention बेहतर किया गया

prototype उपयोग के परिणाम

sandbox environment में विभिन्न वास्तविक queries के प्रयोगों से पारंपरिक तरीकों की तुलना में काफी अधिक सटीक (context-aware) question answering की पुष्टि हुई
keyword mismatch, omission/metaphor/compound questions जैसी स्थितियों में भी app ने intent पहचानकर सही context sentence matching किया—छिपे ज्ञान और संबंधों को भी प्रभावी ढंग से खोजा

बड़े पैमाने का वेब crawler (node-आधारित)

काम के वितरण के लिए work stealing, domain-स्तरीय concurrency/traffic control, DNS/URL/header validation जैसी स्थिरता और दक्षता संबंधी कई बातों पर ध्यान दिया गया
crawler में asynchronous I/O-आधारित Promise, DDoS-प्रतिरोधी mechanism, resource management (memory, delay, backoff), noise domain detection आदि लागू किए गए
URL normalization, protocol तथा port/userinfo restrictions, और canonicalization के जरिए duplicate/abnormal URL filtering को मजबूत किया गया

pipeline (distributed task queue)

हर page की state को PostgreSQL में manage किया गया; शुरुआती चरण में सीधे polling/transactions का उपयोग किया गया
बड़े distributed environment (हजारों crawlers) में scale की समस्या और queue/lock bottlenecks आने पर Rust-आधारित in-memory coordinator से queue state को manage किया गया
task structure: hashmap-आधारित index, binary heap, domain groups, random poll, swap_remove जैसी विभिन्न indexing तकनीकें
प्रति task memory लगभग 100B स्तर की, इसलिए 128GB server पर 1B tasks तक संभालना संभव
बाद में SQS के विकल्प के रूप में Open Source RocksDB-आधारित queue भी विकसित की गई, जो 1 node पर 3 लाख ops/second तक support करती है

storage design (Oracle → PostgreSQL → RocksDB)

शुरुआत में Oracle Cloud (कम लागत egress/storage), फिर PostgreSQL (TOAST) तक उपयोग हुआ, लेकिन write scalability/performance limitations सामने आईं
PostgreSQL की MVCC, write amplification, WAL जैसी विशेषताओं के कारण बड़े पैमाने के parallel INSERT में bottleneck आया, और अंततः KV store RocksDB पर migration किया गया
RocksDB के अलग blob storage (BlobDB), SST files, multithreading, hash indexing आदि से NVMe SSD का अधिकतम performance लिया गया
64 RocksDB shards तक scale किया गया—हर shard में xxHash(key)-आधारित routing और Serde+MessagePack serialization का उपयोग हुआ
अंततः हजारों clients (crawler/parser/vectorizer) से 2 लाख ops/second तक प्रोसेसिंग, साथ ही metadata और blobs का अलग/compressed storage संभव हुआ

service mesh और networking

infrastructure scale होने पर service instances की automatic discovery और secure communication के लिए mTLS+HTTP2-आधारित design अपनाया गया
हर node पर root CA-आधारित certificates लगाए गए, MessagePack serialization का सीधे उपयोग किया गया, और internal DNS, CoreDNS, custom client SDK आदि विकसित किए गए
पहले VPN (ZeroTier, Tailscale) का अनुभव था, लेकिन network/performance/operations समस्याओं के कारण सीधे HTTP+mTLS चुना गया
system services को systemd + cgroup + journald से manage करके केंद्रीकृत नियंत्रण, हल्कापन और standardization हासिल की गई

बड़े पैमाने की GPU embedding generation pipeline

शुरुआत में OpenAI API का उपयोग हुआ, लेकिन लागत के कारण Runpod जैसे high-performance GPU environments पर शिफ्ट किया गया
pipeline में हर stage को asynchronous रूप से अलग किया गया, GPU efficiency 90% से ऊपर रही, और 250 GPUs पर प्रति सेकंड 1 लाख embeddings बनाए गए
Rust pipeline, Python inference → named pipe के जरिए IPC, और structured backpressure से resources का automatic tuning किया गया

vector indexing (HNSW/sharding)

HNSW algorithm का उपयोग memory-आधारित vector search के लिए किया गया, जहाँ ANN (Approximate Nearest Neighbor) से ultra-low latency हासिल हुई
RAM सीमा आने पर nodes में समान sharding (64 nodes) की गई, और हर shard को अलग HNSW index के रूप में parallel search में उपयोग किया गया
HNSW की प्रकृति के कारण बहुत अधिक RAM की आवश्यकता और live updates की सीमाएँ थीं → अंततः CoreNN नामक disk-based open source vectorDB पर migration किया गया
CoreNN, 128GB RAM और single node पर भी 3B embeddings की high-accuracy retrieval संभव बनाता है

search engine UX और latency optimization

search engine UX में instant response सबसे महत्वपूर्ण है (कोई load indicator नहीं, पारंपरिक SSR)
Cloudflare Argo आदि के जरिए edge PoP के करीब पहुँचकर, HTTP/3 अपनाया गया ताकि transmission latency न्यूनतम हो
app server स्तर पर सारा data पहले से तैयार रखा गया, individual API round trips कम किए गए, और minified तथा compressed pages तुरंत serve किए गए

यह सार बताता है कि आधुनिक natural language processing और ML तकनीकों का उपयोग करके बड़े पैमाने का वेब सर्च इंजन सिर्फ 2 महीनों में end-to-end कैसे बनाया जा सकता है, और system, algorithm, तथा infrastructure के स्तर पर किन प्रमुख design/optimization पहलुओं पर ध्यान देना पड़ता है।