[2009 WSDM] बड़े पैमाने के सूचना खोज सिस्टम बनाने की चुनौतियाँ

lemonmint · 2025-04-23T12:18:29+09:00

सूचना खोज सिस्टम के मुख्य आयाम और trade-offs सिस्टम डिज़ाइन करते समय निम्न तत्वों के बीच engineering trade-offs को संतुलित रूप से ध्यान में रखना चाहिए। इंडेक्स किए गए दस्तावेज़ों की संख्या प्रति सेकंड प्रोसेस होने वाली queries की संख्या (QPS) इंडेक्स की ताजगी/अपडेट गति query latency प्रत्येक दस्तावेज़ के लिए संग्रहीत जानकारी की मात्रा score calculation/सर्च एल्गोरिदम की जटिलता/लागत engineering की कठिनाई इन parameters के गुणनफल के अनुपात में बढ़ने की प्रवृत्ति रखती है। ये सभी तत्व कुल performance और price-performance ($ के प्रति performance) को प्रभावित करते हैं। सिस्टम स्केल में बदलाव (1999 vs 2009) पिछले 10 वर्षों में सिस्टम स्केल और performance requirements में नाटकीय बदलाव आया है। # दस्तावेज़: ~7 करोड़ → कई अरब (~100 गुना वृद्धि) दैनिक प्रोसेस की गई queries की संख्या: (~1000 गुना वृद्धि) प्रति दस्तावेज़ इंडेक्स जानकारी: (~3 गुना वृद्धि) अपडेट विलंब समय: कई महीने → कुछ मिनट (~10000 गुना कमी) औसत query विलंब समय:

(static.googleusercontent.com)

5 पॉइंट द्वारा lemonmint 2025-04-23 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

सूचना खोज सिस्टम के मुख्य आयाम और trade-offs

सिस्टम डिज़ाइन करते समय निम्न तत्वों के बीच engineering trade-offs को संतुलित रूप से ध्यान में रखना चाहिए।
- इंडेक्स किए गए दस्तावेज़ों की संख्या
- प्रति सेकंड प्रोसेस होने वाली queries की संख्या (QPS)
- इंडेक्स की ताजगी/अपडेट गति
- query latency
- प्रत्येक दस्तावेज़ के लिए संग्रहीत जानकारी की मात्रा
- score calculation/सर्च एल्गोरिदम की जटिलता/लागत
engineering की कठिनाई इन parameters के गुणनफल के अनुपात में बढ़ने की प्रवृत्ति रखती है।
ये सभी तत्व कुल performance और price-performance ($ के प्रति performance) को प्रभावित करते हैं।

सिस्टम स्केल में बदलाव (1999 vs 2009)

पिछले 10 वर्षों में सिस्टम स्केल और performance requirements में नाटकीय बदलाव आया है।
- # दस्तावेज़: ~7 करोड़ → कई अरब (~100 गुना वृद्धि)
- दैनिक प्रोसेस की गई queries की संख्या: (~1000 गुना वृद्धि)
- प्रति दस्तावेज़ इंडेक्स जानकारी: (~3 गुना वृद्धि)
- अपडेट विलंब समय: कई महीने → कुछ मिनट (~10000 गुना कमी)
- औसत query विलंब समय: <1 सेकंड → <0.2 सेकंड (~5 गुना कमी)
- सिस्टम संसाधन: अधिक मशीनें * अधिक तेज मशीनें (~1000 गुना वृद्धि)
ये parameters लगातार, और कभी-कभी कई orders of magnitude तक बदलते रहते हैं, इसलिए सिस्टम डिज़ाइन को लगातार evolve होना पड़ता है।
- किसी विशेष समय (X) पर उपयुक्त डिज़ाइन 10x, 100x स्केल पर बहुत खराब डिज़ाइन साबित हो सकता है। (इसलिए लगभग 10x growth को ध्यान में रखकर डिज़ाइन करें, लेकिन 100x होने से पहले rewrite की योजना बनानी चाहिए)
- पिछले 10 वर्षों में 7 बड़े overhaul हुए।

प्रारंभिक सिस्टम आर्किटेक्चर (1997-1999)

research project चरण (1997):
- frontend web server query लेकर index servers और document servers को distributed processing requests भेजता था
- index server: document ID (docid) आधारित sharding
- document server: document ID (docid) आधारित sharding
मूल सिद्धांत:
- दस्तावेज़ों को integer ID (docids) दी जाती थी (महत्वपूर्ण/उच्च-गुणवत्ता वाले दस्तावेज़ों को छोटे ID दिए जाते थे)
- index server: (query) → sorted (score, docid, ...) सूची लौटाता है। docid आधारित sharding, capacity के लिए replication। लागत O(#queries * index में #documents)।
- document server: (docid, query) → (title, snippet) बनाता है। docid → disk पर document full text mapping। docid आधारित sharding। लागत O(#queries)।
serving system (1999):
- research project संरचना में cache server और ad system integration जोड़ा गया
- index/document server shard replicas संचालित किए गए
- caching: index results और document snippets दोनों cache किए गए। cache hit rate 30-60%। performance सुधार और query latency घटाने में बड़ा योगदान। (हालाँकि index update/cache flush के समय बड़े latency spikes हो सकते हैं, इस पर ध्यान देना आवश्यक है)

इंडेक्स partitioning रणनीति

दस्तावेज़ आधारित (By doc): प्रत्येक shard के पास कुछ दस्तावेज़ों का index होता है (Google की पसंद)
- फायदे: shard-स्तर पर स्वतंत्र query processing, अतिरिक्त per-document जानकारी बनाए रखना आसान, network traffic (request/response) कम
- नुकसान: सभी shards को query process करनी पड़ती है, K-word query के लिए N shards पर O(K*N) disk seeks की आवश्यकता
शब्द आधारित (By word): प्रत्येक shard के पास सभी दस्तावेज़ों के लिए कुछ शब्दों का index होता है
- फायदे: K-word query अधिकतम K shards ही process करते हैं, O(K) disk seeks
- नुकसान: कहीं अधिक network bandwidth की आवश्यकता (matched documents की per-word जानकारी एक जगह इकट्ठी करनी पड़ती है), per-document जानकारी बनाए रखना कठिन

प्रारंभिक crawling और indexing (1998-1999)

crawling:
- सरल batch crawling system: seed URL सूची → pages crawl करना → links निकालकर queue में जोड़ना → पर्याप्त pages इकट्ठा होने पर रुकना
- विचारणीय बिंदु: किसी विशेष site पर overload न हो, uncrawled pages की priority तय करना (जैसे PageRank), uncrawled URL queue का कुशल प्रबंधन, machine failures को संभालना
indexing:
- सरल batch indexing system: साधारण Unix tools पर आधारित
- समस्याएँ: checkpointing न होने से machine failure पर बहुत परेशानी, source data checksum न होने से hardware bit errors की समस्या (प्रारंभिक machines में ECC/parity न होने से स्थिति और खराब, "mostly sorted" समस्या), "adversarial memory and programming" का अनुभव
- समाधान: छोटे record checksums संग्रहीत करना और corrupt records को skip/resync कर सकने वाला file abstraction विकसित करना

इंडेक्स update विधि (1998-1999)

आवृत्ति: लगभग महीने में एक बार
प्रक्रिया:
1. traffic कम होने वाले समय तक प्रतीक्षा
2. कुछ replicas को offline करना
3. नया index offline replicas पर copy करना
4. updated index की ओर संकेत करने वाला नया frontend शुरू करना और कुछ traffic serve करना
index server disk उपयोग रणनीति:
- disk का outer part अधिक bandwidth देता है
1. (पुराने index को serve करते हुए) नया index disk के inner half में copy करना
2. नए index का उपयोग करने हेतु restart करना
3. पुराना index delete करना
4. नए index को तेज outer half में फिर से copy करना
5. inner half में copied पहले नए index को delete करना
6. खाली हुए inner space का उपयोग performance-improving data structures बनाने में करना (उदा. Pair cache - अक्सर साथ आने वाले query term pairs की posting list intersections को पहले से compute करना)

growth के अनुसार scaling (1999-2001)

'99, '00, '01 में index size और traffic में तेज वृद्धि (~5 करोड़ → 100 करोड़+ pages, traffic में हर महीने 20% growth + Yahoo partnership के कारण रातोंरात traffic 2x होना आदि)
index server performance अत्यंत महत्वपूर्ण हो गई: लगातार machines जोड़ना + हर महीने 10-30% software-based performance improvements की आवश्यकता
scaling तरीका: अधिक shards और replicas जोड़ना
निहितार्थ:
- shard response time disk seeks की संख्या और पढ़े जाने वाले data की मात्रा पर निर्भर करती है
- performance सुधार के क्षेत्र: बेहतर disk scheduling, बेहतर index encoding

इंडेक्स encoding तकनीकों का विकास

प्रारंभिक encoding ('97): बहुत सरल byte-aligned format (WORD → [docid+nhits:32b, hit:16b, hit:16b...])। hit = position + attributes (font size, title आदि)। बड़े posting lists के लिए skip table जोड़ा गया। decoding सस्ती थी, लेकिन compression कम होने से अधिक disk bandwidth चाहिए थी।
विभिन्न encoding techniques:
- bit level: Unary, Gamma, Rice_k (Golomb code का special case), Huffman-Int
- byte-aligned: varint (प्रति byte 7 bits + continuation bit)
block-based index format: space और CPU उपयोग दोनों कम हुए (~30% आकार में कमी), decoding speed में सुधार। variable-length blocks का उपयोग। header (delta, length आदि) + document ID deltas (Rice_k) + hit counts (Gamma) + hit attributes (run-length Huffman) + hit positions (Huffman-Int)।
नया index format (2004 के बाद): single flat position space का उपयोग। auxiliary data structures से document boundaries track की जाती हैं। posting list = delta-encoded positions की सूची। compactness और बहुत तेज decoding speed दोनों महत्वपूर्ण हैं।

in-memory index system (2001 की शुरुआत)

पृष्ठभूमि: sharding scale-up + replicas में वृद्धि → कुल memory इतनी हो गई कि पूरे index की copies memory में रखी जा सकें
आर्किटेक्चर: frontend → load balancer (bal) → shard (प्रत्येक shard पर कई in-memory index server replicas)
फायदे: throughput में भारी वृद्धि, latency में बड़ी कमी (विशेषकर जटिल queries जो पहले GB-स्तर disk I/O मांगती थीं - जैसे "circle of life")
समस्याएँ:
- variance: दर्जनों नहीं बल्कि हज़ारों machines के उपयोग से predictability घट गई (उदा. random cron jobs समस्या पैदा करते थे)
- availability: प्रत्येक document index data के replicas 1 या कम थे → "query of death" (सभी backends का एक साथ down होना), machine failure के समय index data availability की समस्या (विशेषकर महत्वपूर्ण documents के लिए replication ज़रूरी)

बाद के सिस्टम डिज़ाइन और तकनीकें (2004 के बाद)

hardware: बड़े पैमाने के data centers, custom-designed racks, PC-class motherboards, low-cost storage/networking hardware, Linux + in-house software
serving design (2004 edition): hierarchical structure
- Root server → Parent servers → Leaf servers (GFS से File Loader के माध्यम से Repository Shard load)
- cache server integration

Group Varint encoding

विचार: Varint decoding की अक्षमता (बहुत सारी branch/shift/mask operations) को हल करना। 4 integer values को group करके 5~17 bytes में encode करना।
तरीका:
- प्रत्येक value की byte length (1~4 bytes) दिखाने वाले 2-bit tags के 4 समूह बनाकर 1-byte prefix byte तैयार करना।
- prefix byte के बाद actual data bytes को store करना।
decoding: prefix byte पढ़कर उसे index की तरह उपयोग करते हुए precomputed 256-entry table lookup → offsets और mask जानकारी प्राप्त → 4 values को एक साथ decode करना।
performance: पुराने तरीकों से बहुत तेज (Group Varint: ~400M/sec, 7-bit Varint: ~180M/sec, 30-bit Varint w/ 2-bit len: ~240M/sec)

Universal Search (2007)

केवल web search results ही नहीं, बल्कि विभिन्न प्रकार की जानकारी (images, local info, news, video, blogs, books आदि) को एकीकृत करके दिखाने वाला system।
Super root node कई specialized search systems (vertical search engines) को query भेजकर results aggregate करता है।

low-latency crawling और indexing की चुनौतियाँ

कुछ मिनटों में updates को reflect करना बहुत कठिन काम है।
crawling heuristics: कौन-से pages crawl किए जाएँ?
crawling system: pages को तेज़ी से crawl करना चाहिए।
indexing system: PageRank, anchor text जैसी global data पर निर्भरता → online approximation की आवश्यकता।
serving system: query processing के दौरान updates स्वीकार करने के लिए तैयार होना चाहिए (इसके लिए batch update systems से बहुत अलग data structures चाहिए)।

experiments का महत्व और infrastructure

experimentation की आसानी: अत्यंत महत्वपूर्ण (तेज़ experiment cycle → अधिक experiments → अधिक improvements)।
experiments के प्रकार:
- आसान experiments: existing data weights adjust करना आदि।
- कठिन experiments: ऐसा नया data चाहिए जो production index में नहीं है। (नया data generate/integrate करना और experiments में उपयोग को आसान बनाना चाहिए)
मुख्य infrastructure:
- GFS: large-scale distributed file system
- MapReduce: large-scale data processing jobs को आसानी से लिखना/चलाना। (production index data generation तेज़ करना, temporary experiments जल्दी चलाना)
- BigTable: semi-structured storage system। (per-document जानकारी तक online/efficient access, कई processes द्वारा asynchronous document information updates संभव - hourly → minute-level updates के लिए महत्वपूर्ण)

experiment cycle और rollout

idea बनाना और offline experiments:
- MapReduce, BigTable आदि से data generate करना।
- offline experiments चलाना और प्रभाव की जाँच (human-evaluated query sets, random query sets आदि)।
- इस चरण में prototype की latency/throughput महत्वपूर्ण नहीं है।
- experiment परिणामों के आधार पर iterative improvement।
live experiments:
- यदि offline experiment results अच्छे हों, तो real user traffic के एक छोटे हिस्से (tiny sliver) पर live experiments चलाना (आमतौर पर random sample, कभी-कभी specific query class पर)।
- इस चरण में throughput से अधिक latency महत्वपूर्ण है! experiment framework को production latency के क़रीब काम करना चाहिए।
performance tuning और rollout:
- यदि live experiment results अच्छे हों, तो full load पर चलाने योग्य बनाने के लिए performance tuning/reimplementation (उदा. runtime computation की जगह data precompute करना, "काफी अच्छा" होने पर सस्ता approximation इस्तेमाल करना)।
- rollout process का महत्व: quality vs cost trade-off पर लगातार विचार, तेज rollout और कम latency/site stability के बीच संतुलन (search quality team और performance/stability team के बीच अच्छा सहयोग आवश्यक)।

भविष्य की दिशा और चुनौतियाँ

cross-language information retrieval (Cross-Language IR): दुनिया के सभी documents को सभी भाषाओं में translate करना → index size में भारी वृद्धि, computation cost अधिक। (translation quality में निरंतर सुधार, बड़े और जटिल language models संभालने के लिए बड़े-scale systems की आवश्यकता)
information retrieval systems में access control lists (ACLs): private/semi-private/widely shared/public documents का मिश्रित वातावरण। (बहुत अलग आकार के ACLs को कुशलता से संभालने वाली systems की ज़रूरत - 10 लोगों से साझा document और दुनिया भर से साझा document के लिए optimal solution अलग होगा, sharing patterns समय के साथ बदल सकते हैं)
efficient IR systems का automatic construction: अभी अलग-अलग उद्देश्यों के लिए कई systems उपयोग में हैं (ultra-fast updates के लिए, ultra-large documents के लिए आदि)। (क्या ऐसा एकल system विकसित किया जा सकता है जो parameter inputs के आधार पर requirements के अनुरूप efficient search system अपने-आप बना सके?)
semi-structured data से information extraction: स्पष्ट semantic labels वाले data बहुत कम हैं। tables, forms जैसे semi-structured data प्रचुर मात्रा में हैं। (unstructured/semi-structured sources से structured information extraction को बेहतर बनाने वाले algorithms/techniques की आवश्यकता - noise बहुत है लेकिन redundancy का उपयोग संभव है, कई sources की जानकारी को relate/combine/aggregate करने की क्षमता चाहिए)

निष्कर्ष

बड़े पैमाने के information retrieval systems का design और निर्माण चुनौतीपूर्ण और रोचक काम है।
नई search technologies अक्सर नए system design की मांग करती हैं।