Discord खरबों संदेशों को कैसे index करता है

(discord.com)

28 पॉइंट द्वारा GN⁺ 2025-05-05 | 4 टिप्पणियां | WhatsApp पर शेयर करें

Discord ने मौजूदा Elasticsearch-आधारित search infrastructure की सीमाओं को दूर करने के लिए पूरे आर्किटेक्चर को Kubernetes-आधारित रूप में फिर से डिज़ाइन किया, जिससे message indexing performance और stability में बड़ा सुधार हुआ
पुरानी Redis queue में message loss का जोखिम था, लेकिन इसे PubSub से बदलकर reliable message delivery सुनिश्चित की गई, साथ ही संदेशों को cluster/index unit के आधार पर वर्गीकृत करके अधिक कुशलता से प्रोसेस किया गया
"cell" आर्किटेक्चर अपनाकर कई छोटे Elasticsearch clusters में distribution किया गया, जिससे node overload और update न कर पाने की समस्या हल हुई
personal DM messages और server (guild) messages को अलग cells में index किया गया, जो नई full DM search feature की नींव बना
बहुत बड़े communities (BFGs) के लिए dedicated cells और multi-shard indexes के जरिए Lucene की maximum message count limit से आगे scale करना संभव हुआ

मौजूदा infrastructure की सीमाएँ

Redis-आधारित message queue में Elasticsearch node failure होने पर bottleneck बनता था, और message loss की संभावना रहती थी
बड़े clusters (200+ nodes) में एक single node failure से पूरे indexing failure rate के 40% तक पहुँचने की समस्या थी
Lucene की MAX_DOCS (2 अरब messages) सीमा तक पहुँच चुके indexes पूरी indexing को रोक देते थे
पुराने सिस्टम के कारण log4shell patch भी पूरे सिस्टम को offline करने के बाद ही लागू किया जा सकता था

समाधान रणनीति

Kubernetes-आधारित पुनर्निर्माण

Elastic Kubernetes Operator(ECK) का उपयोग करके Elasticsearch cluster operations को automate किया गया
rolling restart, OS और software upgrades को सुरक्षित तरीके से करना संभव हुआ

“cell” आर्किटेक्चर से cluster distribution

पहले के बड़े single cluster की जगह कई छोटे clusters मिलकर एक cell बनाते हैं
हर cell में indexes की संख्या सीमित रखी गई, और shard size को 50GB तथा 20 करोड़ messages के भीतर बनाए रखा गया
indexing और query performance बेहतर हुई, और cluster state को बनाए रखने का बोझ कम हुआ

PubSub-आधारित message queue

Redis → PubSub बदलाव के साथ message loss के बिना queue बनाए रखना संभव हुआ
दूसरे features (जैसे task scheduling) में भी PubSub का उपयोग बढ़ाया जा रहा है

cluster-आधारित batch indexing

PubSub से मिले messages को target cluster और index के आधार पर वर्गीकृत कर अलग-अलग tasks में parallel प्रोसेस किया जाता है
Rust के tokio task + channel के जरिए message distribution processing structure लागू किया गया

search feature में सुधार

user-आधारित DM search

पहले DM को channel unit के आधार पर index किया जाता था, इसलिए full DM search inefficient थी
अब user-specific indexes में DM messages को dual-index किया जाता है, जिससे सभी DM को एक साथ search करना संभव है

BFG (Big Freaking Guilds) के लिए समर्थन

Lucene की message count limit से बड़े communities के लिए multi-shard indexes अपनाए गए
BFGs को dedicated Elasticsearch cells में multi-primary-shard structure के जरिए संभाला जाता है
पुराने indexes और नए indexes दोनों में एक साथ dual-indexing करने के बाद, धीरे-धीरे query target को बदला गया

परिणाम

खरबों messages का indexing, और पहले की तुलना में indexing throughput 2 गुना
query response speed: औसतन 500ms → 100ms, p99 1s → 500ms से कम
40 से अधिक clusters और हजारों indexes production में चल रहे हैं
cluster upgrade और rolling restart पूरी तरह automated हैं और service interruption नहीं होता

4 टिप्पणियां

mhj5730 2025-05-08

उस काम को चलाते हुए करना... वाकई सम्मान की बात है।

ethanhur 2025-05-08

Discord engineering हमेशा प्रेरणादायक लगता है। ईर्ष्या होती है।

jujumilk3 2025-05-07

मुझे लगा pubsub क्या है, तो पता चला कि यह GCP द्वारा प्रदान की जाने वाली IaaS है.

https://cloud.google.com/pubsub?hl=en

mssmss 2025-05-05

काफ़ी प्रभावशाली है। समस्या सुलझाने के लिए सब कुछ उलट-पुलट देना भी।