एकल तकनीकी startup का architecture stack

xguru · 2021-04-12T10:49:45+09:00

बिना stress के, धीमी रफ्तार से SaaS चलाने वाले एक solo developer के architecture का विवरण कई projects को एक साथ चलाने के लिए infrastructure तैयार किया गया है हाल में बनाए गए PanelBear नाम के SaaS को आधार बनाकर समझाया गया है → सबसे छोटे VPS पर SQLite + Django से शुरुआत → 6 महीनों की iteration के बाद, EKS पर Django monolith + Postgres + ClickHouse (analytics) + Redis (caching) + Celery (scheduled jobs) → ज़्यादातर चीज़ें automated हैं: autoscaling, ingress, TLS certificate, failover, logging, monitoring आदि → यही setup कई projects में इस्तेमाल होता है, इसलिए लागत घटती है और experiments शुरू करना बहुत आसान हो जाता है → infrastructure management में लगभग समय नहीं लगता (महीने में 0~2 घंटे) → ज़्यादातर समय feature development, customer support और business growth में जाता है AWS पर Kubernetes का उपयोग किया गया है, लेकिन यह अनिवार्य नहीं है. लेखक को इसमें अनुभव है, इसलिए इसे स्थिर रूप से चला पाने का भरोसा है → इसे कई सालों तक एक धैर्यवान team के साथ इस्तेमाल करके सीखा गया, जो errors को साथ मिलकर संभाल सके → "Kubernetes सरल चीज़ों को जटिल बनाता है, लेकिन जटिल चीज़ों को सरल भी बना सकता है" Automated DNS, SSL और load balancing → CloudFlare Proxy से AWS L4 NLB (Network Load Balancer) तक सारा traffic भेजा जाता है → request आने पर LB उसे k8s cluster के किसी node तक forward करता है → ये nodes कई AZ (Availability Zone) में फैले private subnets में हैं → k8s को किस service तक request भेजनी है, यह ingress-nginx (Nginx cluster) तय करता है → nginx traffic भेजने से पहले RateLimiting और traffic shaping rules लागू करता है → PanelBear में app container, Uvicorn पर serve होने वाला Django है → Terraform/K8s के बीच कुछ config files हैं, जिन्हें ज़्यादातर projects share करते हैं → नया project deploy करना लगभग 20 lines की ingress setting से संभव है Automated rollout और rollback → master पर हर push के साथ GitHub Actions से CI pipeline चलती है → codebase checks, Docker-Compose से complete environment बनाकर E-to-E tests → checks pass होने पर ECR (AWS का Docker Registry) में push की जाने वाली नई Docker image build होती है → k8s cluster में flux component ( https://fluxcd.io/ ) अपने आप cluster के images को sync करता है → Flux अपने आप incremental rollout चलाता है Horizontal Autoscaling → CPU/memory usage के आधार पर autoscaling → अगर cluster के किसी node पर बहुत ज़्यादा Pods हों, तो और servers अपने आप बनते हैं ताकि cluster capacity बढ़े और load घटे. काम कम होने पर यह फिर scale down हो जाता है → PanelBear में API Pods की replicas अपने आप 2 से 8 तक adjust होती हैं CDN से static assets caching → DNS पर CloudFlare सेट करके सभी requests को handle किया जाता है और DDoS protection भी मिलती है → static files serve करने के लिए Whitenoise ( https://github.com/evansd/whitenoise ) का उपयोग किया गया है, इसलिए NGinx/Cloudfront/S3 पर files upload करने की ज़रूरत नहीं → PanelBear की landing जैसी कुछ static websites के लिए NextJS का उपयोग किया गया है application data caching → कुछ हिस्सों में Python की in-memory LRU caching का उपयोग → ज़्यादातर endpoints in-cluster Redis का उपयोग करते हैं प्रति endpoint Rate Limiting → nginx-ingress global rate limit देता है, लेकिन कभी-कभी endpoint/method के हिसाब से अलग limits चाहिए होती हैं → Django Ratelimit library से हर Django view पर limits घोषित की जा सकती हैं → Redis को backend की तरह इस्तेमाल करके हर endpoint पर request करने वाले clients को track किया जाता है (IP नहीं, client key आधारित hash) App Administration → Django का Admin Panel मूल रूप से data देखने और edit करने की सुविधाएँ देता है → suspicious account access block करना / announcement emails भेजना / account deletion requests process करना जैसी सुविधाएँ जोड़ी गईं (पहले soft delete, फिर 72 घंटे के भीतर permanent delete) scheduled jobs चलाना → SaaS में कई तरह के scheduled jobs चलते हैं: customers के लिए daily reports, हर 15 मिनट में usage stats की calculation, staff को भेजे जाने वाले metrics emails आदि → cluster के भीतर कुछ Celery workers और Celery beat scheduler चल रहे हैं. Redis को task queue की तरह उपयोग किया गया है → scheduled jobs सही से न चलें तो SMS/Slack/Email से alerts पाने के लिए HealthChecks.io का उपयोग App Configuration → सभी settings environment variables से संभाली जाती हैं. यह पुराना तरीका है, लेकिन portable है और अच्छी तरह supported है Secrets को सुरक्षित रखना → kubeseal का उपयोग. इसमें asymmetric encryption से Secrets को encrypt किया जाता है. सिर्फ वही cluster decrypt कर सकता है जिसे decryption key तक access हो → cluster के Secrets की सुरक्षा के लिए AWS KMS की encryption keys का उपयोग relational data: Postgres → experiments के लिए cluster के भीतर vanilla Postgres container चलाया जाता है, और K8s Cronjob से रोज़ाना S3 पर backup लिया जाता है → project बढ़ने पर database को cluster के भीतर से RDS पर migrate कर दिया जाता है, जहाँ encrypted backups और security updates जैसी चीज़ें AWS संभालता है → security मज़बूत करने के लिए AWS का DB सिर्फ Private Network से accessible है column data: ClickHouse → PanelBear के analytics data को कुशलता से store करने और real-time query के लिए ClickHouse का उपयोग → यह एक शानदार columnar database है, बहुत तेज़ है और सही structure होने पर उच्च compression देता है (कम storage = ज़्यादा revenue) → K8s cluster के भीतर ClickHouse instance self-host किया गया है → S3 पर columnar data का नियमित backup लेने के लिए CronJob बनाया गया है → disaster की स्थिति में S3 से data को manually backup और restore करने के लिए कुछ scripts हैं DNS-आधारित service discovery → K8s cluster के भीतर DNS records अपने आप manage करता है और traffic को सही service तक route करता है → autoscaling के दौरान भी healthy pods से connection बना रहे, इसके लिए DNS records अपने आप sync होते हैं Version-Controlled Infrastructure → Docker, Terraform, K8s manifest को एक single repository (Infra Mono-Repo) में manage किया जाता है → simple commands से infrastructure बनाना और हटाना संभव है, और version control से reproducibility मिलती है Cloud resources के लिए Terraform → ज़्यादातर cloud resources Terraform से manage किए जाते हैं → इससे infrastructure resources और settings को document करना और track करना संभव होता है App deployment के लिए K8s manifest → infra mono-repo की YAML files में K8s Manifest लिखे गए हैं → इन्हें cluster और apps दो folders में बाँटा गया है → cluster में nginx-ingress, encrypted secrets, Prometheus scrapers जैसी cluster-wide services की settings शामिल हैं → apps में हर project के लिए एक namespace में जानकारी रखी जाती है subscription और payments → Stripe Checkout से सभी payments process होते हैं → payment information को खुद handle नहीं करना पड़ता, इसलिए product पर ध्यान देना आसान होता है → customer session बनाकर Stripe के page पर redirect करें, फिर webhook से result मिल जाए तो काम पूरा Logging → logging agent इस्तेमाल किए बिना बस stdout पर logs लिखे जाते हैं, और k8s अपने आप logs collect और rotate करता है → FluentBit आदि के जरिए Elasticsearch/Kibana तक भेजा जा सकता है, लेकिन सादगी बनाए रखने के लिए अभी ऐसा नहीं किया गया → logs inspect करने के लिए CLI tool stern का उपयोग monitoring और alerts → शुरुआत में Prometheus / Grafana self-host किए गए थे, लेकिन cluster में समस्या आने पर alert system भी साथ बंद हो जाता था, जो असुविधाजनक था → इसलिए New Relic पर switch किया गया → सभी services में Prometheus Integration है, जो metrics अपने आप collect करके Datadog, New Relic, Grafana Cloud आदि तक भेज सकती है. इसलिए New Relic में migration सिर्फ उनके दिए Prometheus Docker image का उपयोग करके संभव हो गया error tracking → Sentry से application errors collect किए जाते हैं → Slack #alerts channel के जरिए downtime, cron job failures, security alerts, performance regressions, application exceptions जैसी सभी alerts को एक जगह केंद्रीकृत किया गया है Profiling और अन्य अच्छी चीज़ें → गहराई से analysis की ज़रूरत हो तो cProfile या snakeviz जैसे tools का उपयोग → local machine पर Django Debug Toolbar का उपयोग

(anthonynsimon.com)

50 पॉइंट द्वारा xguru 2021-04-12 | 11 टिप्पणियां | WhatsApp पर शेयर करें

बिना stress के, धीमी रफ्तार से SaaS चलाने वाले एक solo developer के architecture का विवरण
कई projects को एक साथ चलाने के लिए infrastructure तैयार किया गया है
हाल में बनाए गए PanelBear नाम के SaaS को आधार बनाकर समझाया गया है
→ सबसे छोटे VPS पर SQLite + Django से शुरुआत
→ 6 महीनों की iteration के बाद, EKS पर Django monolith + Postgres + ClickHouse (analytics) + Redis (caching) + Celery (scheduled jobs)
→ ज़्यादातर चीज़ें automated हैं: autoscaling, ingress, TLS certificate, failover, logging, monitoring आदि
→ यही setup कई projects में इस्तेमाल होता है, इसलिए लागत घटती है और experiments शुरू करना बहुत आसान हो जाता है
→ infrastructure management में लगभग समय नहीं लगता (महीने में 0~2 घंटे)
→ ज़्यादातर समय feature development, customer support और business growth में जाता है
AWS पर Kubernetes का उपयोग किया गया है, लेकिन यह अनिवार्य नहीं है. लेखक को इसमें अनुभव है, इसलिए इसे स्थिर रूप से चला पाने का भरोसा है
→ इसे कई सालों तक एक धैर्यवान team के साथ इस्तेमाल करके सीखा गया, जो errors को साथ मिलकर संभाल सके
→ "Kubernetes सरल चीज़ों को जटिल बनाता है, लेकिन जटिल चीज़ों को सरल भी बना सकता है"
Automated DNS, SSL और load balancing
→ CloudFlare Proxy से AWS L4 NLB (Network Load Balancer) तक सारा traffic भेजा जाता है
→ request आने पर LB उसे k8s cluster के किसी node तक forward करता है
→ ये nodes कई AZ (Availability Zone) में फैले private subnets में हैं
→ k8s को किस service तक request भेजनी है, यह ingress-nginx (Nginx cluster) तय करता है
→ nginx traffic भेजने से पहले RateLimiting और traffic shaping rules लागू करता है
→ PanelBear में app container, Uvicorn पर serve होने वाला Django है
→ Terraform/K8s के बीच कुछ config files हैं, जिन्हें ज़्यादातर projects share करते हैं
→ नया project deploy करना लगभग 20 lines की ingress setting से संभव है
Automated rollout और rollback
→ master पर हर push के साथ GitHub Actions से CI pipeline चलती है
→ codebase checks, Docker-Compose से complete environment बनाकर E-to-E tests
→ checks pass होने पर ECR (AWS का Docker Registry) में push की जाने वाली नई Docker image build होती है
→ k8s cluster में flux component ( https://fluxcd.io/ ) अपने आप cluster के images को sync करता है
→ Flux अपने आप incremental rollout चलाता है
Horizontal Autoscaling
→ CPU/memory usage के आधार पर autoscaling
→ अगर cluster के किसी node पर बहुत ज़्यादा Pods हों, तो और servers अपने आप बनते हैं ताकि cluster capacity बढ़े और load घटे. काम कम होने पर यह फिर scale down हो जाता है
→ PanelBear में API Pods की replicas अपने आप 2 से 8 तक adjust होती हैं
CDN से static assets caching
→ DNS पर CloudFlare सेट करके सभी requests को handle किया जाता है और DDoS protection भी मिलती है
→ static files serve करने के लिए Whitenoise ( https://github.com/evansd/whitenoise ) का उपयोग किया गया है, इसलिए NGinx/Cloudfront/S3 पर files upload करने की ज़रूरत नहीं
→ PanelBear की landing जैसी कुछ static websites के लिए NextJS का उपयोग किया गया है
application data caching
→ कुछ हिस्सों में Python की in-memory LRU caching का उपयोग
→ ज़्यादातर endpoints in-cluster Redis का उपयोग करते हैं
प्रति endpoint Rate Limiting
→ nginx-ingress global rate limit देता है, लेकिन कभी-कभी endpoint/method के हिसाब से अलग limits चाहिए होती हैं
→ Django Ratelimit library से हर Django view पर limits घोषित की जा सकती हैं
→ Redis को backend की तरह इस्तेमाल करके हर endpoint पर request करने वाले clients को track किया जाता है (IP नहीं, client key आधारित hash)
App Administration
→ Django का Admin Panel मूल रूप से data देखने और edit करने की सुविधाएँ देता है
→ suspicious account access block करना / announcement emails भेजना / account deletion requests process करना जैसी सुविधाएँ जोड़ी गईं (पहले soft delete, फिर 72 घंटे के भीतर permanent delete)
scheduled jobs चलाना
→ SaaS में कई तरह के scheduled jobs चलते हैं: customers के लिए daily reports, हर 15 मिनट में usage stats की calculation, staff को भेजे जाने वाले metrics emails आदि
→ cluster के भीतर कुछ Celery workers और Celery beat scheduler चल रहे हैं. Redis को task queue की तरह उपयोग किया गया है
→ scheduled jobs सही से न चलें तो SMS/Slack/Email से alerts पाने के लिए HealthChecks.io का उपयोग
App Configuration
→ सभी settings environment variables से संभाली जाती हैं. यह पुराना तरीका है, लेकिन portable है और अच्छी तरह supported है
Secrets को सुरक्षित रखना
→ kubeseal का उपयोग. इसमें asymmetric encryption से Secrets को encrypt किया जाता है. सिर्फ वही cluster decrypt कर सकता है जिसे decryption key तक access हो
→ cluster के Secrets की सुरक्षा के लिए AWS KMS की encryption keys का उपयोग
relational data: Postgres
→ experiments के लिए cluster के भीतर vanilla Postgres container चलाया जाता है, और K8s Cronjob से रोज़ाना S3 पर backup लिया जाता है
→ project बढ़ने पर database को cluster के भीतर से RDS पर migrate कर दिया जाता है, जहाँ encrypted backups और security updates जैसी चीज़ें AWS संभालता है
→ security मज़बूत करने के लिए AWS का DB सिर्फ Private Network से accessible है
column data: ClickHouse
→ PanelBear के analytics data को कुशलता से store करने और real-time query के लिए ClickHouse का उपयोग
→ यह एक शानदार columnar database है, बहुत तेज़ है और सही structure होने पर उच्च compression देता है (कम storage = ज़्यादा revenue)
→ K8s cluster के भीतर ClickHouse instance self-host किया गया है
→ S3 पर columnar data का नियमित backup लेने के लिए CronJob बनाया गया है
→ disaster की स्थिति में S3 से data को manually backup और restore करने के लिए कुछ scripts हैं
DNS-आधारित service discovery
→ K8s cluster के भीतर DNS records अपने आप manage करता है और traffic को सही service तक route करता है
→ autoscaling के दौरान भी healthy pods से connection बना रहे, इसके लिए DNS records अपने आप sync होते हैं
Version-Controlled Infrastructure
→ Docker, Terraform, K8s manifest को एक single repository (Infra Mono-Repo) में manage किया जाता है
→ simple commands से infrastructure बनाना और हटाना संभव है, और version control से reproducibility मिलती है
Cloud resources के लिए Terraform
→ ज़्यादातर cloud resources Terraform से manage किए जाते हैं
→ इससे infrastructure resources और settings को document करना और track करना संभव होता है
App deployment के लिए K8s manifest
→ infra mono-repo की YAML files में K8s Manifest लिखे गए हैं
→ इन्हें cluster और apps दो folders में बाँटा गया है
→ cluster में nginx-ingress, encrypted secrets, Prometheus scrapers जैसी cluster-wide services की settings शामिल हैं
→ apps में हर project के लिए एक namespace में जानकारी रखी जाती है
subscription और payments
→ Stripe Checkout से सभी payments process होते हैं
→ payment information को खुद handle नहीं करना पड़ता, इसलिए product पर ध्यान देना आसान होता है
→ customer session बनाकर Stripe के page पर redirect करें, फिर webhook से result मिल जाए तो काम पूरा
Logging
→ logging agent इस्तेमाल किए बिना बस stdout पर logs लिखे जाते हैं, और k8s अपने आप logs collect और rotate करता है
→ FluentBit आदि के जरिए Elasticsearch/Kibana तक भेजा जा सकता है, लेकिन सादगी बनाए रखने के लिए अभी ऐसा नहीं किया गया
→ logs inspect करने के लिए CLI tool stern का उपयोग
monitoring और alerts
→ शुरुआत में Prometheus / Grafana self-host किए गए थे, लेकिन cluster में समस्या आने पर alert system भी साथ बंद हो जाता था, जो असुविधाजनक था
→ इसलिए New Relic पर switch किया गया
→ सभी services में Prometheus Integration है, जो metrics अपने आप collect करके Datadog, New Relic, Grafana Cloud आदि तक भेज सकती है. इसलिए New Relic में migration सिर्फ उनके दिए Prometheus Docker image का उपयोग करके संभव हो गया
error tracking
→ Sentry से application errors collect किए जाते हैं
→ Slack #alerts channel के जरिए downtime, cron job failures, security alerts, performance regressions, application exceptions जैसी सभी alerts को एक जगह केंद्रीकृत किया गया है
Profiling और अन्य अच्छी चीज़ें
→ गहराई से analysis की ज़रूरत हो तो cProfile या snakeviz जैसे tools का उपयोग
→ local machine पर Django Debug Toolbar का उपयोग

11 टिप्पणियां

wellsbabo 2024-08-13

धन्यवाद

admin2 2021-04-13

क्या Sentry और New Relic की सुविधाओं में काफ़ी अंतर है?

मुझे लगा था कि दोनों मिलते-जुलते काम करते हैं, लेकिन मैंने अभी तक इन्हें इस्तेमाल नहीं किया है।

kbumsik 2021-04-13

ओह, हमारी कंपनी में भी k8s अपनाने पर विचार चल रहा है, और यह 1-व्यक्ति startup न होने पर भी काफ़ी अच्छा लेख है।

fortune 2021-04-12

अच्छे लेख के लिए धन्यवाद। इससे प्रेरणा मिली.

khris 2021-04-12

ज़रूरी नहीं कि यह सिर्फ़ एकल-founder startup के लिए हो, यह एक अच्छा लेख है

yshrust 2021-04-12

एक छोटी सी टाइपो है,,

एरर ट्रैकिंग

→ Sentry का इस्तेमाल करके एप्लिकेशन एरर को इकट्ठा

=> मेरा ख्याल है कि यहाँ "इकट्ठा करना" होना चाहिए

xguru 2021-04-12

धन्यवाद। मैंने इसे ठीक कर दिया~!

xguru 2021-04-12

मैं चाहता हूँ कि भारत की तरह हमारे यहाँ भी अपने खुद के सर्विस से पैसे कमाने वाले solo developer या छोटे teams और ज़्यादा सामने आएँ.

मुझे उम्मीद है कि यहाँ GeekNews ऐसे services के लिए अपनी पहचान बनाने और स्वस्थ feedback पाने की जगह के रूप में आगे बढ़े.

1-person SaaS startup चलाने के 6 महीनों का पुनरावलोकन https://hi.news.hada.io/topic?id=2415
न्यूनतम प्रयास के साथ software startup चलाना https://hi.news.hada.io/topic?id=1534
2021 independent SaaS status report [63p slides] https://hi.news.hada.io/topic?id=3728
1 ट्रिलियन-वॉन कंपनी बनाने में असफल रहने के अनुभव की कहानी https://hi.news.hada.io/topic?id=2
मैं इंटरनेट पर प्याज़ बेचता हूँ https://hi.news.hada.io/topic?id=3
Samsung के पूर्व startup CEO ने 1.2 billion won गंवाकर क्या सीखा https://hi.news.hada.io/topic?id=3015
startup को सालाना 6$ में चलाना https://hi.news.hada.io/topic?id=1621

wellsbabo 2024-08-13

धन्यवाद

reedids 2021-04-12

मैं सहमत हूँ। धन्यवाद :)

e1q88 2021-04-12

👍

एकल तकनीकी startup का architecture stack

संबंधित पढ़ाई

11 टिप्पणियां