€20 करोड़ के supercomputer पर code चलाने के लिए वास्तव में क्या चाहिए

(towardsdatascience.com)

2 पॉइंट द्वारा GN⁺ 2 시간 전 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

बार्सिलोना की Universitat Politècnica de Catalunya का MareNostrum V दुनिया के शीर्ष 15 supercomputers में से एक है, और 8,000 nodes पर distributed computing करने वाला €20 करोड़ का public research infrastructure है
यह कोई एक single high-performance computer नहीं, बल्कि हज़ारों independent computers से बना distributed system है, जिन्हें InfiniBand NDR200 fat-tree topology से जोड़ा गया है, ताकि कोई भी node समान न्यूनतम latency के साथ communicate कर सके
jobs submit करना SLURM workload manager के ज़रिये होता है, जहाँ resource request, time limit, और project budget बताने वाली batch scripts के माध्यम से scheduler में jobs queue की जाती हैं
यह air-gapped environment में चलता है, जहाँ external internet access बंद रहता है; इसलिए ज़रूरी libraries और datasets पहले से तैयार रखने पड़ते हैं, और कई बार computing से ज़्यादा result extraction bottleneck बन जाता है
यह researchers के लिए मुफ्त public resource है; Spain की संस्थाएँ RES के माध्यम से, और पूरे Europe के शोधकर्ता EuroHPC Joint Undertaking की नियमित calls के ज़रिये access पा सकते हैं

आर्किटेक्चर: network ही computer है

HPC के बारे में सबसे बड़ी गलतफहमी यह है कि मानो आप एक ही ultra-powerful computer किराए पर ले रहे हों, जबकि वास्तव में structure ऐसा होता है कि काम को हज़ारों independent computers पर distributed submit किया जाता है
distributed computing में GPU के data transfer का इंतज़ार करते-करते idle हो जाने की समस्या से बचने के लिए, MareNostrum V ने InfiniBand NDR200 fabric को fat-tree topology में बनाया है
- सामान्य networks में, जब कई computers एक ही switch share करते हैं, तो bandwidth bottleneck होता है
- fat-tree topology में network hierarchy के ऊपर जाते-जाते link bandwidth बढ़ाई जाती है, जिससे non-blocking bandwidth सुनिश्चित होती है
- 8,000 nodes में कोई भी node किसी भी दूसरे node से समान न्यूनतम latency के साथ communicate कर सकता है

computing partitions की संरचना

General Purpose Partition (GPP): highly parallel CPU workloads के लिए design किया गया, 6,408 nodes में हर node पर 112 Intel Sapphire Rapids cores हैं, और कुल peak performance 45.9 PFlops है
Accelerated Partition (ACC): AI training, molecular dynamics जैसी special workloads के लिए design किया गया, 1,120 nodes में हर node पर 4 NVIDIA H100 SXM GPU हैं, और peak performance 260 PFlops है
- अगर एक H100 की retail कीमत लगभग $25,000 मानी जाए, तो सिर्फ GPUs की लागत ही $11 करोड़ से अधिक बैठती है
Login Nodes: SSH access के बाद सबसे पहला entry point, जहाँ file transfer, code compile करना, और job scripts submit करना जैसे lightweight काम किए जाते हैं; यह computing के लिए नहीं होते

quantum infrastructure

MareNostrum 5 में Spain का पहला quantum computer physical और logical, दोनों स्तरों पर integrated है
- इसमें digital gate-based quantum system और superconducting qubit-based quantum annealer MareNostrum-Ona शामिल हैं
Quantum Processing Units (QPU) classical supercomputer को replace नहीं करते, बल्कि specialized accelerators की तरह काम करते हैं
optimization problems या quantum chemistry simulations, जिन्हें H100 GPU से भी efficiently संभालना मुश्किल हो, उन्हें quantum hardware पर offload करके large-scale hybrid classical-quantum computing powerhouse बनाया जाता है

air gap, quota, और HPC operations की वास्तविकता

Airgap: बाहर से SSH access संभव है, लेकिन compute nodes से external internet access नहीं होता
- pip install, wget, या external HuggingFace repositories से connect करना संभव नहीं
- script को जो भी चाहिए, उसे पहले से download और compile करके storage directory में तैयार रखना ज़रूरी है
- administrators module system के ज़रिये अधिकांश libraries और software उपलब्ध कराते हैं
Data movement: login nodes के माध्यम से scp या rsync से data input/output किया जाता है
- क्योंकि actual computation बहुत तेज़ हो सकती है, इसलिए completed results को local machine तक निकालना ही कभी-कभी bottleneck बन जाता है
Limits and quotas: हर project को एक निश्चित CPU time budget दिया जाता है, और किसी एक user के लिए एक साथ चल रही या waiting jobs की संख्या पर hard limits होती हैं
- हर job के लिए सख्त wall-time limit देना अनिवार्य है
- अगर requested time 1 second भी पार हो जाए, तो scheduler process को तुरंत terminate कर देता है
Logging: job submit होने के बाद live terminal output नहीं मिलता; सारा stdout और stderr अपने-आप log files (जैसे sim_12345.out, sim_12345.err) में redirect हो जाता है
- job complete होने या crash के बाद इन्हीं text files को देखकर result verify और debugging की जाती है
- squeue या tail -f से submitted jobs की स्थिति monitor की जा सकती है

SLURM workload manager

research allocation approve होने के बाद जब आप SSH से login करते हैं, तो एक बिल्कुल सामान्य Linux terminal prompt दिखाई देता है
क्योंकि इसे हज़ारों researchers एक साथ इस्तेमाल करते हैं, अगर आप terminal से सीधे कोई heavy script चला दें तो login node down हो सकता है और system administrators से warning mail भी मिल सकती है
SLURM (Simple Linux Utility for Resource Management): यह open source job scheduling software है, जिसमें आप bash script में required hardware, software environment, और execution code specify करते हैं; फिर job queue में जाती है, hardware उपलब्ध होते ही चलती है, और काम खत्म होने पर nodes रिलीज़ हो जाते हैं
मुख्य #SBATCH directives:
- --nodes: कितनी physical machines चाहिए
- --ntasks: कुल कितने MPI processes (tasks) बनाने हैं; SLURM इन्हें nodes में distribute करता है
- --time: सख्त wall-clock time limit; पार होते ही तुरंत terminate
- --account: वह project ID जिससे CPU time deduct होगा
- --qos: Quality of Service या specific queue (जैसे debug queue, जो तेज़ access देती है लेकिन runtime छोटा रखती है)

व्यावहारिक उदाहरण: OpenFOAM sweep orchestration

aerodynamic downforce predict करने के लिए ML surrogate model बनाना था, और इसके लिए 50 अलग-अलग 3D meshes पर 50 high-fidelity CFD (Computational Fluid Dynamics) simulations चलानी थीं
General Purpose Partition पर एक single OpenFOAM CFD case के लिए SLURM job script का उदाहरण:
- --nodes=1, --ntasks=6, --time=00:30:00 जैसी settings से resources define किए गए
- module load OpenFOAM/11-foss-2023a से environment load किया गया
- srun --mpi=pmix के साथ surfaceFeatureExtract, blockMesh, decomposePar, snappyHexMesh, potentialFoam, simpleFoam, reconstructPar को क्रम से चलाया गया
50 jobs manually submit करने की बजाय SLURM dependency का उपयोग कर हर job को पिछली job के बाद chain किया गया
- sbatch --dependency=afterany:$PREV_JOB_ID के ज़रिये 50 jobs कुछ ही seconds में queue में register हो गईं
- अगली सुबह तक 50 aerodynamic evaluations process और log हो चुकी थीं, और ML training के लिए tensor conversion तैयार था

parallelization की सीमा: Amdahl's Law

हर node पर 112 cores होने के बावजूद CFD simulation के लिए सिर्फ 6 tasks माँगे गए, क्योंकि Amdahl's Law यही कहता है
हर program में एक serial fraction होता है जिसे parallelize नहीं किया जा सकता, और theoretical speedup उसी serial हिस्से से सख्ती से सीमित होता है
- सूत्र: S = 1 / ((1−p) + p/N), जहाँ S कुल speedup है, p parallelize होने वाला अनुपात है, और N processor cores की संख्या है
- अगर code का सिर्फ 5% हिस्सा भी serial हो, तो MareNostrum V के सारे cores इस्तेमाल करने पर भी अधिकतम theoretical speedup सिर्फ 20x होगा
tasks को बहुत ज़्यादा cores में बाँटने से InfiniBand network पर communication overhead बढ़ जाता है
- अगर cores के बीच boundary conditions भेजने में actual computation से ज़्यादा समय लगे, तो hardware बढ़ाने से उल्टा performance गिर सकती है
छोटे systems (N=100) की simulations में 16 threads के बाद runtime बढ़ जाता है, जबकि बड़े systems (N=10k+) में ही hardware पूरी तरह productive होता है
supercomputer के लिए code लिखना असल में compute-to-communication ratio को manage करने का अभ्यास है

access कैसे मिलता है

hardware की विशाल लागत के बावजूद MareNostrum V का access researchers के लिए मुफ्त है, क्योंकि compute time को public-funded scientific resource माना जाता है
Spain की संस्थाओं से जुड़े researchers Spanish Supercomputing Network (RES) के माध्यम से आवेदन कर सकते हैं
पूरे Europe के researchers EuroHPC Joint Undertaking की नियमित access calls के माध्यम से आवेदन कर सकते हैं
- Development Access track code porting या ML model benchmarking projects के लिए design किया गया है, इसलिए data scientists के लिए भी यह अपेक्षाकृत सुलभ है

€20 करोड़ के supercomputer पर code चलाने के लिए वास्तव में क्या चाहिए

आर्किटेक्चर: network ही computer है

computing partitions की संरचना

quantum infrastructure

air gap, quota, और HPC operations की वास्तविकता

SLURM workload manager

व्यावहारिक उदाहरण: OpenFOAM sweep orchestration

parallelization की सीमा: Amdahl's Law

access कैसे मिलता है

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.