ML इंजीनियरिंग की ऑनलाइन किताब

(github.com/stas00)

4 पॉइंट द्वारा GN⁺ 2024-01-25 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Machine Learning Engineering Open Book LLM, VLM और RAG models की training, fine-tuning और inference को सफलतापूर्वक करने के लिए methodologies, tools और step-by-step commands का एक public resource है
लक्षित पाठक LLM/VLM training engineers और operators हैं, और इसमें scripts व copy करके चलाए जा सकने वाले commands बड़ी संख्या में शामिल हैं ताकि समस्याएं जल्दी हल की जा सकें
सामग्री 2022 में BLOOM-176B, 2023 में IDEFICS-80B, और 2024 में Contextual.AI के RAG model training अनुभवों से संचित know-how पर आधारित है
इसका scope cloud selection, accelerators, storage, network, orchestration, training, inference, debugging, testing और resources तक फैला है, और PDF व EPUB e-books भी उपलब्ध हैं
यह एक public knowledge repository है, जो उन communities को वास्तविक large-scale training अनुभव से निकले operational knowledge को अप्रत्यक्ष रूप से सीखने में मदद करता है जिनके लिए बड़े ML compute clusters को सीधे संभालना कठिन है

किताब का उद्देश्य और पाठक

Machine Learning Engineering Open Book बड़े language models और multimodal models की training, fine-tuning और inference के लिए public knowledge collection है
इसका स्वरूप काफी technical documentation जैसा है, और इसमें scripts व copy करके चलाए जा सकने वाले commands शामिल हैं ताकि LLM/VLM training engineers और operators इन्हें जल्दी लागू कर सकें
repository की सामग्री उन personal notes से शुरू हुई थी जिनका उद्देश्य पहले जांचे गए और सच में काम करने वाले solutions को जल्दी दोबारा ढूंढना था, और इसे व्यापक ML community के साथ साझा किया गया है

अनुभव-आधारित scope

know-how का बड़ा हिस्सा वास्तविक large-scale model training अनुभवों से संचित हुआ है
- 2022 में open source BLOOM-176B की training
- 2023 में multimodal model IDEFICS-80B की training
- 2024 में Contextual.AI में RAG model training
focus इस बात पर है कि बड़े ML compute clusters को rent करने की लागत अधिक होने के कारण जिन क्षेत्रों का direct experience पाना कठिन है, उनका knowledge community अप्रत्यक्ष रूप से सीख सके

शामिल विषय

Insights
- AI Battlefield Engineering
- cloud provider चुनने का तरीका
Hardware
- Compute: accelerators, CPU, CPU memory
- Storage: local, distributed, shared file systems
- Network: node के अंदर और nodes के बीच networking
Orchestration
- containers और resources को manage करने वाले orchestration systems
- SLURM: Simple Linux Utility for Resource Management
Training / Inference
- model training से जुड़े guides
- model inference से जुड़े insights
Development
- आसान और कठिन दोनों तरह की समस्याओं को कवर करने वाली debugging और troubleshooting
- संबंधित recipes और methodologies वाली The Art of Debugging Open book
- tests लिखने में मदद करने वाली tips और tools
Miscellaneous
- LLM/VLM chronology resources

जल्दी खोजने के लिए comparison tables और tools

high-performance accelerators की comparison table theoretical TFLOPS और accelerator memory size व speed को कवर करती है
network comparison table nodes के बीच networking और node के अंदर networking की theoretical speed को कवर करती है
अक्सर इस्तेमाल होने वाले tools अलग shortcuts के रूप में उपलब्ध हैं
- all_reduce_bench.py: nccl-tests की तुलना में network throughput को ज्यादा आसानी से benchmark करने वाला tool
- torch-distributed-gpu-test.py: nodes के बीच connectivity को जल्दी test करने वाला tool
- mamf-finder.py: accelerators पर वास्तव में हासिल होने वाले TFLOPS measurements खोजने वाला tool
अक्सर इस्तेमाल होने वाले guides भी अलग shortcuts के रूप में उपलब्ध हैं
- PyTorch applications hang या break होने पर जल्दी लागू किए जा सकने वाले debugging solutions
- SLURM users के लिए cheatsheet और tricks
- छोटे models, datasets और tokenizers बनाने का तरीका
- public LLM/VLM training logbooks का collection

distribution formats और participation

e-book Hugging Face Hub पर उपलब्ध है
- PDF
- EPUB
e-book को लगभग हर कुछ हफ्तों में फिर से build किया जाएगा, और latest e-book को खुद build करने के निर्देश भी उपलब्ध हैं
ML engineering से जुड़ी चर्चा repository के community discussions में की जा सकती है
bugs, typos और improvement suggestions Issue या PR के जरिए स्वीकार किए जा सकते हैं
content license Attribution-ShareAlike 4.0 International है
citation information में Machine Learning Engineering Open Book, वर्ष 2023-2026, और GitHub repository URL शामिल हैं

1 टिप्पणियां

GN⁺ 2024-01-25

Hacker News की टिप्पणियाँ

मैं research support के काम में रोज़ाना LLM training setup debugging करता/करती हूँ, और लगता है कि जब मैंने शुरुआत की थी तब ऐसे notes होते तो बहुत अच्छा होता
- एक game developer के तौर पर मैं machine learning/deep learning में आने की कोशिश कर रहा/रही हूँ, और सीखते हुए इतना कठिन न हो कि कर न सकूँ, लेकिन वास्तविक value वाला problem ढूँढना सबसे बड़ी चुनौती रहा है; लगता है एक मिल गया है, इसलिए राय जानना चाहता/चाहती हूँ
  अभी games/film animation के लिए motion capture data collection में दो systems हैं: inertial और optical. Inertial आसान और सस्ता है, लेकिन उसमें capture errors और inaccuracies ज़्यादा होती हैं, इसलिए manual correction की ज़रूरत पड़ती है; optical ज़्यादा accurate है और कम cleanup चाहिए, लेकिन hardware और space cost ज़्यादा है
  idea यह है कि inertial motion capture suit पहनकर साथ-साथ optical session भी record किया जाए, फिर machine learning से motion capture data automatic correction train कराई जाए. Theory में inertial recording data को machine learning से पास करके optical-level precision मिल सकती है
  जानना चाहता/चाहती हूँ कि क्या यह first project के तौर पर लेने लायक है, इसे कैसे solve करना बेहतर होगा, और क्या कोई existing projects reference के लिए हैं
मैं model training और deployment से जुड़े काम में applied scientists की मदद करता/करती हूँ, और जानना चाहता/चाहती हूँ कि optimization·performance जैसे lower-level engineering कामों का exposure कैसे मिल सकता है
कंपनी में ML infra team है, लेकिन उसका लक्ष्य platform के आसपास tools बनाना है, workloads को optimal तरीके से चलाने पर focus करना नहीं
- मेरे हिसाब से optimization profiling के बिना संभव नहीं है. model performance समझने वाले tools से familiar होना पहला step हो सकता है
  उदाहरण: https://pytorch.org/tutorials/recipes/recipes/profiler_recip...
- Brendan Gregg की system performance और profiling resources एक अच्छा starting point हैं. ML performance issues का बड़ा हिस्सा आखिरकार Linux perf, या SLURM जैसे high-performance computing scheduling systems में आखिर चल क्या रहा है, यह समझने तक पहुँचता है
  https://www.brendangregg.com/linuxperf.html
AI Battlefield section का Unsolicited Advice हिस्सा खास तौर पर अच्छा लगा. सब कुछ बहुत तेज़ी से भाग रहा है, और AI development की लगातार, आक्रामक तेज़ प्रगति के बीच हमेशा पानी में डूबते रहने जैसा emotional burden महसूस होने की reality को बहुत real तरीके से address करता है
https://github.com/stas00/ml-engineering/blob/master/insight...
Slurm कितना widely used है?
- Slurm high-performance computing (HPC) community में practically हर जगह है. HPC side में इसके जैसे competitors के तौर पर मुझे SGE [1] और Torque/PBS [2] resource schedulers ही दिखते हैं
  exact numbers नहीं पता, लेकिन मेरा अंदाज़ा है कि Top 500 supercomputers [3] की overwhelming majority Slurm चला रही होगी. जैसा दूसरों ने कहा, academia के research computing centers भी ज़्यादातर Slurm इस्तेमाल करते हैं, और US DoE national labs में भी Slurm dominant है
  और एक मज़ेदार बात, शायद legend भी हो सकती है, कि “Simple Linux Utility for Resource Management (SLURM)” नाम Futurama के drink Slurm से बना एक backronym है [4]
  [1] https://en.wikipedia.org/wiki/Oracle_Grid_Engine
  [2] https://github.com/adaptivecomputing/torque
  [3] https://www.top500.org/
  [4] https://futurama.fandom.com/wiki/Slurm
- Wikipedia के मुताबिक, “Slurm TOP500 supercomputers के करीब 60% में workload manager के रूप में इस्तेमाल होता है.” पिछले करीब 10 सालों से अधिकांश compute clusters में इसे job manager frontend के रूप में इस्तेमाल करता/करती आया/आई हूँ
- Llama 2 models भी Slurm पर train हुए थे
- इसी से जुड़ा, जानना चाहता/चाहती हूँ कि क्या किसी ने कई GPUs पर large models training करने वाले physical clusters में Slurm से Kubernetes पर migration सफलतापूर्वक किया है
- अधिकांश high-performance computing clusters में इस्तेमाल होता है. जो अभी भी Torque पर हैं, वे exception होंगे
reproducibility item पर random click किया, लेकिन अब भी सोच रहा/रही हूँ कि distributed training में reproducibility कैसे हासिल की जाती है. deterministic synchronization करने से slow नहीं हो जाता? फिर भी सुना है कि कम से कम कुछ बड़ी कंपनियों में training reproducible है
- आप शायद training updates को जितना हो सके commutative बनाना चाहेंगे. तब updates किस order में apply किए जाते हैं, इससे फर्क नहीं पड़ेगा
नौकरी न होने की स्थिति में इन चीज़ों का experience कैसे लिया जा सकता है?
- submitted book जैसी resources पढ़ें, और खुद छोटे projects करके देखें
  यह programming job न होने की स्थिति में programming सीखने से बहुत अलग नहीं है
  बेशक इसका मतलब यह नहीं कि दोनों आसान हैं; काफ़ी dedication चाहिए
- अगर लक्ष्य job पाना है तो realistic expectations रखनी होंगी
  web development जैसे क्षेत्रों से तुलना करें तो इस side का hiring market बहुत छोटा है, और projects बहुत deep knowledge वाले experts मांगते हैं. यह उस तरह का काम नहीं है जिसमें ChatGPT या Stack Overflow बहुत मदद कर दें
- side projects करें या किसी और के side project में join करें. सबसे अहम बात है community से जुड़ना और उनसे बात करने की technical language सीखना
  यह community अपेक्षाकृत छोटी है, और शुरुआत करने के लिए कई चीज़ें चाहिए. कुछ machine learning, मजबूत coding ability, modern accelerators कैसे काम करते हैं इसकी knowledge, और इस direction के papers पढ़कर समझने की capability चाहिए
- मेरे experience में सबसे अच्छा तरीका side projects हैं. सिर्फ़ technology न सीखें; ऐसी feasible project चुनें जो उस नई technology का use करे जिसे आप सीखना चाहते हैं, और उसमें गहराई से लग जाएँ
  “feasible” चीज़ चुनना अक्सर tricky होता है, इसलिए कुछ हफ्तों बाद फिर से evaluate करने और ज़रूरत पड़े तो expectations adjust करने से डरना नहीं चाहिए
  अहम बात है चलते रहना
- fast.ai course कर सकते हैं. थोड़ी मेहनत और creativity के साथ, 2 हफ्तों से ज़्यादा लगें तब भी आप model fine-tune करके state-of-the-art level के results पा सकते हैं
मैं इसे experiment करके देखना चाहता/चाहती हूँ, लेकिन proper GPU नहीं है. लोग असल में इसे कैसे run करते हैं, यह जानना चाहता/चाहती हूँ
latest information follow करने के लिए कौन से Twitter accounts follow करना अच्छा रहेगा?
क्या PDF कहीं है? build instructions दिख रहे हैं, लेकिन actual file नहीं दिख रही
- अब PDF तैयार है: https://github.com/stas00/ml-engineering#pdf-version
- कुछ हफ्तों में तैयार हो जाएगा. build workflow तैयार है, लेकिन stylesheet और chapter structure reorganization को finish करना बाकी है

ML इंजीनियरिंग की ऑनलाइन किताब

किताब का उद्देश्य और पाठक

अनुभव-आधारित scope

शामिल विषय

Insights

Hardware

Orchestration

Training / Inference

Development

Miscellaneous

जल्दी खोजने के लिए comparison tables और tools

distribution formats और participation

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की टिप्पणियाँ