Bend - GPU पर चलने वाली high-level भाषा (HVM2 का उपयोग)

(github.com/HigherOrderCO)

1 पॉइंट द्वारा GN⁺ 2024-05-18 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Bend एक high-level parallel programming language है जिसका लक्ष्य Python·Haskell जैसी expressiveness और CUDA-स्टाइल बड़े पैमाने की parallel execution को साथ लाना है, और यह HVM2 runtime पर चलती है
closures वाले higher-order functions, तेज object allocation, unbounded recursion, और continuation को support करते हुए भी यह thread creation, locks, mutexes, atomic जैसी explicit parallelization notation के बिना GPU जैसे parallel hardware पर चलती है
मौजूदा design goal core count के अनुसार scalable performance है, और यह 10,000+ concurrent threads support करती है, लेकिन current version में single-core performance कम हो सकती है और code generation व optimization में सुधार जारी है
execution modes bend run-rs, bend run-c, bend run-cu में बंटे हैं; parallelizable code को केवल execution command बदलकर C interpreter या CUDA interpreter में parallel चलाया जा सकता है
Windows support अभी काम में है, इसलिए WSL2 एक विकल्प है, और GPU execution फिलहाल सिर्फ NVIDIA GPU support करती है

Bend जिस programming model को target करता है

Bend एक programming language है जो high-level language जैसा उपयोग अनुभव बनाए रखते हुए बड़े पैमाने के parallel hardware पर चलती है
यह Python और Haskell जैसी expressive languages की capabilities देती है
- तेज object allocation
- closures वाले higher-order functions
- unbounded recursion
- continuation
यह CUDA की तरह GPU जैसे large-scale parallel hardware पर चलती है, और core count के आधार पर लगभग linear acceleration का लक्ष्य रखती है
parallel execution के लिए निम्न चीजें सीधे लिखने की जरूरत नहीं होती
- thread creation
- locks
- mutexes
- atomic
runtime HVM2 का उपयोग करता है

मौजूदा सीमाएँ और सावधानियाँ

Bend core count के साथ performance scale करने पर focus करती है, और इसे 10,000+ concurrent threads support करने के लिए design किया गया है
current version में single-core performance कम हो सकती है
code generation और optimization techniques के विकसित होने के साथ performance improvements अपेक्षित हैं
Windows support अभी काम में है, और विकल्प के तौर पर WSL2 इस्तेमाल किया जा सकता है
GPU support फिलहाल सिर्फ NVIDIA GPU support करती है

installation और execution method

Linux और Mac दोनों पर Rust installation जरूरी है
Bend के C version के लिए GCC इस्तेमाल होता है, और README GCC 12.x या उससे कम की सिफारिश करता है
CUDA runtime इस्तेमाल करने के लिए Linux के लिए CUDA Toolkit 12.x installation जरूरी है
HVM2 को cargo install hvm से install करें, और Bend को cargo install bend-lang से install करें
Bend program चलाने के commands runner के हिसाब से बंटे हैं
- bend run <file.bend>: default रूप से C interpreter इस्तेमाल, parallel execution
- bend run-rs <file.bend>: Rust interpreter इस्तेमाल, sequential execution
- bend run-c <file.bend>: C interpreter इस्तेमाल, parallel execution
- bend run-cu <file.bend>: CUDA interpreter इस्तेमाल, large-scale parallel execution
gen-c और gen-cu का उपयोग करके standalone C/CUDA files में compile किया जा सकता है
code generator अभी शुरुआती stage में है, और GCC या GHC जैसे compilers जितना mature नहीं है
-s flag से reductions की संख्या, execution time, और interactions per second देखे जा सकते हैं

sequential sum और parallel sum examples

README का summation example start से target तक के numbers जोड़ने वाले code को दो तरीकों से compare करता है
sequential version में Sum(start + 1, target) के result में current start जोड़ने की structure है
- अगला calculation पिछले summation result पर निर्भर करता है
- current calculation खत्म होने से पहले अगले step पर नहीं जाया जा सकता, इसलिए इसे parallelize नहीं किया जा सकता
- example Sum(1, 1_000_000) call करता है, और इसमें Bend numbers की maximum value overflow हो सकती है—ऐसी comment शामिल है
parallelizable version range को आधे में बांटकर left और right sums को recursively calculate करता है
- (3 + 4) calculation, (1 + 2) calculation पर निर्भर नहीं है
- दोनों calculations एक साथ हो सकते हैं, इसलिए parallel execution possible है
Bend में अगर code parallel रूप से चल सकता है, तो सिर्फ execution command बदलने से parallel execution हो जाता है

Bitonic Sorter performance example

README speed example के रूप में immutable tree rotations से implemented bitonic sorter पेश करता है
यह algorithm ऐसे type का है जिससे GPU पर तेज होने की उम्मीद करना आसान नहीं है, लेकिन divide-and-conquer approach का उपयोग करके Bend इसे कई threads पर चलाती है
explicit thread creation या lock management की जरूरत नहीं होती
benchmark results इस प्रकार हैं
- bend run-rs: CPU, Apple M3 Max, 12.15 seconds
- bend run-c: CPU, Apple M3 Max, 0.96 seconds
- bend run-cu: GPU, NVIDIA RTX 4090, 0.21 seconds
अन्य algorithms examples folder में देखे जा सकते हैं

references

Bend की underlying technology HVM2 paper में देखी जा सकती है
official documentation काम में है, और deeper explanation GUIDE.md में है
feature list FEATURES.md में देखी जा सकती है
Bend को HigherOrderCO ने develop किया है

1 टिप्पणियां

GN⁺ 2024-05-18

Hacker News की राय

sum उदाहरण को pure Python में पोर्ट करके देखा, तो pypy3 पर single thread में 4.478 सेकंड लगे, और Python 3.12 पर 1 मिनट 42.148 सेकंड लगे
इसके उलट Bend का single-thread version मेरे laptop पर 42 मिनट से चल रहा है और 6GB memory इस्तेमाल करने के बावजूद खत्म नहीं हुआ। Environment 12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04 है
इतने सरल उदाहरण में अगर यह इतना धीमा है, तो जटिल कामों में इससे उम्मीद करना मुश्किल है; और सोच रहा हूँ कि Mac/aarch64 के अलावा दूसरे environments में testing या development हुआ भी है या नहीं। बाद में -s argument के साथ फिर से चलाकर देखूँगा
- 42 मिनट तक चलना संभवतः bug है। अभी M3 Max के अलावा environments में बहुत testing नहीं की है, और यह पता है कि non-Apple CPU पर यह 2x धीमा है, इसलिए इसे सुधारने की योजना है
  sum उदाहरण में Bend के साथ बड़ा नुकसान यह है कि वह हर numeric operation पर 2 IC nodes allocate करता है, जबकि Python ऐसा नहीं करता। HVM1 की तरह जल्द ही इसे avoid कर पाएँगे, लेकिन HVM2 में अभी implement नहीं हुआ है
  Bend पर अधिकांश काम parallel evaluator को सही बनाने में गया, और GPU पर closures और unbounded recursion चलाना बेहद कठिन था। अब जाकर वह हिस्सा पूरा हुआ है, इसलिए micro-optimization पर लगभग कोई मेहनत नहीं हुई, और HVM2 का code generation भी अभी काफी खराब है
  Bitonic Sort example जैसे cases से तुलना करें, जहाँ दोनों sides समान amount allocation करती हैं, तो वास्तविक performance को अधिक fair तरीके से देखा जा सकेगा। HVM1 single core पर GHC से करीब 3x धीमा था, और मेरा मानना है कि HVM2 भी जल्द ही उस स्तर तक पहुँच सकता है
  मैं समझता हूँ कि “अभी खराब है लेकिन बेहतर होगा” कहना थोड़ा निराशाजनक लग सकता है। फिर भी base तैयार हो चुका है, इसलिए micro-optimization सबसे आसान हिस्सा है, और मुझे विश्वास है कि यहाँ से performance काफी बढ़ेगी
- इस बहस में मेरा कोई stake नहीं है, लेकिन recursion computation performance से ज़्यादा यह test करता है कि compiler/interpreter call stack बनाने और हटाने में कितना efficient है
  यह language compute-heavy GPU applications को target कर रही है और अभी शुरुआती stage में है। Recursion इसका target application नहीं है, और इसे relevant benchmark मानना मुश्किल है
- GPU और CPU में thread का मतलब अलग होता है, और GPU में यह SIMD lane के अधिक करीब है
  यह कुछ वैसा है जैसे ISPC CPU thread प्रति 32 function calls को साथ-साथ execute करने के लिए compile कर सकता है। उदाहरण के लिए AVX512 पर 16-bit data इस्तेमाल करें, तो 32 cores × प्रति core 2 SMT threads × compiler executions 32 = 2048 executions साथ-साथ चल सकते हैं
- Python recursion में बहुत कमजोर है, और यह उन वजहों में से एक है कि वह functional programming के लिए उपयुक्त नहीं है; इसलिए यह fair benchmark नहीं हो सकता
  Pythonic implementation में शायद loops और mutable state इस्तेमाल होते
- समझ नहीं आता +0 की जरूरत क्यों है। क्या यह ऐसा operation नहीं है जो कुछ नहीं करता?
इस thread में negative reactions बहुत हैं, लेकिन यहाँ तक बना देने भर के लिए भी मैं author को kudos देना चाहता हूँ
similar projects में मुझे बस Futhark जैसा कुछ पता है, लेकिन उसका Haskell-style syntax C/C++/Python/JS/Java वगैरह के आदी आम developers के लिए काफी कठिन लग सकता है
सबसे बड़ी कमी यह है कि Futhark के उलट यह सिर्फ CUDA या multicore को target करता है। Futhark OpenCL, CUDA, ISPC, HIP, single-core CPU और multicore CPU को target कर सकता है। दूसरों ने जो performance issues बताए हैं, वे मुझे पूरी तरह solve किए जा सकने वाले लगते हैं
- ILGPU भी देखने लायक है। यह काफी समय से मौजूद है और काफी अच्छा है, लेकिन दुख की बात है कि ज्यादा known नहीं है
  छोटा example: https://github.com/m4rs-mt/ILGPU/blob/master/Samples/SimpleM...
  inline PTX assembly जैसे advanced features भी support करता है: https://github.com/m4rs-mt/ILGPU/blob/master/Samples/InlineP...
- Chapel high-performance computing में काफी इस्तेमाल होता है
  NVIDIA ने CUDA में Haskell, .NET, Java, Julia variants को भी sponsor किया है, Python JIT भी है, और Mojo side के साथ भी collaboration चल रहा है
- ParaSail भी इसी दिशा में जाने वाली language है: https://github.com/parasail-lang/parasail
  इसे Tucker Taft ने बनाया है, जो 1995 से Ada designer के रूप में काम कर रहे हैं, और ParaSail की कुछ parallel capabilities Ada 2022 में शामिल हुई हैं
OP हाल ही में HN पर आए सबसे cool projects में से कुछ लेकर आता है, लेकिन दुख है कि यह साफ तौर पर early version होने के बावजूद उसे बस लंबी आलोचनाएँ मिलती दिख रही हैं
- HN ऐसी community के करीब है जो नई या original चीजें post करना चाहती है। अगर कोई तारीफ करना चाहता है, तो अक्सर एक और “cool” comment लिखने के बजाय पहले से मौजूद comment को upvote कर देता है
  वहीं criticism में सही होने के तरीके सीमित और गलत होने के तरीके बहुत होते हैं, इसलिए वह endless variety ले सकता है। इसी वजह से positive comments कुछ ही दिखते हैं और ज्यादातर criticism या “यह भी होना चाहिए था” जैसे लगते हैं। यह किसी खास व्यक्ति की गलती से ज्यादा आज की technologist culture का स्वभाव है
- अगर यह मेरा project होता, तो लोगों की criticism के लिए मैं काफी thankful होता। इसी से growth होती है
  अगर लोग brutal truth को सिर्फ तालियों के पीछे छिपाएँ, तो दुनिया ढह जाएगी
- इसे 905 upvotes मिले हैं, तो इसका मतलब positive reaction भी काफी मिला है
  Criticism भी ideas और approach में interest लेकर participate करने का संकेत है, इसलिए अक्सर यह positive signal होता है
- नए और ambitious projects की criticism न करना एक अच्छा social norm है। ऐसी कोशिशों को encourage किया जाना चाहिए, discourage नहीं
  लेकिन ऐसे projects की criticism करना भी अच्छा social norm है जो misleading, weakly supported या false claims करते हैं, क्योंकि इससे ऐसे claims कम होते हैं
- सबसे cool चीजें अक्सर समझने में सबसे कठिन होती हैं
  जिसे समझना कठिन हो, वह अक्सर threatening महसूस होता है, और criticism threat के प्रति common reaction है—साथ ही ऐसा जवाब देने का तरीका भी जिसके लिए सबसे कम understanding चाहिए
होमपेज सच में बहुत अच्छी तरह बनाया गया है। यह तुरंत साफ दिखाता है कि यह क्या करता है।
“combinator” से काम करने वाले लोग आम तौर पर डराने वाली jargon बहुत इस्तेमाल करना चाहते हैं, लेकिन OP टूल के पीछे का सरल आइडिया सच में दिखाता है। यह उस academic तरीके के उलट है जिसमें आखिरी detail तक दिखा दी जाती है, लेकिन असल में हो क्या रहा है यह नहीं बताया जाता। अच्छा है। ऐसे तरीके और होने चाहिए।
थ्योरी के हिसाब से यह शानदार है और value proposition भी समझ आता है, लेकिन ईमानदारी से कहूं तो मुझे नहीं लगता कि यह सच में relevant tool बनेगा।
ये मेरी first impression और paper पर सरसरी नजर डालने के बाद की notes हैं। मुझे पता है कि यह बहुत शुरुआती software है।
Bend एक बहुत सीमित DSL जैसा दिखता है। FFI नहीं है, raw buffers के साथ interact करने का कोई तरीका नहीं है, और 24-bit floating-point format भी अजीब है।
IC के mainstream न होने की वजह है। performance आगे भी बेहद खराब रहने की संभावना है, और graph traversal hardware के साथ अच्छी तरह fit नहीं होता।
optimal reduction वाला premise valid है, लेकिन आखिरकार kernels ऐसे लिखने होंगे जिन्हें parallelize किया जा सके। यानी data dependency नहीं होनी चाहिए, और recursion के use को भी ध्यान में रखना होगा।
Bend/HVM code और equivalent OMP/CUDA program की सीधे तुलना करने वाला कोई serious example नहीं है। implementation complexity कितनी घटती है और performance कैसी है, यह evaluate करना मुश्किल है।
real-world high-performance parallel computing में tree-like structures बहुत कम होते हैं और arrays ही king हैं। वजह यह है कि memory hardware level पर जिस physical nature से काम करती है। mutable contiguous memory buffer पर सबसे अच्छा काम loops करते हैं। HVM इसे implement करे तो मैं देखूंगा।
फिलहाल यह external data से लगभग पूरी तरह isolated, बहुत slow, और hardware के ऊपर एक विशाल abstraction चढ़ाई हुई आधी-अधूरी language जैसी दिखती है। यह multi-level cache, tensor cores, SIMD, atomic operations जैसी capabilities का भी फायदा नहीं उठा पाती।
अगर यह harsh लगा हो तो माफ करें, लेकिन technical implementation और theoretical background अभी भी मुझे बहुत दिलचस्प लगते हैं। बस real world में इसकी usefulness पर अभी मैं convince नहीं हुआ हूं।
- feedback के लिए धन्यवाद। कुछ बातें ठीक कर दूं: हम multi-level cache इस्तेमाल कर रहे हैं, और सही तरह से इस्तेमाल करने पर 5x ज्यादा performance मिल सकती है।
  FFI पहले से implement है, लेकिन अभी public नहीं किया है। वजह यह है कि हम इसे graphics rendering के साथ release करना चाहते हैं, और हमें लगता है यह काफी cool होगा।
  Haskell/GHC भी graphs और trees इस्तेमाल करते हैं, लेकिन कोई नहीं कहेगा कि वे practical नहीं हैं। arrays king हैं, यह सही है, लेकिन compiler, type checker, solver आदि जैसे कई modern algorithms जो arrays में अच्छी तरह fit नहीं होते, Haskell में implement किए गए हैं।
  IC तेज नहीं है, इसका मुख्य कारण यह है कि किसी ने उस पर low-level optimization का काम ठीक से नहीं किया। existing implementations सभी बेहद inefficient थे, और मेरा काम भी अब तक GPU पर इसे सही तरीके से चलाने में लगा रहा, इसलिए ऐसा है।
  जैसे आपने कहा कि अभी loops भी नहीं हैं, solution बस loops add करना है। अगर आपको लगता है कि इसमें कोई fundamental limit है, तो आप surprise होंगे।
  HVM2 आखिरकार scalable और correct algorithm बन गया है, और अब असली low-level performance optimize करने की बारी है।
- point 5 पर, trees आम computer-science-style implementation से अलग हैं, लेकिन काफी widely used हैं।
  Fast Multipole या Barnes-Hut algorithms में Morton order या H-index order का इस्तेमाल करके O(n²) pairwise operations को क्रमशः O(n), O(n log n) तक घटाया जाता है। Barnes-Hut astrophysics में ज्यादा common है, और Fast Multipole chemistry molecular dynamics में ज्यादा दिखता है।
10 साल पहले CMU का parallel algorithms course 15-210 लिया था। उसमें समझाया गया था कि Moore’s law limit पर पहुंच रही है, इसलिए parallelism computing का future बनेगा, और मैं इस बात से convince होकर experiment करना चाहता था।
लेकिन general-purpose parallel programming के options ज्यादा नहीं थे। class में इस्तेमाल हुई SML भी parallel नहीं थी, और आखिरी में extensions और CUDA इस्तेमाल करने वाला section था, लेकिन मेरी याद में वह limited था।
बाद में Rust की वजह से मैंने multithreading के साथ थोड़ा experiment किया, और Shadertoy की वजह से shaders के साथ creative काम कर पाया। लेकिन GPU पर general-purpose parallel language—इसे खुद आजमाने के लिए मैं बहुत excited हूं।
- आजकल 210 सच में parallel है। MaPLe(https://github.com/MPLLang/mpl) इस्तेमाल करने पर आप 210-style code चला सकते हैं और C/C++ के मुकाबले competitive performance भी पा सकते हैं।
  अगर आपको 210 पसंद आया था, तो https://futhark-lang.org/ भी पसंद आ सकता है। यह ML-family language है, GPU में compile होती है, और performance भी अच्छी है।
- machines के multicore की ओर जाने का trend, Elixir सीखने का फैसला करने की एक वजह था।
idea बहुत cool है, लेकिन अगर मैंने कुछ miss नहीं किया है तो यह बहुत slow दिखता है।
मैंने C++ में 0 से 2³⁰ तक जोड़ने वाला एक simple loop लिखा, तो optimization के बिना single thread पर मेरे laptop में 1.7 seconds लगे, जो RTX 4090 पर Bend की performance के आसपास है। -O3 देने पर loop vectorize होकर 80ms से कम में चल गया।
- Bend में अभी tail call optimization नहीं है। यह 1 billion लंबाई का stack allocate कर रहा है, जबकि C बस loop चला रहा है।
  अगर उस C program से compare करें जो सच में allocation करता है, तो Bend कुछ ही threads के साथ भी तेज हो सकता है।
  Bend का code generation अभी खराब है, लेकिन ये low-hanging fruits हैं। ज्यादातर काम बहुत कठिन parallel evaluator को correct बनाने में गया है।
  मुझे पता है कि यह “trust me” जैसा सुनाई देता है, लेकिन procedure compilation, loop generation आदि शुरू करने पर single-thread performance बहुत बेहतर होगी। बस अभी यह किया नहीं है।
  सच कहूं तो शायद इसे post करने से पहले थोड़ा और इंतजार करना चाहिए था।
- objdump से check करना बेहतर होगा कि loop सच में vectorize हुआ था या compiler ने उसे पूरी तरह optimize कर दिया।
  वह loop signed integer overflow करता है, और C++ में यह undefined behavior है। compiler कानूनी तौर पर कोई भी result दे सकता है।
  इससे बचने के लिए sum को unsigned declare करना चाहिए। unsigned integer overflow well-defined है, और optimization फिर भी होती है, लेकिन कम से कम correctness guaranteed रहती है।
- clang में -O3 से compile करने पर loop पूरी तरह हट जाता है: https://godbolt.org/z/M1rMY6qM9
  शायद यह fair comparison नहीं है।
- main point शायद यह है कि Bend, C++ से काफी ज्यादा high-level है।
  बेशक, हो सकता है मैं ही point miss कर रहा हूं।
लेखक को बधाई देना चाहता हूं। वाकई शानदार काम है।
सही automatic parallelization बनाना बिल्कुल आसान नहीं है, और आपको इस पर गर्व होना चाहिए। आगे project कैसे evolve होता है, देखने का इंतजार है।
समझ नहीं आता कि इतनी नकारात्मक प्रतिक्रिया क्यों है। ऐसा लग रहा था जैसे कोई गुस्साई भीड़ README की कमियां पकड़कर पोस्ट के संदर्भ और इरादे को बदलने की कोशिश करने वाले bots जैसी हो
ठीक से पढ़ने में 2 मिनट भी न लगाकर घंटों बहस करना अज्ञानता और क्रूरता है। OP ने यह सब 1-person project के तौर पर यहां तक पहुंचाया है, इसलिए उम्मीद है कि वे इसे आगे बढ़ाते रहें
मुझे जिज्ञासा थी कि HVM2 interaction nets को, उदाहरण के लिए SPIR-V में compile करता है, या फिर मूल HVM की तरह GPU पर चलने वाला interpreter है
पहले मैंने programs को जितना हो सके reduce करने के बाद inputs को reduce न करने वाले तरीके से interaction nets को C में compile किया था, और इसे whole-program optimization की तरह treat किया था। Shader language को target करना भी बहुत मुश्किल नहीं लगता था
repository देखने पर लिखा है कि यह HVM2 nets निर्दिष्ट करने वाली low-level IR language और C/CUDA तक जाने वाला compiler देता है: https://github.com/HigherOrderCO/HVM
लेकिन फिर से देखने पर HVM2 CUDA runtime memory में graph को traverse करके reductions apply करने वाले interpreter जैसा दिखता है: https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...
मेरा मतलब था interaction nets को traverse करके lambda calculus expression के करीब terms को recover करना, और छोटे-छोटे हिस्सों में C में lower करके runtime overhead को न्यूनतम रखना
ईमानदार motivation यह है कि Bend से ML workloads जैसी जगहों पर hand-written GPU kernels को पछाड़ना मुश्किल है। सैद्धांतिक रूप से HVM compute kernels को जोड़ने और execution order को parallelize करने वाली glue की भूमिका निभा सकता है, लेकिन उसके लिए अच्छी FFI चाहिए
Interaction nets को FFI boundary के पार translate करना मुश्किल है, लेकिन FFI compute kernel nodes को interaction network के अंदर रखकर और nets को C में compile करके translation overhead के बिना reasonable FFI recover की जा सकती है
दूसरा विकल्प HVM को hardware में implement करना है, जिसे मैं खाली पड़े FPGA पर थोड़ा आज़मा रहा हूं
- यह GPU पर चलने वाला interpreter भी है, और native C और CUDA तक जाने वाला compiler भी
  SPIR-V को सीधे target नहीं करता, लेकिन वह लक्ष्य है
  C compiler अपेक्षित speed-up, यानी 3~4x और जल्द ही उससे भी ज्यादा देता है, लेकिन CUDA runtime में non-compiled version की तुलना में बड़ा speed-up नहीं मिला
  वजह warp branching लगती है। Uncompiled procedures में सभी function calls को एक “general-purpose” interpreter-style function expander में मिला सकते हैं, और warp threads बिना branching के reduce कर सकते हैं। आगे इस हिस्से पर और गहराई से research करेंगे

Bend - GPU पर चलने वाली high-level भाषा (HVM2 का उपयोग)

Bend जिस programming model को target करता है

मौजूदा सीमाएँ और सावधानियाँ

installation और execution method

sequential sum और parallel sum examples

Bitonic Sorter performance example

references

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय