Tiny GPU: Verilog में लागू किया गया न्यूनतम GPU

(github.com/adam-maj)

2 पॉइंट द्वारा GN⁺ 2024-04-27 | 1 टिप्पणियां | WhatsApp पर शेयर करें

tiny-gpu GPU के hardware स्तर पर काम करने के तरीके को बुनियाद से सीखने के लिए Verilog-आधारित न्यूनतम GPU implementation है, और graphics-only hardware की बजाय GPGPU और ML accelerators के साझा सिद्धांतों पर फोकस करता है
implementation में 15 से कम documented Verilog files, architecture और ISA documents, matrix addition/multiplication kernels, kernel simulation और execution tracing support शामिल हैं
GPU एक समय में एक single kernel चलाता है; program memory और data memory load करता है, thread_count सेट करता है, फिर start signal high करके kernel शुरू करता है
सरलता के लिए हर core एक समय में एक block process करता है, और हर thread के पास ALU, LSU, PC और register file होती है, लेकिन यह मानता है कि सभी threads हर instruction के बाद उसी PC पर converge करते हैं
modern GPU की multi-level cache, shared memory, memory coalescing, pipelining, warp scheduling, branch divergence और barriers जैसी ज्यादातर capabilities हटाई गई हैं, ताकि learning-oriented structure को प्राथमिकता मिले

tiny-gpu जिस समस्या को हल करना चाहता है

CPU के लिए architecture से लेकर control signals तक सीखने की बहुत सामग्री उपलब्ध है, लेकिन competitive market के कारण modern GPU की low-level technical details ज्यादातर proprietary बनी हुई हैं
GPU programming पर बहुत सामग्री है, लेकिन GPU hardware level पर कैसे काम करता है, यह सीखने की सामग्री बहुत कम है
open-source GPU implementations Miaow और VeriGPU feature completeness और operation को लक्ष्य बनाते हैं, इसलिए उनकी संरचना जटिल है
tiny-gpu production-grade graphics card की कई complexities हटाकर modern hardware accelerators में common core elements पर ध्यान देता है
- GPU architecture के महत्वपूर्ण components
- SIMD programming model hardware में कैसे implement होता है
- सीमित memory bandwidth को GPU कैसे संभालता है

पूरा architecture

tiny-gpu को एक समय में सिर्फ एक kernel चलाने के लिए design किया गया है
kernel execution प्रक्रिया इस प्रकार है
- global program memory में kernel code load करना
- data memory में जरूरी data load करना
- device control register में चलाए जाने वाले threads की संख्या specify करना
- start signal को high set करके kernel execution शुरू करना
GPU इन units से बना है
- device control register
- dispatcher
- variable number of compute cores
- data memory और program memory के लिए memory controllers
- cache

kernel execution और thread distribution

device control register kernel execution metadata store करता है; tiny-gpu में यह केवल चलाए जाने वाले कुल threads की संख्या thread_count store करता है
dispatcher kernel शुरू होने पर threads को कई compute cores में distribute करता है
- parallel execution योग्य thread groups को block के रूप में बनाता है
- उपलब्ध cores को block भेजकर process कराता है
- सभी blocks process हो जाने पर kernel execution complete होने की सूचना देता है
simplified core एक समय में एक block process करता है
हर thread के पास dedicated ALU, LSU, PC और register file होती है
इन resources पर thread instruction execution को manage करना GPU की कठिन समस्याओं में से एक है

memory structure और controllers

GPU को external global memory के साथ interface करने के लिए बनाया गया है, और सरलता के लिए data memory और program memory को अलग किया गया है
data memory specification
- 8-bit addressing
- कुल 256 rows
- 8-bit data
- हर row 256 से कम value store करती है
program memory specification
- 8-bit addressing
- कुल 256 rows
- 16-bit data
- ISA के अनुसार हर instruction 16-bit है
memory controller cores से आने वाली memory requests को track करता है, actual external memory bandwidth के हिसाब से requests को limit करता है, और responses को सही resource तक पहुंचाता है
हर memory controller में global memory bandwidth के अनुसार fixed number of channels होते हैं
cache work-in-progress feature है; यह external memory से लाया गया data device के अंदर SRAM में store करता है, ताकि बाद की requests में उसे तेज़ी से लाया जा सके और memory bandwidth नए data के लिए इस्तेमाल हो सके

core के अंदर की संरचना

हर core में एक single scheduler होता है, जो thread execution manage करता है
tiny-gpu scheduler एक block की instruction को अंत तक execute करने के बाद नया block लाता है, और सभी threads की instructions synchronized order में execute करता है
अधिक advanced scheduler में pipelining और warp scheduling से resource utilization बढ़ाया जा सकता है
scheduler की मुख्य limitation global memory से data load और store करते समय होने वाली latency है
- ज्यादातर instructions synchronously execute हो सकती हैं
- LDR और STR जैसे load-store operations asynchronous हैं, इसलिए लंबी waiting time को ध्यान में रखकर instruction execution organize करना पड़ता है
Fetcher current program counter की instruction को program memory से asynchronously fetch करता है
Decoder fetched instruction को thread execution के control signals में decode करता है
हर thread की register file calculation के दौरान data रखती है और SIMD pattern को संभव बनाती है
- read-only registers में %blockIdx, %blockDim, %threadIdx होते हैं
- kernel local thread ID के आधार पर अलग data पर execute हो सकता है
हर thread का ALU ADD, SUB, MUL, DIV arithmetic instructions handle करता है
CMP दो registers के difference के result के negative, zero या positive होने को output करता है, और result को PC unit के NZP register में store करता है
हर thread का LSU global data memory access करता है, और LDR, STR तथा asynchronous memory waiting time को handle करता है
हर thread का PC अगली execute होने वाली instruction तय करता है
- default रूप से हर instruction पर 1 से बढ़ता है
- BRnzp पिछली CMP द्वारा set किए गए NZP register की condition match होने पर किसी specific program memory row पर branch करता है
- loops और conditional statements इसी तरीके से implement होते हैं
tiny-gpu सरलता के लिए मानता है कि सभी threads हर instruction के बाद उसी PC पर converge करते हैं
असली GPU में अलग-अलग threads अलग PC पर branch कर सकते हैं, और तब साथ process हो रहा thread group कई execution flows में बंट जाता है; इसे branch divergence कहते हैं

ISA

tiny-gpu matrix addition और matrix multiplication जैसे proof-of-concept सरल kernels चलाने के लिए 11-instruction ISA implement करता है
supported instructions
- BRnzp: NZP condition match होने पर किसी दूसरी program memory row पर jump
- CMP: दो register values की तुलना करके result को NZP register में store करना
- ADD, SUB, MUL, DIV: tensor math के लिए basic arithmetic operations
- LDR: global memory से data load करना
- STR: global memory में data store करना
- CONST: constant value को register में load करना
- RET: current thread execution समाप्त होने का signal
हर register 4-bit से specified है, इसलिए कुल 16 registers हैं
- R0 से R12 तक 13 read-write free registers हैं
- आखिरी 3 SIMD के लिए जरूरी %blockIdx, %blockDim, %threadIdx देने वाले read-only special registers हैं

execution flow

हर core instruction execute करते समय अगले steps का control flow follow करता है
- FETCH: current PC की next instruction fetch करना
- DECODE: instruction को control signals में decode करना
- REQUEST: LDR या STR की जरूरत होने पर global memory से data request करना
- WAIT: जरूरत होने पर global memory response का wait करना
- EXECUTE: data पर calculation execute करना
- UPDATE: register file और NZP register update करना
यह control flow simplicity और understandability के लिए बनाया गया है
real implementation में कुछ steps compress करके processing time optimize किया जा सकता है, या pipelining से core resources पर multiple instruction execution coordinate किया जा सकता है
हर thread dedicated register file के data पर same execution path follow करते हुए compute करता है
यह CPU diagram जैसा है, लेकिन फर्क यह है कि %blockIdx, %blockDim, %threadIdx read-only registers में हैं, जिससे SIMD capability संभव होती है

example kernels

ISA के proof of concept के लिए matrix addition और matrix multiplication kernels लिखे गए हैं
repository की test files इन kernels को GPU पर पूरी तरह simulate कर सकती हैं, और data memory state तथा full execution trace generate कर सकती हैं
matrix addition
- matadd.asm दो 1 x 8 matrices को add करता है
- 8 element-wise additions अलग-अलग threads में की जाती हैं
- %blockIdx, %blockDim, %threadIdx registers का इस्तेमाल करके SIMD programming दिखाता है
- LDR और STR instructions का इस्तेमाल करके asynchronous memory management शामिल करता है
matrix multiplication
- matmul.asm दो 2 x 2 matrices को multiply करता है
- संबंधित row और column का dot product element-wise calculate करता है
- CMP और BRnzp का इस्तेमाल करके thread के अंदर branching दिखाता है
- सभी branches फिर converge हो जाती हैं, इसलिए current tiny-gpu implementation में यह काम करता है

simulation

kernel simulation चलाने के लिए iverilog और cocotb की जरूरत है
setup steps
- brew install icarus-verilog और pip3 install cocotb से Verilog compiler और cocotb install करें
- sv2v का latest version download करके extract करें और binary को $PATH में जोड़ें
- repository root में mkdir build चलाएं
kernel simulation make test_matadd और make test_matmul से चलती है
execution result test/logs में log files के रूप में output होता है
- initial data memory state
- kernel का full execution trace
- final data memory state
हर log file की शुरुआत में input matrices दिखती हैं, और अंत में final data memory में result matrix दिखती है
execution trace में हर cycle पर सभी cores के सभी threads की execution state शामिल होती है
- current instruction
- PC
- register values
- status information

जानबूझकर हटाई गई advanced GPU features

tiny-gpu सरलता के लिए modern GPU के performance और feature improvement elements में से ज्यादातर को exclude करता है
multi-level cache और shared memory
- modern GPU global memory access घटाने के लिए कई cache layers इस्तेमाल करते हैं
- tiny-gpu request resource और memory controller के बीच recent data store करने वाली सिर्फ single cache layer implement करता है
- multi-level caches अक्सर उपयोग होने वाले data को use location के ज्यादा पास cache करके load time घटाते हैं
- GPU समान block के threads को shared results exchange करने देने के लिए shared memory भी इस्तेमाल करते हैं
memory coalescing
- parallel execution कर रहे कई threads अक्सर contiguous addresses access करते हैं, जैसे matrix के adjacent elements
- memory coalescing queued memory requests को analyze करके adjacent requests को एक transaction में merge करता है
- इसका उद्देश्य addressing में लगने वाला समय घटाना और requests को साथ process करना है
pipelining
- tiny-gpu का core एक thread group की एक instruction पूरी execute होने के बाद ही next instruction शुरू करता है
- modern GPU dependency वाली instructions के sequential execution की guarantee रखते हुए भी कई sequential instruction executions stream करते हैं
- asynchronous memory request wait जैसी situations में core resources idle न रहें, इसके लिए resource utilization बढ़ाता है
warp scheduling
- block को साथ execute हो सकने वाली thread batch यानी warp में बांटता है
- जब एक warp waiting में हो, तब दूसरे warp की instructions execute करके single core में multiple warps को concurrently process करता है
- यह pipelining जैसा है, लेकिन अलग threads की instructions से deal करता है
branch divergence
- tiny-gpu मानता है कि single batch के सभी threads हर instruction के बाद same PC पर हैं
- असल में data के आधार पर अलग-अलग threads अलग lines पर branch कर सकते हैं
- अलग PC वाले threads separate execution flows में बंटते हैं, और उनके फिर converge करने के point को भी manage करना पड़ता है
synchronization और barriers
- modern GPU उसी block के thread groups को किसी specific point तक सभी के पहुंचने तक wait कराने के लिए barriers set कर सकते हैं
- जब threads को shared data exchange करना हो, तब यह guarantee देने में उपयोगी है कि data processing पूरी हो चुकी है

अगला काम

future improvements इस प्रकार हैं
- simple instruction cache जोड़ना
- Tiny Tapeout 7 में GPU इस्तेमाल करने के लिए adapter बनाना
- basic branch divergence जोड़ना
- basic memory coalescing जोड़ना
- basic pipelining जोड़ना
- cycle time सुधारने के लिए control flow और register usage optimize करना
- graphics capability दिखाने के लिए basic graphics kernel लिखना या simple graphics hardware जोड़ना
repository को improve करना चाहने वाले users PR से contribute कर सकते हैं

1 टिप्पणियां

GN⁺ 2024-04-27

Hacker News की राय

GPU मार्केट इतना प्रतिस्पर्धी है कि आधुनिक architectures की ज़्यादातर low-level technical details बंद ही रखी जाती हैं
अपवाद के तौर पर Intel ने GPU technical docs काफी प्रकाशित किए हैं: https://kiwitree.net/~lina/intel-gfx-docs/prm/
i810/815 manuals भी ऑनलाइन मिल जाते हैं, और 855/910/915/945 के गायब होने वाले 965 से पहले के अजीब gap को छोड़ दें तो documentation काफी लगातार रहा है
- AMD भी काफी docs प्रकाशित करता है: https://www.amd.com/en/developer/browse-by-resource-type/documentation.html
  इसमें current और पुराने products की instruction set architecture documentation तक शामिल है, लेकिन यह interested hobbyists के लिए high-level explanation से ज़्यादा implementers के लिए लिखे docs जैसा लगता है
- Intel का Linux driver भी अच्छी quality का है और mainline में शामिल है
  काश सभी कंपनियां यही तरीका अपनाएं
- 2018 का material है, लेकिन कुछ हद तक संबंधित है: The Thirty Million Line Problem - Casey Muratori
सचमुच शानदार project है, और ऐसे hardware projects को खुले तौर पर develop होते देखना अच्छा है
हालांकि मुझे लगता है कि यह SIMD coprocessor के ज्यादा करीब है
GPU कहने के लिए कम-से-कम किसी न किसी रूप में display output होना चाहिए, ऐसा मुझे लगता है
मुझे पता है कि हाल में Nvidia वगैरह server-only graphics architecture variants को भी GPU के रूप में बेच रहे हैं, जिससे शब्द काफी ढीला हो गया है, लेकिन GPU design में graphics वाला हिस्सा आज भी complexity का बड़ा हिस्सा है
- अगर यह graphics process करता है, तो output न होने पर भी इसे GPU माना जा सकता है, ऐसा मुझे लगता है
  Output न देने वाला GPU भी फिर भी उपयोगी होता है
  मेरे workplace में करीब 75 workstations हैं जिनमें mid-range Quadro लगी है; cards में सिर्फ mini-DisplayPort है और कंपनी ने सिर्फ HDMI cables खरीदे हैं, इसलिए सब integrated graphics से जुड़े हैं
  फिर भी वे cards software को accelerate करते हैं और graphics process करते हैं, बस screen output नहीं देते
अच्छा है। open core GPU पर काम का मैं जोरदार समर्थन करता हूं
एक और उदाहरण भी है: https://github.com/jbush001/NyuziProcessor
- ऐसे open core processors में से किसी एक के लिए minimal CUDA implementation हो तो अच्छा होगा
  TSMC या किसी दूसरी foundry में ऐसे processor को economically produce करने के लिए कितनी मात्रा की जरूरत होगी?
सचमुच बेहतरीन project है
मैं FPGA करना चाहता हूं, लेकिन सच कहूं तो कहां से शुरू करूं इसका अंदाजा लगाना भी मुश्किल है, और पूरा field काफी intimidating लगता है
अंतिम लक्ष्य LLM के लिए accelerator card बनाना है; भले ही यह लक्ष्य बिल्कुल arbitrary तरीके से तय किया गया हो, इस project से इसका काफी overlap लगता है, और शायद फर्क बस बड़े models load करने के लिए memory offloading वाले हिस्से में होगा
- सोचने का frame बदलना होगा
  FPGA की शुरुआत को कई sub-technologies में तोड़ना होगा, और expectations भी adjust करनी होंगी
  किसी software engineer से यह उम्मीद नहीं की जाती कि वह पहले दिन से principles से पूरा computer बनाए, instruction set architecture लिखे, machine code समझे, उसे assembly में बदले, और Python code से application बनाने के लिए programming language भी develop करे
  ऊपर से शुरू करके stack में नीचे जाना सही है
  अगर आप complexity को abstract कर दें और पहले से बने IP से system बनाने पर focus करें, तो FPGA design काफी आसान है
  आमतौर पर MATLAB जैसी चीज़ recommend करता हूं, क्योंकि reference design वाले DevKit में HDL Coder से initial application बना सकते हैं
  वरना digital computing architecture, Verilog, timing, transceivers/I/O, pin planning, Quartus/Vivado, simulation/verification, embedded systems आदि सीखने का बहुत बड़ा बोझ आ जाता है
  संक्षेप में, system-level design से शुरू करें, plug-and-play IP लेकर top level पर connect करना सीखें, और उस module को पहले से बने reference design में डालकर देखें
  इसके बाद धीरे-धीरे layers हटाते हुए नीचे की complexity सामने लाएं
- मैं भी उसी स्थिति में हूं, और मेरा plan यह है
  1. Harris, Harris की Digital Design and Computer Architecture. (2022). Elsevier पढ़ना: https://doi.org/10.1016/c2019-0-00213-0
  2. Authors के RVFpga course को follow करते हुए FPGA पर असली RISC-V CPU बनाना: https://www.youtube.com/watch?v=ePv3xD3ZmnY
- मैं यह path recommend करता हूं
  1. educational repository https://github.com/yuri-panchul/basics-graphics-music clone करें। यह शुरू से Verilog सीखने वालों के लिए simple exercises का collection है, और इसे GPU development के लिए Imagination में काम कर चुके Yuri Panchul ने लिखा है
  2. supported दर्जनों FPGA boards में से कोई एक और switches, LEDs जैसे accessories लें
  3. Yosys और related tools install करें
  4. lab01 DeMorgan से शुरू करें और repository की जितनी ज्यादा exercises कर सकें करें
    Harris&Harris पढ़ते हुए exercises साथ-साथ कर सकते हैं
    Exercises और book खत्म करने के बाद अपना project शुरू करने का समय होगा
    वैसे, HackerMojo में weekly meetups भी होते हैं, और Valley में न हों तब भी Zoom से हिस्सा ले सकते हैं
- आप किस stage पर हैं यह नहीं पता, लेकिन digital logic और CPU/GPU architecture को बेहतर समझने में ये resources मेरे लिए मददगार रहे
  1. https://learn.saylor.org/course/CS301
  2. https://www.coursera.org/learn/comparch

https://hdlbits.01xz.net/wiki/Main_Page

अगर LLM को accelerate करना चाहते हैं, तो पहले architecture समझनी होगी
वहीं से शुरू करें
hardware असल में आसान हिस्सा भी है, और manufacturing के लिहाज़ से कठिन हिस्सा भी
यहाँ sequential always block में non-blocking assignment और blocking assignment operators को मिलाकर इस्तेमाल करने की कोई वजह है क्या?
- वह local variable जैसा दिखता है
- अगर simulation और synthesis results के match होने को लेकर बहुत ज़्यादा obsessive नहीं हैं, तो ऐसा कर सकते हैं
बहुत पहले VHDL में कुछ ऐसा ही किया था
कई open source HDL projects वाला opencores नाम का एक site था
सोच रहा हूँ कि आजकल HPC-level के बड़े distributed HDL simulators में कोई अच्छा विकल्प है या नहीं
RTL-level simulation में modern GPU का इस्तेमाल करना उचित लगता है
- “था” नहीं, अभी भी है: https://opencores.org/projects?language=VHDL
  क्या यह वही site नहीं, बल्कि मिलता-जुलता कोई दूसरा site है?
ALU DIV instruction को hardware level पर वैसे ही implement करता है?
क्या modern CUDA cores जैसी जगहों में division का actual instruction होना सामान्य है, या आम तौर पर software से emulate किया जाता है?
असली hardware division circuit बहुत जगह घेरता है, इसलिए मैंने उम्मीद नहीं की थी कि वह GPU ALU में होगा
Verilog में DIV: begin alu_out_reg <= rs / rt; end जैसी एक line लिखना बहुत आसान है, लेकिन वही एक line silicon का बड़ा हिस्सा खा जाती है
अगर सिर्फ Verilog simulate कर रहे हों, तो शायद यह बात दिखे ही नहीं
- यह बस किसी का Verilog सीखने वाला project है
  project simulation पर ही रुक जाता है, और actual hardware बनाने के लिए इससे कहीं ज़्यादा काम चाहिए
फिर यह graphics capability के बिना वाला “GPU” है
निजी तौर पर मुझे लगता है कि ऐसी चीज़ को किसी और नाम से बुलाना चाहिए
- पहला सवाल तो यह है कि CPU और GPU अलग-अलग बने ही क्यों
  दोनों के बीच का gap कम हो रहा है और दोनों तरफ़ एक-दूसरे की capabilities जोड़ी जा रही हैं, लेकिन अभी भी काफ़ी अंतर है
  मेरे हिसाब से इसका संबंध Amdahl's law से है [0]
  उस अर्थ में CPU को latency-optimized processor और GPU को throughput-optimized processor कहा जा सकता है
  और खास तौर पर [1] CPU को लंबी और गहरी data dependency वाला processor, और GPU को चौड़ी और सपाट data dependency वाला processor भी कहा जा सकता है
  [0]: https://en.wikipedia.org/wiki/Amdahl%27s_law
  [1]: https://en.wikipedia.org/wiki/Data_dependency
- इसे TPU, यानी tensor processing unit कह सकते हैं
  tensor बस n-dimensional array होता है
  इसके ऊपर software या firmware रखकर इसे GPU जैसा behave कराया जा सकता है
- मैं ‘display adapter’ बनाने का project शुरू करने के बारे में सोचता रहा हूँ, लेकिन शुरू करने से पहले ही UEFI के GOP driver और display adapter के बीच communication protocol समझ नहीं पाया और अटक गया
  EDK2 source से टुकड़े जोड़ने की कोशिश की, लेकिन यह साफ़ नहीं था कि कितना हिस्सा QEMU-specific है
- इसे MPU, यानी matrix processing unit कह सकते हैं
- जो term जमती दिख रही है वह AIA, यानी AI accelerator है
tiny-gpu का यह मान लेना कि सभी threads हर instruction के बाद उसी program counter पर “converge” करते हैं, बहुत naive simplification है
असली GPU में individual threads अलग PC पर branch कर सकते हैं, और पहले साथ process हो रहा thread group अलग executions में टूट जाता है, जिसे branch divergence कहते हैं
silicon में GPU बनाने से पहले GPU programming करके देखना बेहतर होता
ऊपर से इसे SIMD कहना भी कुछ ठीक नहीं लगता
यह वही व्यक्ति है जिसने पहले दूसरों के circuits जोड़कर LED blink कराई थी और कहा था कि CPU बना लिया
- पहली बात तो यह है कि क्या यह हर execution पर __syncthreads() call करने जैसा नहीं है?

Tiny GPU: Verilog में लागू किया गया न्यूनतम GPU

tiny-gpu जिस समस्या को हल करना चाहता है

पूरा architecture

kernel execution और thread distribution

memory structure और controllers

core के अंदर की संरचना

ISA

execution flow

example kernels

matrix addition

matrix multiplication

simulation

जानबूझकर हटाई गई advanced GPU features

multi-level cache और shared memory

memory coalescing

pipelining

warp scheduling

branch divergence

synchronization और barriers

अगला काम

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय