SiLU और SoftMax को 2 गुना तेज़ बनाने वाला नया exponential function, accuracy पूरी तरह बरकरार

(github.com/ggerganov)

1 पॉइंट द्वारा GN⁺ 2024-05-16 | 1 टिप्पणियां | WhatsApp पर शेयर करें

llama.cpp PR #7154 ने GGML के CPU के लिए SiLU और SoftMax calculation को llamafile के vectorized expf()-based implementation से दोबारा लिखा, और यह 17 मई 2024 को master में merge हुआ
मौजूदा GGML speed के लिए short[65536] lookup table इस्तेमाल करता था, लेकिन नया implementation aarch64 और SSE2+ पर worst-case rounding error को 2 ULP तक रखते हुए ज़्यादा accurate calculation का लक्ष्य रखता है
SOFT_MAX CPU performance test में SSE2+FMA 1.5x, AVX2+FMA 1.9x, AVX512 2.1x तेज़ हुआ, और AMD Ryzen 9 5950X व M2 Ultra पर भी master की तुलना में लगभग 1.5x तेज़ result confirm हुआ
बदलावों में ggml_v_expf(), ggml_v_silu() जोड़ना, duplicate code को ggml_vec_soft_max_f32() में extract करना, GGML_SILU_FP16 से जुड़े functions हटाना, और SSE2 या ARM NEON conditional SiLU path adjust करना शामिल है
merge के बाद >1 slots server run में non-deterministic results reproduce हुए, और बाद में कारण -ffinite-math-only तक सीमित हुआ, जिससे build-level constraint बनी कि -fno-finite-math-only ज़रूरी है

PR का change goal और merge status

PR #7154 का title ggml : rewrite silu and softmax for cpu है, और यह llama.cpp के GGML CPU path में SiLU और SoftMax calculation को दोबारा लिखता है
बदलाव की शुरुआत llamafile के vectorized expf() function को upstream करने के रूप में हुई
PR 17 मई 2024 को ggml-org:master में merge हुआ, और merge commit 934266c के रूप में दिखाया गया
author ने बताया कि नया तरीका SoftMax और SiLU को उस short[65536] lookup table की तुलना में ज़्यादा accurately calculate कर सकता है जिसे मौजूदा GGML speed के लिए इस्तेमाल करता था

accuracy और support range

नया expf()-based path aarch64 और SSE2+ को support करता है, और worst-case rounding error 2 ULP बताया गया है
शुरुआती explanation में AVX2 और AVX512 implementations भी लिखे गए थे, लेकिन SSE2+FMA की तुलना में code complexity उठाने लायक फायदा बड़ा नहीं था, इसलिए उन्हें शामिल नहीं किया गया
बाद में benchmark results के आधार पर AVX2 और AVX512 code भी शामिल किया गया
अलग test output में 4294967296 numbers tested successfully दिखाया गया, और कई input values के लिए exp और llamafile implementation के results की comparison शामिल थी

code changes का scope

reviewer द्वारा summarize किए गए मुख्य changes ये हैं
- commented-out #define हटाना
- duplicate 5 lines को ggml_vec_soft_max_f32() में extract करना
- GGML_SILU_FP16 से जुड़े कई functions हटाना
- ggml_v_expf() जोड़ना
- ggml_v_silu() जोड़ना
- ggml_vec_silu_f32() को SSE2 या __ARM_NEON flag के अनुसार अलग function इस्तेमाल करने के लिए preprocessor statements adjust करना
GitHub metadata में changed files की संख्या 1 दिखाई गई
PR पर refactoring और Review Complexity : High labels लगे, और दूसरे label में explanation थी कि LLM या GPU की deep knowledge की ज़रूरत हो सकती है

benchmarks और performance results

ggerganov ने AMD Ryzen 9 5950X और M2 Ultra पर confirm किया कि SOFT_MAX, master से लगभग 1.5x तेज़ है
इस्तेमाल की गई test command यह थी

make -j tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf

बाद में author ने बताया कि उसी command पर performance advantage इस तरह बढ़ता है
- SSE2+FMA: 1.5x
- AVX2+FMA: 1.9x
- AVX512: 2.1x
अलग development script में ये numbers दिए गए
- run_expf(): 2.98601 ns
- run_llamafile_expf_sse2(): 1.35154 ns
- run_llamafile_expf_avx2(): 1.16659 ns
- run_llamafile_expf_avx512(): 1.18844 ns
GitHub Actions के llama.cpp server benchmark ने Standard_NC4as_T4_v3 पर phi-2 q4_0 configuration में 543 iterations record किए
- concurrent users: 8
- duration: 10 मिनट
- HTTP request average: 8626.19ms
- p95: 21696.44ms
- Prompt processing average: 94.59 tk/s
- Token generation average: 33.43 tk/s

AVX512 optimization discussion

chriselrod ने AVX512 में vscalefps use करने का सुझाव दिया
- vscalefps, zmm0 = zmm1 * 2^{zmm2} calculate करता है
- कहा गया कि यह overflow और underflow को ठीक से handle करके checks और blends हटा सकता है
Julia implementation example और assembly loop share किए गए, और अगर test सही है तो x=47.483456f पर maximum error 1 ULP से कम था
समझाया गया कि vscalefps approach lookup table इस्तेमाल नहीं करता, और Float64/double implementation में vpermi2pd के जरिए 16-element lookup table इस्तेमाल होता है
बाद में C++ implementation link भी share किया गया
- ExpAVX512
- source include/ExpAVX512.hpp में है
- README में benchmarks शामिल हैं, लेकिन कहा गया कि अन्य implementations के साथ comparative benchmark नहीं किया गया

merge के बाद non-determinism issue

merge के बाद server में >1 slots इस्तेमाल करने पर non-deterministic results आने का reproduced case report हुआ
minimal reproduction steps ये हैं

make clean && make server
./server -m models/opt/llama_2-7b-q4_0.gguf --parallel 2 --threads 1

दूसरे shell में चलाई गई request यह थी

curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "", "n_predict":10, "n_probs": 2, "temperature": -1}' | python3 -m json.tool

बताया गया कि last token की token probabilities हर curl call पर दो values के बीच cycle करती थीं, और 4 slots इस्तेमाल करने पर चार possible values के बीच cycle करती हैं

`-ffinite-math-only` और build constraints

बाद के related commits में उन findings का reference है जिनमें -ffinite-math-only को issue का cause narrow down किया गया
record किया गया कि issue में शायद SiLU छोटे values को 0 में flush करने के बजाय NaN या कोई दूसरा garbage value return कर रहा था
fix में check किया गया कि -fno-finite-math-only set है या नहीं, और यह enforce किया गया कि compile mode finite math mode नहीं होना चाहिए
error message बताता है कि GGML की कुछ routines को non-finite math arithmetic चाहिए, और compiler को -fno-finite-math-only pass करने की सलाह देता है
बाद में users ने अपने अनुभव share किए कि -Ofast या -ffast-math, -ffinite-math-only शामिल करके build तोड़ सकते हैं
- report है कि GCC 13.2 तक -Ofast use किया जा सकता था, लेकिन GCC 14 से results garbage हो गए
- कुछ tests में -fno-finite-math-only के अलावा -fmath-errno भी ज़रूरी बताया गया
- कई repositories में ऐसे follow-up commits reference हुए जिनमें -ffast-math हटाकर या -fno-finite-math-only explicit करके ggml compile error solve किया गया

1 टिप्पणियां

GN⁺ 2024-05-16

Hacker News की राय

करीब 20 साल पहले Hughes radar signal processor के लिए programming करते समय, 0 < x < 1 रेंज में e^x निकालना पड़ता था
उस processor में multiplication था, इसलिए 32-bit word के 4 अलग-अलग 8-bit blocks में से हर एक के लिए 256 संभावित values की e^x tables 4 बनाईं, और उन्हें multiply करके final value पाई
यह पिछले best e^x routine से लगभग 5 गुना तेज़ था, और आज भले ही outdated है, लेकिन कुछ समय तक यह एक मज़ेदार machine थी जो nominally कहीं तेज़ processors की तुलना में radar signals को ज़्यादा तेज़ process करती थी
- अगर follow करना मुश्किल था, तो idea मोटे तौर पर e^x = e^(a+b+c+d) है, जहाँ a/b/c/d x के हर byte हैं; फिर इसे e^a * e^b * e^c * e^d में बदलकर हर e^a, e^b lookup table बनाई जाती है
  सख्ती से देखें तो a कुछ high byte << 24 जैसा होता है, इसलिए e^a table a => e^(a<<24) mapping बन जाती है, और दूसरे bytes भी इसी तरह handle होते हैं
सोच रहा हूँ कि ऐसी silu और softmax improvements कुल LLM inference speed पर कितना असर डालती हैं
अगर मैं गलत हूँ तो सुधार दें, लेकिन ज़्यादातर समय matrix multiplication में जाता है, इसलिए इस change का effect छोटा होगा लगता है
- हाँ, floating-point operations का ज़्यादातर हिस्सा matrix multiplication में जाता है, लेकिन softmax memory bandwidth को disproportionately ज़्यादा इस्तेमाल करता है, इसलिए सिर्फ operation count देखकर जितना लगता है, आम तौर पर उससे कहीं ज़्यादा समय लेता है
थोड़ा off-topic है, लेकिन skim करते हुए मैंने सोचा “यह तो काफी crazy optimization लग रही है। complex है और code को पहले ही बहुत लोगों ने देखा है”, फिर contributor देखा तो लगा “बिल्कुल, यह तो jart है। ऐसे crazy-good solutions हमेशा jart से ही आते हैं”
- यह मुख्यतः डरावना इसलिए दिखता है क्योंकि C/C++ की intrinsics syntax ही ऐसी होती है
  उस तरफ की कई चीज़ों की तरह, यह pain भी काफी हद तक self-inflicted है
  जहाँ तक मुझे पता है, C# style SIMD और hardware intrinsic syntax enable करने वाली C++ libraries भी हैं, लेकिन drawback यह है कि instruction set docs में सीधे mnemonics देखना मुश्किल हो जाता है
  यहाँ किए गए काम की अहमियत कम करने का इरादा नहीं है; बस यह कहना है कि यह व्यापक readers के लिए ज्यादा approachable हो सकता था। हालांकि मैं अभी inference backend को C# में फिर से लिखने वाला वह proposal नहीं देने जा रहा, जिसे यहाँ सब बेतुका मानेंगे
- adapted from arm limited optimized routine यानी आखिरकार यह giants के shoulders पर खड़ा है
- मुझे नहीं लगता कि asymptotic analysis की classes में ऐसी चीज़ें पढ़ाई जाती हैं
  एक professor की मशहूर बात याद आती है: “जिस constant को सब ignore करते हैं, engineering में वही आपका पूरा सिर खा सकता है”
यह short[65536] lookup table को replace करता है, लेकिन पहली जगह में वह कुछ slow choice नहीं लगती?
मतलब L1 cache के पूरे size की lookup table रखना—क्या probabilistically किसी तरह roughly fit हो जाता है और unexpectedly अच्छा काम करता है?
- lookup table unexpectedly अच्छा इसलिए काम करती है क्योंकि workload खुद बेहद cache-unfriendly है
  L1 cache उड़ भी जाए तो बहुत फर्क नहीं पड़ता, और LUT रखने से जो data बाहर धकेला गया, उसे वैसे भी फिर reuse होने की संभावना बहुत कम थी
  machine learning loads आम तौर पर हर iteration में पूरे dataset को linearly पढ़ने वाले streaming loads होते हैं
- lookup table शायद क्यों नहीं इस्तेमाल करनी चाहिए, इस पर यह लेख https://specbranch.com/posts/lookup-tables/ सामान्य तौर पर बताता है कि कब यह सही होती है
  मेरे सीमित अनुभव में, lookup से तेज़ होने से पहले आप real-time computation काफी ज्यादा कर सकते हैं
llama.cpp में, यह CPU के लिए है
- मूल रूप से इसे llamafile के लिए develop किया गया था, और यह पिछले दो releases में शामिल है: https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.2
  अब इसे llama.cpp project में upstream किया जा रहा है
  अभी ऐसी कुछ और performance improvements हैं जो सिर्फ llamafile में मिलती हैं, जैसे Kawrakow का काम जिसने K quants को काफी तेज़ बना दिया
थोड़ा अलग सवाल हो सकता है, लेकिन क्या किसी को पता है कि ggml जैसी चीज़ें tensorflow lite, onnxruntime जैसे runtimes की तुलना में कैसी हैं?
- मैं ONNX और llama.cpp Flutter libraries को सभी 6 True Platforms पर maintain करता हूँ, इसलिए काफी अच्छी तरह जानता हूँ
  संक्षेप में, LLMs के लिए llama.cpp सही है, और उसकी core dependency GGML से whisper भी चल सकता है
  बाकी चीज़ों के लिए ONNX इस्तेमाल करें
  TF machine learning world का Apple जैसा है: अगर आप Google ML ecosystem में पूरी तरह locked-in हैं तो बढ़िया है, लेकिन उसके बाहर यह practically dead है। HF models का absurdly बड़ा ratio, करीब 94%, PyTorch है
  direct inference performance comparison के लायक ONNX का Whisper और GGML हैं; किसी ने मेरी llama.cpp library को Whisper के साथ चलाया था और कोई meaningful performance difference report नहीं किया
- यह बहुत important है कि बात ठीक किस hardware के संदर्भ में हो रही है
इस समय CUDA devices पर non-batched inference के लिए gguf/llama.cpp ज़्यादा performant solution है, या अभी भी exllamav2+flashattention आगे है?
- 2x 4090 पर difference negligible है
  4-bit KV cache जैसे ज्यादा important differences हैं
LUT को भी vectorize किया जा सकता है
https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
पहले मैंने LUT से possible चीज़ों पर भी लिखा था https://darkcephas.blogspot.com/2018/10/validating-utf8-stri...
- सही है, लेकिन सीधे exp implement करने पर भी desired accuracy के हिसाब से करीब 10~20 FMA ही लगते हैं
  gather या permutation का pure computation से compete करना मुश्किल है
इसी तरह, एक तेज़ tanh भी है https://github.com/microsoft/onnxruntime/pull/20612
- शानदार काम है
  लेकिन goal क्या है? उस GeLU approximation को faster बनाना?
  erff() पर वापस जाएँ तो शायद यह बहुत ज्यादा तेज़ हो जाएगा
क्या यह gguf के GPU partial offloading use case में भी मदद करता है?
क्या CPU side भी तेज़ हो जाती है?

SiLU और SoftMax को 2 गुना तेज़ बनाने वाला नया exponential function, accuracy पूरी तरह बरकरार

PR का change goal और merge status

accuracy और support range

code changes का scope

benchmarks और performance results

AVX512 optimization discussion

merge के बाद non-determinism issue

-ffinite-math-only और build constraints

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय

`-ffinite-math-only` और build constraints