CPU को सचमुच चिढ़ा देने वाले डेटा access patterns

(blog.weineng.me)

1 पॉइंट द्वारा GN⁺ 3 시간 전 | 1 टिप्पणियां | WhatsApp पर शेयर करें

एक जैसे integer summation loop में भी सिर्फ access order बदलने से execution time बहुत बदल जाता है, और प्रयोग दिखाता है कि random access से 30% से अधिक धीमा permutation बनाया जा सकता है
Linear access 133 million cycles के साथ तेज़ है, जबकि random access में CPU के लिए अगली location का अनुमान लगाना कठिन होता है, इसलिए यह 1.57 billion cycles तक धीमा हो जाता है
Cache line और page boundary का उपयोग करके accesses को दूर-दूर करने पर prefetcher और cache reuse कमजोर हो जाते हैं, और set-associative cache की विशेषता के कारण L1d की प्रभावी उपयोग क्षमता घटकर लगभग 768B रह जाती है
Page stride को 8 रखने पर PTE cache line locality भी टूट जाती है और 2.06 billion cycles रिकॉर्ड होते हैं, जिससे यह simple random shuffle से भी खराब pattern बन जाता है
DRAM bank/row conflicts को निशाना बनाने वाला approximate pattern 2.08 billion cycles के साथ थोड़ा और धीमा था, लेकिन physical address की DRAM mapping platform-dependent होने के कारण इसे पूरी तरह control करना कठिन है

प्रयोग की शर्तें और baseline

लक्ष्य data array के integers को sum करने वाले fixed accumulator function में सिर्फ positions permutation बदलकर सबसे धीमा execution time बनाना है
positions generation time को छोड़कर, केवल accumulation function के execution time को rdtsc cycle count से मापा गया
Data size 2^26 uint32_t integers है
- 65,536 pages का उपयोग
- हर page 4,096 bytes का है
- हर page में 1,024 integers होते हैं
huge pages disabled हैं
Measurement machine Intel Core Ultra 7 268V आधारित x86_64 system है
- 8 CPUs
- L1d total 320KiB, L1i total 512KiB
- L2 total 14MiB
- L3 12MiB
पूरा code GitHub repository में है, और example g++ -std=c++2a -O3 slowest.cc && taskset -c 3 sudo ./a.out से run किया गया
Core loop एक simple form है जो positions[i] द्वारा point किए गए data[pos] को क्रम से जोड़ता है

Linear access और random access का अंतर

सबसे simple linear pattern में positions[i] = i होता है और array को शुरुआत से क्रमवार access किया जाता है
Linear access में 132,752,394 cycles लगे, और CPU sequential access के लिए strongly optimized होने के कारण यह सबसे तेज़ patterns में है
fisher_yates_shuffle से positions को random permutation बनाने पर CPU के लिए अगला access होने वाला data predict करना कठिन हो जाता है
Random access में 1,572,108,618 cycles लगे, जो linear access से 10 गुना से भी अधिक धीमा है
प्रयोग इसके आगे यह जांचता है कि क्या random से भी खराब permutation जानबूझकर बनाया जा सकता है

Cache line और page unit में access को दूर करना

पहला worsening pattern continuous accesses को हमेशा cache line के अंतर से रखने के लिए positions arrange करता है
यह pattern 64-byte cache line में 4-byte integer में से केवल एक का उपयोग करता है और फिर अगली cache line पर चला जाता है
- उसी cache line पर वापस आने तक reusable data cache से बाहर निकल चुका होने की संभावना अधिक होती है
Cache-line spaced access में 718,804,156 cycles लगे, जो linear scan से 4 गुना से अधिक धीमा है
फिर भी इस case में hardware prefetcher data के लिए simple streaming pattern detect करके future cache lines पहले से ला सकता है
कई Intel hardware data prefetchers 4KiB page boundary के पार prefetch नहीं करते
अगला pattern access interval को cache line नहीं बल्कि पूरे page तक फैला देता है
- हर page 4,096 bytes का है
- page_index * elements_per_page + element_index form में पहले हर page के same offset को access किया जाता है
Page-spaced access 1,411,153,154 cycles के साथ काफी धीमा हो गया

Set-associative cache और reuse distance

प्रयोग वाली machine का per-core L1d cache 48KB, 12-way, 64-set structure वाला है
L1d में 64 sets होने के कारण address A और A + 4096 bytes same L1d set में map होते हैं
- 4,096 bytes 64 sets * 64-byte cache line है
Page-level stride हर inner loop को पूरे 64 sets में समान रूप से फैलाने के बजाय लगातार same set को hit कराता है
एक set में केवल 12 ways होते हैं, इसलिए 12 से अधिक active cache lines compete करें तो CPU को lines बार-बार evict करके reload करनी पड़ती हैं
L1d की total capacity 48KB है, लेकिन इस access pattern में उपयोगी रूप से इस्तेमाल होने वाली L1d capacity 12 ways * 64B यानी लगभग 768B है
separated_by_a_page pattern 65,536 cache lines access करने के बाद उसी cache line पर लौटता है, इसलिए cache line reuse distance 65,536 है
Page के अंदर cache line तक अलग करने वाला separated_by_a_page_and_cacheline pattern reuse distance को 65536 pages * 4096 page size / 64 cacheline size तक बढ़ाता है
हालांकि result 1,408,519,172 cycles था, जो page-level access जैसा ही लगभग है
प्रयोग core 3 पर run हुआ, और उस core में 2.5MB L2 और 48KB L1 है
- 65,536 pages को एक बार traverse करने पर 4MB data access होता है
- यह उस core की private L1/L2 capacity से बड़ा है
- अगली बार चाहिए होने वाली cache line के private cache में बचे होने की संभावना कम है
Private cache reuse की उम्मीद तभी की जा सकती है जब cache line reuse distance लगभग ((2560+48)*1024/64) यानी करीब 40,000 से कम हो

Page stride, PTE और DRAM तक को परेशान करना

अगला variant continuous pages नहीं, बल्कि N pages के interval से access करने वाला separated_by_stride_pages_and_cacheline pattern है
कई strides मापने पर page stride 8 पर सबसे खराब result आया और यह random access से धीमा था
-DSTRIDE=8 से standalone run करने पर 2,058,425,640 cycles लगे
stride 8 पर peak आने के संभावित कारणों में से एक address translation है
- Virtual address access को MMU physical address में translate करता है
- Translation में page table entry(PTE) का उपयोग होता है
- PTE 8 bytes का होता है, और एक cache line में 8 PTEs आते हैं
- ऐसा आंका गया कि 8-page stride में data cache line के साथ-साथ page mapping के लिए अलग cache line भी हर बार लानी पड़ती है
आखिरी चरण DRAM controller behavior तक को बाधित करने की कोशिश है
DRAM channel, rank, chip, bank, row, column से बना होता है
- bank के अंदर कई rows होते हैं
- किसी खास row को access करने के लिए उस row को activate करके row buffer में copy किया जाता है
- same bank में अलग row access करने के लिए पुराने row को precharge से close करके नया row activate करना पड़ता है
same bank में rows को बारी-बारी access करने पर row-buffer conflict होता है, जिससे DRAM controller response धीमा हो जाता है
इसके विपरीत, पहले से open row access करने पर row-buffer hit होता है, और कई banks को एक साथ इस्तेमाल करने पर memory controller काम overlap कर सकता है
slow pattern बनाने की strategy इस प्रकार है
- Virtual page number को page table से physical frame number(PFN) में translate करना
- page offset preserve करके physical address बनाना
- physical address को DRAM channel, rank, bank group, bank, row, column के रूप में interpret करना
- same bank के अलग-अलग rows को बार-बार access करके लगभग हर request पर precharge और activation force करना
लेकिन physical address से DRAM channel, rank, bank, row तक की mapping documented नहीं है और platform-dependent है
- CPU, memory type, BIOS/firmware settings, channel/rank configuration, address hashing के अनुसार बदल सकती है
- DRAMA या Sudoku जैसे tools इस्तेमाल किए जा सकते हैं, लेकिन इस experiment machine पर इन्हें चला नहीं सके
DRAMA paper और local experiments के आधार पर 4 bank groups, प्रति group 4 banks, और DRAM_ROW_SHIFT = 18 का उपयोग करके approximation किया गया
DRAM bank/row conflicts तक consider करने वाला pattern 2,082,308,014 cycles के साथ stride 8 से consistently थोड़ा धीमा था, लेकिन अंतर बड़ा नहीं है
Perfect row conflict न बना पाने के पीछे कुछ constraints हैं
- bank group hash, bank hash, row shift estimation सटीक न हो सकती है
- 8-page stride लगभग 32KB interval access है, इसलिए same DRAM row में होना कठिन है
- Intel की bank hashing के कारण असल में कई banks एक साथ इस्तेमाल हो जाते हैं
Full execution example ये results दिखाता है
- linear: 132,752,394
- fisher_yates_shuffle: 1,572,108,618
- separated_by_a_cacheline: 718,804,156
- separated_by_a_page: 1,411,153,154
- separated_by_a_page_and_cacheline: 1,408,519,172
- stride=8 separated_by_stride_pages_and_cacheline: 2,058,425,640
- separated_by_stride_bank_conflicts_and_cacheline: 2,082,308,014
Cache, prefetcher, cache line reuse, PTE access, DRAM bank/row characteristics को क्रम से बिगाड़ने पर random access से 33% धीमा summation pattern बनाया जा सकता है
Power-saving mode switching की तरह accumulation को और धीमा बनाने के तरीके भी हैं, लेकिन यह सिर्फ access pattern बदलने वाले experiment से अलग है

1 टिप्पणियां

GN⁺ 3 시간 전

Lobste.rs की राय

यह सोचकर मज़े से पढ़ा कि अंदरूनी Windows 11 development research कुछ ऐसी दिखती होगी /s
https://github.com/ob/cache बनाते समय मैंने यह सीखा था
- README के दो वाक्यों को कैसे समझना चाहिए, यह जानना चाहता/चाहती हूँ
  एक में कहा गया है कि Computer Architecture: A Quantitative Approach के पेज 476 पर exercise 5.2 में latency के आंकड़े generate करने की तकनीक पहली बार देखी थी, और दूसरे में कहा गया है कि idea Rafael Héctor Saavedra‑Barrera की PhD thesis से आया था
  मैं confirm करना चाहता/चाहती हूँ कि क्या ये अलग-अलग बातें कह रहे हैं, contradiction है, या वही बात है और Saavedra-Barrera को CA:AQA में cite किया गया है
  यह भी स्पष्ट नहीं है कि Claude program को README लिखने से exclude किया गया था या नहीं, और वह repository contributor के रूप में भी दिखता है, इसलिए पहले यह verify करना चाहता/चाहती हूँ कि reference असली है या नहीं
अगर आप custom cache hierarchy access के साथ experiment करना चाहते हैं, तो मेरा बनाया simulator भी recommend करता/करती हूँ
https://github.com/TheCloudlet/Stratum
Index calculate करने के overhead और actual access cost को कैसे अलग किया गया, यह जानना चाहता/चाहती हूँ
- अगर आपका सवाल यह है कि positions बनाने का time साथ में count किए बिना सिर्फ accumulator cycles कैसे measure किए, तो हमने पहले reset और arrange_positions run किए, फिर एक macro इस्तेमाल किया जिसमें rdtsc_start() और rdtsc_end() के बीच सिर्फ accumulator(data, positions) रखा गया
  उसके बाद result print किया और sum == linear_scan_sum verify किया, और do_not_optimize(sum) से optimization के कारण हटाए जाने से रोका
Professors ने जो data access patterns सिखाए हैं, उनसे भी करें तो पहले SafeNumber.java class बनानी पड़ेगी और add(SafeNumber b) member की जरूरत होगी
अगले semester में शायद हम इसे Redis के पीछे रखकर web-scale performance देने वाली architecture सीखेंगे
Claude की मदद से benchmark को Java में port करके देखा, और Java SafeNumber[] linear access में C++ uint32_t[] से लगभग 3.6x धीमा था, और Fisher-Yates shuffle में C++ linear की तुलना में 52.2x धीमा था
absolute time में, 6.7 करोड़ elements के लिए C++ linear 10,258,584ns, Java linear 36,740,667ns, C++ shuffle 264,856,042ns, Java shuffle 535,724,208ns था, और यह देखकर impression हुआ कि उम्मीद से कम, “सिर्फ 4x” के आसपास ही है
- Java का multiple बहुत अच्छा तो नहीं है, लेकिन Project Valhalla आने पर शायद यह और करीब हो सके

CPU को सचमुच चिढ़ा देने वाले डेटा access patterns

प्रयोग की शर्तें और baseline

Linear access और random access का अंतर

Cache line और page unit में access को दूर करना

Set-associative cache और reuse distance

Page stride, PTE और DRAM तक को परेशान करना

संबंधित पढ़ाई

1 टिप्पणियां

Lobste.rs की राय