Concurrent data structures को सही तरीके से टेस्ट करने का तरीका

(matklad.github.io)

2 पॉइंट द्वारा GN⁺ 2024-07-07 | 1 टिप्पणियां | WhatsApp पर शेयर करें

टूटे हुए Rust concurrent counter के उदाहरण से दिखाया गया है कि सामान्य thread load test जिन समस्याओं को छोड़ देते हैं, उन्हें reproducible और minimize की जा सकने वाली execution-order control से कैसे सामने लाया जा सकता है
टेस्ट के लिए बना AtomicU32 wrapper pause() डालता है, और managed thread atomic operations से पहले/बाद रुकता है, फिर टेस्ट द्वारा चुने गए order में आगे बढ़ता है
साधारण test में 100 threads, हर एक 100 बार increment करके expected value 10000 के बजाय 9598 जैसी failure ला सकते हैं, लेकिन यह timing-dependent है, इसलिए reproduce, debug और shrink करना मुश्किल है
arbtest-based property test उसी seed से वही interleaving reproduce करता है, और failure case को 0: increment, 1: increment, 0: unpause, 1: unpause तक minimize करता है
इसी structure को exhaustigen से बढ़ाने पर अधिकतम 5 increments तक सभी interleavings enumerate की जा सकती हैं, और fetch_add fix के बाद 81133 interleavings pass होती हैं

Atomic न होने वाला concurrent counter

उदाहरण Rust के AtomicU32 का उपयोग करता है, लेकिन increment() load के बाद store(value + 1) करता है, इसलिए increment operation खुद atomic नहीं है
Counter structure सरल है
- value: AtomicU32
- increment() SeqCst से value पढ़ता है, पढ़ी गई value में 1 जोड़कर फिर store करता है
- get() current value को SeqCst से पढ़ता है
दो threads एक ही value पढ़कर वही incremented result store कर सकते हैं, जिससे update गायब हो जाता है

सामान्य thread test क्यों पर्याप्त नहीं हैं

सबसे सरल verification तरीका है कि कई threads एक ही counter को बार-बार increment करें और अंत में value check करें
- thread_count = 100
- increment_count = 100
- expected value 10000 है
उदाहरण run left: 9598, right: 10000 के साथ fail होता है
यह तरीका scheduling timing पर बहुत निर्भर करता है
- वही failure deterministically reproduce करना मुश्किल है
- debugging मुश्किल है
- thread count या increment count कम करने पर किस्मत से pass हो सकता है, इसलिए failure case को minimize करना मुश्किल है

Property-based testing से interleavings संभालना

Property-based testing (PBT) state machine testing के साथ अच्छा match करती है
- arbitrary inputs generate करना आसान है
- यह property रखी जा सकती है कि concurrent execution का result sequential execution model जैसा होना चाहिए
- failure input को minimize करने की जरूरत से भी मेल खाती है
मुश्किल यह है कि वास्तविक OS threads को इच्छित समय पर एक-एक step आगे बढ़ाना कठिन है
समाधान ऐसा structure है जिसमें हर iteration में किसी random thread को चुनकर एक step आगे बढ़ाया जाता है
- एक thread के load और store के बीच दूसरे thread को insert कर सकना चाहिए
- इसके लिए threads को सीधे control करने वाला managed thread API बनाया गया है

Test के लिए AtomicU32 और pause insertion

test build में std::sync::atomic::AtomicU32 के बजाय अपना managed_thread::AtomicU32 उपयोग किया जाता है
- #[cfg(test)] use managed_thread::AtomicU32
- #[cfg(not(test))] use std::sync::atomic::AtomicU32
wrapper AtomicU32, load() और store() से पहले/बाद pause() call करता है
- load: pause() → actual load → pause()
- store: pause() → actual store → pause()
इन insertion points की वजह से test atomic operations के आसपास threads को रोक और फिर आगे बढ़ाकर execution order control कर सकता है

managed thread API का रूप

test std::thread::scope के अंदर दो managed threads बनाता है
- scoped thread का उपयोग होने से stack-local data borrow किया जा सकता है
- spawn(scope, &counter) की तरह counter reference को state के रूप में pass किया जाता है
managed thread शुरू से कोई खास main function नहीं चलाता, बल्कि control thread द्वारा submit() से भेजे गए closure को execute करता है
- t.submit(|c| c.increment())
- thread अपनी state T पर closure execute करता है
test loop entropy बची रहने तक हर thread के लिए random action करता है
- अगर thread रुका हुआ है तो unpause()
- अगर रुका हुआ नहीं है तो submit() से increment() execute करता है
- sequential model counter_model को भी उतनी ही बार increment करता है
अंत में सभी threads को join() किया जाता है और counter_model की तुलना actual counter.get() से की जाती है

pause और unpause implementation

pause() test target Counter API को बदले बिना current managed thread का context खोजने के लिए thread_local! का उपयोग करता है
- context Arc<SharedContext> के रूप में shared है
- SharedContext में Mutex<State> और Condvar होते हैं
state को Ready, Running, Paused में बांटा गया है
- Ready: अगले closure का इंतजार करने वाली state
- Running: managed thread के running होने की state
- Paused: pause() point पर रुकी हुई state
जब managed thread pause() पर पहुंचता है, तो state को Running से Paused में बदलता है और condition variable के जरिए control thread को notify करता है
unpause() state को Paused से Running में बदलता है और managed thread को जगाने के बाद फिर तब तक wait करता है जब तक state Running न रहे
- control thread और managed thread के एक साथ चलते रहने की स्थिति को रोकता है
- किसी भी समय दोनों में से केवल एक ही चले, ऐसा बनाकर non-determinism घटाता है

Failure reproduction और minimization

arbtest run टूटे हुए counter में failure ढूंढता है
- उदाहरण failure में model value 4, actual value 3 है
- failure seed 0x4fd7ddff00000020 है
वही seed देने पर वही interleaving फिर मिलती है, इसलिए failure reproduce करना आसान हो जाता है
.minimize() का उपयोग करने पर failure case छोटी execution में घट जाता है
- अंतिम minimal case seed 0x9c2a13a600000001 है
- minimal trace चार steps का है
  - 0: increment
  - 1: increment
  - 0: unpause
  - 1: unpause
इस minimal case में expected value 2 है, लेकिन actual value 1 होती है, जिससे load/store-based increment की खामी सामने आती है

सभी interleavings enumerate करने तक विस्तार

इसी structure को random interleaving के बजाय enumeration-based बनाया जा सकता है
exhaustigen का उपयोग करके अधिकतम 5 increments तक सभी interleavings explore करने वाला test लिखा गया है
- test dummy loops से बचता है और हमेशा thread को unpause करता है या increment submit करता है
टूटा हुआ implementation वही bug ढूंढता है
- example failure left: 2, right: 1 है
Counter::increment() को fetch_add(1, SeqCst) से fix करने पर test pass होता है
- AtomicU32 wrapper में भी fetch_add() से पहले/बाद pause() जोड़ा जाता है
- execution result all 81133 interleavings are fine! है
- run time real 8.65s, CPU 8.16s, RSS 63.91mb है

Weak memory model और model checking तक विस्तार

मौजूदा toy implementation का AtomicU32 actual atomic को delegate करता है
extension idea यह है कि हर atomic के लिए लिखी गई values का set रखा जाए, और read के समय weak memory model से consistent कोई arbitrary value return की जाए
interleaving exploration को random से ज्यादा smart भी बनाया जा सकता है
- model checking approach से verify किया जा सकता है कि meaningfully different सभी interleavings consider हुई हैं या नहीं
- Generate All The Things वाले तरीके की तरह छोटे scope की सभी interleavings enumerate की जा सकती हैं

shrinking के बिना minimization क्यों संभव है

इस्तेमाल किया गया arbtest परिचित PRNG interface जैसा दिखता है, लेकिन finite PRNG का उपयोग करता है
- random values मांगते रहने पर किसी point पर Err(OutOfEntropy) return करता है
- इसलिए test code में ? और while !rng.is_empty() दिखाई देते हैं
जब test entropy खत्म कर देता है, तो वह जल्दी समाप्त हो जाता है; इसलिए उपलब्ध entropy घटाने पर test execution भी छोटा हो जाता है
internal implementation conceptually &mut &[u8] के करीब है
- हर बार random number मांगने पर byte slice घटता है
- initial slice जितना छोटा होगा, test उतना ही सरल होगा
इस तरीके की वजह से अलग से shrinking logic खुद implement किए बिना भी failure case छोटा हो सकता है
example source code properly-concurrent में है

1 टिप्पणियां

GN⁺ 2024-07-07

Hacker News की राय

Rust में मिलते-जुलते approach से Temper नाम की library बना रहा हूँ: https://github.com/reitzensteinm/temper/tree/main
हालांकि Rust के पूरे memory model से निकलने वाले अजीब implications को model करने के लिए काफी आगे जाना पड़ता है, इसलिए एक ledger चाहिए जो track करे कि हर thread ने कौन-सी writes देखी हैं। atomic memory ordering, read/write fences वगैरह के हिसाब से ऐसी guarantees बन सकती हैं कि अगर write X दिखाई देती है, तो write Y भी जरूर दिखाई देनी चाहिए
मुझे लगता है कि यह C++/Rust memory model test cases का सबसे बड़ा collection है; किताबों, C++ standard, Stack Overflow, blogs आदि में जो भी मिल सकता था, लगभग सब इकट्ठा किया है। उदाहरण के लिए Mara Bos की Rust Atomics and Locks के लिए file यहां है: https://github.com/reitzensteinm/temper/blob/main/memlog/tes...
लेख में बताई गई Loom मिलती-जुलती, लेकिन कहीं ज्यादा polished library है, जो mutex या queue जैसे higher-level components को thorough तरीके से test करने देती है: https://github.com/tokio-rs/loom हालांकि memory model को खुद Temper जितनी बारीकी से model नहीं करती, और मैं test cases को Loom में port करने के बारे में सोच रहा था
Will Wilson की FoundationDB testing presentation से inspiration मिली, और वे अब Antithesis में arbitrary Docker containers पर इसी तरह की testing करने वाला hypervisor-based solution बना रहे हैं: https://www.youtube.com/watch?v=4fFDFbi3toc, https://antithesis.com/
मेरा मजबूत विश्वास है कि अगले 10 साल में यह क्षेत्र बहुत बड़ा होगा। WebAssembly एक ऐसा sweet spot है: arbitrary software compile करने के लिए पर्याप्त complete, फिर भी इतना simple कि Antithesis जैसी चीज बनाना किसी ऐसी elite team का 5 साल का project न बन जाए जिसने पहले ही database ship किया हो
Rust में shared-memory atomic snapshot implement किया था, और automated testing को भी जितना हो सके serious लिया: https://github.com/kaymanb/todc/tree/main/todc-mem
शुरुआत में लेख में बताई गई Loom इस्तेमाल की, लेकिन बाद में shuttle पर switch किया: https://github.com/tokio-rs/loom, https://github.com/awslabs/shuttle
shuttle, Loom की तरह exhaustive exploration करने के बजाय randomized approach अपनाता है, लेकिन scheduler फिर भी bug detection पर probabilistic guarantees देता है। इस्तेमाल करने पर shuttle ज्यादा तेज निकला और ज्यादा complex test scenarios तक scale हुआ
लेख के तरीके की तरह ही, अगर कोई खास schedule test failure कराता है तो random seed save किया जा सकता है। failing test को जल्दी reproduce कर पाने की क्षमता बहुत अहम है, और इससे पहले पकड़े व fix किए गए bugs के लिए explicit test cases लिखना संभव होता है: https://github.com/kaymanb/todc/blob/0e2874a70ec8beed8fae773...
Kotlin/Java में JetBrains की Lincheck इस तरह के काम के लिए अच्छी library है: https://github.com/JetBrains/lincheck
खास तौर पर इसका declarative होना और linearizability results को output करने का तरीका पसंद है
जानना चाहता हूँ कि C++ में भी Loom जैसी कोई library है या नहीं। कुछ lock-free data structures test करना चाहता हूँ
- है। निजी तौर पर मुझे सबसे आसान Relacy Race Detector लगता है: https://github.com/dvyukov/relacy, https://www.1024cores.net/home/relacy-race-detector
  यह काफी पुराना tool है और संभालना आसान है। इसे concurrency expert Dmitry Vyukov ने बनाया है
- Folly में DeterministicSchedule है, और यह भी atomic operations को wrap करता है व core synchronization primitives की testing में इस्तेमाल होता है। हालांकि मुझे नहीं लगता कि यह Loom जितना sophisticated है
  https://github.com/facebook/folly/blob/main/folly/test/Deter...
- https://plv.mpi-sws.org/genmc/
अगर मैंने ठीक से समझा है, तो इस approach की कमज़ोर forward progress guarantee के संदर्भ में सीमाएँ हैं
लेख में calculation बहुत मामूली नहीं है, लेकिन असली hardware और असली scheduler पर ऐसे cmpxchg loop के बारे में सोच सकते हैं जिसके किसी खास CPU पर रुक जाने की संभावना बेहद कम होती है। अगर CPU की संख्या n है, तो worst case में progress करने की probability 1/n है, लेकिन इस testing तरीके में यह 1/t^p हो जाती है। यहाँ t कामों की संख्या है, जो CPU की संख्या से काफी ज्यादा हो सकती है, और p उस loop body के अंदर pause की संख्या है, जो आसानी से 3 या उससे ज्यादा हो जाती है। इतना किसी ऐसे algorithm को, जो असल में काम करता है, टूटा हुआ दिखाने के लिए काफी है
उल्टा, अगर आप कमजोर forward progress को bug की तरह पकड़ना चाहते हैं और इसलिए strong forward progress मांगते हैं, तब भी यह तरीका कोई उपयोगी tool देता हुआ नहीं लगता
फिर भी, कई concurrency समस्याओं के लिए यह निश्चित रूप से उपयोगी है
- 1/t^p सही नहीं लगता, मैं इसे बस 1/t मानता हूँ। आखिरकार t समय बीतने पर कोई न कोई task जरूर आगे बढ़ा होगा, और अगर tasks t हैं, तो जिस task ने progress किया वह मेरा task होने की probability 1/t है
  मुख्य भ्रम शायद यह है कि रुक जाने का मतलब जरूरी नहीं कि CAS में हारना ही हो
“ईमानदारी से कहूँ तो यहाँ थोड़ी background knowledge है। inline assembly में बेहद शापित काम किए बिना असली thread creation से बच पाना संभव नहीं लगता। अगर कोई चीज़ pause() function को call करती है, और हम चाहते हैं कि वह अगला instruction मिलने तक रुकी रहे, तो वह काम ऐसे thread में होना चाहिए जो test के stack से अलग stack रखता हो” — इस हिस्से के बारे में सोच रहा हूँ कि क्या किसी तरह का async runtime इस्तेमाल नहीं किया जा सकता
यह atomic operations को instrument करके cooperative multitasking हासिल करने जैसा दिखता है। शायद मुझे और coffee पीनी पड़े, लेकिन threads के बिना करना ज्यादा सरल लगता है
- async इस्तेमाल करना सुविधाजनक होगा, लेकिन एक और requirement यह है कि test किए जा रहे software के बाहर से दिखने वाले API को बदलना नहीं चाहते। async “संक्रामक” होता है, इसलिए sync API के लिए sync implementation ही इस्तेमाल करनी पड़ेगी
इस approach की एक कमी यह है कि test किए जा रहे code को ही test code के हिसाब से modify करना पड़ता है
लगता है दो threads शुरू करके ptrace से single-step execution करते हुए instruction execution को “randomly” interleave करके भी यही काम किया जा सकता है। rr के chaos mode जैसा तरीका
लेकिन कुछ instructions atomic नहीं हो सकते, इसलिए बिना emulation के अगर यह संभव भी हो, तो “atomic microcode” units पर single-step execution करने का तरीका चाहिए होगा
- Antithesis के hypervisor जैसा लगता है
Loom इस्तेमाल करने के लिए conditional compilation की जरूरत लगती है, और एक library test करते समय यह ठीक होगा, लेकिन काफी invasive है
#[cfg(loom)]
pub(crate) use loom::sync::atomic::AtomicUsize;
#[cfg(not(loom))]
pub(crate) use std::sync::atomic::AtomicUsize;
सोच रहा हूँ कि क्या ऐसी कोई language है जो अपने scheduler को बेहतर तरीके से इस्तेमाल करने देती हो
- C# में यह लगभग automatically हो जाता है: https://github.com/microsoft/coyote/
अगर सचमुच बहुत thorough होना हो, तो ptrace के साथ test चलाकर threads को single-step में आगे बढ़ाते हुए instruction level पर अलग-अलग interleavings बनाए जा सकते हैं। जानना चाहूँगा कि क्या असल में ऐसा तरीका कभी देखा है
जहाँ, यहाँ की तरह, code को instrument नहीं किया जा सकता, वहाँ black-box testing के लिए कोई alternative है?
- async signal handler tests के लिए मैंने ऐसा तरीका इस्तेमाल किया है, लेकिन वहाँ combinations की संख्या कहीं ज्यादा अनुकूल होती है। अगर main thread n instructions चलाता है, तो signal को बीच में डालने से पहले 0 से n instructions तक चलाने वाले सिर्फ n runs चाहिए होते हैं, और उसके बाद signal handler अंत तक चलता है, फिर main thread भी अंत तक चलता है। कुल समय O(n^2) है
  लेकिन अगर t threads हैं और हर एक n instructions चलाता है, और हर boundary पर वे एक-दूसरे को रोक सकते हैं, तो realistic n values पर यह approach कठिन हो जाता है। लगता है कि interesting behavior वाले operations को चुनकर simulate करने की तरह इसे घटाना पड़ेगा
काफी शानदार लग रहा है, इसे एक बार आज़माना चाहिए। हालांकि यह हर तरह की error नहीं पकड़ेगा। pause() की हर call पर threads के बीच synchronization बन जाता है, जिससे कुछ data race issues छिप नहीं जाएँगे? Rust में शायद यह समस्या न हो

Concurrent data structures को सही तरीके से टेस्ट करने का तरीका

Atomic न होने वाला concurrent counter

सामान्य thread test क्यों पर्याप्त नहीं हैं

Property-based testing से interleavings संभालना

Test के लिए AtomicU32 और pause insertion

managed thread API का रूप

pause और unpause implementation

Failure reproduction और minimization

सभी interleavings enumerate करने तक विस्तार

Weak memory model और model checking तक विस्तार

shrinking के बिना minimization क्यों संभव है

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय