Ripgrep: grep·ag·Git grep आदि से तेज़ सर्च टूल (2016)

(blog.burntsushi.net)

5 पॉइंट द्वारा GN⁺ 2023-12-01 | 1 टिप्पणियां | WhatsApp पर शेयर करें

ripgrep(rg) एक Rust-आधारित कमांड-लाइन सर्च टूल है, जो The Silver Searcher जैसी code search सुविधा को GNU grep-स्तर के raw performance के साथ जोड़ता है, और Linux·Mac·Windows binaries उपलब्ध कराता है
25 benchmarks में, एकल बड़े file और बड़े directory search—दोनों में performance और accuracy के लिहाज़ से ripgrep से स्पष्ट रूप से आगे कोई टूल नहीं था, और Unicode support की लागत भी कम बनी रही
.gitignore प्रोसेसिंग, hidden·binary files को default रूप से बाहर रखना, file type filters, optional PCRE2 support, कई encodings और compressed files में search, और preprocessor filters तक शामिल करके यह code search tools के व्यावहारिक उपयोग-क्षेत्र को बढ़ाता है
Linux kernel repository और OpenSubtitles2016 experiments के बीच का अंतर मुख्य रूप से literal optimization, Teddy SIMD multi-pattern search, Aho-Corasick, UTF-8 decoding तरीका, line counting, और .gitignore प्रोसेसिंग लागत से प्रभावित होता है
कई छोटे files को parallel में search करते समय memory map धीमा पड़ सकता है, जबकि एकल बड़े file में यह फायदेमंद हो सकता है, इसलिए ripgrep स्थिति के अनुसार intermediate buffer search और memory map search को अलग-अलग इस्तेमाल करता है

ripgrep ने अपना लक्ष्य कहाँ रखा

ripgrep एक कमांड-लाइन सर्च टूल है, जिसका लक्ष्य code search tools की सुविधा और grep-प्रकार के tools की performance—दोनों को साथ लाना है
तुलना के लिए GNU grep, git grep, The Silver Searcher(ag), Universal Code Grep(ucg), The Platinum Searcher(pt), sift को लिया गया
benchmark जिन तीन मुख्य बातों की पुष्टि करना चाहता था, वे ये थीं
- single file और large directory search—दोनों में ripgrep से स्पष्ट रूप से बेहतर कोई tool नहीं है
- सही Unicode support देने के लिए बड़े performance cost की ज़रूरत नहीं पड़ती
- कई files को एक साथ search करते समय memory map आम तौर पर तेज़ होने के बजाय धीमा भी पड़ सकता है
लेखक ripgrep और उसके आधार regex engine—दोनों का निर्माता है, और यह भी बताता है कि benchmark चुने हुए होने के कारण पक्षपाती हो सकते हैं

फीचर्स और डिफ़ॉल्ट व्यवहार

ripgrep की executable file का नाम rg है
default search current directory को recursively scan करता है, .gitignore का सम्मान करता है, और hidden files व binary files को skip करता है
.rgignore भी समर्थित है, और .rgignore patterns को .gitignore से अधिक प्राथमिकता मिलती है
-u, -uu, -uuu से ignore files को नज़रअंदाज़ करना, hidden files शामिल करना, और binary files शामिल करना—इनकी सीमा बढ़ाई जा सकती है
- rg -uuu grep -a -r के समान है
file type filters समर्थित हैं
- rg -tpy foo: केवल Python files में search
- rg -Tjs foo: JavaScript files को छोड़कर search
- --type-add से नए file type rules जोड़े जा सकते हैं
grep के कई features भी उपलब्ध हैं
- context output
- multiple pattern search
- color highlight
- full Unicode support
default regex engine look-around और backreference को support नहीं करता, लेकिन -P से PCRE2 engine चुनने पर ये features इस्तेमाल किए जा सकते हैं
कुछ UTF-16 auto-detection और -E/--encoding आधारित encoding specification भी supported है
- इसमें UTF-16, latin-1, GBK, EUC-JP, Shift_JIS आदि शामिल हैं
-z/--search-zip के साथ gzip, xz, lzma, bzip2, lz4 जैसी compressed files में search supported है
PDF text extraction, अतिरिक्त decompression, decryption, और auto encoding detection जैसे arbitrary preprocessor filters भी supported हैं

इसे न इस्तेमाल करने के कारण

अगर portability और हर जगह उपलब्ध होना सबसे पहली प्राथमिकता है, तो standard-अनुरूप और व्यापक रूप से installed grep अधिक उपयुक्त है
अगर आप किसी दूसरे tool के specific feature या bug पर निर्भर हैं, तो ripgrep उपयुक्त नहीं हो सकता
कुछ performance edge cases में दूसरे tools बेहतर काम कर सकते हैं
अगर install नहीं किया जा सकता या platform support नहीं है, तो इसका उपयोग भी नहीं हो सकेगा

grep-प्रकार के tools की कार्य संरचना

search tools मोटे तौर पर तीन चरणों से गुजरते हैं
- search के लिए files इकट्ठा करना
- actual search करना
- results output करना
grep-प्रकार के tools को बड़े files में अच्छी search करनी होती है, इसलिए regex engine performance महत्वपूर्ण है
ack-प्रकार के tools को recursive directory traversal और .gitignore जैसे ignore rules के application को तेज़ी से संभालना होता है
ripgrep इन दोनों approaches को जोड़ने की कोशिश करता है
- तेज़ regex engine
- parallel search
- search target filtering

file collection और ignore प्रोसेसिंग

ack-प्रकार के tools में current directory से किन files में search करनी है, यह जल्दी तय करना महत्वपूर्ण होता है
directory traversal performance अनावश्यक stat calls की संख्या से प्रभावित होती है
ripgrep minimal system calls का लक्ष्य रखने वाले recursive directory iterator का उपयोग करता है
.gitignore प्रोसेसिंग की लागत होती है
- हर directory में ignore file ढूँढनी पड़ती है
- ignore patterns को compile करना पड़ता है
- सभी candidate paths पर patterns लागू करने पड़ते हैं
Linux kernel repository में 4,640 directories और 178 .gitignore files थीं
ripgrep .gitignore semantics को अधिक पूर्ण रूप से support करने की कोशिश करता है, और सबसे हाल में defined matching pattern को प्राथमिकता देता है
ucg .gitignore की जगह whitelist-आधारित glob rules इस्तेमाल करता है, इसलिए तेज़ हो सकता है, लेकिन unknown extension वाले files छूट सकते हैं

regex engine का अंतर

regex engines को मोटे तौर पर दो श्रेणियों में बाँटा जा सकता है
- backtracking-आधारित: feature-rich, लेकिन कुछ inputs पर exponential time तक धीमे हो सकते हैं
- finite automata-आधारित: features सीमित हो सकते हैं, लेकिन search text की लंबाई के सापेक्ष linear time guarantee देते हैं
हर tool के engine इस प्रकार हैं
- GNU grep, git grep: अपना finite automata-आधारित engine
- ripgrep: Rust regex library, finite automata-आधारित
- ag, ucg: PCRE-आधारित backtracking
- pt, sift: Go regex library, finite automata-आधारित
ag और ucg, PCRE के उपयोग के कारण worst-case backtracking व्यवहार के प्रति अधिक exposed हो सकते हैं
उदाहरण pattern (a*)* c PCRE-आधारित tools में समस्या पैदा कर सकता है, जबकि benchmark के अन्य tools इसे बिना समस्या के संभाल लेते हैं

literal optimization और SIMD

simple string search में literal search optimization regex engine से भी अधिक महत्वपूर्ण हो सकता है
Boyer-Moore एक classic substring search algorithm है, और candidate positions जल्दी खोजने के लिए memchr जैसी routines का उपयोग कर सकता है
memchr implementations अक्सर SIMD instructions के ज़रिए एक बार में 16 bytes जाँचती हैं, और कई GB/s throughput दे सकती हैं
Rust regex library pattern से prefix·suffix literals को सक्रिय रूप से extract करती है
- foo|bar
- (a|b)c
- [ab]foo[yz]
- (foo)?bar
- (foo)*bar
- (foo){3,6}
अगर पूरा regex एक single literal या literal alternation में टूट सकता है, तो core regex engine का उपयोग ही नहीं करना पड़ता
ripgrep line-based result output की विशेषता का उपयोग करके inner literal भी extract करता है
- उदाहरण: \w+foo\d+ में पहले foo ढूँढकर केवल candidate lines को regex से verify किया जाता है
multiple literal search के लिए GNU grep, Commentz-Walter-जैसा algorithm इस्तेमाल करता है, जबकि Rust regex Aho-Corasick या Teddy SIMD algorithm इस्तेमाल करता है
Teddy Intel Hyperscan से आया SIMD-आधारित multi-pattern search algorithm है, और ripgrep के GNU grep से आगे निकलने वाली प्रमुख optimizations में से एक है

search तरीका: line-by-line search से बचना

एक साधारण implementation file को line-by-line पढ़कर हर line पर pattern लागू करता है, लेकिन अधिकांश searches में match दुर्लभ होते हैं, इसलिए यह अप्रभावी है
search tools आम तौर पर बड़े byte buffer को एक साथ search करते हैं
- file को memory map करना
- पूरी file को memory में पढ़ना
- fixed-size intermediate buffer के साथ incremental search
ripgrep, GNU grep, git grep incremental search को support करते हैं, इसलिए इसे files और streams—दोनों पर लागू किया जा सकता है
incremental search को implement करना कठिन है
- line number की गणना
- buffer का line के बीच में समाप्त होना
- लंबी lines को संभालना
- invert match को संभालना
- match के आसपास context output को संभालना
ripgrep implementation complexity स्वीकार करके incremental search का उपयोग करता है, और benchmark में कई छोटे files की search में memory map की तुलना में तेज़ परिणाम दिखाता है

आउटपुट और parallelism

parallel सर्च में अगर हर thread तुरंत आउटपुट लिखे, तो अलग-अलग फ़ाइलों के नतीजे आपस में मिल सकते हैं
सभी parallel code search tools सर्च रिज़ल्ट को memory के intermediate buffer में लिखते हैं, और केवल output stage को serialize करते हैं
यह तरीका search threads को वास्तविक सर्च parallel में करने देता है
कमी यह है कि अगर 2GB की ऐसी फ़ाइल हो जिसमें हर लाइन match करती हो, तो memory usage बहुत बढ़ सकता है
ripgrep stdin या single file search में intermediate buffer के बिना सीधे stdout में लिखता है

benchmark methodology

benchmark को end user की समस्याओं के आधार पर बाँटा गया है
- बड़े code repository में सर्च
- एकल बड़ी फ़ाइल में सर्च
search patterns simple literals, alternation, और हल्के regular expressions की तरफ झुके हुए हैं
हर tool का default behavior अलग है, इसलिए fair comparison के लिए line number, Unicode, .gitignore, whitelist जैसी शर्तों को मिलाने की कोशिश की गई
benchmark के लिए इस्तेमाल किए गए version इस प्रकार हैं
- ripgrep v0.1.2
- GNU grep v2.25
- git grep v2.7.4
- ag commit cda635, PCRE 8.38
- ucg commit 487bfb, PCRE 10.21 JIT
- pt commit 509368
- sift commit 2d175c
ack उस समय दूसरे tools की तुलना में बहुत धीमा था, इसलिए उसे शामिल नहीं किया गया
benchmark runner benchsuite है, जिसे Python 3.5 या उससे ऊपर चाहिए, और यह ripgrep repository में शामिल है
हर command को measurement से पहले 3 बार warm-up चलाया गया ताकि corpus OS page cache में लोड हो जाए
हर command को 10 बार मापा गया और average तथा standard deviation दर्ज किए गए
execution environment Amazon EC2 c3.2xlarge, Ubuntu 16.04, Xeon E5-2680 2.8GHz, memory 16GB, और 80GB SSD था
config log, summary results, और raw CSV भी सार्वजनिक किए गए

Linux kernel code search results

code search benchmark built Linux kernel repository commit d0acc7 पर चलाया गया
built kernel repository का उपयोग इसलिए किया गया क्योंकि build artifacts repository में रह सकते हैं और search results की relevance तथा performance को प्रभावित कर सकते हैं
linux_literal_default में simple literal PM_RESUME की खोज हर tool के default behavior का अंतर दिखाती है
- rg .gitignore का सम्मान करता है और hidden तथा binary files को छोड़ देता है
- ag और pt भी मिलते-जुलते हैं, लेकिन वे lines की गिनती करते हैं
- ucg .gitignore नहीं पढ़ता और whitelist-आधारित खोज करता है
- sift default रूप से लगभग सब कुछ खोजता है
- git grep को git index से search file set पाने का फायदा मिलता है
.gitignore का सम्मान करने से results की relevance बढ़ती है, लेकिन performance पर इसकी लागत हो सकती है
linux_literal में rg (whitelist) ने ucg के लगभग बराबर performance दिखाई, और rg (ignore) का स्तर git grep जैसा था
rg (ignore) (mmap) और ag (ignore) (mmap) memory map के उपयोग से धीमे हो गए, और समान शर्तों में rg (ignore) कहीं ज़्यादा तेज़ था
local machine पर भी memory-mapped versions धीमे थे, लेकिन EC2 की तुलना में अंतर कम था

Unicode और case search

linux_literal_casei में pt ने -i को Go regexp के (?i) के रूप में संभाला, जिससे वह काफ़ी धीमा हो गया
sift ने pattern और search block को lowercase में बदलने का तरीका अपनाया, इसलिए वह कम धीमा हुआ, लेकिन यह optimization केवल ASCII case को संभालती है और Unicode case handling में सटीक नहीं है
ripgrep case-insensitive search को जहाँ संभव हो literal combinations में बदलता है, और Teddy से candidate positions को तेज़ी से ढूँढता है
linux_unicode_word में \wAh खोज यह जाँचती है कि Unicode-aware \w क्या µAh जैसे results पकड़ता है
केवल rg और git grep में Unicode toggle किया जा सकता था; ag, pt, sift, ucg ASCII-only \w का उपयोग करते हैं
git grep में Unicode support चालू करने पर performance की बड़ी कीमत चुकानी पड़ी, लेकिन ripgrep में performance drop लगभग नहीं था
ripgrep UTF-8 decoding को finite state machine में शामिल करता है, इसलिए अलग decoding step के बिना सीधे UTF-8 byte strings पर match करता है

regex complexity के अनुसार अंतर

[A-Z]+_RESUME जैसे regex, जिनमें literal suffix हो, उनमें rg और ucg _RESUME का उपयोग करके candidates जल्दी ढूँढते हैं
ERR_SYS|PME_TURN_OFF|LINK_REQ_RST|CFG_BME_EVT जैसी literal alternation में ripgrep Teddy का उपयोग करता है और संभव है कि core regex engine का उपयोग ही न करे
case-insensitive alternation में भी ripgrep case combinations का prefix बनाकर Teddy से candidates ढूँढता है, और केवल candidates को पूरे regex से verify करता है
\p{Greek} खोज में केवल Rust regex और Go regex उस Unicode property को support करते थे, और rg, pt तथा sift से बहुत तेज़ था
\p{Greek} की case-insensitive search में sift match report नहीं कर पाया, और pt Unicode case handling सही तरीके से नहीं कर पाया
\w{5}\s+... जैसे patterns, जिनमें literal नहीं है, उनमें regex engine की performance सीधे सामने आती है
- rg Unicode support के साथ भी तेज़ बना रहा
- git grep Unicode support पर बड़ी performance cost देता है
- Unicode DFA को ASCII DFA की तुलना में बहुत बड़े NFA state sets संभालने पड़ते हैं; उदाहरण के तौर पर ASCII में लगभग 250, जबकि Unicode में लगभग 77,000 NFA states थे

एकल बड़ी फ़ाइल में खोज

single file benchmark में OpenSubtitles2016 sample का उपयोग किया गया
- English sample लगभग 1GB था
- Russian sample लगभग 1.6GB था
इस क्षेत्र में regex engine performance और literal optimization अधिक महत्वपूर्ण हो जाते हैं
subtitles_literal में Sherlock Holmes और Шерлок Холмс दोनों खोजों में rg सबसे तेज़ था
ripgrep literals में sparse bytes चुनकर उन्हें memchr में इस्तेमाल करने की कोशिश करता है
- standard Boyer-Moore implementation आमतौर पर candidate search के लिए आख़िरी byte का उपयोग करती है
- rg ज़्यादा दुर्लभ byte चुनता है ताकि SIMD-optimized loop में अधिक दूर तक skip किया जा सके
Russian patterns में UTF-8 के तहत कई characters \xD0 या \xD1 से शुरू होते हैं, इसलिए first-byte search अप्रभावी हो सकती है
rg पहले से गणना की गई 256-byte frequency table का उपयोग करके \xD0, \xD1 की जगह अधिक दुर्लभ bytes को प्राथमिकता देता है
single large file में memory map केवल एक बार बनानी पड़ती है, इसलिए rg की memory-mapped search rg (no mmap) से लगभग 25% तेज़ थी

एकल फ़ाइल में Unicode और alternation

subtitles_literal_casei में rg Unicode case-insensitive search को सही ढंग से संभालते हुए भी तेज़ है
GNU grep Unicode case-insensitive search में बड़ी performance cost देता है
Russian case-insensitive search में grep (ASCII) का -i व्यवहारिक रूप से अनदेखा करता हुआ दिखा, और ag ने 0 matches report किए
subtitles_alternate में कई पात्रों के नामों वाली alternation search में rg English और Russian दोनों में सबसे तेज़ था
English alternation में rg GNU grep से लगभग एक अंक के गुणक जितना तेज़ था
subtitles_alternate_casei में rg पहले की तुलना में काफ़ी धीमा हुआ, लेकिन English में फिर भी दूसरे tools से आगे रहा
इस स्थिति में Teddy के लिए literal candidates बहुत ज़्यादा हो गए, इसलिए rg Teddy की जगह Aho-Corasick पर स्विच करता है
ripgrep transition table-आधारित “advanced” Aho-Corasick का उपयोग करता है, जो input के हर byte पर एक transition करता है

inner literal और बिना literal वाले pattern

\w+\s+Holmes\s+\w+ जैसे pattern इस तरह बनाए गए थे कि prefix·suffix literal optimization से बचा जा सके, लेकिन inner literal Holmes का उपयोग किया जा सकता था
ripgrep और GNU grep inner literal optimization करते हैं
ripgrep pattern AST से literal निकालने के लिए Rust regex के regex-syntax का उपयोग करता है
रूसी संस्करण \w+\s+Холмс\s+\w+ में केवल Unicode को सही तरह support करने वाले tools ही सार्थक परिणाम दे सके
बिल्कुल literal के बिना लंबे \w{5}\s+... pattern में rg अंग्रेज़ी में सबसे तेज़ tools में था, और GNU grep का Unicode support वाला version अंग्रेज़ी में 90 सेकंड से ज़्यादा, रूसी में 4 मिनट से ज़्यादा लेने के कारण बाहर कर दिया गया
ripgrep UTF-8 decoding को DFA में शामिल करने के तरीके से Unicode support बनाए रखते हुए performance हासिल करता है

अतिरिक्त benchmark

everything Linux repository में .* से हर line को match कराने वाला अवास्तविक test है
- rg ने 22,065,361 lines को 1.081 सेकंड में report किया
- ag और pt ने सभी lines report नहीं कीं, इसलिए लगता है कि उनमें match limit है
nothing .* पर invert match लागू करके कोई भी line report न करने वाला test है
- rg ने 0.302 सेकंड और git grep ने 0.905 सेकंड दर्ज किए
- pt और ucg invert search support नहीं करते
context अंग्रेज़ी subtitle corpus में Sherlock Holmes के आसपास की 2 lines का context output करता है
- rg 0.612 सेकंड और sift 0.717 सेकंड के साथ लगभग समान थे
- ucg यह feature support नहीं करता
huge 9.3GB की पूरी अंग्रेज़ी subtitle file में Sherlock Holmes खोजता है
- rg ने 1.786 सेकंड, GNU grep ने 5.119 सेकंड, और sift ने 3.047 सेकंड दर्ज किए
- ucg ने line counting condition में केवल 1,543 lines report कीं, इसलिए उसने गलत परिणाम दिया, और शक है कि 2GB से बड़ी files खोजने में समस्या हुई

निष्कर्ष

Linux kernel repository search में ripgrep हर benchmark में हमेशा नहीं जीता, लेकिन performance और accuracy में यह कहना मुश्किल था कि कोई दूसरा tool स्पष्ट रूप से बेहतर है
git grep कुछ सरल मामलों में कुछ millisecond आगे हो सकता था, लेकिन pattern जटिल होने पर या Unicode की ज़रूरत पड़ने पर ripgrep कई बार बहुत आगे निकल गया
ripgrep की code search performance में निम्न तत्वों का योगदान है
- न्यूनतम stat calls को लक्ष्य बनाने वाली तेज़ directory traversal
- .gitignore glob matching के लिए RegexSet का उपयोग
- Chase-Lev work stealing queue के ज़रिए काम का वितरण
- कई छोटी files खोजते समय memory map का उपयोग न करने का निर्णय
- तेज़ regular expression engine
single file search में ripgrep सभी प्रमुख benchmarks में सबसे तेज़ था या बड़े अंतर से आगे था
single file performance पर sparse byte-आधारित memchr, Teddy SIMD, Aho-Corasick, और UTF-8 decoding built-in DFA का प्रभाव पड़ता है
Unicode feature की ज़रूरत वाले benchmarks में केवल rg, GNU grep, और git grep ने सार्थक support दिखाया, और GNU grep तथा git grep ने आम तौर पर इसके लिए बड़ा performance cost चुकाया
memory map Linux x86_64 के आधार पर कई छोटी files की parallel search में नुकसानदेह था, single बड़ी file search में फायदेमंद था, और VM environment में अतिरिक्त penalty हो सकती है

1 टिप्पणियां

GN⁺ 2023-12-01

Hacker News की राय

यह वाकई तेज़ है, और मैं लगातार fzf combination की सिफारिश करता रहता हूँ
पहले ripgrep से खोजता हूँ, फिर मिले हुए file+text results पर fuzzy search लगाता हूँ, और bat से context दिखाने वाले PowerShell function के रूप में इस्तेमाल कर रहा हूँ
जिन projects में कई repositories मिली हुई हैं, वहाँ “मुझे पता है कि यह कहीं है, लेकिन सही location या नाम नहीं पता” वाली स्थिति में बहुत तेज़ी से scope घटाया जा सकता है
यह तरीका https://github.com/junegunn/fzf/blob/master/ADVANCED.md से आया है, और पूरा इस्तेमाल न भी करें तो ideas लेने के लिए एक बार देखना worth it है
- एक कदम आगे जाकर ripgrep-all(rga) को fzf के साथ integrate करने की सिफारिश करूँगा
  इससे सिर्फ text files ही नहीं, PDF, zip जैसे कई file formats में भी fuzzy search किया जा सकता है
  ज़्यादा जानकारी https://github.com/phiresky/ripgrep-all/wiki/fzf-Integration में है
- मैंने इसका bash version भी लिखा है
  इसमें rg results को fzf से चुनते हैं, चुनी हुई file और line number को parse करके $EDITOR +"${linenumber}" "$file" से open करते हैं
- Vim में fzf+rg न हो तो लगभग टूटा हुआ-सा लगता है
  जैसे electric grinder की जगह हाथ से coffee पीस रहे हों
- fzf इस्तेमाल करने पर Git में add करने के लिए बहुत-सी files चुनते हुए कुछ को skip किया जा सकता है
  gitconfig के [alias] में fza = "!git ls-files -m -o --exclude-standard | fzf -m --print0 | xargs -0 git add" डाल दें तो git fza से modified या अभी तक add न हुई files की list दिखती है, और space से items toggle करते हुए आगे बढ़ते हैं
  यह alias और fzf+fd workflow के कुछ हिस्सों को काफी तेज़ बना देते हैं
  macOS पर zsh settings में डालने वाली चीज़ों को समेटने वाली guide भी है: https://gist.github.com/aclarknexient/0ffcb98aa262c585c49d4b...
- मैं भी ripgrep को लगभग इसी तरीके से इस्तेमाल करता हूँ
  सैकड़ों repositories वाले codebase में किसी file या project को narrow down करने के starting point के रूप में इस्तेमाल करता हूँ, फिर उसके बाद और अंदर जाता हूँ
Emacs में ripgrep को project.el और dumb-jump packages के साथ इस्तेमाल कर रहा हूँ
यह सबसे popular तरीका नहीं हो सकता, लेकिन पूरा experience काफी संतोषजनक है
package-install से dumb-jump install करके सिर्फ (add-hook 'xref-backend-functions #'dumb-jump-xref-activate) set करना होता है
Python project में M-. या C-u M-. से identifier definition खोजने पर dumb-jump current project और file type के हिसाब से rg command चलाता है और results को Xref buffer में दिखाता है
यह ag भी support करता है, और अगर ag या rg नहीं है तो grep पर fallback करता है, लेकिन पूरे home directory में खोजते समय उम्मीद के मुताबिक धीमा हो सकता है
- Emacs में built-in project.el भर से भी ripgrep काफी आसानी से इस्तेमाल किया जा सकता है
  external package ज़रूरी नहीं है, और बड़ी directories में slow grep की जगह इस्तेमाल करना हो तो (setq xref-search-program 'ripgrep) set कर दें
  फिर C-x p g foo RET जैसी project search current project में rg -i --null -nH --no-heading --no-messages -g '!*/' -e foo के रूप में चलेगी
  Results Xref buffer में दिखते हैं, इसलिए n, p, RET, C-o जैसी keys से next/previous match पर जाना, source jump करना और split window में दिखाना सुविधाजनक है
- ripgrep के लेखक के नज़रिए से, वह regex मैंने सीधे run नहीं किया है, लेकिन लगता है --pcre2 flag हटाया जा सकता है
  दूसरा और तीसरा \b assertion भी शायद हटाया जा सकता है, और पहला ज़रूरी हो सकता है
- Deadgrep ripgrep इस्तेमाल करता है और इसमें evil-collection bindings भी हैं, इसलिए इसे आराम से इस्तेमाल किया जा सकता है: https://github.com/Wilfred/deadgrep
- यह तरीका भी अच्छा है, लेकिन जब कई projects में एक साथ search करना हो या project के अंदर सिर्फ subfolder में search करना हो, तब मैं अब भी rg.el इस्तेमाल करता हूँ
  ऐसी situation में पहले मैं rgrep इस्तेमाल करता
दिलचस्प बात यह है कि VS Code search भी अब Node.js wrapper के जरिए ripgrep पर चलता है
https://www.npmjs.com/package/@vscode/ripgrep
- अगर ऐसा environment है जहाँ VS Code request या install किया जा सकता है, लेकिन ripgrep install नहीं कर सकते, तो यह बहुत अच्छा है
  VS install path के अंदर rg binary मिल सकती है। कम से कम मेरे Windows office environment में तो ऐसा संभव था
- मैं हमेशा सोचता था कि VS Code Electron app होते हुए भी search इतना तेज़ क्यों है, अब वजह समझ में आ गई
- यह नया feature नहीं है; VS Code में यह 7 साल पहले से मौजूद है
मैं ripgrep करीब 2 साल से इस्तेमाल कर रहा हूँ और अब यह indispensable tool बन गया है
grep से switch करने की मुख्य वजह ease of use थी
default रूप से यह .gitignore rules का सम्मान करता है और hidden files/directories व binary files को skip करता है, इसलिए rg search_term directory उसके equivalent grep command से बहुत बेहतर है, और speed improvement bonus है
जब match बहुत लंबा हो और terminal अस्त-व्यस्त हो जाए, तब मैं अक्सर -M 1000 जैसे -M option का इस्तेमाल करता हूँ
- -M सच में शानदार है
  जिन minified file results को नहीं देखना चाहते उन्हें ignore करने में यह खास तौर पर convenient है, और -g *.cs जैसे -g option से सिर्फ किसी खास extension की files search करना भी अच्छा है
  standalone portable binary होना भी उपयोगी है; नई machine पर काम करते समय executable डालकर grep alias को rg पर set कर दें, तो आदत से grep type करने पर भी rg ही चलेगा
2023 में भी यह बात शायद अब भी सही हो सकती है, लेकिन समस्या यह है कि parallelized grep alternative tools, जैसे ripgrep या ag, पुराने grep से इतने ज़्यादा तेज़ हैं कि इनके बीच की छोटी-छोटी speed differences को अलग पहचान का आधार बनाना मुश्किल है
मैं 9 लाख lines वाले codebase में Emacs के अंदर ag इस्तेमाल करता हूँ, और 16-core Ryzen Threadripper 2950X पर यह practically तुरंत खत्म हो जाता है
1 सेकंड से कम को “थोड़ा और 1 सेकंड से कम” करने की ज़रूरत महसूस नहीं होती
नए grep-type tools की मुख्य खासियत speed नहीं है; उनका मूल्यांकन और तुलना दूसरे तरीकों से करनी चाहिए
- मेरे हिसाब से 2016 में speed निश्चित रूप से मुख्य खासियत थी
  ag में काफी बड़ा performance cliff है, और यह blog post में भी दिखता है
  हालांकि workload हर व्यक्ति का अलग होता है, इसलिए कुछ मामलों में performance difference मायने नहीं रख सकता
  9 लाख lines बहुत बड़ा नहीं है, और simple query हो तो naive न होने वाले ज़्यादातर grep-type tools उसे बहुत जल्दी handle कर लेते हैं
  दूसरे comparison criteria से देखें तो ag लगभग life support पर है, और लगता है Debian से हटाए जाने वाला था लेकिन किसी ने उसे बचा लिया: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=999962
  blog post Unicode support की भी तुलना करता है, और ag में practically Unicode support नहीं है। यह हर किसी के लिए अहम नहीं होगा, लेकिन non-performance comparison criterion के तौर पर पर्याप्त है
- मेरे अनुभव में ये सभी tools बुरी तरह I/O bottleneck से बंधे हैं
  search time उतना ही लगता है जितना files को disk से load होने में लगता है, और उसके बाद का फर्क meaningful होना मुश्किल है
  अगर files cache में हों, तो search time से ज़्यादा file system में navigate करने और command लिखने का time dominate करता है, इसलिए वहाँ भी performance difference meaningful होना मुश्किल है
title में (2016) होना चाहिए
यह असल में release announcement है, नई जानकारी नहीं
- संबंधित discussions ये हैं
  “Ripgrep – A new command line search tool” https://news.ycombinator.com/item?id=12564442 (740 points | Sept 23, 2016 | 209 comments) — speed से जुड़ी discussion भी है
  “Ripgrep is faster (2016)” https://news.ycombinator.com/item?id=17941319 (98 points | Sept 8, 2018 | 40 comments)
qgrep से तेज़ नहीं है
दोनों के काम करने का तरीका काफी अलग है, और qgrep re2-based है लेकिन उसकी speed index होने की वजह से आती है
बड़े file repositories में हर बार सभी files scan करने के बजाय qgrep और index इस्तेमाल करना ज़्यादा समझदारी लगता है, और मुझे हैरानी है कि लोग qgrep option क्यों भूल जाते हैं
हालांकि अगर UTF-8 में multi-line matching चाहिए, तो ripgrep को किसी दूसरी PCRE2 library पर fallback करना पड़ता है, इसलिए मुझे नहीं लगता कि वह इतना तेज़ रहता है
- ripgrep के author के तौर पर, यह सही है कि qgrep indexing इस्तेमाल करता है, इसलिए non-indexing tools की तुलना में उसे फायदा है
  लेकिन index set up और maintain करना पड़ता है, इसलिए UX “बस search चलाओ” जितना simple नहीं रहता
  लोग qgrep क्यों नहीं इस्तेमाल करते, इसकी वजह वैसी ही है जैसे “मेरे लिए grep भी काफी तेज़ है” कहकर वे ripgrep नहीं इस्तेमाल करते
  छोटे search targets में अक्सर ripgrep और grep, या qgrep और ripgrep के speed difference को महसूस नहीं किया जा सकता
  अगर ripgrep Linux kernel search को 100ms के अंदर खत्म कर देता है, तो standard interactive use में indexing tool पर switch करने लायक तकलीफ़ होगी या नहीं, यह situation पर निर्भर है, लेकिन आम तौर पर नहीं होगी
  ripgrep में indexing जोड़ने का idea मैंने सोचा है: https://github.com/BurntSushi/ripgrep/issues/1497
  और multi-line search के लिए PCRE2 की ज़रूरत नहीं होती। default regex engine में भी Unicode support है, और PCRE2 के बिना build करने पर भी multi-line search support बना रहता है
ripgrep से ugrep पर switch करने के बाद वापस देखने की ज़रूरत नहीं पड़ी
speed भी similar है, लेकिन इसमें fuzzy matching है, code review के लिए काम आने वाला TUI भी है, और PDF या compressed files के अंदर भी search कर सकता है
optional तौर पर Google search syntax इस्तेमाल कर पाना भी सुविधाजनक है
https://ugrep.com
- मैं ripgrep का बड़ा fan हूँ, लेकिन हाल में ripgrep में न होने वाले feature, यानी zip archive के अंदर search, की वजह से मुझे ugrep देखना पड़ा
  disk पर unzip किए बिना search कर सकते हैं
  मैं लाखों छोटे text files वाले compressed corpus के साथ काम करता हूँ, और अच्छा है कि अब पूरी चीज़ को file system पर extract करने की ज़रूरत नहीं। कुछ file systems इस scale पर struggle करते हैं
  दोनों tools के लिए आभारी हूँ, और उनके respective authors को धन्यवाद
- अगर grep में Google search syntax इस्तेमाल करना शुरू कर दूँ, तो डर है कि ज़्यादातर results कुछ बेचने की कोशिश करेंगे
- “ugrep vs ripgrep” article हल्के-फुल्के ढंग से खोजते हुए, मैंने ऐसे posts देखे जिनसे लगा कि ugrep और ripgrep के authors Reddit पर कई सालों तक बहस करते रहे
  उदाहरण के लिए https://www.reddit.com/r/programming/comments/120wqvr/ripgre...
  बात तो बस open source tools की है, लेकिन यह थोड़ा अजीब लगा
- सोच रहा हूँ कि TUI results को fzf में pass करने से बेहतर है या नहीं
  मेरे लिए fzf की configurability और flexibility को beat करना मुश्किल लगता है
- यह बताने के लिए धन्यवाद
  killer feature मौजूदा grep command-line options compatibility लगती है
  बिल्कुल नया option set सीखने की ज़रूरत न होना काफी अच्छा है
यह सोचकर हैरानी होती है कि grep को बदला या बेहतर क्यों नहीं किया गया
यह विषय भी अब थोड़ा पुराना लगने लगा है
- इसकी व्याख्या करने के कई कारण हो सकते हैं
  जड़ता, compatibility, बदलाव के प्रति प्रतिरोध, innovator's dilemma जैसी चीज़ें। यह मैं नकारात्मक अर्थ में नहीं कह रहा; ये सब मुझ पर भी लागू होती हैं
  compatibility के बारे में FAQ देखें: https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#pos...
- यह कुछ वैसा ही है जैसे आप जिस 40 साल पुरानी कुर्सी पर अभी बैठे हैं, उसे Razer UltraSeat XR3000-A से क्यों नहीं बदलते
  वह आरामदायक है, आसपास के कामकाजी माहौल में अच्छी तरह फिट है, और उसे बदलकर सब कुछ फिर से सेट करने की कोई खास वजह नहीं है
  यह उपमा बस इस हद तक ही जाती है कि Razer जैसी कोई कुर्सी पास में पहले से है और उस पर कपड़े रखे हुए हैं
- Unix डिज़ाइन करने वाले किसी व्यक्ति ने कुछ system functions को core OS functions और साथ ही इंसानों द्वारा इस्तेमाल किए जाने वाले tools, दोनों बना दिया, और नतीजा यह हुआ कि दशकों बाद “xyz नाम का program अनिवार्य रूप से होना चाहिए, और उसे यह argument लेकर ठीक इसी तरह behave करना चाहिए” जैसी अजीब स्थिति बन गई
- ripgrep जैसे कई alternative tools पहले से इस्तेमाल किए जा सकते हैं
  अगर बात grep command को ही किसी दूसरी utility से बदलने की है, तो मिलने वाले value की तुलना में टूटने वाली चीज़ें बहुत ज़्यादा दिखती हैं
  जिन्हें तेज़ grep चाहिए वे दूसरा tool इस्तेमाल करें, और जो मौजूदा grep इस्तेमाल करते हैं वे उसे जारी रखें—इसलिए यह पहले से ही लगभग ideal स्थिति है
- grep हर तरह की files में text खोजने वाला general-purpose tool है, और UNIX standard में गहराई से शामिल है
  कुछ programmers इसे source code search के लिए इस्तेमाल करते हैं, लेकिन दूसरे लोग इसे source code से असंबंधित text search या scripts में इस्तेमाल करते हैं, और उम्मीद करते हैं कि यह कभी crash न करे
  इसके उलट ripgrep मुख्य रूप से source code repositories में search के लिए डिज़ाइन किया गया specialized और opinionated tool है
  general-purpose text search को और तेज़ बनाने की गुंजाइश बहुत ज़्यादा नहीं है। mmap() इस्तेमाल करने पर truncated files में crash का जोखिम होता है, regular expressions की expressiveness घटाने से यह तेज़ हो सकता है, और सभी locale और character set support छोड़कर सिर्फ UTF-8/UTF-16 को hardcode भी किया जा सकता है, लेकिन ऐसा नहीं करना चाहिए
Portage में देखने पर लगता है कि PDF और doc जैसे दूसरे documents तक संभालने वाला version भी है
https://github.com/phiresky/ripgrep-all

Ripgrep: grep·ag·Git grep आदि से तेज़ सर्च टूल (2016)

ripgrep ने अपना लक्ष्य कहाँ रखा

फीचर्स और डिफ़ॉल्ट व्यवहार

इसे न इस्तेमाल करने के कारण

grep-प्रकार के tools की कार्य संरचना

file collection और ignore प्रोसेसिंग

regex engine का अंतर

literal optimization और SIMD

search तरीका: line-by-line search से बचना

आउटपुट और parallelism

benchmark methodology

Linux kernel code search results

Unicode और case search

regex complexity के अनुसार अंतर

एकल बड़ी फ़ाइल में खोज

एकल फ़ाइल में Unicode और alternation

inner literal और बिना literal वाले pattern

अतिरिक्त benchmark

निष्कर्ष

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय