GNU awk से CLI टेक्स्ट प्रोसेसिंग

(learnbyexample.github.io)

4 पॉइंट द्वारा GN⁺ 2023-08-29 | 1 टिप्पणियां | WhatsApp पर शेयर करें

GNU awk stdin और files को अपने-आप iterate करता है, और filtering, substitution, field processing को one-line commands में जोड़कर CLI टेक्स्ट tasks संभालता है
/regexp/ और !/regexp/ मौजूदा input line की जांच करने वाली shorthand syntax हैं; condition true होने पर default रूप से $0 output होता है
sub सिर्फ पहली match बदलता है, जबकि gsub सभी matches बदलता है; target छोड़ देने पर substitution मौजूदा input line $0 पर लागू होता है
whitespace-based field splitting और $N, NF, $NF की वजह से awk खास columns चुनने या conditions लगाकर process करने जैसे field-based tasks के लिए अच्छी तरह फिट बैठता है
cond{action} pieces को जोड़कर और BEGIN{}·END{} जोड़ने पर, simple filters से आगे बढ़कर strings, numbers और associative arrays तक शामिल करने वाले छोटे programs बनाए जा सकते हैं

`awk` कैसे execute होता है

awk, grep और sed की तरह टेक्स्ट filter कर सकता है, और ज्यादा complex processing के लिए programming features भी देता है
input stdin या files से लिया जा सकता है, और default रूप से line by line iterate करते हुए conditions और actions apply करता है
पूरी input line को special variable $0 से access किया जाता है
- सटीक term input record है, लेकिन इस chapter में इसे line-based explanation के रूप में समझाया गया है

Filtering और default output

regular expression /.../ form में लिखी जाती है; string regex से match करती है या नहीं, यह check करने के लिए string ~ /regexp/ syntax इस्तेमाल होता है
opposite condition को string !~ /regexp/ से express किया जाता है
check की जाने वाली string छोड़ देने पर $0 target बन जाता है
अगर सिर्फ condition हो और action न हो, तो condition true होने पर $0 अपने-आप output होता है
- awk '/regexp/', awk '$0 ~ /regexp/{print $0}' का shorthand है
- awk '!/regexp/', awk '$0 !~ /regexp/{print $0}' का shorthand है
examples में सिर्फ at वाली lines output करना, या e शामिल न करने वाली lines output करना दिखाकर basic filtering flow समझाया गया है

`1` का idiomatic output के लिए इस्तेमाल

condition expression में non-zero number और non-empty string को true माना जाता है
awk '1' सभी input lines के लिए हमेशा true condition देने वाला idiomatic expression है
action न होने पर true condition $0 output करती है, इसलिए awk '1' पूरा input जैसा है वैसा output करता है
result awk '{print $0}' या simple cat जैसा ही होता है

Substitution: `sub` और `gsub`

awk search और substitution के लिए sub और gsub functions देता है
sub(/:/, "-") हर input line में सिर्फ पहले : को - से बदलता है
gsub(/:/, "-") हर input line में सभी : को - से बदलता है
दोनों functions पहले argument के रूप में match करने वाली regular expression, और दूसरे argument के रूप में replacement string लेते हैं
अगर input string अलग से न दी जाए, तो default target $0 होता है
substitution सफल होने पर target input modify हो जाता है
substitution block के बाद वाला 1 block के बाहर condition expression के रूप में interpret होता है, जिससे modified $0 फिर से output होता है
- awk '{sub(/:/, "-")} 1' का result awk '{sub(/:/, "-"); print $0}' जैसा ही है
- सिर्फ print लिखने पर भी default output target $0 ही होता है

`grep`, `sed`, `awk` चुनने का मानदंड

simple line filtering में इस use case के लिए optimized grep, sed या awk से तेज हो सकता है
substitution tasks में sed, awk से तेज हो सकता है
tools की features हमेशा 1:1 map नहीं होतीं
- grep -o को sed या awk में implement करने के लिए ज्यादा steps चाहिए
- recursive search सिर्फ grep देता है
संबंधित discussion unix.stackexchange: When to use grep, sed, awk, perl, etc पर देखी जा सकती है

Field-based processing

awk खासकर field-based processing के लिए अक्सर इस्तेमाल होता है
default रूप से input line को whitespace के आधार पर split करता है, और हर field को $N से access किया जाता है
- $2 दूसरा field है
- $NF आखिरी field है
- NF मौजूदा input line में कुल fields की संख्या है
table.txt example column-based processing को compact तरीके से दिखाता है
- हर line का दूसरा field output करना
- सिर्फ वे lines output करना जिनका आखिरी field negative है
- सिर्फ पहले field में b को B में बदलना
example file example_files directory में है

One-line command structure

एक सामान्य awk one-line command का form इस प्रकार होता है

awk 'cond1{action1} cond2{action2} ... condN{actionN}'

condition न हो तो action हमेशा execute होता है
action न हो तो condition true होने पर $0 output होता है
block के अंदर कई statements को semicolon ; से अलग किया जा सकता है
कई blocks को जोड़कर, ज्यादातर one-line commands में explicit if के बिना condition-specific actions express किए जा सकते हैं
- awk '$NF<0' table.txt सिर्फ वे lines output करता है जिनका आखिरी field negative है
BEGIN{} input पढ़ने से पहले execute होता है
END{} सभी input processing खत्म होने के बाद execute होता है
operators और condition expressions की details gawk manual: Operators, gawk manual: Truth Values and Conditions में हैं

Strings और numbers

awk CLI में short solutions बनाने के लिए values के types context के अनुसार तय करता है
string literals double quotes के अंदर लिखे जाते हैं
numbers integer, floating-point और scientific notation में हो सकते हैं
BEGIN{} का इस्तेमाल external input के बिना awk program चलाने के लिए भी किया जा सकता है
variables में numbers और strings store किए जा सकते हैं
- उदाहरण: a=5; b=2.5; print a+b
- strings को साथ-साथ रखने पर वे concatenate हो जाती हैं
uninitialized variable string context में empty string, और numeric context में 0 की तरह behave करता है
string को numeric expression में इस्तेमाल करने पर वह number में coerce हो जाती है
- अगर string valid number से शुरू नहीं होती, तो उसे 0 माना जाता है
- शुरुआत की whitespace ignore होती है
number के साथ string concatenate करने पर number string में convert हो जाता है
details gawk manual: Constant Expressions और gawk manual: How awk Converts Between Strings and Numbers में हैं

Arrays

awk arrays associative arrays हैं और key-value pairs के रूप में काम करती हैं
key number या string हो सकती है, लेकिन numeric keys internally strings में convert हो जाती हैं
multidimensional arrays भी इस्तेमाल किए जा सकते हैं
examples में student["id"], student["name"] की तरह string keys से values store और access की जाती हैं
key मौजूद है या नहीं, यह "id" in student form से check किया जाता है
details gawk manual: Arrays में हैं

Practice और next steps

यह chapter awk syntax की basic shorthands, filtering, substitution, field processing, type conversion और arrays का संक्षिप्त परिचय देता है
अगला chapter regular expressions पर है, और इस chapter में बताए गए features examples में आगे भी इस्तेमाल होते रहेंगे
field numbers बदलकर, या negative और floating-point field numbers जैसे अलग inputs के साथ experiment करके syntax की आदत डाली जा सकती है
interactive practice के लिए TUI app AwkExercises repository से install किया जा सकता है, और usage app_guide.md में है
सभी practice questions Exercises.md में इकट्ठे हैं, और solutions Exercise_solutions.md में हैं

1 टिप्पणियां

GN⁺ 2023-08-29

Hacker News की राय

मुझे awk पसंद है और मैं इसे काफ़ी अक्सर इस्तेमाल करता हूँ; इसके मुख्य उपयोगों में से एक इसे stateful sed की तरह इस्तेमाल करना है
जैसे, किसी लाइन को तभी match करना जब वह किसी खास पिछली लाइन के बाद आए—temporary linter बनाने में यह उपयोगी होता है
हाल ही में मैंने एक check बनाया था जो ऐसे migration files ढूँढता है जिनमें बहुत बड़े tables पर समस्या पैदा कर सकने वाला CREATE INDEX बिना CONCURRENTLY के डाला गया हो; SQL statements कई lines में फैल सकते हैं, इसलिए simple matching मुश्किल थी
awk कई lines के पार “create statement के अंदर हूँ”, “index बना रहा हूँ” जैसी state tracking कर सकता है, इसलिए जल्दी-जल्दी जोड़ा गया script करीब 1 साल से ठीक चल रहा है
- किसी दिन awk सीखना पड़ेगा, लेकिन तब तक मैंने sed की ज्यादा गहरी state-based features सीख ली हैं
  पिछली line किसी खास pattern की हो तभी current line print करनी हो, तो sed -ne 'x' -e '/PREV/ {x; /CURR/ p; x}' जैसा किया जा सकता है
  उदाहरण: echo -e "PREV\nCURR\nCURR\nCURR\nPREV\nRED" | sed -ne 'x' -e '/PREV/ {x; /CURR/ p; x}' सिर्फ CURR print करता है
  यह sed के hold buffer का उपयोग करने का तरीका है; -n से default output रोका जाता है और फिर p से केवल जरूरी lines print की जाती हैं
  x current line और hold buffer को swap करता है, और /PREV/ { ... } block के अंदर दोबारा swap करने के बाद current line में CURR होने पर ही print करता है
  आखिरी x overlapping matches के cases के लिए फिर से वापस करने के काम आता है
  बेशक awk script के बहुत ज्यादा simple और direct होने की संभावना है, लेकिन sed से भी ऐसा किया जा सकता है
  यह सीखने में लगा समय शायद awk पर लगाना बेहतर होता, लेकिन https://www.grymoire.com/Unix/Sed.html tutorial इतना अच्छा है कि sed के बारे में मैं जो लगभग सब जानता हूँ, यहीं से सीखा है
- अच्छा होगा अगर आप awk से SQL की state track करने वाला example share कर सकें
अगर कोई यह thread पढ़ रहा है, तो 2 हफ्ते पहले वाला “Ask HN: Share a shell script you like” भी उसे रुचिकर लग सकता है
comments 78 ही थे, इसलिए उम्मीद जितना popular नहीं हुआ, लेकिन reference के लिए ठीक है: https://news.ycombinator.com/item?id=37112991
- 5 महीने पहले भी ऐसी ही discussion हुई थी: https://news.ycombinator.com/item?id=35122780 (332 points, 328 comments)
  पिछले साल की post भी है: https://news.ycombinator.com/item?id=32467957 (374 points, 294 comments)
Lisp वगैरह के साथ-साथ Awk में भी मेरी थोड़ी रुचि बनी हुई है, और 2022 में मैंने cppawk बनाया था: https://www.kylheku.com/cgit/cppawk/about/
cppawk, Awk में preprocessing features जोड़ता है
इसमें multiple clauses support करने वाले loop macros हैं; clauses को combine करके parallel iteration या Cartesian product iteration बनाया जा सकता है, और user extensions भी संभव हैं
5 simple macros लिखकर नया clause define किया जा सकता है
अगर आप Awk इस्तेमाल करते हैं तो यह उपयोगी हो सकता है; इसकी documentation कई man pages में है और gawk तथा mawk पर चलने वाले unit tests भी हैं
शायद यह पुराने system administrator वाली सोच दिखा रहा हो, लेकिन वही काम सीधे Perl में लिखने की तुलना में awk का फायदा क्या है, यह मुझे ठीक से समझ नहीं आता
मैंने junior system administrators के बनाए बहुत से भयानक shell scripts देखे हैं, और हर बार सोचा कि “text processing वाला हिस्सा Perl में होता तो कहीं ज्यादा साफ-सुथरा होता”
- अगर आपके environment में Perl हमेशा उपलब्ध रहता है, तो यह बात काफ़ी वाजिब है
  मेरे लिए awk उन कुछ languages में से है जिन्हें महीनों छोड़ देने के बाद भी 10 मिनट में फिर से पकड़ मिल जाती है
  इसमें एक intuitive पहलू है, और यह common command-line tools के साथ naturally अच्छी तरह fit बैठता है
- scripting के लिए Awk और Perl की तुलना करूँ तो मैं Perl या Python को prefer करूँगा
  हालांकि यह लेख temporary काम के छोटे one-line commands के बारे में है, और ऐसे cases में Perl से बेहतर sed/awk होते हैं
  अगर आप पहले से Perl जानते हैं, तो अलग tools सीखने के बजाय Perl ही इस्तेमाल करते रहना ठीक है
- Awk की ताकत है मुफ्त में मिलने वाला read loop, field splitting, और pattern/condition matching model
  language अपने आप में “ठीक-ठाक” है, लेकिन पर्याप्त कामचलाऊ है
  Perl से भी जाहिर है सब कुछ किया जा सकता है, लेकिन awk जो boilerplate मुफ्त में कर देता है, वह आपको खुद लिखना पड़ेगा
  मेरे लिए Perl language के फायदे इतने बड़े नहीं हैं कि awk छोड़ दूँ, और मैं awk को “scripting” से ज्यादा data processing और one-off file manipulation के लिए इस्तेमाल करता हूँ, इसलिए Perl की depth की कमी महसूस नहीं होती
  ज्यादा deep features चाहिए हों तो मैं कहीं और चला जाता हूँ
- कोई खास technical advantage नहीं है; Perl की features जानबूझकर awk के superset के करीब हैं
  आज की generation ने Perl नहीं सीखा है, इसलिए लगता है कि stream processing के idea को awk के जरिए नए सिरे से खोज रही है
  awk 1970s के आखिर के हिसाब से एक शानदार idea था और copy करने लायक असली innovation था
  बाद में Perl ने उसे copy करके उससे आगे बढ़ाया, लेकिन फिर भुला दिया गया; इसलिए awk को फिर से discover होते देखना थोड़ा शर्मिंदगी भरा लगता है
- तंज नहीं कर रहा, पर जानना चाहता हूँ कि Perl scripting के लिए Python से बेहतर क्यों है
gawk के बारे में कम जानी जाने वाली बातों में से एक यह है कि आम तौर पर उपयोगी extensions इसके साथ आती हैं
आप readdir(), ord(), chr(), gettimeofday(), sleep() जैसी functionality access कर सकते हैं
https://www.gnu.org/software/gawk/manual/html_node/Extension-Samples.html
awk one-liner commands वाकई बहुत powerful होते हैं
मुश्किल सवाल यह है कि क्या ज़्यादा जटिल awk programming में निवेश करना वाकई worthwhile है
अगर processing task में complex logic चाहिए, तो awk वह देता तो है, लेकिन early computing दौर की छोटी और obscure शैली में देता है
वहीं modern alternative चुनें तो उसका भी बोझ होता है, खासकर pandas जैसा tool हमेशा intuitive नहीं हो सकता और performance issues भी आ सकते हैं
- मेरे लिए बड़ा मुद्दा libraries हैं
  personal common function file तक अच्छी तरह supported नहीं लगता, और third-party libraries import करने का भी कोई ठीक तरीका नहीं दिखता
  जैसे ही helper functions की ज़रूरत पड़ती है और code कई lines में टूटने लगता है, मैं आम तौर पर Python पर चला जाता हूँ
  फिर program 2–3 गुना बड़ा हो जाता है और आह निकलती है
  Ruby awk के alternative के रूप में शानदार है, लेकिन अगर coworkers Ruby नहीं जानते, तो maintenance की उम्मीद करना मुश्किल है
- ChatGPT से पहले मुझे obscure awk/sed one-liners पसंद नहीं थे
  आजकल ऐसे पढ़ने में कठिन commands को AI में paste करें तो वह step-by-step काफी अच्छी explanation दे देता है, इसलिए अब ठीक लगता है
  फिर भी unit testing की संभावना के कारण मैं Python की कुछ extra lines को prefer करता हूँ, हालांकि quick data munging tasks में कभी-कभी unit tests की ज़रूरत शुरू से ही नहीं होती
“CLI text processing with GNU awk” e-book का नया version release किया है
इसमें GNU awk command को beginner से advanced level तक सैकड़ों examples और exercises के जरिए step-by-step सीखा जा सकता है, और field processing, filtering, multiple files handling, multiple records पर निर्भर solutions, files के बीच records/fields compare करना, input order बनाए रखते हुए duplicates खोजना, regular expressions आदि को गहराई से cover किया गया है
PDF/EPUB: https://learnbyexample.gumroad.com/l/gnu_awk (31 अगस्त 2023 तक free)
Web version: https://learnbyexample.github.io/learn_gnuawk/
Markdown source और example files: https://github.com/learnbyexample/learn_gnuawk
Practice के लिए interactive TUI app: https://github.com/learnbyexample/TUI-apps/blob/main/AwkExercises
Bundle में grep, sed, awk, perl, ruby one-liners का collection Magical one-liners $5 में है: https://learnbyexample.gumroad.com/l/oneliners/new_awk_release, और सभी 13 e-books का पूरा bundle $12 में: https://learnbyexample.gumroad.com/l/all-books/new_awk_release
Feedback—typos, code mistakes, कौन से हिस्से अच्छे लगे या नहीं लगे—कुछ भी हो, बहुत मददगार होगा
पिछली discussions में https://news.ycombinator.com/item?id=15549318 और https://news.ycombinator.com/item?id=22758217 हैं
- जानना चाहता हूँ कि Magical one-liners जैसी सामग्री के लिए वास्तव में कितने लोग पैसे देते हैं, और क्या “देखने के बाद payment चुनें” model कभी try किया है
  यह $5 या $15 देने लायक useful हो सकता है, लेकिन यह मेरे personal one-liner file में पहले से save चीज़ों से अलग न भी हो सकता है, इसलिए check करने के लिए पैसे देने में हिचक होती है
- इसमें बहुत मेहनत लगी है, यह साफ दिखता है, और खासकर TUI application मुझे पसंद आया
  जानना चाहता हूँ कि web version 31 अगस्त के बाद भी free बना रहेगा या नहीं
पहले कभी AWK से heart drawing golf किया था: https://gist.github.com/auselen/906a53b47a7d616b080dbef85eb8f776
awk usage का 99.9% हिस्सा continuous whitespace को ignore करते हुए lines split करने के लिए होता है
उदाहरण: echo "key: value" | awk '{print $1}'
कोई और सरल alternative होता तो अच्छा होता
- https://github.com/c-blake/bu/blob/main/doc/cols.md consider किया जा सकता है
  यह Nim में है, लेकिन शायद बहुत बड़ी barrier नहीं होगी, और bu/ के अंदर दूसरे tools भी interest के हो सकते हैं
- cut से भी संभव है: echo "key: value" | cut -wf 2
  हालांकि यह सच में ज़्यादा सरल है या नहीं, इस पर बहस हो सकती है
  check करने पर पता चला कि GNU cut में -w नहीं है, इसलिए यह BSD-only है
  आखिरकार awk इस्तेमाल करना ही बेहतर होगा
- https://github.com/sstadick/hck और https://github.com/theryangeary/choose भी देखे जा सकते हैं
  दोनों cut/awk alternatives हैं और regex-based splitting भी support करते हैं, लेकिन याद पड़ता है कि leading/trailing whitespace removal नहीं करते
  मेरा बनाया https://github.com/learnbyexample/regexp-cut awk का इस्तेमाल करके regex-based splitting, negative indexes आदि देने वाला cut जैसा tool है, और awk के default behavior की वजह से leading/trailing whitespace भी handle करता है
- शायद आपका मतलब echo "key: value" | awk '{print $2}' कहना था
पहले मैंने bash से पोर्ट की गई diff2html स्क्रिप्ट awk में लिखी थी, और जाहिर कारणों से यह कहीं ज़्यादा तेज़ थी
bash स्क्रिप्ट की तुलना में awk वाला हिस्सा पढ़ने में भी कहीं आसान था, और मैं एक ही रात में भाषा सीखकर, debugging करके, bug समझकर उसे ठीक कर पाया
यह idiomatic awk तरीका है या नहीं, पता नहीं, लेकिन मुझे यह सचमुच अच्छी भाषा लगी
https://github.com/berry-thawson/diff2html/blob/master/diff2html.sh

GNU awk से CLI टेक्स्ट प्रोसेसिंग

awk कैसे execute होता है

Filtering और default output

1 का idiomatic output के लिए इस्तेमाल

Substitution: sub और gsub

grep, sed, awk चुनने का मानदंड

Field-based processing

One-line command structure

Strings और numbers

Arrays

Practice और next steps

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय

`awk` कैसे execute होता है

`1` का idiomatic output के लिए इस्तेमाल

Substitution: `sub` और `gsub`

`grep`, `sed`, `awk` चुनने का मानदंड