Apache Lucene से प्रेरित Tantivy फुल-टेक्स्ट सर्च इंजन लाइब्रेरी

(github.com/quickwit-oss)

1 पॉइंट द्वारा GN⁺ 2024-05-28 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Tantivy Rust में लिखी गई एक तेज फुल-टेक्स्ट सर्च इंजन लाइब्रेरी है; यह पूरा तैयार सर्च सर्वर नहीं, बल्कि सर्च इंजन बनाते समय इस्तेमाल होने वाले crate के ज्यादा करीब है
इसका डिज़ाइन Apache Lucene से काफी प्रेरित है, और अगर आप Elasticsearch या Apache Solr के वैकल्पिक सर्वर की तलाश में हैं, तो Tantivy पर बने distributed search engine Quickwit को देखने की सलाह दी गई है
फीचर्स में फुल-टेक्स्ट सर्च, BM25 scoring, natural query language, phrase search, incremental indexing, multi-threaded indexing, mmap directory, SIMD integer compression, facet search, JSON fields, aggregation Collector आदि शामिल हैं
यह stable Rust पर चलता है और Linux, macOS, Windows को सपोर्ट करता है; इसका startup time 10ms से कम है, इसलिए इसे command-line tools के लिए उपयुक्त बताया गया है
distributed search Tantivy के scope से बाहर है; document modify करने के लिए पुराने document को delete करके फिर से index करना होता है, और नए documents commit के बाद IndexReader reload और नया Searcher हासिल करने के बाद search में उपलब्ध होते हैं

Tantivy की स्थिति और डिज़ाइन

Tantivy Rust में लिखी गई एक तेज फुल-टेक्स्ट सर्च इंजन लाइब्रेरी है
यह Elasticsearch या Apache Solr की तरह सीधे चलाया जाने वाला search engine server नहीं है, बल्कि ऐसा crate है जिसे ऐसे search engine बनाने में इस्तेमाल किया जा सकता है
डिज़ाइन के लिहाज से यह Apache Lucene के ज्यादा करीब है, और Lucene के डिज़ाइन से काफी प्रेरित है
अगर आप Elasticsearch या Apache Solr के विकल्प की तलाश में हैं, तो Tantivy पर बने distributed search engine Quickwit को देखने की सलाह दी गई है

परफॉर्मेंस और benchmark

Tantivy query और collection type के हिसाब से performance दिखाने वाला benchmark उपलब्ध कराता है
benchmark परिणाम query की प्रकृति और load के अनुसार बदल सकते हैं
benchmark की details search-benchmark-game repository में देखी जा सकती हैं
FAQ के अनुसार, search latency benchmark में Tantivy औसतन Lucene से लगभग 2 गुना तेज है

सर्च और indexing फीचर्स

सर्च फीचर्स
- फुल-टेक्स्ट सर्च
- Lucene जैसी BM25 scoring
- natural query language सपोर्ट: (michael AND jackson) OR "king of pop"
- phrase search सपोर्ट: "michael jackson"
- range queries
- facet search
- JSON Field
- Aggregation Collector: histogram, range buckets, average, stats metrics
indexing फीचर्स
- incremental indexing सपोर्ट
- multi-threaded indexing सपोर्ट
- बताया गया है कि desktop पर English Wikipedia indexing में 3 मिनट से कम समय लगता है
- optional term frequency और position indexing के जरिए indexing settings configure की जा सकती हैं
- LogMergePolicy with deletes सपोर्ट
- Searcher Warmer API उपलब्ध
storage और fields
- mmap directory सपोर्ट
- u64, i64, f64 के single-valued और multivalued fast fields सपोर्ट
- &[u8] fast fields सपोर्ट
- text, i64, u64, f64, dates, ip, bool, hierarchical facet fields सपोर्ट
- document store compression LZ4, Zstd, None को सपोर्ट करता है

tokenizer और भाषा सपोर्ट

tokenizer configure किया जा सकता है, और 17 Latin-family भाषाओं के लिए stemming उपलब्ध है
third-party tokenizers का सपोर्ट भी उपलब्ध है
- Chinese: tantivy-jieba, cang-jie
- Japanese: lindera, Vaporetto, tantivy-tokenizer-tiny-segmenter
- Korean: lindera और lindera-ko-dic-builder
Tantivy के लिए tokenizer implement करते समय tantivy-tokenizer-api crate पर निर्भर रहना होगा

रनटाइम environment और शुरुआत करने का तरीका

Tantivy stable Rust पर चलता है
supported operating systems Linux, macOS, Windows हैं
startup time 10ms से कम है, इसलिए यह command-line tools के लिए उपयुक्त है
शुरुआती resources
- Tantivy का simple search example
- tantivy-cli and its tutorial: search engine creation, document indexing, और CLI या REST API वाले छोटे server के जरिए search को आसान बनाने वाला वास्तविक command-line interface
- Reference doc for the last released version
local build और test इन commands से किए जाते हैं

git clone https://github.com/quickwit-oss/tantivy.git
cd tantivy
cargo test

scope से बाहर फीचर्स और data change model

distributed search Tantivy के scope से बाहर है
अगर distributed search चाहिए, तो Quickwit देखने की सलाह दी गई है
Tantivy का data immutable है
document modify करने के लिए मौजूदा document को delete करके फिर से index करना होता है
indexing में मौजूद documents IndexWriter में commit call होने के बाद search के लिए उपलब्ध होते हैं
मौजूदा IndexReader को changes reflect करने के लिए reload करना होता है
changes सिर्फ नए हासिल किए गए Searcher में ही दिखाई देते हैं

bindings और use cases

दूसरी भाषाओं में इस्तेमाल के लिए bindings
- Python: tantivy-py
- Ruby: tantiny
- GitHub पर अन्य bindings भी मिल सकते हैं, लेकिन उनका maintenance कम हो सकता है
Tantivy इस्तेमाल के उदाहरण
- seshat: Matrix message database/indexer
- tantiny: Ruby के लिए छोटा फुल-टेक्स्ट search
- lnx: REST API वाला adaptive typo-tolerant search engine
- Bichon: WebUI वाला lightweight, high-performance Rust email archiver
Tantivy इस्तेमाल करने वाली कंपनियों के रूप में Etsy, ParadeDB, Nuclia, Humanfirst.ai, Element.io दिखाए गए हैं

1 टिप्पणियां

GN⁺ 2024-05-28

Hacker News की राय

इस लाइब्रेरी को बनाने वाले लोग वाकई कमाल के हैं। पिछले साल मैंने लंबे समय से पड़ी पुरानी Python2 AppEngine codebase को replace करते हुए https://progscrape.com [1] को इसी पर फिर से बनाया था; यह शानदार लाइब्रेरी है और बेहद तेज़ है
Raspberry Pi पर 10 लाख stories को कुछ ही सेकंड में index कर देती है
मैं घर के Pi पर full-text search service चला रहा हूँ, और peak load कुछ rps जितना ही है, ज़्यादा नहीं, लेकिन CPU भी कुछ प्रतिशत से ऊपर शायद ही कभी उछलता है। Pi पर search को लगभग 100rps तक load test किया था और यह टिक गया। यह बहुत उपयोगी लाइब्रेरी थी जिसे लगभग सीधे plug in किया जा सकता था, team ने bug reports पर भी बहुत तेजी से प्रतिक्रिया दी, और bugs भी बहुत कम थे
इतने छोटे device पर search responsiveness कैसी है, यह देखने के लिए हर story के labels दबाकर देखें। queries लगभग तुरंत चलती हैं, और यह अधिकतम 10 साल * 12 महीनों के search shards को hit कर रही है: https://progscrape.com/?search=javascript
किसी modern project के लिए मैं Lucene के बजाय इसे देखने की सलाह दूँगा। छोटे ARM64 पर भी यह इतना अच्छी तरह scale करती है, इसलिए बड़े servers पर अनुभव शायद कहीं बेहतर होगा
[1] https://github.com/progscrape/progscrape
- वाकई बहुत अच्छी लाइब्रेरी है। JMAP इस्तेमाल करने वाले email providers के लिए, अभी काम जारी एक incremental email backup CLI tool में इसका इस्तेमाल कर रहा हूँ
  मैं चाहता था कि users अपने backups search कर सकें, और चूँकि मैं Rust इस्तेमाल कर रहा था, Tantivy एकदम fit लगा। एक email को index करने की गति इतनी तेज़ है कि इसे अलग thread में ले जाने की ज़रूरत भी नहीं पड़ी, और हजारों emails में search भी समस्या नहीं लगती
  अगर Rust application में search चाहिए, तो Tantivy देखना अच्छा रहेगा
- छोटा bug report: https://progscrape.com/?search=grep पर Error: PersistError(UnexpectedError("Storage fetch panicked")) दिखता है
- कुछ दिन पहले एक quick proof of concept के लिए meilisearch इस्तेमाल किया था, इस repository के ज़रिए Tantivy को फिर से देखना पड़ेगा
  मूल रूप से मुझे बस full-text search ही चाहिए
हाल में ParadeDB के अंदर Tantivy मिला। ParadeDB, Elastic को replace करने की कोशिश करने वाला Postgres extension है
https://github.com/paradedb/paradedb/blob/dev/pg_search/Carg...
“Extending Postgres for High Performance Analytics (with Philippe Noël)” सुनकर इसके बारे में पता चला
https://www.youtube.com/watch?v=NbOAEJrsbaM
और यह core project Quickwit में भी शामिल है। यह logs, traces, और जल्द ही metrics तक संभालने वाला project है
https://github.com/quickwit-oss/quickwit
multilingual search वाले personal project में Quickwit और ClickHouse को साथ इस्तेमाल किया था, और यह हैरान करने जितना अच्छा था। आखिरकार Chinese, Japanese, Korean के लिए काम लायक combination मिल गया
https://quickwit.io/docs/guides/add-full-text-search-to-your...
PostgreSQL का to_tsvector मेरे use case में कभी ठीक से fit नहीं बैठा
SELECT * FROM dump WHERE to_tsvector('english'::regconfig, hh_fullname) @@ to_tsquery('english'::regconfig, 'query');
उम्मीद है अच्छा चले। जिन posts में Tantivy keyword के रूप में आएगा, उन्हें शायद अपने-आप upvote कर दूँगा
- URL/REST आधारित indexing और search queries को पूरी तरह SQL के अंदर handle करने वाला combination एक बढ़िया design pattern है। Postgres FDW से भी यही तरीका किया जा सकता है
हाल में Tantivy-based और उसी team द्वारा बनाया गया Quickwit production में deploy किया और अरबों objects index किए; इससे बहुत संतुष्ट हूँ। indexing speed शानदार है और query latency भी competitive है
सबसे अहम बात यह है कि compute और storage separation ने बहुत बड़ी value दी। लंबे समय तक चलने वाले high-performance servers का खर्च उठाए बिना object storage में पड़े अरबों objects के ऊपर नई search service खड़ी कर पाना, और complex aggregations तक कर पाना, ऐसे नए use cases संभव कर गया जो आम तौर पर काफी महँगे पड़ते
जब use case high-performance servers को justify करने लायक हो जाए, तो Quickwit हर server पर data cache करके performance बढ़ाने का option भी देता है
बड़ा bonus यह है कि Discord पर team बहुत तेज़ और मददगार है
एक और resource के तौर पर etsy/hound[0] में इस्तेमाल होने वाला Go-based trigram search index है। यह Russ Cox के article और code “Regular Expression Matching with a Trigram Index”[1] पर आधारित है
[0] https://github.com/hound-search/hound
[1] http://swtch.com/~rsc/regexp/regexp4.html
ज़रूरत के हिसाब से Lucene के alternatives के use cases भी अलग-अलग होते हैं
ध्यान देने वाली बात यह है कि अभी भी फ़ील्ड जोड़ना/हटाना संभव नहीं है: https://github.com/quickwit-oss/tantivy/issues/470
फ़ील्ड जोड़ने का इकलौता तरीका है कि सारे डेटा को किसी दूसरे search index में फिर से index किया जाए
- workaround के तौर पर JSON फ़ील्ड इस्तेमाल किए जा सकते हैं। दस्तावेज़ देखें: https://github.com/quickwit-oss/tantivy/blob/main/doc/src/js...
default रूप से telemetry data भेजने वाले Meilisearch का विकल्प खोजते हुए Tantivy मिला। यह search engine खुद कम, search engine builder ज़्यादा लगता है, लेकिन configuration काफ़ी सरल दिखती है [0]
[0]: https://github.com/quickwit-oss/tantivy-cli
- QuickWit भी default रूप से telemetry भेजता है: https://quickwit.io/docs/telemetry
- रुचि तो है, लेकिन इसे Rust library के रूप में इस्तेमाल करते हुए JSON configuration के बजाय सिर्फ़ Rust types से काम करना चाहूँगा
  Meilisearch का Java SDK भी अच्छा था। CLI और manual configuration की ज़रूरत नहीं थी, और बस database entity की ओर इशारा करने पर पूरी table index की जा सकती थी
  अच्छा होगा अगर Tantivy में भी वैसा तरीका हो
- सिर्फ़ एक command-line argument जोड़कर इसे आसानी से बंद किया जा सकता है; usable interactive search को इसी वजह से न अपनाना मामूली आपत्ति जैसा लगता है
Tantivy का इस्तेमाल LanceDb नाम के एक दिलचस्प vector database product में full-text search सुविधा देने के लिए भी होता है: https://lancedb.github.io/lancedb/fts/
आख़िरी बार जब देखा था, तब यह केवल Python bindings के ज़रिए संभव था, लेकिन मेरी जानकारी में वे अन्य platforms को support करने के लिए Rust bindings को natively implement करने की कोशिश कर रहे हैं
कुछ साल पहले Elasticsearch इतना resource-hungry monster था कि जबरदस्त frustration में मैंने एक personal project शुरू किया। मेरे personal computer में भी कई अच्छे-ख़ासे funded startups द्वारा अपने products को दिए जाने वाले resources से ज़्यादा resources थे, फिर भी ऐसा था
Tantivy चुनने के दो कारण थे। एक, मैं सब कुछ Rust में बनाना चाहता था, और दूसरा, Tantivy खुद था। performance 10/10 है, documentation top-tier है, और library का उपयोग करने का अनुभव भी बहुत अच्छा है
अफ़सोस, अकेले spare time में संभालने के लिए project का scope बहुत बड़ा था, इसलिए छोड़ दिया, लेकिन Tantivy फिर भी सचमुच शानदार है
मैं काफ़ी समय से Tantivy पर नज़र रखे हुए था। founders की लगन और हाल में Tantivy ने जो performance हासिल की है, वह प्रभावशाली है
पूरी team को बड़ा applause। मुझे पूरा भरोसा है कि वे अपना लक्ष्य हासिल करेंगे
Lucene और Solr का बहुत इस्तेमाल कर चुके व्यक्ति के तौर पर मेरी सबसे बड़ी इच्छा upgrade support है। आम तौर पर Lucene, Solr, ES indexes को नई version में upgrade नहीं किया जा सकता। कुछ मामलों में संभव तो है, लेकिन सुविधा के लिए उसे छोड़ रहा हूँ
बड़े projects में reindexing बहुत महँगी होती है, और कभी-कभी लगभग असंभव काम होती है
कुछ मामलों में यह सचमुच असंभव होने की संभावना ज़्यादा होती है। उदाहरण के लिए, जब किसी lossy indexed field में data type का indexing algorithm बदल गया हो। लेकिन कई मामलों में सारी जानकारी मौजूद रहती है, इसलिए ऐसे indexes की पहचान कर उन्हें upgrade किया जा सके तो बहुत अच्छा होगा

Apache Lucene से प्रेरित Tantivy फुल-टेक्स्ट सर्च इंजन लाइब्रेरी

Tantivy की स्थिति और डिज़ाइन

परफॉर्मेंस और benchmark

सर्च और indexing फीचर्स

सर्च फीचर्स

indexing फीचर्स

storage और fields

tokenizer और भाषा सपोर्ट

रनटाइम environment और शुरुआत करने का तरीका

scope से बाहर फीचर्स और data change model

bindings और use cases

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय