Python की 80 लाइनों में बना सर्च इंजन

(alexmolas.com)

6 पॉइंट द्वारा GN⁺ 2024-02-08 | 1 टिप्पणियां | WhatsApp पर शेयर करें

microsearch सर्च इंजन के अंदरूनी कामकाज को सीधे समझने के लिए बनाया गया एक toy implementation है। मुख्य सर्च इंजन क्लास 80 लाइनों से कम है, लेकिन crawler, API और HTML templates जोड़ने पर प्रोजेक्ट बड़ा हो जाता है
छोटे websites और blogs के बड़े search engines में आसानी से न मिल पाने की समस्या की पृष्ठभूमि में, 642 RSS feeds से posts इकट्ठा करके search data बनाया गया
asyncio आधारित asynchronous crawling से collection time 20 मिनट से घटकर 20 सेकंड हो गया, और साफ किया गया body text Parquet data के रूप में stored है
Search, शब्दों को URL-wise occurrence counts से जोड़ने वाले inverted index पर चलता है, और results ranking में link-based PageRank के बजाय content-based BM25 का इस्तेमाल होता है
FastAPI UI search box और results page देता है, लेकिन query operators, n-gram indexing, query/document expansion, और crawling के दौरान indexing जैसी सुविधाएं अभी नहीं हैं

microsearch का लक्ष्य और scope

microsearch GitHub repository पर प्रकाशित एक Python search engine implementation है
इसका उद्देश्य production search engine बनाना नहीं, बल्कि search engine अंदर से कैसे काम करता है यह दिखाने वाला usable toy example बनाना है
Search target उन छोटे websites और blogs के ज्यादा करीब है जो Google SEO competition में आसानी से discover नहीं होते
मुख्य search engine implementation 80 लाइनों से कम है, लेकिन data crawler, API, HTML templates जैसे supporting code को शामिल करने पर पूरा project बड़ा है
यह implementation Solr और Lucene पर काम करते हुए search engine behavior को और गहराई से समझने की प्रक्रिया में बना

RSS-based crawler

Search करने लायक data बनाने के लिए blog RSS feeds crawl किए गए
इस्तेमाल किए गए feeds कुल 642 RSS feeds थे
- करीब 100 feeds ML, data science, mathematics आदि से जुड़े वे blogs हैं जिन्हें लेखक खुद पढ़ता है
- बाकी करीब 500 feeds surprisetalk blogs.hn project से लिए गए
Crawling flow में हर RSS feed से post URLs extract करना, post HTML download करना और फिर body text साफ करना शामिल है
HTML cleanup में BeautifulSoup से script और style हटाए जाते हैं, line breaks और spaces साफ करके text में बदला जाता है
aiohttp और asyncio का इस्तेमाल करने वाली asynchronous crawling से execution time 20 मिनट से 20 सेकंड हो गया
Result को URL और साफ किए गए body text वाले DataFrame में बनाया गया और फिर output.parquet में stored किया गया

Inverted index structure

Search engine का पहला core data structure inverted index है
Inverted index keywords को documents से map करता है, जिससे यह जल्दी पता लगाया जा सकता है कि कोई specific word किन documents में आता है
Implementation में dict[str, dict[str, int]] प्रकार का defaultdict इस्तेमाल होता है
- outer key शब्द है
- inner key URL है
- inner value वह count है कि उस URL के document में वह शब्द कितनी बार आया
SearchEngine class के पास दो internal dictionaries हैं
- _index: word-wise URL occurrence counts store करता है
- _documents: URL-wise original content store करता है
index(url, content) content को normalize करने के बाद spaces से split करता है और हर शब्द के URL-wise occurrence count को बढ़ाता है
bulk_index() URL और content lists लेकर कई documents को एक साथ index करता है
get_urls(keyword) keyword को normalize करने के बाद उस word को शामिल करने वाले URLs और occurrence counts return करता है

String normalization और basic search

String normalization punctuation को spaces में बदलता है, duplicate spaces साफ करता है और lowercase में convert करता है
Case difference कम करने के लिए Foo और foo को एक ही keyword माना जाता है
अगर दो example documents index किए जाएं, तो foo search result दोनों documents return करता है
- Foo: Hello, World! My name is Foo!
- Bar: Hello, World! My name is Bar, I'm not Foo!
इस stage में सिर्फ यह पता चलता है कि document search term रखता है या नहीं और कितनी बार रखता है, इसलिए result order तय करने के लिए अलग ranking चाहिए

BM25 ranker

Search result sorting में BM25 इस्तेमाल होता है
PageRank links के आधार पर documents को rank करता है, जबकि BM25 document content के आधार पर score calculate करता है
SearchEngine में BM25 calculation के लिए default parameters k1=1.5, b=0.75 हैं
Class ranking calculation के लिए जरूरी properties देता है
- posts: indexed URL list
- number_of_documents: कुल documents की संख्या
- avdl: average document length
idf(kw) किसी specific keyword की inverse document frequency calculate करता है
- कुल documents की संख्या N
- उस keyword को शामिल करने वाले documents की संख्या n_kw
- formula log((N - n_kw + 0.5) / (n_kw + 0.5) + 1) इस्तेमाल होता है
bm25(kw) उस keyword को शामिल करने वाले हर URL के लिए BM25 score calculate करता है
search(query) query को normalize करके words में split करता है, फिर हर word के BM25 scores को URL-wise sum करके return करता है
Example में केवल foo search करने पर Foo document का score Bar से higher होता है, और foo bar search करने पर Bar document का score higher हो जाता है

FastAPI interface

Search engine एक छोटे FastAPI app के रूप में expose किया गया है
App एक SearchEngine instance बनाता है और startup पर Parquet data से URL और content पढ़कर bulk_index() से index करता है
मुख्य routes तीन हैं
- /: search page render करता है और indexed posts की list pass करता है
- /results/{query}: query search करता है और top 5 URLs को results page पर दिखाता है
- /about: about page render करता है
Results को score के आधार पर descending order में sort करने के बाद केवल top-N URLs चुने जाते हैं
UI और UX में सुधार की काफी गुंजाइश है, लेकिन search fast चलता है और results भी खराब नहीं हैं

Missing features और limitations

Implementation में real search engines से अपेक्षित कई features नहीं हैं
Query operators नहीं हैं
- उदाहरण के लिए, Google के how to build a search engine -solr की तरह किसी specific word को exclude करने वाली search support नहीं है
n-gram indexing नहीं है
- "search engine" की तरह दो words किसी specific order में आने वाले documents ही खोजने का तरीका support नहीं है
Query या document expansion नहीं है
- engine search करने पर engines वाले documents automatically search नहीं होते
Crawling और indexing अलग-अलग हैं
- Document मिलते ही उसे index करने के तरीके से इन्हें integrate किया जा सकता है, और इस process को भी asynchronous बनाया जा सकता है

Next steps

Project के जरिए Solr अंदर से कैसे काम करता है, इस पर और intuition मिली
यह भी confirm हुआ कि IO-heavy tasks में asynchronous code बड़ा असर डालता है
अगला step search engine में semantic search feature जोड़ना है
Embedding models और ANN के साथ experiments किए गए हैं, और उस feature को microsearch में जोड़ना अगला काम है

1 टिप्पणियां

GN⁺ 2024-02-08

Hacker News की राय

यह सच में कमाल है। लोकल टेस्टिंग के लिए मैं Pandas से काफी तेज़ BM25 search engine बना रहा हूँ: https://github.com/softwaredoug/searcharray
Pandas इस्तेमाल करने की वजह यह है कि सिर्फ BM25 काफी नहीं है, और मैं recency व popularity जैसे दूसरे factors को pandas/numpy से आसानी से calculate करके combine करना चाहता हूँ
वैसे phrase search मुश्किल हिस्सा है। phrase matching में बहुत सारे edge cases होते हैं, और slop जैसी चीज़ों को भी ध्यान में रखना पड़ता है। position info को भी जितना हो सके कम memory में compress करना पड़ता है: https://github.com/softwaredoug/searcharray/blob/main/searcharray/utils/roaringish.py
- मैंने एक toy project में phrase matching को handle किया था: https://github.com/vasilionjea/lofi-dx/blob/main/test/search/inverted-search.test.ts#L140
  मुझे लगता है कि मैंने इसे काफी thorough तरीके से test किया है, लेकिन feedback मिले तो अच्छा होगा। position info को delta encode किया और base36 में encode किया था
- उत्सुक हूँ कि sentiment analysis जोड़ने से phrase processing में मदद मिली या उल्टा नुकसान हुआ। phrases संभालना मुश्किल है, और सोच रहा हूँ कि performance improve करने के लिए क्या किया जा सकता है
- जिज्ञासा है कि आपने यह पोस्ट इतनी जल्दी कैसे ढूँढी और comment किया। क्या आप interest keywords ढूँढने के लिए first page scan करने वाला कोई search monitoring tool इस्तेमाल करते हैं, या यह बस संयोग था
बात सही है। search में मुश्किल हिस्सों का ज़्यादातर हिस्सा data scale handle करने में है। logic खुद हैरान करने लायक आसान हो सकता है, या आसानी से बनाया जा सकता है
बेशक इसे अंतहीन रूप से complex भी बनाया जा सकता है, लेकिन इस project ने गैर-ज़रूरी हिस्सों को अच्छी तरह हटाया है। अगर इसे search engine को बड़ा बनाने की समस्या के बजाय data को physically छोटा बनाने या signal-to-noise ratio बढ़ाने की समस्या की तरह देखें, तो काफी आगे जा सकते हैं
src/microsearch/engine.py में SearchEngine.__init__(self, k1: float = 1.5, b: float = 0.75) जैसा code है, लेकिन k1 या b क्या हैं, बिल्कुल नहीं पता, और पूरे file में एक भी comment नहीं है
क्या आजकल comments trend में नहीं हैं? _documents शायद ऐसा लगता है कि key URL है और value उस URL का content, लेकिन मैं गलत भी हो सकता हूँ। यह search engine बनाना सीखने और उसे extend करने के लिए अच्छा resource हो सकता था, लेकिन documentation न होने से code quality थोड़ी निराश करती है
- वह हिस्सा article में explain किया गया है, और article खुद code की documentation की तरह काम करता है। BM25 link mathematical background तक ले जाता है, और BM25 parameters पर थोड़ा और खोजें तो उन्हें कैसे choose करें इस पर related articles भी मिलते हैं
- article title को attention-grabbing बनाने के लिए code lines जितनी हो सके कम रखनी थीं ;)
  मज़ाक अलग, मैं सहमत हूँ कि आम तौर पर docs और code साथ होना बेहतर होता है। बस इस case में यह educational project है, इसलिए code और docs को अलग रखा गया और blog post में code को document करने का फैसला किया गया
- mobile पर हूँ इसलिए detail में नहीं देख सकता, लेकिन k1 और b TF-IDF या BM25 में इस्तेमाल होने वाले standard weight values हैं, और यहाँ BM25 वाले हैं
  comments हों तो उपयोगी होगा, लेकिन इस problem से familiar लोगों के लिए ये तुरंत पहचान में आने वाले नाम भी हैं
- k1 और b BM25 ranking function के tuning parameters हैं। ये original author द्वारा बनाए गए नए नाम नहीं हैं, बल्कि लगभग हर implementation और textbook में यही variable names इस्तेमाल होते हैं
  information retrieval field जानने वाले व्यक्ति के लिए समझने में तो k1 और b नाम रखना ही सही है: https://en.wikipedia.org/wiki/Okapi_BM25
- a: float जैसी pattern देखता हूँ तो Rich Hickey की “हमें types की नहीं, सही names की ज़रूरत है” वाली talk हमेशा याद आती है
  Go से आया हुआ सा लगने वाला, बिना explanation के one-letter variable names इस्तेमाल करने और type system को name-assist tool की तरह abuse करने का trend मुझे सच में नापसंद है। names program क्या करता है, इसके बारे में semantic information convey कर सकते हैं, इसलिए उन्हें सही से इस्तेमाल करना चाहिए
external dependencies इस्तेमाल करते हुए कुल \r\n count नहीं बल्कि code line count का brag करने का क्या मतलब है, समझ नहीं आता
codebase को measure करने की कोई SI unit तो नहीं है, लेकिन cognitive load को किसी न किसी तरह measure करना चाहिए, ऐसा मुझे लगता है
- कोई official standard तो नहीं, लेकिन हमारी team कभी-कभी https://grugbrain.dev का हवाला देकर कहती है “यह code grug नहीं है” या “यह code काफी grug है”
- 80-line वाला search engine खुद external dependencies इस्तेमाल नहीं करता। यह सिर्फ collections, math, string import करता है और सब standard library है
  हालांकि ज्यादा accurate कहें तो शायद इसे “search engine engine” कहना सही होगा। crawler और interface उन 80 lines में शामिल नहीं हैं, लेकिन किसी न किसी form में ज़रूरी हैं, और दी गई implementation lines और libraries को काफी बढ़ाती है। फिर भी वे libraries search engine खुद से related नहीं हैं। अगर pandas या fastapi जैसी general dependencies तक गिनना शुरू करें, तो शायद operating system की लाखों lines, network card firmware और hardware complexity तक भी गिननी पड़े
- क्या वजह है कि हम इस बात को celebrate न करें कि industry ने ऐसी उपलब्धि हासिल की है जिससे यह 80 lines में बनाया जा सकता है?
- यहाँ इसका मतलब है। अगर “Python की 4000 lines से बना search engine” होता तो ज़्यादातर लोग बस आगे बढ़ जाते, लेकिन 80 lines इतना छोटा है कि एक बार देखने लायक बन जाता है
- पुराने तरीके में cyclomatic complexity है
पसंद आया। search engine के साथ इस्तेमाल करने के लिए 20 lines से कम का recommendation engine भी संभव है। अगर clicked URLs के session logs रखते हैं, तो हर session में current URL के बाद sliding window देखकर, नज़दीकी links को ज्यादा weight देकर recommendation list बनाई जा सकती है
recommendation results sort करके सिर्फ top N रखने पर किसी specific URL के लिए recommended URLs की list मिलती है। थोड़ा tweak करें तो typed search terms और clicked URLs को logs में mix करके spelling suggestions भी निकाली जा सकती हैं
बहुत शानदार और शिक्षाप्रद। बस इसे deploy मत करना :-)
पहले मुझे भी कुछ ऐसा ही चाहिए था, लेकिन scale थोड़ा बड़ा था—documents कुछ दसियों हजार थे—और जवाब हमेशा की तरह sqlite था। संरचना के लिहाज़ से यह यहाँ वाली चीज़ जैसी ही है, बस inverted index persistence layer किसी और ने लिख दी थी
- SQLite FTS को मैं लगभग हर जगह इस्तेमाल करता हूँ, और इसने कभी निराश नहीं किया
- सच में, वही formula तक इसमें है। इस comment की वजह से एक तरह का “समझ आने का रोमांच” महसूस हुआ
Google में "search engine" जैसे quotes लगाकर search करें, तो वह सिर्फ वे results दिखाता है जिनमें दोनों words उसी क्रम में आते हैं
कम-से-कम कुछ cases में ऐसा है, लेकिन अफ़सोस कि हमेशा नहीं। Advanced users जो चाहते हैं वह “web के लिए grep” है, न कि “Google जो दिखाना चाहता है वह बताने वाली चीज़”
- मैं भरोसे से कह सकता हूँ कि सच में “web के लिए grep” चाहने वाले लोग बहुत कम हैं। बहुत मामूली query expansion करने वाले search engine की तुलना में भी web वाला grep साफ़ तौर पर खराब है
  यह सही है कि Google query को interpret करते समय बहुत ज़्यादा freedom लेता है, लेकिन ऐसे बहुत-से processing steps हैं जिन्हें कोई भी search engine न करने से बेहतर ही करेगा। आज Google Search की समस्या यह है कि यह समझना मुश्किल है कि वे results क्यों आ रहे हैं, और ऐसा लगता है कि string comparison में embeddings पर बहुत ज़्यादा निर्भरता इसकी वजह है। जब "cat food" का match "dog restaurant" से हो जाता है—embedding space में वे semantic रूप से पास हो सकते हैं, लेकिन इंसानी reasoning से मेल नहीं खाते—तो झुंझलाहट होती है
बाहरी libraries जैसे feedparser, bs4 आदि इस्तेमाल करते हुए इसे 80 lines of code कहना मुझे fair नहीं लगता
- अगर यह elasticsearch के ऊपर बना होता तो मैं सहमत होता, लेकिन अगर असली search engine वाला हिस्सा उन 80 lines में implement है, तो मुझे यह fair लगता है। जिन libraries को import किया गया है, वे ऐसी चीज़ें हैं जिन्हें खुद implement न करना ही सही है
  कभी-कभी “अपना search engine बनाएं” वाली posts असल में searxng या yacy install करने की guide होती हैं, लेकिन यह वैसा मामला नहीं है
- अगर dependency बहुत common और mainstream है, तो मुझे ठीक लगता है
अच्छा है। इसमें fuzzy search feature जोड़ना भी शायद बहुत मुश्किल नहीं होगा। उदाहरण के लिए, "hackrnew" search करने पर "hackernews" से match हो, ऐसे results खोजने का तरीका जिनकी prefix edit distance किसी threshold से कम या बराबर हो
basic idea यह है कि एक और inverted index रखा जाए, लेकिन key document collection में मौजूद words के n-grams (आमतौर पर 3-grams) हों, और postings उन words या word IDs की हों जिनमें वह n-gram आता है। PED(x, y) <= delta हो तो |N(x) ∩ N(y)| >= |N(x)| - n ∙ delta lemma इस्तेमाल किया जा सकता है। input x के n-grams calculate करके हर n-gram की postings लाएं और duplicates merge करें, तो हर candidate word y के साथ share किए गए n-grams की संख्या मिलती है। अगर यह संख्या criterion से बड़ी हो तभी actual PED calculate करें, और छोटी हो तो skip करें—इससे महंगा computation बहुत कम हो जाता है
इस तरह मिले word list को existing index में query कर दें। पहले https://dont.watch/ पर client-side JS fuzzy search engine बनाते समय मैंने यही approach इस्तेमाल की थी। JS code में देखें तो inverted index और compressed n-gram index को JS file के रूप में सीधे pass किया गया है। actual search engine बिना external dependencies के करीब 300 lines of JS है, और search results बेहतर करने के लिए बस बहुत basic heuristics हैं
- उस approach में index size कितना बढ़ जाता है?

Python की 80 लाइनों में बना सर्च इंजन

microsearch का लक्ष्य और scope

RSS-based crawler

Inverted index structure

String normalization और basic search

BM25 ranker

FastAPI interface

Missing features और limitations

Next steps

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय