Jaccard समानता और MinHash का उपयोग करके fuzzy duplicate detection

(blog.nelhage.com)

1 पॉइंट द्वारा GN⁺ 2024-07-06 | 1 टिप्पणियां | WhatsApp पर शेयर करें

बड़े पैमाने के document collections में web crawling के दौरान एक ही page कई बार fetch हो सकता है या छोटी-छोटी संशोधित copies मिल सकती हैं, इसलिए Jaccard similarity और MinHash “लगभग समान” documents खोजने का व्यावहारिक तरीका बन जाते हैं
Jaccard similarity document को feature set में बदलने के बाद intersection size / union size की गणना करती है, और threshold से ऊपर वाले pairs को fuzzy duplicates मानती है, लेकिन यह संबंध आम तौर पर transitive नहीं होता
सभी document pairs की तुलना करने पर corpus size के हिसाब से O(n²) लागत आती है, इसलिए MinHash हर document को fixed-size signature में summarize करके similarity का probabilistic approximation करता है
k hash functions इस्तेमाल करने पर दो document signatures में समान position की values के match होने के अनुपात से similarity estimate की जा सकती है, और hash function selection में min-wise independence जैसी conditions महत्वपूर्ण होती हैं
पूरे signature या signature के हिस्से को group key के रूप में इस्तेमाल करके similar documents के same bucket में जाने की probability adjust की जा सकती है, और n-gram/tokenization method detection sensitivity और cost को तय करता है

fuzzy duplicate detection की कठिनाई

लक्ष्य बड़े document set में ऐसे documents खोजना है जो बिल्कुल समान नहीं, लेकिन लगभग समान हों
- web को किसी अवधि में crawl करने पर एक ही page कई बार fetch हो सकता है, लेकिन metadata थोड़ा अलग हो सकता है
- page की छोटी modified versions भी कई हो सकती हैं
basic approach दो documents के बीच similarity function S(A, B) define करना है, और जिन pairs की value threshold Scrit से ऊपर हो उन्हें fuzzy duplicates मानना है
“लगभग समानता” आम तौर पर transitive relationship नहीं होती
- A और B, B और C threshold से ऊपर similar हो सकते हैं
- उसी समय A और C threshold से नीचे हो सकते हैं
- इसी वजह से large-scale fuzzy duplicate detection, exact duplicate detection की तुलना में ज्यादा कठिन होता है

Jaccard similarity की परिभाषा

Jaccard index दो finite sets की similarity को intersection size / union size के रूप में व्यक्त करता है

[ J(A, B) = \frac{|A \cap B|}{|A \cup B|} ]
अगर दो sets similar हैं, तो उनमें ज्यादातर elements समान होंगे, इसलिए union बस थोड़ा बड़ा होगा और intersection बस थोड़ा छोटा होगा
अगर दो sets पूरी तरह अलग हैं, तो intersection size 0 होता है, इसलिए Jaccard similarity 0 होती है
अगर दो sets identical हैं, तो intersection और union दोनों वही set होते हैं, इसलिए Jaccard similarity 1 होती है
वास्तविक documents Unicode strings जैसे रूप में होते हैं, इसलिए पहले document को feature set में बदलना पड़ता है

सभी pairs की तुलना में scalability समस्या

documents को feature sets में बदलने के बाद high Jaccard similarity वाले pairs खोजने की definition अपने आप में सरल है
लेकिन सभी document pairs की तुलना करने पर cost corpus size के हिसाब से O(n²) तक बढ़ती है
exact duplicate detection में documents को hash करके और same hash bucket वालों को group करके इस cost से बचा जाता है
fuzzy duplicate detection में भी इसी तरह के workaround की जरूरत होती है, और इस क्षेत्र में इसे locality-sensitive hash कहा जाता है
Jaccard similarity के लिए इस उद्देश्य के अनुरूप technique मौजूद है, और उसका core MinHash है

MinHash से Jaccard similarity का approximation

MinHash पूरे set की हर बार तुलना किए बिना, हर document के लिए पहले से computed छोटे signature से Jaccard similarity का approximation करता है
basic idea union से uniformly random element चुनना और देखना है कि वह element intersection में भी है या नहीं—यानी sampling
व्यवहार में random permutation की जगह अच्छा hash function H(x) इस्तेमाल किया जाता है, और हर set में सबसे छोटी hash value वाली feature को store किया जाता है

[ a_{min} \leftarrow \min_{x \in A} H(x) ]

[ b_{min} \leftarrow \min_{x \in B} H(x) ]
min operation associative होता है, इसलिए हर document की minimum hash value को independently preprocess किया जा सकता है
दो sets की minimum hash values समान होने की probability उन दो sets की Jaccard similarity के बराबर होती है

कई hash functions और signature vector

सिर्फ एक hash function इस्तेमाल करने पर दो documents के लिए “same/different” वाला boolean estimate ही मिल सकता है
k अलग-अलग hash functions इस्तेमाल करने पर हर document को k MinHash values वाले vector में summarize किया जा सकता है

[ A_{sig} = (\min_{x \in A} H_1(x), \min_{x \in A} H_2(x), \dots, \min_{x \in A} H_k(x)) ]
दो signatures में समान positions की values के match होने के अनुपात से Jaccard similarity approximate की जाती है

[ J(A, B) \approx \frac{1}{k} \sum_{i=1}^{k} (A_{sig}[i] = B_{sig}[i]) ]
hash function family का selection सूक्ष्म मामला है
- लक्ष्य feature space की पूरी random permutation को approximate करना है
- वास्तविक hash function family सभी possible permutations में से सिर्फ बहुत छोटा हिस्सा express करती है
- अनुपयुक्त correlations से बचना चाहिए, और संबंधित property को min-wise independence कहा जाता है
- यह समस्या काफी अच्छी तरह study की जा चुकी है और literature में efficient solutions मौजूद हैं

पूरे corpus में candidate pairs खोजना

हर document को k hash value fingerprint में घटाने पर Jaccard similarity को efficiently approximate किया जा सकता है
बची हुई समस्या यह है कि सभी document pairs देखे बिना पूरे corpus में high-similarity documents कैसे खोजें
strategy यह है कि documents को किसी key से group किया जाए, और तुलना सिर्फ same group के भीतर की जाए
group key ऐसी होनी चाहिए कि similar documents high probability से साथ group हों, और non-similar documents यथासंभव साथ group न हों
पूरी MinHash signature को key के रूप में इस्तेमाल करना
- सबसे सरल तरीका k MinHash values सभी को एक group key के रूप में इस्तेमाल करना है
- दो documents को तभी fuzzy duplicates माना जाता है जब सभी MinHash values match करें
- GPT-3 paper ने dataset preparation pipeline में fuzzy duplicate removal इस्तेमाल किया था, और quoted wording से इसे Spark के MinHashLSH implementation और 10 hashes इस्तेमाल करने के रूप में समझा जाता है
- इस approach का फायदा simplicity और efficiency है
- एक high-cardinality byte string से group करना horizontally scale करना आसान है
- यह data processing tools के basic primitives जैसा है, जैसे MapReduce में map और reduce के बीच “shuffle”
- अगर दो documents की Jaccard similarity J(A, B) है और सभी k values match करनी हों, तो single pair के लिए collision probability J(A, B)^k है
- k = 10 होने पर लगभग 0.6 या उससे कम similarity वाले documents लगभग collision नहीं करते
- match probability लगभग 0.95 similarity के आसपास बढ़ने लगती है
- बहुत नजदीकी document siblings खोजने के उद्देश्य के लिए यह पर्याप्त हो सकता है
- यह J^k calculation single document pair के लिए है
- अगर बहुत similar documents बड़ी संख्या में हों, तो pairwise probabilities independent नहीं होतीं
- व्यवहार में, बहुत similar documents आम तौर पर दो-तीन या उससे कम buckets में जाते हैं और लगभग सारे duplicates मिल जाते हैं

ज्यादा ढीली duplicate detection

अगर 1 के करीब similarity वाले documents के अलावा 0.8 या 0.7 से ऊपर वाले documents भी खोजने हों, तो पूरी signature को key के रूप में इस्तेमाल करने वाला तरीका बहुत strict हो सकता है
k MinHash में से केवल कुछ को group key के रूप में इस्तेमाल करने पर lower similarity पर भी collision की संभावना बढ़ती है
- उदाहरण के लिए, पहले 4 MinHash values से group करने के बाद, same bucket के भीतर पूरी MinHash values से actual similarity estimate की जा सकती है
hash count घटाने की एक सीमा है
- J^r हमेशा J से छोटा होता है
- अगर r बहुत छोटा हो जाए, तो false collisions बहुत ज्यादा हो सकते हैं
इसके बजाय हर document के लिए कई keys बनाकर उसे कई buckets में डाला जा सकता है
- उदाहरण के लिए k = 20 hashes calculate करके, b = 4 buckets में डालें, और हर key r = 5 hashes से बनी हो सकती है
दो documents के कम से कम एक bucket में collide करने की probability इस प्रकार है

[ p = 1 - (1 - J^r)^b ]
4 groups और प्रति group 5 hashes वाले उदाहरण में, collision probability 50% होने का बिंदु लगभग J = 0.7 के आसपास shift हो जाता है
जब r और b दोनों 1 से बड़े हों, तो resulting curve आम तौर पर S-shape का होता है, और sensitivity, recall, और performance cost के बीच tuning space देता है

HyperLogLog से संबंध

MinHash की core trick HyperLogLog जैसे sketch algorithms से मिलती-जुलती है
HyperLogLog stream के हर element को hash करता है, और hash value में leading zeros की संख्या का running maximum store करता है
दोनों techniques input elements को hash function से uniform distribution में map करती हैं, फिर running extreme calculate करके constant-size summary से distributional properties estimate करती हैं
HyperLogLog को अगर bit order उलटकर सोचें, तो उसे log2(H(x)) का running minimum calculate करने के तरीके के रूप में देखा जा सकता है, और MinHash H(x) itself की minimum value इस्तेमाल करता है
दोनों structures एक अर्थ में dual हैं
- दो HyperLogLog structures को combine करने से दो sets की union size estimate की जा सकती है
- दो MinHash structures की तुलना करने से दो sets के intersection के relative size का estimate मिलता है
दोनों structures combine करके arbitrary sets के intersection और union से जुड़े सवालों को handle करने वाला sketch बनाया जा सकता है
- यह idea 2013 तक ज्ञात था, और संबंधित literature तथा follow-up research मौजूद हैं

documents को set के रूप में represent करने के तरीके

Jaccard और MinHash इस्तेमाल करने के लिए string documents को पहले feature set में बदलना होता है
जो भी तरीका इस्तेमाल हो, preprocessing में document को normalize किया जा सकता है
- standard Unicode normalization form में conversion
- case folding
- लगातार whitespace को collapse करना
- इसी तरह के transformations
n-gram या shingle
- document को उसमें आने वाले सभी n-gram के set के रूप में represent किया जा सकता है
- large-scale text processing literature में “shingle” term भी इस्तेमाल होती है, लेकिन यहां यह n-gram जैसा ही role निभाती है
- n value चुनने में trade-off है
- छोटी value documents की तुलना ज्यादा coarse तरीके से करती है
- उदाहरण के लिए, ज्यादातर English text bigram viewpoint से काफी similar दिख सकता है
- बड़ी value ज्यादा distinguishing features और बड़ा set बनाती है
- बहुत बड़ी होने पर sensitivity घट सकती है, लेकिन उससे पहले performance problem आने की संभावना है
- Mining of Massive Datasets §3.2.2 के अनुसार कई applications में n = 5 से 9 के बीच की values आम choice लगती हैं
words या tokens में splitting
- input को “words” या “tokens” में बांटकर इन्हें features के रूप में भी इस्तेमाल किया जा सकता है
- GPT-3 paper excerpt Spark के standard tokenizer का उल्लेख करता है, और यह input को lowercase बनाकर whitespace के आधार पर split करने वाले pyspark.ml.feature.Tokenizer की ओर इशारा करता प्रतीत होता है
- ज्यादा sophisticated NLTK tokenizer भी इस्तेमाल किया जा सकता है
- tokenization के बाद tokens के n-gram इस्तेमाल करने वाला hybrid तरीका भी संभव है
- individual tokens में bytes या characters की तुलना में entropy ज्यादा होती है, इसलिए इस case में छोटी n values इस्तेमाल की जाती हैं

1 टिप्पणियां

GN⁺ 2024-07-06

Hacker News की राय

Jaccard similarity (Tanimoto coefficient) या F1 score (Dice coefficient) जैसे set-based metrics को fuzzy sets पर भी उसी तरह इस्तेमाल किया जा सकता है—यह बात अक्सर छूट जाती है
बस fuzzy set के intersection और union की अवधारणा व्यक्त करने के लिए सही T-Norm / T-Conorm pair चुनना पड़ता है, और इनके प्रकार अनगिनत हैं
बल्कि अपनी मनचाही semantics के हिसाब से pair चुन पाने के लिहाज से यह एक फायदा है
medical image segmentation validation में, जब segmentation output और ground truth binary mask नहीं बल्कि probabilistic/fuzzy रूप में होते हैं, तब मैंने इस विषय पर काम किया था: https://link.springer.com/chapter/10.1007/978-3-319-46723-8_..., https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...
आम तौर पर 0.5 पर threshold लगाकर binary set बनाया जाता है और फिर Jaccard/Dice के binary variants इस्तेमाल किए जाते हैं, लेकिन इससे validation operator की precision शायद करीब दो decimal digits तक गिर जाती है
यानी algorithm को state-of-the-art से 0.001 बेहतर बताकर पेश किया जाता है, जबकि validation operator की error range 0.1 है—इस तथ्य को नजरअंदाज कर दिया जाता है
फ्रांस सरकार के एक बड़े database में नागरिक records को deduplicate करने के लिए एक client ने इस technique का अपना Python implementation बनाया था, और यह अच्छी तरह काम करता था
आज के समय में शायद मैं datasketch इस्तेमाल करने को कहता: https://pypi.org/project/datasketch/
देखने पर पता चला कि इस विषय पर नए tools भी लगातार आ रहे हैं। उदाहरण के लिए https://pypi.org/project/rensa/ datasketch के MinHash से ज्यादा specialised और तेज version है, जो Rust में लिखा गया है और ऊपर थोड़ा Python जोड़ा गया है
- लोगों की deduplication के लिए Fellegi-Sunter model भी एक मजबूत approach है। Splink बड़े datasets के लिए इसका free Python library implementation है, और लगता है कि दोनों approaches के कुछ हिस्सों को जोड़ा भी जा सकता है
  बता दूं कि मैं इसका lead author हूं
  मैंने इसके काम करने का तरीका समझाने वाला interactive tutorial भी लिखा है: https://github.com/moj-analytical-services/splink, https://www.robinlinacre.com/intro_to_probabilistic_linkage/
- gaoya भी है। इसे मैंने बनाया है, यह Rust में लिखा गया है और Python bindings भी देता है
  datasketch शानदार है, लेकिन मेरे use case के लिए इसकी performance पर्याप्त नहीं थी, और gaoya बड़े पैमाने के clustering operation systems में इस्तेमाल हो रहा है: https://github.com/serega/gaoya
क्या जबरदस्त coincidence है। अभी-अभी मैंने एक MinHash system implement किया है जो शायद किसी को दिलचस्प लगे
समस्या एक बड़े square matrix में कई उपयुक्त submatrices के pseudo-inverses खोजने की है
Woodbury, Banachiewicz जैसी matrix identities का इस्तेमाल करके “करीबी” submatrix के inverse को update कर नया inverse सस्ते में compute किया जा सकता है
पहले से compute किए गए inverses को row/column indices को key बनाकर store कर दें, और हर नई submatrix के लिए update के starting point के तौर पर इस्तेमाल करने योग्य कोई करीबी मौजूदा inverse खोजें
मैंने इस समस्या को MinHash से हल किया, indices पर min-value hashing करके यह संभावना बढ़ाई कि करीबी matrices का hash समान हो
मेरे implementation में पहले से computed inverses की संख्या बढ़ने पर search selectivity adjust कर सकें, इसके लिए multi-resolution hash इस्तेमाल किया गया
इस लेख में जो background छूटा है, वह थोड़ा जोड़ूं तो, जहां तक मुझे पता था, यह technique Google के शुरुआती दिनों में crawling set की deduplication के लिए बनाई गई थी
यह भी दिलचस्प है कि LLM बनाना और साधारण web text index बनाना हैरानी भरे तरीके से मिलते-जुलते हैं
Jeffrey Ullman की free किताब “Mining Massive Datasets” में इसके बारे में विस्तार से पढ़ा जा सकता है, और उस समय पूरे internet का index बनाने के लिए इस्तेमाल की गई कई शानदार और प्रभावशाली techniques समझाई गई हैं
संबंधित material “chapter 3 pdf mmds ullman” खोजकर मुफ्त में मिल सकता है
edit: पता चला कि मैं गलत था, और Wikipedia के अनुसार इसे DEC में AltaVista के लिए invent किया गया था: https://en.wikipedia.org/wiki/MinHash
फिर भी Ullman की किताब में अच्छी explanation है, और यह भी बताया गया है कि Google में इसका उपयोग कैसे हुआ
MinHash और उसके variants को समझने की कोशिश की तो बात दिमाग में ठीक से नहीं बैठी, इसलिए मैं एक online visualization tool बना रहा हूं: https://websla.sh/tools/minhash
यह अभी पूरा नहीं हुआ है और मैं Jaccard similarity calculation जैसी चीजें भी दिखाना चाहता हूं, लेकिन अभी भी आप कई strings डालकर खुद देख सकते हैं कि “minhash” असल में क्या है
hashing या छोटे neural networks को vector search engine और Tanimoto/Jaccard के साथ इस्तेमाल करना बड़े datasets की deduplication में बहुत आम strategy है
यह linear complexity वाले MapReduce jobs इस्तेमाल करने से ज्यादा समझदारी भरा हो सकता है
Google का एक अच्छा project है जिसमें 500k-parameter RETSim model और USearch engine इस्तेमाल होते हैं: https://github.com/google/unisim
अभी PostgreSQL में इसी तरह की समस्या है। feed_items 600000 हैं और schema (feed_item_id uuid, author varchar, content text, guid varchar, link varchar, title varchar, summary text, feed_id integer) है
खास तौर पर कुछ news items के content और summary columns बहुत मिलते-जुलते हैं, लेकिन समान नहीं हैं
ऐसे दो news items दिए जाने पर मैं उन्हें एक में घटाना चाहता हूँ; क्या कोई अच्छा तरीका है?
- मैंने BigQuery में MinHash जैसा system implement किया था, और सभी Stack Overflow items के बीच cosine similarity को उचित समय में calculate कर पाया था
  मोटा-मोटा process ऐसा है
  1. सभी text fields को जोड़ें और n-gram arrays में बाँटें, उदाहरण के लिए 2~n character units
  2. length n की global arrays A और B declare करें और उन्हें 32~64-bit random integers से भरें
  3. हर n-gram को 32~64-bit integer में hash करें, फिर उस hash को array A के हर random value से multiply करें और result को array B के हर random value से divide करने पर बचा remainder निकालें, और minimum value लें
    लक्ष्य यह है कि हर row के लिए step 2 की arrays जितनी लंबाई वाली “minhashed” integer array मिले। अगर global array length 64 declare करें, तो हर row की MinHash array भी length 64 होगी
  4. window function से लगातार N MinHash values को sum करके hash array को bucketize करें। उदाहरण के लिए लगातार 4 rows को sum करें
    अगर यह ठीक से हुआ हो, तो इस array को flatten करके “source row” के रूप में रखें, और हर bucketized MinHash value पर dataset को खुद से join करें तो “target row” column जुड़ जाएगा
    source/target columns से group करें और occurrences count करें, तो अनुमान लगाया जा सकता है कि दो rows कितनी मिलती-जुलती हैं
    मूल रूप से, दो items जितना अधिक समान buckets में hash होते हैं, वे उतने ही अधिक similar हैं; और किस point से actual pairwise Jaccard या cosine similarity calculate करनी है, यह आप तय कर सकते हैं
- यहाँ text embeddings और cosine similarity का उपयोग उपयोगी हो सकता है: https://simonwillison.net/2023/Oct/23/embeddings/
- MinHash का उपयोग करने से पूरी O(N^2) distance matrix से बचा जा सकता है, लेकिन अगर items सिर्फ 600000 हैं तो simplicity के लिए पूरी matrix को brute-force calculate करना भी संभव हो सकता है
  मुद्दा यह है कि आपके पास time budget कितना है
- अगर मानें कि दोनों items बहुत similar keywords cover करते हैं, तो Jaccard distance अच्छी तरह fit होगा
  अगर मानें कि दोनों items बहुत similar text share करते हैं, तो Levenshtein distance try करने लायक है
- LLM से items के लिए inverted index बनवाएँ, लेकिन cardinality कम रखने के लिए मजबूर करें
  तब आप Jaccard similarity इस्तेमाल कर सकते हैं
लेख पसंद आया। NVIDIA में हमारी team ने हाल ही में बताए गए fuzzy deduplication algorithm का GPU-accelerated version release किया है, और लगता है कि इस community की भी इसमें रुचि होगी
repository यहाँ है: https://github.com/NVIDIA/NeMo-Curator/
fuzzy deduplication script के docs यहाँ हैं: https://docs.nvidia.com/nemo-framework/user-guide/latest/dat...
Python example भी है: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fu...
feedback सुनना चाहूँगा
इस तरह की technique को लेख में पढ़कर समझ नहीं आता, लेकिन working code example में अपना data कुछ बार डालकर और अंदर का process देखकर यह तुरंत समझ में आ जाती है
मैंने यह technique सबसे पहले Douglas Eck से सीखी थी: https://research.google/people/douglas-eck/
Google में इसे song clustering के लिए इस्तेमाल किया गया था, और मुझे याद है कि उन्होंने hashing और random vectors के बारे में बात की थी
उस समय मुझे भ्रम हुआ था, क्योंकि लगता था कि कम randomness वाली optimization बेहतर चलेगी
- मुख्य intuition, कम-से-कम मेरे लिए, यह है कि अगर आप object को बहुत छोटे टुकड़ों के ढेर में बाँटें और उन ढेरों को sort करने के n तरीके बनाएँ, तो similar objects में कई sortings में वही टुकड़ा सबसे ऊपर आएगा
  इसमें banding और थोड़ी probability जोड़ दें, तो विशाल datasets में Jaccard similarity को सस्ते और बहुत आसानी से parallelize किए जा सकने वाले तरीके से approximate किया जा सकता है
document clustering या dataset deduplication technique के रूप में देखें, तो इससे सरल discrete algorithmic approach की तुलना में “समस्या पर machine learning फेंकने” वाला approach quality और performance के लिहाज़ से कैसा रहेगा?
उदाहरण के लिए, pre-trained LLM encoder से document vector embeddings बनाना, उन vectors को vector DB में डालना, और फिर k-means से clustering करना
- LLM embeddings generate करने के कई तरीकों में से सिर्फ एक है
  k-means करने के लिए आपको फिर भी Jaccard जैसा distance function चुनना होगा, और k-means शायद near-duplicates के लिए ideal नहीं है
  MinHash को k-means के preprocessing के तौर पर इस्तेमाल करके speed बढ़ाई जा सकती है
  मुझे नहीं लगता कि vector DB बहुत मदद करेगा
  अगर आपके पास करोड़ों documents हैं, तो MinHash sketch lookup को तेज़ करने में इसका उपयोग हो सकता है, लेकिन कुल मिलाकर यह शायद जरूरत से ज़्यादा भारी विकल्प है
- मैंने देखा है कि ऐसा तरीका LSH से बेहतर काम करता है
  हर बार document embed करते समय उसे add करने से पहले approximate nearest neighbor search किया जाता है, इसलिए यह MinHash की तरह O(N) है
  HNSW और PQ जैसे vector indexes, cosine distance के लिए MinHash के equivalent SimHash LSH की तुलना में performance/quality trade-off बेहतर देते हैं
  quality इस पर निर्भर करती है कि near-duplicate को आप कैसे define करते हैं और कौन सा embedding model इस्तेमाल करते हैं
  modern models अच्छे से काम करते हैं, और labeled data हो तो fine-tuning से उन्हें और बेहतर बनाया जा सकता है
  मुख्य downside सभी documents को embed करने की extra cost है, खासकर लंबे documents में यह भारी पड़ती है
  लेकिन छोटे models, बेहतर optimization और तेज़ hardware की वजह से यह cost बहुत तेजी से कम हुई है

Jaccard समानता और MinHash का उपयोग करके fuzzy duplicate detection

fuzzy duplicate detection की कठिनाई

Jaccard similarity की परिभाषा

सभी pairs की तुलना में scalability समस्या

MinHash से Jaccard similarity का approximation

कई hash functions और signature vector

पूरे corpus में candidate pairs खोजना

पूरी MinHash signature को key के रूप में इस्तेमाल करना

ज्यादा ढीली duplicate detection

HyperLogLog से संबंध

documents को set के रूप में represent करने के तरीके

n-gram या shingle

words या tokens में splitting

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय