Gmail को SQLite में सहेजना

(github.com/marcboeker)

2 पॉइंट द्वारा GN⁺ 2025-05-11 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Gmail to SQLite एक Python application है जो Gmail messages को local SQLite database में sync करता है, ताकि उन्हें analysis और archive के लिए इस्तेमाल किया जा सके
इसका default behavior incremental sync है, जो केवल नए messages डाउनलोड करता है; full sync option के साथ सभी messages डाउनलोड किए जा सकते हैं और deleted status भी detect किया जा सकता है
message import में multithreaded parallel processing का उपयोग होता है, और इसमें exponential backoff आधारित automatic retry तथा CTRL+C handling जैसे error और shutdown response शामिल हैं
इसे चलाने के लिए Python 3.8 या उससे ऊपर, Gmail API enabled Google Cloud Project, और OAuth 2.0 credentials.json file की आवश्यकता होती है
सहेजे गए data में sender, recipient, label, body, size, read status, outgoing status, deleted status आदि शामिल होते हैं, जिससे SQL के जरिए Gmail usage patterns का सीधे analysis किया जा सकता है

Gmail messages के लिए local sync tool

Gmail to SQLite एक Python application है जो Gmail messages को local SQLite database में store करता है
इसका उद्देश्य Gmail data को analyze और archive करने योग्य बनाना है
पूरे codebase में type hints लागू किए गए हैं, जिससे type safety मिलती है

Sync का तरीका और reliability

default sync incremental sync के रूप में काम करता है, इसलिए केवल नए messages डाउनलोड होते हैं
--full-sync option का उपयोग करने पर सभी messages sync होते हैं और Gmail से deleted messages भी detect किए जाते हैं
message import multithreaded parallel processing के साथ किया जाता है, जिससे performance बेहतर होती है
error handling में automatic retry और exponential backoff शामिल हैं
CTRL+C दबाने पर graceful shutdown process शुरू होती है
- नए tasks स्वीकार करना बंद हो जाता है
- चल रहे tasks के पूरा होने का इंतज़ार किया जाता है
- पूरे हो चुके tasks की progress save की जाती है
- process सामान्य रूप से बंद हो जाती है
CTRL+C एक बार और दबाने पर तुरंत exit हो जाता है

Installation और prerequisites

runtime environment में Python 3.8 या उससे ऊपर आवश्यक है
Gmail API enabled Google Cloud Project आवश्यक है
OAuth 2.0 authentication file credentials.json project root में मौजूद होनी चाहिए
installation flow में repository को clone करने के बाद uv sync से dependencies install की जाती हैं
Gmail API authentication setup के लिए Google Cloud Console में project बनाना या चुनना होता है, Gmail API enable करना होता है, फिर Desktop application के लिए OAuth 2.0 credentials बनाकर उन्हें credentials.json के रूप में save करना होता है

Command usage

default incremental sync इस तरह चलाया जाता है

python main.py sync --data-dir ./data

# or: uv run main.py sync --data-dir ./data

full sync और deleted message detection के लिए --full-sync का उपयोग किया जाता है

python main.py sync --data-dir ./data --full-sync

केवल किसी specific message को sync करने के लिए sync-message और --message-id का उपयोग किया जाता है

python main.py sync-message --data-dir ./data --message-id MESSAGE_ID

केवल deleted messages को detect और mark करने के लिए sync-deleted-messages का उपयोग किया जाता है

python main.py sync-deleted-messages --data-dir ./data

worker thread count को --workers से set किया जा सकता है, और इसका default value CPU core count है

python main.py sync --data-dir ./data --workers 8

command line arguments इस प्रकार हैं
- command: required, और sync, sync-message, sync-deleted-messages में से एक
- --data-dir: required, वह directory जहाँ SQLite database store होगा
- --full-sync: optional, full sync को force करता है
- --message-id: sync-message में required, sync किए जाने वाले specific message की ID
- --workers: optional, worker threads की संख्या
- --help: commands और options की help दिखाता है

SQLite schema और analysis examples

बनने वाले SQLite database की messages table में Gmail message analysis के लिए ज़रूरी fields शामिल हैं
- message_id: unique Gmail message ID
- thread_id: Gmail thread ID
- sender: नाम और email वाला JSON sender information
- recipients: to, cc, bcc type के अनुसार recipient JSON
- labels: Gmail label array
- subject: message subject
- body: plain text message body
- size: bytes में message size
- timestamp: message timestamp
- is_read: read status
- is_outgoing: क्या message user ने भेजा है
- is_deleted: क्या message Gmail से deleted है
- last_indexed: आखिरी sync timestamp
sender के अनुसार email count aggregate किया जा सकता है

SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
GROUP BY sender->>'$.email'
ORDER BY count DESC

unread emails को sender के अनुसार aggregate करके यह देखा जा सकता है कि कौन से senders बहुत अधिक गैर-ज़रूरी emails भेजते हैं

SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE is_read = 0
GROUP BY sender->>'$.email'
ORDER BY count DESC

strftime का उपयोग करके साल, महीना, दिन, weekday, और hour के आधार पर email count aggregate किया जा सकता है

SELECT strftime('%Y', timestamp) AS period, COUNT(*) AS count
FROM messages
GROUP BY period
ORDER BY count DESC

body में newsletter या unsubscribe शामिल होने वाले mails खोजकर newsletters को sender के अनुसार group किया जा सकता है

SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE body LIKE '%newsletter%' OR body LIKE '%unsubscribe%'
GROUP BY sender->>'$.email'
ORDER BY count DESC

sender के अनुसार total mail size और बड़े email senders को MB unit में देखा जा सकता है

SELECT sender->>'$.email', sum(size)/1024/1024 AS size
FROM messages
GROUP BY sender->>'$.email'
ORDER BY size DESC

खुद को भेजे गए mails की संख्या recipients JSON और sender email condition से calculate की जा सकती है

SELECT count(*)
FROM messages
WHERE EXISTS (
  SELECT 1
  FROM json_each(messages.recipients->'$.to')
  WHERE json_extract(value, '$.email') = 'foo@example.com'
)
AND sender->>'$.email' = 'foo@example.com'

received mails में sender के अनुसार कुल size को descending order में देखा जा सकता है

SELECT sender->>'$.email', sum(size)/1024/1024 as total_size
FROM messages
WHERE is_outgoing=false
GROUP BY sender->>'$.email'
ORDER BY total_size DESC

deleted messages को is_deleted=1 condition से query किया जाता है

SELECT message_id, subject, timestamp
FROM messages
WHERE is_deleted=1
ORDER BY timestamp DESC

1 टिप्पणियां

GN⁺ 2025-05-11

Hacker News की टिप्पणियां

उत्सुकता है कि स्कीमा में कुछ खास हेडर अलग से क्यों निकाले गए। recipients, subject, sender को JSON फ़ील्ड के रूप में भी रखा जा सकता है, लेकिन headers नाम के एक ही फ़ील्ड में सब कुछ, साथ में मैसेज के बाकी हेडर भी, डाले जा सकते हैं
अगर वजह performance है, तो headers को एक single JSON blob के रूप में रखकर ज़रूरी फ़ील्ड्स को generated columns बनाया जा सकता है। जैसे subject को json_extract("headers", '$.Subject') से बनाया जा सकता है और उस पर index लगाया जा सकता है
यह मॉडल काफ़ी शक्तिशाली था, क्योंकि यूज़र अपनी query के लिए ज़रूरी indexed generated columns को ALTER TABLE से जोड़ सकता था। DKIM status के लिए भी "Dkim-Signature" निकालकर column और index बनाया जा सकता है, फिर GROUP BY किया जा सकता है
- असल में generated columns की भी ज़रूरत नहीं है; SQLite expression indexes को support करता है। उदाहरण के लिए CREATE INDEX subjectidx ON messages(json_extract(headers, '$.Subject')) जैसा बनाने पर, उस expression को refer करने वाली जगहों पर index इस्तेमाल होगा
  इस तरह index बनाने के बाद main table को ALTER करके generated column जोड़ने के बजाय उस expression का इस्तेमाल करने वाला VIEW बनाना ज़्यादा उपयोगी लगा
- एकबारगी query के लिए index जोड़ना बुरी आदत जैसा लगता है
  आम तौर पर जिन columns का लगातार इस्तेमाल होना है, उन्हें अलग निकालना बेहतर लगता है। email headers जैसे stable target के मामले में तो और भी; headers column schema change को थोड़ा आसान बना सकता है, लेकिन यह write-time की तकलीफ़ को read-time की तकलीफ़ में बदलने जैसा है, और चुपचाप fail होने की गुंजाइश भी छोड़ता है
- PostgreSQL में system को scale करते समय मैं अक्सर ऐसा ही pattern इस्तेमाल करता हूं। शुरुआत में जिन fields की ज़रूरत पता होती है, उनके हिसाब से table बनाता हूं, और बचा हुआ metadata JSON column में रख देता हूं
  करीब दो महीने बाद जब वास्तव में ज़रूरी fields दिखने लगते हैं, तो उन्हें JSON से भर देता हूं और API को लगातार up to date रखने देता हूं, या view बना देता हूं। “बस सब कुछ MongoDB में डाल दो” या “बस file system में रख दो” वाली growth pains से बचने में यह काफ़ी मददगार रहा, और cost भी ज़्यादा नहीं थी
- dkim column को NOT NULL define किया गया है; उत्सुकता है कि अगर किसी email message में Dkim-Signature header न हो तो क्या होता है
कुछ साल पहले Gmail जैसे large-scale email visualization tool बनाया था: https://github.com/terhechte/postsack
- काफ़ी शानदार। disk usage visualization tool जैसा है, लेकिन लगता है कि disk usage की बजाय mail की कुल मात्रा पर ज़्यादा focus है
  उत्सुकता है कि size option भी है या नहीं। मैं देखना चाहता हूं कि कौन-सा sender मेरी storage सबसे ज़्यादा इस्तेमाल कर रहा है। और website का SSL certificate expire हो चुका है
- दिलचस्प लग रहा है। README में gmvault link अब dead है; उत्सुकता है कि क्या यह सही है: https://github.com/gaubert/gmvault
- दिलचस्प लग रहा है। पहले qdirstat से मैंने खुद कुछ ऐसा ही किया था, लेकिन email को किसी खास तरीके से रखना पड़ता था, जैसे date folders, और फिर किसी दूसरे मानदंड से दोबारा slice करना मुश्किल था
  इसके उलट qdirstat cache files बनाना आसान है, इसलिए file जैसे दिखने वाले कई targets को visualize करने के लिए उनका इस्तेमाल किया जा सकता है
अब app-specific password से भी login नहीं किया जा सकता, और OAuth client बनाकर OAuth flow से गुजरना पड़ता है—यह सच में अफ़सोसजनक है। मेरा अपना email होते हुए भी Google ने access के लिए इस्तेमाल होने वाला open standard मुझसे छीन लिया
- free Gmail address पर आने वाले spam की मात्रा और Gmail servers से non-Gmail accounts पर आने वाले spam की मात्रा देखकर मेरा मन धीरे-धीरे de-Google करने की तरफ जा रहा है
  खासकर मुझे लगातार ज़्यादा जानकारी मिल रही है कि मेरा freelance email recipient systems में spam में जा रहा है। हालांकि Google ecosystem की आदत कैसे छोड़ूं, यह समझ नहीं आता
- उत्सुकता है कि app-specific password को open standard और OAuth को नहीं, ऐसा क्यों मानते हैं
- app password इस्तेमाल करने पर IMAP full access मिल जाता है—इस लिहाज़ से आप क्या कहना चाह रहे हैं, यह साफ़ नहीं है
हाल में अपने app https://github.com/rumca-js/Django-link-archive में Gmail integrate करने की कोशिश की, लेकिन इसमें बहुत ज़्यादा समय लग गया, और मैंने तय किया कि Gmail support करना worth it नहीं है
Gmail to SQLite credentials setup को 6 steps में समझाता है, लेकिन मेरे मामले में ऐसा नहीं था। 6 steps के बाद भी Google ने कहा कि app published नहीं है, इसलिए उसे publish करना होगा; फिर कहा कि मैं Workspace user नहीं हूं इसलिए इसे internal app नहीं रख सकता; और external app में बदलने पर कहा कि verification से पहले इसे इस्तेमाल नहीं किया जा सकता
verification process में domain, address, बाकी details, scopes के लिए justification, और app कैसे इस्तेमाल होता है यह समझाने वाला video तक मांगा गया; और कहा गया कि submitted data verify करने में समय लगेगा। पूरा setup एक maze जैसा है, और users को Google की मांगों वाली hurdles पार करवाना बहुत ज़्यादा है
- सिर्फ़ एक API key पाने के लिए भी Google लोगों से जो प्रक्रिया करवाता है, वह पूरी तरह बेतुकी है। क्या किसी को पता है कि यह इतना खराब क्यों है
- बस पुराने तरीके वाला IMAP और app password इस्तेमाल करें। Google की hurdles के हिसाब से कूदने की ज़रूरत नहीं है
जानना चाहता हूँ कि अभी मौजूद सबसे अच्छा open source Gmail backup software कौन-सा है। Attachments को सुरक्षित रखने सहित क्या किसी ने ऐसा setup बनाया है?
- https://github.com/GAM-team/got-your-back है। यह open source है, और resume feature होने की वजह से backup और restore आखिरकार पूरा हो जाता है
  संदर्भ के लिए https://www.mailstore.com/en/products/mailstore-home/ भी है। यह open source नहीं है, लेकिन indexed GUI होने के कारण local mail search के लिए अच्छा है, और resume सिर्फ backup में होता है इसलिए बड़े restore आम तौर पर fail हो जाते हैं
- शायद यह बिल्कुल वही जवाब न हो जो चाहिए, लेकिन Google के पास Takeout नाम की service है जिससे Gmail सहित सभी Google services के data का backup request करके download किया जा सकता है
  हर कुछ महीनों में इसे चलाने का reminder लगा रखा है और local backup update कर लेता हूँ। याद के मुताबिक यह gzip-compressed mbox file के रूप में आता है
- अगर IMAP client इस्तेमाल करें और उसे offline/download mode में set करें, तो पूरा data download करके local में store किया जा सकता है। Evolution में इसे शायद “offline mode” कहा जाता है, लेकिन Thunderbird या दूसरे clients में नाम अलग हो सकता है
मुझे लगता है इसका नाम “Gmail to SQLite” नहीं बल्कि “IMAP to SQLite” जैसा होना चाहिए। समझ नहीं आता इसे किसी एक specific email provider से क्यों बाँधा गया है
- क्योंकि यह सच में Gmail-specific है। यह OAuth और शायद API access का इस्तेमाल कर रहा है
  IMAP कहीं ज्यादा मुश्किल और काफी धीमा है, और Google की bandwidth limits से भी बँधा रहता है
- कई सालों तक IMAP के जरिए Gmail account backup करने की कोशिश की, लेकिन Gmail-specific tools सहित एक बार भी सफल नहीं हुआ। सबसे अच्छा sync tool भी एक महीने तक चलता रहा और फिर किसी specific mail को fetch न कर पाने वाली जगह पर अटक गया
  पता नहीं वह बहुत cold storage में था इसलिए timeout हुआ या नहीं। इसलिए समझ आता है कि Google की proprietary API इस्तेमाल करने वाला तरीका बेहतर काम कर सकता है
  आजकल Google Takeout में mbox शामिल होता है, ठीक से काम करता है और काफी तेज भी है, लेकिन continuous updates नहीं देता। आखिरकार मैं दूसरे mail provider Infomaniak पर चला गया, और खुद को धन्यवाद दिया कि पहले से अपना mail domain इस्तेमाल कर रहा था
Full-text search भी enable कर सकें तो अच्छा होगा
- Search company द्वारा चलाए जाने के हिसाब से Gmail की full-text search हैरान करने वाली हद तक खराब लगती है
कल मैंने भी यही चीज बनाई। वजह यह थी कि मैं recipient emails को domain के हिसाब से list करना चाहता था। Code गड़बड़ है, लेकिन यहाँ है: https://github.com/hugoferreira/gmail-sqlite-db
PostgreSQL-based IMAP server Archiveopteryx की थोड़ी याद आती है: https://github.com/aox/aox
AOX का schema हमेशा अच्छा लगता था, लेकिन असल में कभी ठीक से इस्तेमाल नहीं कर पाया। मुख्य use case daily driver IMAP server नहीं, बल्कि mail analysis और search था
- Manitou-Mail भी याद आता है। यह daily driver के तौर पर इस्तेमाल किया जा सकने वाला मजबूत PostgreSQL-based dedicated mail client है और काफी solid है: https://www.manitou-mail.org/
यहाँ bandwidth cost क्या होगी, यह जानना चाहता हूँ। 40GB से बड़ा Gmail account रखने वाले के तौर पर जानना चाहता हूँ कि इस tool से transfer करने पर charge लगेगा या नहीं
इसे ठीक करना आसान है। Google Takeout शायद free है, इसलिए पहले download करके file parse कर सकते हैं। फिर भी तुरंत शुरू करने के लिहाज से यह tool ज्यादा तेज लगता है

Gmail को SQLite में सहेजना

Gmail messages के लिए local sync tool

Sync का तरीका और reliability

Installation और prerequisites

Command usage

SQLite schema और analysis examples

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की टिप्पणियां