ggsql - SQL के लिए graphics grammar

(opensource.posit.co)

11 पॉइंट द्वारा GN⁺ 11 일 전 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

SQL syntax-आधारित visualization tool, जो VISUALIZE, DRAW, PLACE, SCALE, LABEL जैसे clauses के जरिए data query और graph composition को एक ही flow में जोड़ता है
columns को visual properties से map करके layer composition approach के साथ scatter plot, bar graph, histogram, boxplot, और annotation elements तक को एक ही structure में बनाया जा सकता है
SQL query results को सीधे visualization input के रूप में भेजता है, और कुछ layers single SQL query execution के जरिए सिर्फ aggregate लाती हैं, इसलिए बड़े पैमाने के data processing में फायदेमंद है
R या Python runtime के बिना इस्तेमाल किए जा सकने वाले छोटे और focused executable की दिशा में डिज़ाइन किया गया है, इसलिए code-based reporting tools और AI analysis assistants के integration के लिए भी उपयुक्त है
मौजूदा version alpha-release है, और high-performance writer, themes, interactivity, language server, formatter, spatial data support जैसी expansion plans पेश किए गए हैं

ggsql परिचय

ggsql SQL syntax-आधारित grammar of graphics implementation है, जो SQL में structured visualization capabilities जोड़ता है
- Quarto, Jupyter notebooks, Positron, VS Code आदि में इस्तेमाल किया जा सकता है
इसे इस तरह डिज़ाइन किया गया है कि SQL users visualization code को अपने परिचित तरीके से लिख सकें
- SQL की declarative और composable clause structure को visualization grammar पर भी लागू किया गया है
motivation और usage examples के साथ ggsql की core syntax को समझाया गया है

बुनियादी उदाहरण

पहला plot
- built-in penguins dataset से scatter plot बनाया जा सकता है
  - VISUALIZE bill_len AS x, bill_dep AS y FROM ggsql:penguins
  - DRAW point
- VISUALIZE में data columns को visual properties से map किया जाता है, और DRAW point उसी default mapping का उपयोग करके point layer बनाता है
- सिर्फ species AS color जोड़ने से color categories के आधार पर भेद किया जा सकता है
  - VISUALIZE bill_len AS x, bill_dep AS y, species AS color FROM ggsql:penguins
  - DRAW point
- DRAW smooth जोड़ने से point layer के ऊपर regression line layer जोड़ी जा सकती है
  - species के हिसाब से color mapping बनी रहती है, इसलिए हर species के लिए अलग line बनती है
- predefined plot types की जगह modular components को जोड़ने का तरीका अपनाया गया है
- वही structure बनाए रखते हुए visualization को पूरी तरह अलग रूप में बदला जा सकता है
  - VISUALIZE island AS x, species AS color FROM ggsql:penguins
  - DRAW bar
पूरा उदाहरण
- ऊपर का हिस्सा सामान्य SQL query है, और VISUALIZE के बाद का हिस्सा visualization query के रूप में अलग होता है
  - उदाहरण में DuckDB backend का उपयोग किया गया है
- SQL हिस्से में astronauts.parquet से हर नाम के लिए सबसे हालिया mission ही रखने हेतु ROW_NUMBER() और QUALIFY का उपयोग किया गया है
- इसके बाद दो sets को जोड़ा जाता है
  - year_of_selection - year_of_birth को age के रूप में निकालकर Age at selection category दी जाती है
  - year_of_mission - year_of_birth को age के रूप में निकालकर Age at mission category दी जाती है
  - दोनों results को UNION ALL से जोड़ा जाता है
- visualization query में age AS x, category AS fill mapping के बाद DRAW histogram का उपयोग होता है
  - SETTING binwidth => 1, position => 'identity' दिया जाता है
- PLACE rule से पहले से calculated average positions के annotation जोड़े जाते हैं
  - x => (34, 44), linetype => 'dotted'
- PLACE text से text annotations जोड़े जाते हैं
  - x => (34, 44, 60)
  - y => (66, 49, 20)
  - label में Mean age at selection = 34, Mean age at mission = 44, John Glenn was 77 on his last mission - the oldest person to travel in space! शामिल हैं
  - hjust => 'left', vjust => 'top', offset => (10, 0) दिया जाता है
- SCALE fill TO accent से fill mapping values को accent color palette में बदला जाता है
- LABEL clause से title, subtitle, x-axis label, और legend label नियंत्रित किए जाते हैं
  - title How old are astronauts on their most recent mission?
  - subtitle Age of astronauts when they were selected and when they were sent on their mission
  - x-axis Age of astronaut (years)
  - fill => null

visualization query structure

VISUALIZE से पहले का हिस्सा pure SQL है, और result table को table के रूप में लौटाने के बजाय सीधे visualization input में भेजा जाता है
SQL हिस्से में बने tables या CTE को visualization query में refer किया जा सकता है
अगर data पहले से visualization के लिए उपयुक्त format में है, तो SQL हिस्सा छोड़ा जा सकता है
- VISUALIZE year_of_selection AS x, year_of_mission AS y FROM 'astronauts.parquet'
VISUALIZE या VISUALISE SQL query के अंत और visualization query की शुरुआत को दर्शाता है
mapping का काम columns को visual properties यानी aesthetics से जोड़ना है
- उदाहरण में age x-axis position और category fill color से जुड़ा है
DRAW visualization में layers जोड़ता है
- point जैसे simple types भी हैं, और histogram जैसे types भी जिनमें binning aggregation calculation की जरूरत होती है
- layers उसी क्रम में render होती हैं जिसमें वे define की गई हैं
PLACE, DRAW का sibling clause है, जो table data की जगह literal values का उपयोग करने वाला annotation-specific clause है
- उदाहरण वाला visualization histogram layer, rule annotation layer, और text annotation layer—इन तीन layers से बना है
- एक layer हमेशा सिर्फ एक graphic object से मेल नहीं खाती; यह एक ही type के कई individual objects भी render कर सकती है
SCALE data values को aesthetic के अनुरूप values में बदलने का clause है
- सिर्फ string categories को actual colors में बदलने तक सीमित नहीं, बल्कि continuous transformations, break point definition, और ordinal या binned जैसे scale types भी सेट कर सकता है
LABEL title, subtitle, axis titles, legend titles जैसे text labels जोड़ने और बदलने की सुविधा देता है

एक कदम पीछे हटकर

ऊपर का उदाहरण बहुत सारी syntax दिखाता है, लेकिन साथ ही core grammar के अहम हिस्सों को एक साथ समेटता है
कई visualization queries इससे कहीं ज्यादा simple हो सकती हैं
astronauts.parquet का उपयोग करके gender के अनुसार birth year का boxplot उदाहरण दिया गया है
- VISUALIZE sex AS x, year_of_birth AS y FROM 'astronauts.parquet'
- DRAW boxplot
code की लंबाई दूसरे plotting systems से ज्यादा हो सकती है, लेकिन इसमें अधिक structured, composable, और self-descriptive गुण हैं
ये गुण users के लिए हर तरह के plot behavior को समझना आसान बनाते हैं और भविष्य के LLM coding assistants के लिए भी फायदेमंद हैं
उसी संबंध को jitter scatter plot में आसानी से बदला जा सकता है
- DRAW point
- SETTING position => 'jitter'
jitter को data distribution follow कराने के लिए सेट करके violin plot जैसा behavior दिया जा सकता है
- SETTING position => 'jitter', distribution => 'density'
ऐसी syntax structure और composability exploratory analysis और visualization design की iterative process को आसान बनाती है

ggsql क्यों

ggsql बनाने के पाँच कारण बताए गए हैं
- मुख्य रूप से SQL के साथ काम करने वाले data analysts और data scientists को support करना
- SQL और grammar of graphics के बीच गहरा मेल
- R या Python जैसी पूरी programming language के बिना भी शक्तिशाली code-based visualization tool बनाना
- LLM की बेहतरीन SQL handling क्षमता और नए visualization interfaces की संभावना
- ggplot2 के 18 साल के development experience को नई नींव पर लागू करने का इरादा
Hello, SQL user
- data science revolution के दौरान R और Python को ज्यादा ध्यान मिला, लेकिन SQL ने data work की भरोसेमंद foundation के रूप में अपनी भूमिका बनाए रखी
- बहुत से data workers ऐसे हैं जो केवल SQL या मुख्य रूप से SQL का ही उपयोग करते हैं
- इनके लिए उपलब्ध मौजूदा visualization choices, लेख के अनुसार, अधिकतर optimal नहीं हैं
  - data export करके R या Python का उपयोग करना
  - GUI-based BI tools का उपयोग करना, जिनमें reproducibility support कम होता है
  - query के भीतर के visualization tools का उपयोग करना, लेकिन उन्हें पर्याप्त powerful या ergonomic नहीं माना गया
- ggsql syntax इस तरह बनाई गई है कि SQL users इसे तुरंत समझ सकें
  - composable और declarative clauses के प्रति उनकी अपेक्षाओं का लाभ उठाया गया है
- ggsql सिर्फ visualization workflow सुधारने के लिए नहीं, बल्कि SQL users को Quarto-आधारित code-based reporting और sharing ecosystem की ओर लाने के लिए भी काम करता है
declarative data transformation, declarative visualization
- SQL एक domain-specific language है जो एक या अधिक tables में stored relational data के साथ काम करती है
- SQL syntax relational algebra की अवधारणाओं पर आधारित है और data manipulation को structure के साथ सोचने का तरीका देती है
- SQL semantics functional नहीं बल्कि declarative modular operations के सेट को define करती है
- grammar of graphics data visualization concepts को modular components में तोड़ने का एक theoretical framework है
- ggplot2 जैसे tools इन concepts को practical implementation में बदलते हैं
- grammar of graphics भी functional से ज्यादा declarative modular operations का सेट define करती है
- दोनों systems अपने-अपने domain तक पहुँचने के तरीके में काफी समान हैं, और raw data से final visualization तक एक natural और powerful end-to-end pipeline बना सकते हैं
No runtime, no problem
- ggplot2 के लिए R install होना चाहिए, और plotnine के लिए Python
- इसके विपरीत, single focused executable पर आधारित visualization tool के स्पष्ट फायदे हैं
  - किसी दूसरे tool में एक छोटा executable embed करना, R/Python bundling या installation requirement की तुलना में आसान है
  - scope छोटा होने से malicious या accidental code execution को sandbox में सीमित करना आसान होता है
- ये विशेषताएँ ggsql को AI analysis assistants या अलग-अलग environments में code चलाने वाले code-based reporting tools के integration के लिए ज्यादा आकर्षक बनाती हैं
- interpreted language से बाहर निकलने पर कुछ constraints आते हैं, लेकिन लाभ भी बड़े हैं
- सबसे महत्वपूर्ण फायदा यह है कि strict structure की वजह से backend हर layer के लिए पूरे data pipeline को single SQL query के रूप में चला सकता है
  - उदाहरण के तौर पर, 10 billion transaction rows के bar plot में data warehouse से हर bar का सिर्फ count value लिया जाएगा, पूरी 10 billion rows नहीं
  - यही सिद्धांत boxplot, density plot जैसे ज्यादा complex layer types पर भी लागू होता है
- यह उन ज्यादातर visualization tools से अलग है जो पहले पूरा data materialize करते हैं, फिर calculation और plotting करते हैं
SELECT pod_door FROM bay WHERE closed
- यह साबित हो चुका है कि LLM natural language को SQL में बदलने में बहुत प्रभावी हैं
- यही स्तर ggsql पर भी लागू हो सकेगा, ऐसी उम्मीद जताई गई है
- querychat में ggsql के जरिए natural language आधारित visual data exploration पहले से संभव है
- ggsql, R या Python की तुलना में ज्यादा safe और lightweight runtime है, इसलिए production environment में coding agents deploy करने के लिए ज्यादा भरोसा देता है
We are now wise beyond our years
- ggplot2 के 18 साल के development और maintenance का मतलब है data visualization grammar, usability, और design पर 18 साल की संचित सोच
- यह सारा ज्ञान फिर से पूरा का पूरा ggplot2 में वापस नहीं डाला जा सकता
  - पुराने निर्णयों और user expectations का सम्मान करना पड़ता है, और बदलाव हों भी तो बहुत धीरे-धीरे
- ggsql एक blank slate है
  - शुरुआत से नया बनाया गया project
  - ऐसे environment के लिए डिज़ाइन किया गया system जहाँ visualization tools को लेकर पहले से बनी expectations नहीं हैं
- बताया गया है कि इस शुरुआती स्थिति ने development process में आज़ादी और ऊर्जा दी, और वही user experience में भी दिखाई देती है

भविष्य

मौजूदा version alpha-release है और अभी पूरा नहीं हुआ है
आगे जो चीजें जोड़ने की योजना है, उनकी एक non-exhaustive list दी गई है
- Rust में शुरू से लिखा गया नया high-performance writer
- theme infrastructure
- interactivity
- Posit Workbench या Positron से Connect तक end-to-end deployment flow
- पूरा ggsql language server और code formatter
- spatial data support
ggplot2 development के लिए इसका क्या मतलब है
- कहा गया है कि ggplot2 users ggsql की घोषणा को उत्साह और चिंता दोनों के साथ देख सकते हैं
- क्या Posit, ggplot2 को पीछे छोड़कर ggsql पर ध्यान दे रहा है? जवाब है नहीं
- ggplot2 पहले से बहुत mature और stable है, लेकिन इसे support और expand करना जारी रहेगा
- उम्मीद जताई गई है कि ggsql के development process से ggplot2 में नए features लाने में भी मदद मिलेगी

और जानें

अगर आप ggsql को तुरंत आज़माना चाहते हैं, तो ggsql website के Getting started section में installation guide और tutorial देख सकते हैं
documentation pages में ggsql की सभी supported capabilities देखी जा सकती हैं
user experience पर feedback की उम्मीद का भी उल्लेख है

ggsql - SQL के लिए graphics grammar

ggsql परिचय

बुनियादी उदाहरण

पहला plot

पूरा उदाहरण

visualization query structure

एक कदम पीछे हटकर

ggsql क्यों

Hello, SQL user

declarative data transformation, declarative visualization

No runtime, no problem

SELECT pod_door FROM bay WHERE closed

We are now wise beyond our years

भविष्य

ggplot2 development के लिए इसका क्या मतलब है

और जानें

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.

`SELECT pod_door FROM bay WHERE closed`