Vortex - उच्च-प्रदर्शन Columnar फ़ाइल फ़ॉर्मैट

xguru · 2024-10-17T11:01:27+09:00

"The LLVM of columnar file formats" एक Columnar फ़ॉर्मैट फ़ाइल फ़ॉर्मैट जिसमें memory, disk और network के ज़रिए compressed Apache Arrow arrays को संभालने के लिए toolkit शामिल है Apache Parquet का महत्वाकांक्षी उत्तराधिकारी, जो 100-200x तेज़ random access reads और 2-10x तेज़ scans को सपोर्ट करता है, जबकि zstd का उपयोग करने वाले Parquet के लगभग समान compression ratio और write throughput को बनाए रखता है बहुत बड़े tables (दसियों हज़ार columns) और GPU पर decompression को भी सपोर्ट करता है Vortex को इस तरह डिज़ाइन किया गया है कि column-based file formats के लिए वही भूमिका निभाए जो Apache DataFusion query engine के लिए निभाता है यानी high scalability, बहुत तेज़ speed, और batteries-included features इसकी विशेषताएँ हैं [!Caution] > अभी सक्रिय रूप से development जारी है मुख्य फीचर्स: Logical Types - schema definition जो physical layout के बारे में कोई मान्यता नहीं मानती Zero-Copy to Arrow - canonicalized Vortex arrays को Apache Arrow arrays में zero-copy conversion किया जा सकता है Extensible Encodings - plugin-style physical layout sets. Arrow-compatible encodings के अलावा modern encodings (FastLanes, ALP, FSST आदि) को extensions के रूप में प्रदान करता है Cascading Compression - data को कई nested encodings के साथ recursively compress किया जा सकता है Pluggable Compression Strategies - built-in Compressor, BtrBlocks-आधारित है, लेकिन अन्य strategies भी आसानी से उपयोग की जा सकती हैं Compute - encoded data पर काम करने वाले basic compute kernels (जैसे filter pushdown) Statistics - हर array के पास summary statistics होते हैं जिन्हें read time पर वैकल्पिक रूप से compute किया जाता है. इन्हें compute kernels और compressors में उपयोग किया जा सकता है Serialization - IPC और file format के लिए arrays की zero-copy serialization Columnar File Format (प्रगति पर) - Vortex serde library का उपयोग करके compressed array data को स्टोर करने वाला आधुनिक file format. random access reads और बहुत तेज़ scans के लिए optimize किया गया है. इसका लक्ष्य Apache Parquet का successor बनना है अवलोकन: Logical vs Physical Vortex के मुख्य design principles में से एक logical concerns और physical concerns का सख्त separation है उदाहरण: Vortex array को logical data type (scalar elements का type) और physical encoding (array का type) से परिभाषित किया जाता है Built-in encodings मुख्य रूप से Apache Arrow in-memory format को model करने के लिए डिज़ाइन किए गए हैं. इसके अलावा built-in encodings (sparse, chunked) भी हैं, जो दूसरे encodings के उपयोगी building blocks के रूप में काम करते हैं. Extension encodings मुख्य रूप से compressed in-memory arrays जैसे length encoding या dictionary encoding को model करने के लिए हैं vortex-serde को Vortex arrays की low-level physical details को संभालने के लिए डिज़ाइन किया गया है. कौन-सी encoding उपयोग करनी है या data को logical रूप से कैसे chunk करना है, यह Compressor implementation पर छोड़ा गया है (विकासाधीन) Vortex file format की एक अनोखी property यह है कि यह data के physical layout को file के footer में encode करता है. इससे file format effectively self-describing बन जाता है, और file format spec की compatibility तोड़े बिना evolve कर सकता है Forward compatibility को सपोर्ट करने के लिए file में वैकल्पिक रूप से WASM decoder शामिल करने के लिए इसे डिज़ाइन किया गया है. इससे उन columnar file formats की तेज़ी से कठोर हो जाने वाली समस्या से बचने में मदद मिलेगी, जिसने अन्य formats को परेशान किया है घटक Logical Types Vortex type system अभी भी बदल रहा है. वर्तमान logical types: Null Bool Integer(8, 16, 32, 64) Float(16, b16, 32, 64) Binary UTF8 Struct List (आंशिक रूप से implemented) Date/Time/DateTime/Duration (extension types के रूप में implemented) TODO: Decimal, FixedList, Tensor, Union Canonical/Flat Encodings Vortex डिफ़ॉल्ट रूप से "Flat" encodings शामिल करता है, जिन्हें Apache Arrow के साथ zero-copy होने के लिए डिज़ाइन किया गया है. ये हर logical data type की canonical representation हैं. वर्तमान में समर्थित canonical encodings: Null Bool Primitive (Integer, Float) Struct VarBin (Binary, UTF8) VarBinView (Binary, UTF8) Extension आगे और encodings जोड़े जाएंगे Compressed Encodings Vortex में highly data-parallel और vectorized encodings का एक सेट शामिल है. ये encodings compressed in-memory array implementations के अनुरूप हैं, ताकि decompression को टाला जा सके. वर्तमान में निम्न encodings उपलब्ध हैं: Adaptive Lossless Floating Point (ALP) BitPacked (FastLanes) Constant Chunked Delta (FastLanes) Dictionary Fast Static Symbol Table (FSST) Frame-of-Reference Run-end Encoding RoaringUInt RoaringBool Sparse ZigZag आगे और encodings जोड़े जाएंगे Compression Vortex की default compression strategy BtrBlocks paper पर आधारित है मोटे तौर पर, हर data chunk के लिए कम से कम ~1% data sample लिया जाता है फिर lightweight encodings के एक सेट के साथ (recursive रूप से) compression की कोशिश की जाती है उनमें से सबसे अच्छा performance देने वाले encoding combination को चुनकर पूरे chunk को encode किया जाता है यह बहुत महंगा लग सकता है, लेकिन chunk की basic statistics होने पर कई encodings को सस्ते में prune किया जा सकता है, ताकि search space बहुत ज़्यादा न बढ़े Compute Vortex यह सुविधा देता है कि हर encoding compute functions के implementation को specialize कर सके, ताकि जहाँ तक संभव हो decompression से बचा जा सके. उदाहरण के लिए, dictionary-encoded UTF8 array को filter करने में पहले dictionary को filter करना अधिक सस्ता होता है Vortex केवल उन basic compute operations को implement करता है जो efficient scans और pushdown के लिए आवश्यक हो सकते हैं; यह पूरा compute engine बनने की कोशिश नहीं करता Statistics Vortex arrays में lazily computed summary statistics होते हैं अन्य array libraries के विपरीत, ये statistics Parquet जैसे disk formats से भरे जा सकते हैं और compute engine तक वैसे ही संरक्षित रह सकते हैं Statistics को compute kernels और compressors में उपयोग किया जा सकता है वर्तमान statistics: BitWidthFreq TrailingZeroFreq IsConstant IsSorted IsStrictSorted Max Min RunCount TrueCount NullCount Serialization / Deserialization (Serde) vortex-serde implementation के लक्ष्य: zero-copy और zero heap allocation के साथ scan (column projection + row filtering) को सपोर्ट करना constant time या near-constant time में random access को सपोर्ट करना sorted होने जैसी statistical information को consumer तक पहुँचाना processes के बीच arrays भेजने के लिए IPC format प्रदान करना disk या object storage पर columnar data स्टोर करने के लिए extensible और top-tier file format प्रदान करना Apache Arrow के साथ एकीकरण Apache Arrow, columnar array data में interoperability का de facto standard है. स्वाभाविक रूप से, Vortex को Apache Arrow के साथ जितना संभव हो उतना compatible बनने के लिए डिज़ाइन किया गया है सभी Arrow arrays को zero-copy के साथ Vortex arrays में convert किया जा सकता है. Arrow arrays से बनाए गए Vortex arrays को फिर से zero-copy के साथ Arrow में convert किया जा सकता है यह ध्यान देने योग्य है कि Vortex और Arrow अलग हैं, लेकिन उनके लक्ष्य एक-दूसरे के पूरक हैं Vortex, logical types और physical encodings के स्पष्ट separation के कारण Arrow से अलग है. इससे Vortex अधिक जटिल arrays को model करते हुए भी logical interface expose कर सकता है उदाहरण: Vortex ऐसा UTF8 ChunkedArray model कर सकता है जिसमें पहला chunk run-length encoded हो और दूसरा chunk dictionary-encoded हो. Arrow में RunLengthArray और DictionaryArray अलग और असंगत types हैं, इसलिए उन्हें इस तरह जोड़ा नहीं जा सकता

(github.com/spiraldb)

9 पॉइंट द्वारा xguru 2024-10-17 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

"The LLVM of columnar file formats"
एक Columnar फ़ॉर्मैट फ़ाइल फ़ॉर्मैट जिसमें memory, disk और network के ज़रिए compressed Apache Arrow arrays को संभालने के लिए toolkit शामिल है
Apache Parquet का महत्वाकांक्षी उत्तराधिकारी, जो 100-200x तेज़ random access reads और 2-10x तेज़ scans को सपोर्ट करता है, जबकि zstd का उपयोग करने वाले Parquet के लगभग समान compression ratio और write throughput को बनाए रखता है
- बहुत बड़े tables (दसियों हज़ार columns) और GPU पर decompression को भी सपोर्ट करता है
Vortex को इस तरह डिज़ाइन किया गया है कि column-based file formats के लिए वही भूमिका निभाए जो Apache DataFusion query engine के लिए निभाता है
- यानी high scalability, बहुत तेज़ speed, और batteries-included features इसकी विशेषताएँ हैं

[!Caution] > अभी सक्रिय रूप से development जारी है

मुख्य फीचर्स:
- Logical Types - schema definition जो physical layout के बारे में कोई मान्यता नहीं मानती
- Zero-Copy to Arrow - canonicalized Vortex arrays को Apache Arrow arrays में zero-copy conversion किया जा सकता है
- Extensible Encodings - plugin-style physical layout sets. Arrow-compatible encodings के अलावा modern encodings (FastLanes, ALP, FSST आदि) को extensions के रूप में प्रदान करता है
- Cascading Compression - data को कई nested encodings के साथ recursively compress किया जा सकता है
- Pluggable Compression Strategies - built-in Compressor, BtrBlocks-आधारित है, लेकिन अन्य strategies भी आसानी से उपयोग की जा सकती हैं
- Compute - encoded data पर काम करने वाले basic compute kernels (जैसे filter pushdown)
- Statistics - हर array के पास summary statistics होते हैं जिन्हें read time पर वैकल्पिक रूप से compute किया जाता है. इन्हें compute kernels और compressors में उपयोग किया जा सकता है
- Serialization - IPC और file format के लिए arrays की zero-copy serialization
- Columnar File Format (प्रगति पर) - Vortex serde library का उपयोग करके compressed array data को स्टोर करने वाला आधुनिक file format. random access reads और बहुत तेज़ scans के लिए optimize किया गया है. इसका लक्ष्य Apache Parquet का successor बनना है

अवलोकन: Logical vs Physical

Vortex के मुख्य design principles में से एक logical concerns और physical concerns का सख्त separation है
- उदाहरण: Vortex array को logical data type (scalar elements का type) और physical encoding (array का type) से परिभाषित किया जाता है
Built-in encodings मुख्य रूप से Apache Arrow in-memory format को model करने के लिए डिज़ाइन किए गए हैं. इसके अलावा built-in encodings (sparse, chunked) भी हैं, जो दूसरे encodings के उपयोगी building blocks के रूप में काम करते हैं. Extension encodings मुख्य रूप से compressed in-memory arrays जैसे length encoding या dictionary encoding को model करने के लिए हैं
vortex-serde को Vortex arrays की low-level physical details को संभालने के लिए डिज़ाइन किया गया है. कौन-सी encoding उपयोग करनी है या data को logical रूप से कैसे chunk करना है, यह Compressor implementation पर छोड़ा गया है
(विकासाधीन) Vortex file format की एक अनोखी property यह है कि यह data के physical layout को file के footer में encode करता है. इससे file format effectively self-describing बन जाता है, और file format spec की compatibility तोड़े बिना evolve कर सकता है
Forward compatibility को सपोर्ट करने के लिए file में वैकल्पिक रूप से WASM decoder शामिल करने के लिए इसे डिज़ाइन किया गया है. इससे उन columnar file formats की तेज़ी से कठोर हो जाने वाली समस्या से बचने में मदद मिलेगी, जिसने अन्य formats को परेशान किया है

घटक

Logical Types

Vortex type system अभी भी बदल रहा है. वर्तमान logical types:
- Null
- Bool
- Integer(8, 16, 32, 64)
- Float(16, b16, 32, 64)
- Binary
- UTF8
- Struct
- List (आंशिक रूप से implemented)
- Date/Time/DateTime/Duration (extension types के रूप में implemented)
- TODO: Decimal, FixedList, Tensor, Union

Canonical/Flat Encodings

Vortex डिफ़ॉल्ट रूप से "Flat" encodings शामिल करता है, जिन्हें Apache Arrow के साथ zero-copy होने के लिए डिज़ाइन किया गया है. ये हर logical data type की canonical representation हैं. वर्तमान में समर्थित canonical encodings:
- Null
- Bool
- Primitive (Integer, Float)
- Struct
- VarBin (Binary, UTF8)
- VarBinView (Binary, UTF8)
- Extension
- आगे और encodings जोड़े जाएंगे

Compressed Encodings

Vortex में highly data-parallel और vectorized encodings का एक सेट शामिल है. ये encodings compressed in-memory array implementations के अनुरूप हैं, ताकि decompression को टाला जा सके. वर्तमान में निम्न encodings उपलब्ध हैं:
- Adaptive Lossless Floating Point (ALP)
- BitPacked (FastLanes)
- Constant
- Chunked
- Delta (FastLanes)
- Dictionary
- Fast Static Symbol Table (FSST)
- Frame-of-Reference
- Run-end Encoding
- RoaringUInt
- RoaringBool
- Sparse
- ZigZag
- आगे और encodings जोड़े जाएंगे

Compression

Vortex की default compression strategy BtrBlocks paper पर आधारित है
- मोटे तौर पर, हर data chunk के लिए कम से कम ~1% data sample लिया जाता है
- फिर lightweight encodings के एक सेट के साथ (recursive रूप से) compression की कोशिश की जाती है
- उनमें से सबसे अच्छा performance देने वाले encoding combination को चुनकर पूरे chunk को encode किया जाता है
- यह बहुत महंगा लग सकता है, लेकिन chunk की basic statistics होने पर कई encodings को सस्ते में prune किया जा सकता है, ताकि search space बहुत ज़्यादा न बढ़े

Compute

Vortex यह सुविधा देता है कि हर encoding compute functions के implementation को specialize कर सके, ताकि जहाँ तक संभव हो decompression से बचा जा सके. उदाहरण के लिए, dictionary-encoded UTF8 array को filter करने में पहले dictionary को filter करना अधिक सस्ता होता है
Vortex केवल उन basic compute operations को implement करता है जो efficient scans और pushdown के लिए आवश्यक हो सकते हैं; यह पूरा compute engine बनने की कोशिश नहीं करता

Statistics

Vortex arrays में lazily computed summary statistics होते हैं
अन्य array libraries के विपरीत, ये statistics Parquet जैसे disk formats से भरे जा सकते हैं और compute engine तक वैसे ही संरक्षित रह सकते हैं
Statistics को compute kernels और compressors में उपयोग किया जा सकता है
वर्तमान statistics:
- BitWidthFreq
- TrailingZeroFreq
- IsConstant
- IsSorted
- IsStrictSorted
- Max
- Min
- RunCount
- TrueCount
- NullCount

Serialization / Deserialization (Serde)

vortex-serde implementation के लक्ष्य:
- zero-copy और zero heap allocation के साथ scan (column projection + row filtering) को सपोर्ट करना
- constant time या near-constant time में random access को सपोर्ट करना
- sorted होने जैसी statistical information को consumer तक पहुँचाना
- processes के बीच arrays भेजने के लिए IPC format प्रदान करना
- disk या object storage पर columnar data स्टोर करने के लिए extensible और top-tier file format प्रदान करना

Apache Arrow के साथ एकीकरण

Apache Arrow, columnar array data में interoperability का de facto standard है. स्वाभाविक रूप से, Vortex को Apache Arrow के साथ जितना संभव हो उतना compatible बनने के लिए डिज़ाइन किया गया है
सभी Arrow arrays को zero-copy के साथ Vortex arrays में convert किया जा सकता है. Arrow arrays से बनाए गए Vortex arrays को फिर से zero-copy के साथ Arrow में convert किया जा सकता है
यह ध्यान देने योग्य है कि Vortex और Arrow अलग हैं, लेकिन उनके लक्ष्य एक-दूसरे के पूरक हैं
Vortex, logical types और physical encodings के स्पष्ट separation के कारण Arrow से अलग है. इससे Vortex अधिक जटिल arrays को model करते हुए भी logical interface expose कर सकता है
- उदाहरण: Vortex ऐसा UTF8 ChunkedArray model कर सकता है जिसमें पहला chunk run-length encoded हो और दूसरा chunk dictionary-encoded हो. Arrow में RunLengthArray और DictionaryArray अलग और असंगत types हैं, इसलिए उन्हें इस तरह जोड़ा नहीं जा सकता