19] इस हफ्ते के प्रमुख ML पेपर्स (Top ML Papers of the Week)

(discuss.pytorch.kr)

5 पॉइंट द्वारा ninebow 2025-01-21 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

DAIR.AI द्वारा हर हफ्ते प्रकाशित ML पेपर्स पर आधारित इस लेख का स्वचालित अनुवाद किया गया है।
इस हफ्ते चुने गए पेपर्स की सबसे प्रमुख प्रवृत्ति यह है कि बड़े भाषा मॉडल (LLM, Large Language Models) और multimodal AI पर शोध बहुत सक्रिय है। उदाहरण के लिए, "Self-Adaptive LLMs", "Foundations of LLMs", "Enhancing RAG", और "VideoRAG" जैसे पेपर्स LLM और multimodal learning से जुड़े विषयों को कवर करते हैं। इसके अलावा, "Imagine while Reasoning in Space" और "OmniThink" जैसे पेपर्स भी विभिन्न प्रकार के डेटा का उपयोग करके जटिल समस्या-समाधान की कोशिश करने वाले multimodal approaches का अन्वेषण करते हैं।
यह प्रवृत्ति दर्शाती है कि मौजूदा AI research community में language models का महत्व बढ़ रहा है और विभिन्न प्रकार के डेटा को मिलाकर अधिक व्यापक समझ हासिल करने के प्रयास भी तेज़ हो रहे हैं। LLM, प्राकृतिक भाषा प्रसंस्करण (NLP) के अत्याधुनिक विकास का नेतृत्व कर रहे हैं, और इन तकनीकों को आगे बढ़ाने के लिए multimodal data का उपयोग करने वाला एकीकृत approach ज़रूरी माना जा रहा है। खास तौर पर, multimodal AI इमेज के माध्यम से समझ और natural language के माध्यम से समझ को जोड़कर अधिक जटिल समस्याओं को हल करने में महत्वपूर्ण भूमिका निभा रहा है।
निष्कर्ष रूप में, इस हफ्ते के पेपर्स दिखाते हैं कि AI research का केंद्र बड़े भाषा मॉडल और multimodal learning पर केंद्रित होता जा रहा है। यह संकेत देता है कि AI अब केवल टेक्स्ट प्रोसेसिंग तक सीमित नहीं रहना चाहता, बल्कि visual information के साथ एकीकरण के जरिए अधिक बुद्धिमान और जटिल समस्या-समाधान को संभव बनाने की दिशा में आगे बढ़ रहा है। इसलिए, उम्मीद है कि इस तरह का शोध आगे चलकर AI तकनीक के विकास पर बड़ा प्रभाव डालेगा।

$\text{Transformer}^2$: स्व-अनुकूलनशील LLM / $\text{Transformer}^2$: Self-adaptive LLMs

पेपर परिचय

हम $\text{Transformer}^2$ पेश करते हैं, जो एक नया self-adaptation framework है और weight matrices के singular components को चुनिंदा रूप से समायोजित करके unseen tasks के लिए real-time में LLM को अनुकूलित करता है। यह दो मुख्य चरणों पर आधारित है: 1) एक dispatch system, जो आने वाले task की विशेषताओं का विश्लेषण और पहचान करता है, 2) reinforcement learning के माध्यम से प्रशिक्षित "expert" vectors को मिलाकर task-specific behavior तैयार करने वाला चरण। दावा किया गया है कि यह कम parameters के साथ LoRA से अधिक efficient है और विभिन्न LLM architectures में काम कर सकता है.

Introduces $\text{Transformer}^2$, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting singular components of their weight matrices; it’s built with two key phases: 1) a dispatch system that analyzes and identifies the properties of the incoming task, and 2) a step that combines "expert" vectors (trained via reinforcement learning) to create task-specific behaviors; claims to be more efficient than LoRA with fewer parameters and can works across different LLM architectures.

पेपर सारांश(Abstract)

स्व-अनुकूलनशील बड़े भाषा मॉडल (LLM) का उद्देश्य पारंपरिक fine-tuning methods से उत्पन्न चुनौतियों को हल करना है, जो अक्सर computationally intensive होते हैं और विविध tasks को संभालने की क्षमता में static रहते हैं। यहाँ हम $\text{Transformer}^2$ प्रस्तुत करते हैं, जो एक नया self-adaptation framework है और weight matrices के singular components को चुनिंदा रूप से समायोजित करके unseen tasks के लिए real-time में LLM को अनुकूलित करता है। inference के दौरान $\text{Transformer}^2$ एक two-pass mechanism का उपयोग करता है। पहले dispatch system task की विशेषताओं की पहचान करता है, फिर reinforcement learning से प्रशिक्षित task-specific "expert" vectors को dynamic रूप से मिलाकर incoming prompt के लिए targeted behavior प्राप्त किया जाता है। यह विधि कम parameters और अधिक efficiency के साथ LoRA जैसे व्यापक approaches से बेहतर प्रदर्शन करती है। $\text{Transformer}^2$ vision-language tasks सहित विभिन्न LLM architectures और modalities में अपनी versatility दिखाता है। $\text{Transformer}^2$ एक महत्वपूर्ण छलांग का प्रतिनिधित्व करता है, जो LLM की adaptability और task-specific performance को बेहतर बनाने के लिए scalable और efficient solution प्रदान करता है, और वास्तव में dynamic, self-organizing AI systems की दिशा में रास्ता खोलता है।

Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce $\text{Transformer}^2$, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, $\text{Transformer}^2$ employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific "expert" vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. $\text{Transformer}^2$ demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. $\text{Transformer}^2$ represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.

पेपर लिंक

https://arxiv.org/abs/2501.06252

MiniMax-01: बिजली जैसी तेज़ी से स्केल होने वाले foundation models / MiniMax-01: Scaling Foundation Models with Lightning Attention

पेपर परिचय

Mixture-of-Experts को एकीकृत करने वाली नई मॉडल सीरीज़ का परिचय, 32 experts और 456B parameters वाले मॉडल का परिचय, जिसमें हर token के लिए 45.9B parameters सक्रिय होते हैं; दावा है कि इसका प्रदर्शन GPT-4o और Claude-3.5-Sonnet जैसे नवीनतम मॉडलों के बराबर है। यह 20~32 गुना लंबी context window देते हुए अधिकतम 4 million tokens तक संभाल सकता है; यह linear attention और optimized hardware utilization को एकीकृत करता है, जिससे LLM की efficiency और scalability बेहतर होती है; साथ ही MiniMax-VL-01 नाम का एक vision model भी है, जिसे 51.2 billion vision-language tokens पर continued training के जरिए बनाया गया है।

Mixture-of-Experts को एकीकृत करने वाली नई मॉडल सीरीज़ पेश की गई है; 32 experts और 456B parameters वाला एक मॉडल पेश किया गया है, जिसमें हर token के लिए 45.9B सक्रिय होते हैं; दावा है कि यह GPT-4o और Claude-3.5-Sonnet जैसे state-of-the-art मॉडलों के प्रदर्शन की बराबरी करता है, जबकि 20-32x लंबी context window प्रदान करता है; यह 4 million tokens तक की context window संभाल सकता है; यह linear attention को optimized hardware utilization के साथ एकीकृत करता है, जिससे LLM की efficiency और scalability बढ़ती है; साथ ही MiniMax-VL-01 नाम का एक vision model भी है, जिसे 51.2 billion vision-language tokens पर continued training के जरिए बनाया गया है.

पेपर सारांश (Abstract)

हम MiniMax-01 सीरीज़ का परिचय देते हैं, जिसमें MiniMax-Text-01 और MiniMax-VL-01 शामिल हैं। ये लंबे context को प्रोसेस करने में बेहतर क्षमता प्रदान करते हुए top-tier मॉडलों के तुलनीय हैं। इसका मूल lightning attention और उसकी efficient scaling in है। computational capacity को अधिकतम करने के लिए, हम इसे Mixture of Experts (MoE) के साथ एकीकृत करते हैं, जिससे 32 experts और कुल 456 billion parameters वाला मॉडल बनता है, जिनमें से 45.9 billion हर token के लिए सक्रिय होते हैं। हमने MoE और lightning attention के लिए optimized parallel strategy और अत्यधिक कुशल computation-communication overlap techniques विकसित की हैं। यह तरीका हमें लाखों tokens तक फैले context पर सैकड़ों अरब parameters वाले मॉडलों की efficient training और inference करने में सक्षम बनाता है। MiniMax-Text-01 की context window training के दौरान 1 million tokens तक पहुंच सकती है और inference के दौरान किफायती लागत पर 4 million tokens तक extrapolate कर सकती है। हमारा vision-language model, MiniMax-VL-01, 51.2 billion vision-language tokens पर continued training के जरिए बनाया गया है। standard और in-house benchmarks पर किए गए experiments दिखाते हैं कि हमारे मॉडल GPT-4o और Claude-3.5-Sonnet जैसे state-of-the-art मॉडलों के प्रदर्शन की बराबरी करते हैं, जबकि 20-32 गुना लंबी context window प्रदान करते हैं। MiniMax-01 को सार्वजनिक रूप से https://github.com/MiniMax-AI पर जारी किया गया है।

हम MiniMax-01 सीरीज़ पेश करते हैं, जिसमें MiniMax-Text-01 और MiniMax-VL-01 शामिल हैं, जो लंबे context को प्रोसेस करने में बेहतर क्षमता देते हुए top-tier मॉडलों के तुलनीय हैं। इसका मूल lightning attention और उसकी efficient scaling में है। computational capacity को अधिकतम करने के लिए, हम इसे Mixture of Experts (MoE) के साथ एकीकृत करते हैं, जिससे 32 experts और कुल 456 billion parameters वाला मॉडल बनता है, जिनमें से 45.9 billion हर token के लिए सक्रिय होते हैं। हम MoE और lightning attention के लिए optimized parallel strategy और अत्यधिक कुशल computation-communication overlap techniques विकसित करते हैं। यह तरीका हमें लाखों tokens तक फैले context पर सैकड़ों अरब parameters वाले मॉडलों की efficient training और inference करने में सक्षम बनाता है। MiniMax-Text-01 की context window training के दौरान 1 million tokens तक पहुंच सकती है और inference के दौरान किफायती लागत पर 4 million tokens तक extrapolate कर सकती है। हमारा vision-language model, MiniMax-VL-01, 51.2 billion vision-language tokens पर continued training के जरिए बनाया गया है। standard और in-house benchmarks पर किए गए experiments दिखाते हैं कि हमारे मॉडल GPT-4o और Claude-3.5-Sonnet जैसे state-of-the-art मॉडलों के प्रदर्शन की बराबरी करते हैं, जबकि 20-32 गुना लंबी context window प्रदान करते हैं। हम MiniMax-01 को सार्वजनिक रूप से https://github.com/MiniMax-AI पर जारी करते हैं.

पेपर लिंक

https://arxiv.org/abs/2501.08313

VideoRAG: वीडियो कॉर्पस पर Retrieval-Augmented Generation / VideoRAG: Retrieval-Augmented Generation over Video Corpus

पेपर परिचय

यह एक ऐसा framework है जो बाहरी knowledge source के रूप में वीडियो कंटेंट का उपयोग करके RAG को बेहतर बनाता है; मौजूदा RAG approaches के विपरीत, जो मुख्य रूप से text या images पर केंद्रित होती हैं, VideoRAG queries के आधार पर dynamically संबंधित videos को retrieve करता है और उनके visual तथा textual दोनों तत्वों को generation process में शामिल करता है; यह framework video content को सीधे प्रोसेस करने के लिए Large Video Language Models (LVLMs) का उपयोग करता है, जिससे temporal dynamics, spatial details और multimodal cues को अधिक प्रभावी ढंग से कैप्चर किया जा सकता है, जिन्हें static modalities अक्सर ठीक से व्यक्त नहीं कर पातीं; जिन videos में textual descriptions नहीं हैं, उनके लिए यह automatic speech recognition का उपयोग करके transcripts तैयार करने का प्रस्ताव देता है, ताकि visual और textual दोनों modalities का उपयोग किया जा सके।

एक framework जो बाहरी knowledge source के रूप में video content का उपयोग करके RAG को बेहतर बनाता है; मौजूदा RAG approaches के विपरीत, जो मुख्य रूप से text या images पर केंद्रित होती हैं, VideoRAG queries के आधार पर dynamically संबंधित videos को retrieve करता है और उनके visual तथा textual दोनों तत्वों को generation process में शामिल करता है; this framework video content को सीधे प्रोसेस करने के लिए Large Video Language Models (LVLMs) का उपयोग करता है, जिससे temporal dynamics, spatial details और multimodal cues को अधिक प्रभावी ढंग से कैप्चर किया जा सकता है, जिन्हें static modalities अक्सर व्यक्त नहीं कर पातीं; जिन videos में textual descriptions नहीं हैं, उनके लिए वे automatic speech recognition का उपयोग करके transcripts तैयार करने का प्रस्ताव रखते हैं, ताकि visual और textual दोनों modalities का उपयोग किया जा सके।

पेपर सारांश (Abstract)

Retrieval-Augmented Generation (RAG) एक शक्तिशाली रणनीति है, जो queries से संबंधित बाहरी ज्ञान को retrieve करके और उसे generation process में शामिल करके foundation models में तथ्यात्मक रूप से गलत output बनने की समस्या को हल करती है। हालांकि, मौजूदा RAG approaches मुख्य रूप से textual information पर केंद्रित रही हैं, और हाल की कुछ उन्नत approaches ने images पर विचार करना शुरू किया है, फिर भी वे अक्सर videos को नज़रअंदाज़ करती हैं, जबकि videos multimodal knowledge का एक समृद्ध स्रोत हैं जो events, processes और contextual details को किसी भी अन्य modality की तुलना में अधिक प्रभावी ढंग से प्रस्तुत कर सकते हैं। हाल के कुछ studies response generation process में videos को integrate करने के तरीके तलाशते हैं, लेकिन वे या तो query से संबंधित videos को बिना query के अनुसार retrieve किए पहले से परिभाषित कर देते हैं, या videos की multimodal richness का उपयोग किए बिना उन्हें textual descriptions में बदल देते हैं। इन समस्याओं को हल करने के लिए, हम VideoRAG पेश करते हैं, एक नया framework जो न केवल query के साथ उनकी प्रासंगिकता के आधार पर संबंधित videos को dynamically retrieve करता है, बल्कि output generation के समय videos की visual और textual दोनों information का उपयोग भी करता है। इसके अलावा, इसे operational बनाने के लिए, हमने अपनी method को हाल के Large Video Language Models (LVLMs) की प्रगति के इर्द-गिर्द लागू किया है, जो retrieval के लिए video content को सीधे process करके उसका representation बनाने और retrieved videos को queries के साथ सहज रूप से integrate करने में सक्षम बनाते हैं। प्रयोगों के माध्यम से हम VideoRAG की प्रभावशीलता को validate करते हैं और दिखाते हैं कि यह संबंधित baselines से बेहतर है।

Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.

शोध-पत्र लिंक

https://arxiv.org/abs/2501.05874

आगे पढ़ें

https://x.com/omarsar0/status/1878827350315659421

टाइटन्स: टेस्ट टाइम पर याद रखना सीखना / Titans: Learning to Memorize at Test Time

शोध-पत्र परिचय

एक neural long-term memory module पेश किया गया है, जो पिछले context को याद रखने और लंबे अतीत की जानकारी का उपयोग करते हुए attention को वर्तमान context पर केंद्रित करने में मदद करता है; यह neural memory module केवल attention के उपयोग की तुलना में अधिक दीर्घकालिक और स्थायी memory की तरह काम करता है, जिसे अपेक्षाकृत short-term माना जा सकता है; neural memory पर आधारित Titan language modeling, common-sense reasoning, genomics और time series tasks में अच्छे परिणाम दिखाता है।

Introduces a neural long-term memory module to memorize historical context and help attention to attend to the current context while utilizing long past information; the neural memory module acts as a long-term, more persistent memory than just using attention alone (considered more short-term); Titan, which is based on neural memory, shows good results in language modeling, common-sense reasoning, genomics, and time series tasks.

शोध-पत्र सार (Abstract)

10 साल से अधिक समय में recurrent models और attention का प्रभावी ढंग से उपयोग कैसे किया जाए, इस पर व्यापक शोध हुआ है। recurrent models का उद्देश्य डेटा को एक fixed-size memory (जिसे hidden state कहा जाता है) में compress करना होता है, जबकि attention पूरे context window पर ध्यान दे सकता है और सभी tokens की प्रत्यक्ष dependencies को capture करता है। हालांकि, dependencies की यह अधिक सटीक modeling quadratic cost के साथ आती है, जो मॉडल को fixed-length context तक सीमित कर देती है। हम एक नया neural long-term memory module प्रस्तुत करते हैं, जो ऐतिहासिक context को याद रखना सीखता है और लंबे अतीत की जानकारी का उपयोग करते हुए attention को वर्तमान context पर ध्यान देने में मदद करता है। हम दिखाते हैं कि इस neural memory का लाभ यह है कि fast inference बनाए रखते हुए इसकी training को तेज़ी से parallelize किया जा सकता है। memory के दृष्टिकोण से, हम तर्क देते हैं कि attention, अपने सीमित context लेकिन सटीक dependency modeling के कारण, short-term memory की तरह काम करता है, जबकि neural memory, डेटा को याद रखने की अपनी क्षमता के कारण, अधिक दीर्घकालिक और स्थायी memory की तरह काम करता है। इन दो modules के आधार पर, हम Titans नामक architectures का एक नया family प्रस्तुत करते हैं, और इस architecture में memory को प्रभावी ढंग से शामिल करने के लिए तीन variants पेश करते हैं। language modeling, common-sense reasoning, genomics, और time series tasks पर हमारे experimental results दिखाते हैं कि Titans, Transformers और हाल के modern linear recurrent models की तुलना में अधिक प्रभावी हैं। साथ ही, यह baselines की तुलना में needle-in-haystack tasks में अधिक accuracy के साथ 2M से बड़े context window size तक प्रभावी रूप से scale कर सकते हैं。

एक दशक से अधिक समय में recurrent models और attention का प्रभावी उपयोग कैसे किया जाए, इस पर व्यापक शोध प्रयास हुए हैं। जबकि recurrent models डेटा को एक fixed-size memory (जिसे hidden state कहा जाता है) में compress करने का लक्ष्य रखते हैं, attention पूरे context window पर ध्यान देने की अनुमति देता है, जिससे सभी tokens की प्रत्यक्ष dependencies capture होती हैं। हालांकि, dependencies की यह अधिक सटीक modeling quadratic cost के साथ आती है, जिससे मॉडल fixed-length context तक सीमित हो जाता है। हम एक नया neural long-term memory module प्रस्तुत करते हैं, जो historical context को याद रखना सीखता है और लंबे अतीत की जानकारी का उपयोग करते हुए attention को वर्तमान context पर ध्यान देने में मदद करता है। हम दिखाते हैं कि इस neural memory का लाभ fast parallelizable training है, जबकि inference भी तेज़ बना रहता है। memory के दृष्टिकोण से, हम तर्क देते हैं कि attention, अपने सीमित context लेकिन सटीक dependency modeling के कारण, short-term memory की तरह काम करता है, जबकि neural memory, डेटा को याद रखने की क्षमता के कारण, long-term और अधिक persistent memory की तरह कार्य करता है। इन दो modules के आधार पर, हम Titans नामक architectures का एक नया family प्रस्तुत करते हैं, और इस architecture में memory को प्रभावी रूप से शामिल करने के लिए तीन variants पेश करते हैं। language modeling, common-sense reasoning, genomics, और time series tasks पर हमारे experimental results दिखाते हैं कि Titans, Transformers और हाल के modern linear recurrent models की तुलना में अधिक प्रभावी हैं। इसके अलावा, ये baselines की तुलना में needle-in-haystack tasks में अधिक accuracy के साथ 2M से बड़े context window size तक प्रभावी रूप से scale कर सकते हैं।

पेपर लिंक

https://arxiv.org/abs/2501.00663

बड़े भाषा मॉडल की बुनियाद / Foundations of Large Language Models

पेपर परिचय

LLM की बुनियाद पर एक नया survey, जो pre-training, prompting, और alignment methods जैसे क्षेत्रों को कवर करता है।

LLMs की foundations पर एक नया survey, जो pre-training, prompting, और alignment methods जैसे क्षेत्रों को कवर करता है।

पेपर सारांश(Abstract)

यह बड़े भाषा मॉडलों पर एक पुस्तक है। जैसा कि शीर्षक से स्पष्ट है, यह सभी cutting-edge technologies का व्यापक कवरेज देने के बजाय मुख्य रूप से बुनियादी अवधारणाओं पर केंद्रित है। पुस्तक चार मुख्य अध्यायों में संरचित है, जिनमें से प्रत्येक एक प्रमुख क्षेत्र की पड़ताल करता है: pre-training, generative models, prompting techniques, और alignment methods। यह पुस्तक natural language processing और संबंधित क्षेत्रों के college students, professionals, और practitioners के लिए बनाई गई है, और बड़े भाषा मॉडलों में रुचि रखने वाले किसी भी व्यक्ति के लिए reference के रूप में काम कर सकती है।

यह बड़े भाषा मॉडलों पर एक पुस्तक है। जैसा कि शीर्षक से संकेत मिलता है, यह सभी cutting-edge technologies का comprehensive coverage देने के बजाय मुख्य रूप से foundational concepts पर केंद्रित है। पुस्तक चार मुख्य अध्यायों में संरचित है, जिनमें से प्रत्येक एक key area की पड़ताल करता है: pre-training, generative models, prompting techniques, और alignment methods। यह natural language processing और related fields के college students, professionals, और practitioners के लिए अभिप्रेत है, और बड़े भाषा मॉडलों में रुचि रखने वाले किसी भी व्यक्ति के लिए एक reference के रूप में काम कर सकती है।

पेपर लिंक

https://arxiv.org/abs/2501.09223

OmniThink: सोच के माध्यम से मशीन राइटिंग में ज्ञान की सीमाओं का विस्तार / OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

पेपर परिचय

एक नया framework जो मानव-सदृश iterative expansion और reflection process का अनुकरण करता है; इसे इस तरह बनाया गया है कि जब शिक्षार्थी अपने ज्ञान को गहरा करते हैं, तब उनके cognitive behavior को simulate किया जा सके; RAG और role-playing की तुलना में, OmniThink निरंतर reflection और exploration के माध्यम से ज्ञान की सीमाओं का विस्तार कर सकता है; इसलिए यह उन use cases के लिए आदर्श है जिनमें long-form generation की आवश्यकता होती है।

एक नया framework जो मानव-सदृश iterative expansion और reflection process का emulation करता है; इसे शिक्षार्थियों के cognitive behavior को simulate करने के लिए बनाया गया है, जब वे अपने ज्ञान को गहरा करते हैं; RAG और role-playing की तुलना में, OmniThink continuous reflection और exploration के माध्यम से knowledge boundaries का विस्तार कर सकता है; इससे यह उन use cases के लिए आदर्श बनता है जिनमें long-form generation की आवश्यकता होती है।

पेपर सारांश(Abstract)

बड़े language model का उपयोग करने वाली machine writing अक्सर retrieval-augmented generation पर निर्भर करती है। लेकिन यह तरीका मॉडल की पहले से तय सीमाओं के भीतर ही बंधा रहता है, जिससे समृद्ध जानकारी वाले कंटेंट की generation सीमित हो जाती है। खास तौर पर, साधारण retrieval से मिली जानकारी में अक्सर गहराई और उपयोगिता की कमी होती है और उसमें दोहराव की प्रवृत्ति होती है, जो generated articles की गुणवत्ता पर नकारात्मक असर डालती है और उथले, दोहरावपूर्ण तथा मौलिकता-विहीन आउटपुट पैदा करती है। इन समस्याओं को हल करने के लिए OmniThink प्रस्तावित किया गया है, जो एक machine writing framework है और मानव-जैसी iterative expansion और reflection की प्रक्रिया का अनुकरण करता है। OmniThink का मुख्य विचार यह है कि किसी विषय पर ज्ञान को धीरे-धीरे गहरा करते समय सीखने वाले के cognitive behavior का simulation किया जाए। प्रयोगों के परिणाम दिखाते हैं कि OmniThink, coherence और depth जैसे metrics को प्रभावित किए बिना generated documents की knowledge density को बेहतर बनाता है। मानव मूल्यांकन और विशेषज्ञों की प्रतिक्रिया भी इस बात को और रेखांकित करती है कि OmniThink लंबे-फ़ॉर्म वाले articles की generation में वास्तविक दुनिया की चुनौतियों को हल करने की क्षमता रखता है。

Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, utility, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, repetitive, and unoriginal outputs. To address these issues, we propose OmniThink, a machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they progressively deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles.

पेपर लिंक

https://arxiv.org/abs/2501.09751

आगे पढ़ें

https://x.com/omarsar0/status/1880275861401923619

RAG में सुधार: सर्वोत्तम प्रथाओं का अध्ययन / Enhancing Retrieval-Augmented Generation: A Study of Best Practices

पेपर परिचय

यह अध्ययन व्यवस्थित रूप से उन तत्वों और तरीकों की पड़ताल करता है जो RAG systems को बेहतर बनाते हैं, जैसे retrieval strategies, query expansion, contrastive in-context learning, prompt design, और chunking।

Systematically explores the factors and methods that improve RAG systems such as retrieval strategies, query expansion, contrastive in-context learning, prompt design, and chunking.

पेपर सारांश (Abstract)

Retrieval-Augmented Generation (RAG) systems ने हाल के समय में language models में retrieval mechanisms को एकीकृत करके उल्लेखनीय प्रगति दिखाई है, जिससे अधिक सटीक और contextually relevant responses देने की उनकी क्षमता बेहतर हुई है। हालांकि, RAG systems के भीतर मौजूद विभिन्न components और configurations का प्रभाव अभी पर्याप्त रूप से समझा नहीं गया है। जटिल retrieval tasks के लिए RAG systems को अनुकूलित करने और विविध applications में सर्वोत्तम performance सुनिश्चित करने के लिए इन तत्वों की व्यापक समझ आवश्यक है। इस पेपर में query expansion, विभिन्न नए retrieval strategies, और एक नए Contrastive In-Context Learning RAG को शामिल करने वाले कई उन्नत RAG system designs विकसित किए गए हैं। यह अध्ययन language model size, prompt design, document chunk size, knowledge base size, retrieval stride, query expansion techniques, Contrastive In-Context Learning knowledge bases, multilingual knowledge bases, और sentence-level पर प्रासंगिक context को retrieve करने वाले Focus Mode जैसे प्रमुख कारकों की व्यवस्थित जांच करता है। व्यापक प्रयोगों के माध्यम से यह अध्ययन विस्तार से विश्लेषण प्रस्तुत करता है कि ये कारक response quality को कैसे प्रभावित करते हैं। ये निष्कर्ष RAG systems के विकास के लिए उपयोगी और actionable insights प्रदान करते हैं, contextual richness और retrieval-generation efficiency के बीच संतुलन स्थापित करते हुए विविध वास्तविक-world scenarios में अधिक अनुकूलनीय और उच्च-प्रदर्शन वाले RAG frameworks के निर्माण का मार्ग प्रशस्त करते हैं। कोड और implementation details सार्वजनिक रूप से उपलब्ध हैं।

Retrieval-Augmented Generation (RAG) systems have recently shown remarkable advancements by integrating retrieval mechanisms into language models, enhancing their ability to produce more accurate and contextually relevant responses. However, the influence of various components and configurations within RAG systems remains underexplored. A comprehensive understanding of these elements is essential for tailoring RAG systems to complex retrieval tasks and ensuring optimal performance across diverse applications. In this paper, we develop several advanced RAG system designs that incorporate query expansion, various novel retrieval strategies, and a novel Contrastive In-Context Learning RAG. Our study systematically investigates key factors, including language model size, prompt design, document chunk size, knowledge base size, retrieval stride, query expansion techniques, Contrastive In-Context Learning knowledge bases, multilingual knowledge bases, and Focus Mode retrieving relevant context at sentence-level. Through extensive experimentation, we provide a detailed analysis of how these factors influence response quality. Our findings offer actionable insights for developing RAG systems, striking a balance between contextual richness and retrieval-generation efficiency, thereby paving the way for more adaptable and high-performing RAG frameworks in diverse real-world scenarios. Our code and implementation details are publicly available.

पेपर लिंक

https://arxiv.org/abs/2501.07391

आगे पढ़ें

https://x.com/omarsar0/status/1879178916021318029

AutoCBT: मनोवैज्ञानिक परामर्श में Cognitive Behavioral Therapy के लिए स्वायत्त multi-agent framework / AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling

पेपर परिचय

Cognitive Behavioral Therapy के लिए multi-agent framework AutoCBT प्रस्तावित किया गया है। यह शोध single-turn मनोवैज्ञानिक परामर्श परिदृश्यों के लिए उच्च-गुणवत्ता वाले उत्तर उत्पन्न करने वाला एक सामान्य multi-agent framework प्रस्तुत करता है, dynamic routing, memory और supervisory mechanisms के संयोजन से प्रत्येक agent की स्वायत्त क्षमता को बेहतर बनाता है, और प्रयोगात्मक परिणाम दिखाते हैं कि AutoCBT उच्च-गुणवत्ता वाली स्वचालित मनोवैज्ञानिक परामर्श सेवाएँ प्रदान कर सकता है। AutoCBT अन्य purely prompt-based counseling frameworks की तुलना में संवाद गुणवत्ता में सुधार करता है.

Proposes a multi-agent framework, AutoCBT, for Cognitive Behavioral Therapy; the work proposes a general multi-agent framework that generates high-quality responses for single-turn psychological consultation scenarios; it uses a combination of dynamic routing, memory, and supervisory mechanisms to enhance the autonomous ability of each agent; experimental results show that AutoCBT can provide higher-quality automated psychological counseling services; AutoCBT improves dialogue quality compared to other purely prompt-based counseling frameworks.

पेपर सार(Abstract)

पारंपरिक in-person मनोवैज्ञानिक परामर्श अब भी मुख्यतः एक niche विकल्प बना हुआ है, जिसे अक्सर मनोवैज्ञानिक समस्याओं वाले लोग चुनते हैं, जबकि online automated counseling उन लोगों के लिए एक संभावित समाधान प्रदान करता है जो शर्म की भावना के कारण मदद लेने में हिचकिचाते हैं। Cognitive Behavioral Therapy (CBT) मनोवैज्ञानिक परामर्श में एक आवश्यक और व्यापक रूप से उपयोग किया जाने वाला दृष्टिकोण है। large language models (LLMs) और agent तकनीक के आगमन ने स्वचालित CBT diagnosis और treatment को संभव बनाया है। हालांकि, मौजूदा LLM-based CBT systems या तो fixed-structure agents का उपयोग करते हैं, जिससे उनकी self-optimization क्षमता सीमित हो जाती है, या फिर दोहरावदार response patterns के कारण खोखले और अनुपयोगी सुझाव देते हैं। इस कार्य में, हम Quora-जैसे और YiXinLi single-round consultation models का उपयोग करके single-turn मनोवैज्ञानिक परामर्श परिदृश्यों के लिए उच्च-गुणवत्ता वाले उत्तर उत्पन्न करने वाला एक सामान्य agent framework बनाते हैं। हम bilingual dataset का उपयोग करके प्रत्येक framework द्वारा उत्पन्न single-response consultations की गुणवत्ता का मूल्यांकन करते हैं। इसके बाद, हम वास्तविक मनोवैज्ञानिक परामर्श से प्रेरित dynamic routing और supervisory mechanisms को शामिल कर CBT-oriented autonomous multi-agent framework का निर्माण करते हैं, जिससे इसकी सामान्य प्रयोज्यता प्रदर्शित होती है। प्रयोगात्मक परिणाम बताते हैं कि AutoCBT अधिक उच्च-गुणवत्ता वाली स्वचालित मनोवैज्ञानिक परामर्श सेवाएँ प्रदान कर सकता है।

Traditional in-person psychological counseling remains primarily niche, often chosen by individuals with psychological issues, while online automated counseling offers a potential solution for those hesitant to seek help due to feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and widely used approach in psychological counseling. The advent of large language models (LLMs) and agent technology enables automatic CBT diagnosis and treatment. However, current LLM-based CBT systems use agents with a fixed structure, limiting their self-optimization capabilities, or providing hollow, unhelpful suggestions due to redundant response patterns. In this work, we utilize Quora-like and YiXinLi single-round consultation models to build a general agent framework that generates high-quality responses for single-turn psychological consultation scenarios. We use a bilingual dataset to evaluate the quality of single-response consultations generated by each framework. Then, we incorporate dynamic routing and supervisory mechanisms inspired by real psychological counseling to construct a CBT-oriented autonomous multi-agent framework, demonstrating its general applicability. Experimental results indicate that AutoCBT can provide higher-quality automated psychological counseling services.

पेपर लिंक

https://arxiv.org/abs/2501.09426

आगे पढ़ें

https://x.com/omarsar0/status/1880283025595867631

स्पेस में reasoning करते हुए कल्पना करें: विचारों का visualization: multimodal visualization / Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

पेपर परिचय

MVoT (Multimodal Visualization-of-Thought) नाम का एक नया reasoning framework पेश किया गया है, जो AI models को text और images दोनों में "सोचने" में सक्षम बनाता है, और मॉडल को text explanations के साथ reasoning steps के visual representations उत्पन्न करने की अनुमति देकर पारंपरिक Chain-of-Thought prompting को बेहतर बनाता है; यह framework multimodal language model Chameleon-7B में लागू किया गया है, generated visualizations की गुणवत्ता सुधारने के लिए "token discrepancy loss" पेश करता है, और विशेष रूप से जटिल परिदृश्यों में पारंपरिक approaches की तुलना में काफी बेहतर प्रदर्शन करता है; maze और printer installation tasks पर 90% से अधिक accuracy हासिल करते हुए MVoT मौजूदा approaches से कहीं बेहतर प्रदर्शन करता है।

Introduces MVoT (Multimodal Visualization-of-Thought), a new reasoning framework that enables AI models to "think" in both text and images; MVoT enhances the traditional Chain-of-Thought prompting by allowing models to generate visual representations of their reasoning steps alongside text explanations; the framework is implemented in Chameleon-7B, a multimodal language model, and introduces a "token discrepancy loss" to improve the quality of generated visualizations; MVoT significantly outperforms traditional approaches, especially in complex scenarios; MVoT achieves over 90% accuracy on maze and printer installation tasks.

पेपर सार(Abstract)

Chain-of-Thought (CoT) prompting, Large Language Models (LLMs) और Multimodal Large Language Models (MLLMs) में जटिल reasoning को बेहतर बनाने के लिए बेहद प्रभावी साबित हुआ है। लेकिन यह जटिल spatial reasoning tasks में संघर्ष करता है। इसके बावजूद, मानव cognition केवल भाषा तक सीमित नहीं है, बल्कि शब्दों और छवियों दोनों में सोचने की उल्लेखनीय क्षमता देता है। इसी mechanism से प्रेरित होकर, हम एक नया reasoning paradigm, Multimodal Visualization-of-Thought (MVoT), प्रस्तावित करते हैं। यह reasoning traces के image visualizations बनाकर MLLMs में visual thinking को सक्षम करता है। उच्च-गुणवत्ता visualization सुनिश्चित करने के लिए, हमने autoregressive MLLMs में token discrepancy loss पेश किया है। यह innovation visual coherence और fidelity, दोनों में महत्वपूर्ण सुधार लाता है। हमने इस approach को कई dynamic spatial reasoning tasks के माध्यम से validate किया। प्रयोगों के परिणाम दिखाते हैं कि MVoT कई tasks में competitive performance देता है। इसके अलावा, यह उन सबसे चुनौतीपूर्ण scenarios में भी मजबूत और विश्वसनीय सुधार दिखाता है जहाँ CoT विफल हो जाता है। अंततः, MVoT जटिल reasoning tasks के लिए नई संभावनाएँ स्थापित करता है, जहाँ visual thinking, verbal reasoning को प्रभावी रूप से पूरक कर सकता है।

Chain-of-Thought (CoT) prompting, Large Language Models (LLMs) और Multimodal Large Language Models (MLLMs) में जटिल reasoning को बेहतर बनाने के लिए बेहद प्रभावी साबित हुआ है। लेकिन यह जटिल spatial reasoning tasks में संघर्ष करता है। इसके बावजूद, मानव cognition केवल भाषा तक सीमित नहीं है, बल्कि शब्दों और छवियों दोनों में सोचने की उल्लेखनीय क्षमता देता है। इसी mechanism से प्रेरित होकर, हम एक नया reasoning paradigm, Multimodal Visualization-of-Thought (MVoT), प्रस्तावित करते हैं। यह reasoning traces के image visualizations बनाकर MLLMs में visual thinking को सक्षम करता है। उच्च-गुणवत्ता visualization सुनिश्चित करने के लिए, हमने autoregressive MLLMs में token discrepancy loss पेश किया है। यह innovation visual coherence और fidelity, दोनों में महत्वपूर्ण सुधार लाता है। हमने this approach को कई dynamic spatial reasoning tasks के माध्यम से validate किया। प्रयोगों के परिणाम दिखाते हैं कि MVoT कई tasks में competitive performance देता है। इसके अलावा, यह उन सबसे चुनौतीपूर्ण scenarios में भी मजबूत और विश्वसनीय सुधार दिखाता है जहाँ CoT विफल हो जाता है। अंततः, MVoT जटिल reasoning tasks के लिए नई संभावनाएँ स्थापित करता है, जहाँ visual thinking, verbal reasoning को प्रभावी रूप से पूरक कर सकता है।

पेपर लिंक

https://arxiv.org/abs/2501.07542

ChemAgent: Large Language Models में self-updating library से chemical reasoning में सुधार / ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

पेपर परिचय

हम एक नया framework प्रस्तुत करते हैं, जिसे dynamic, self-updating library के माध्यम से chemical reasoning पर LLMs के प्रदर्शन को बेहतर बनाने के लिए डिज़ाइन किया गया है। library को chemical tasks को sub-tasks में विभाजित करके और उन्हें एक structured collection में compile करके विकसित किया जाता है, जिसे भविष्य की queries के लिए संदर्भित किया जा सकता है; जब system को कोई नई समस्या दी जाती है, तो वह library से प्रासंगिक जानकारी को फिर से आज़माता और refine करता है ताकि अधिक प्रभावी task decomposition संभव हो सके; जैसे-जैसे नए sub-tasks और solutions सामने आते हैं और validate होते हैं, library dynamically update होती रहती है; SciBench पर किए गए experiments दिखाते हैं कि ChemAgent ने existing methods की तुलना में कहीं बेहतर प्रदर्शन करते हुए अधिकतम 46% (GPT-4) तक performance gain हासिल किया।

एक नया framework प्रस्तुत किया गया है, जिसे dynamic, self-updating library के माध्यम से chemical reasoning पर LLMs के प्रदर्शन को बेहतर बनाने के लिए डिज़ाइन किया गया है; library को chemical tasks को sub-tasks में विभाजित करके और उन्हें एक structured collection में compile करके विकसित किया जाता है, जिसे भविष्य की queries के लिए संदर्भित किया जा सकता है; जब system को कोई नई समस्या दी जाती है, तो वह library से प्रासंगिक जानकारी को फिर से आज़माता और refine करता है ताकि अधिक प्रभावी task decomposition संभव हो सके; जैसे-जैसे नए sub-tasks और solutions सामने आते हैं और validate होते हैं, library dynamically update होती रहती है; SciBench पर किए गए experiments दिखाते हैं कि ChemAgent ने existing methods की तुलना में कहीं बेहतर प्रदर्शन करते हुए अधिकतम 46% (GPT-4) तक performance gain हासिल किया।

पेपर सारांश(Abstract)

रासायनिक reasoning में आमतौर पर जटिल, बहु-चरणीय प्रक्रियाएँ शामिल होती हैं जिनमें सटीक गणनाओं की आवश्यकता होती है, और छोटी-सी गलती भी क्रमिक विफलताओं का कारण बन सकती है। इसके अलावा, बड़े language models (LLMs) को रासायनिक reasoning कार्यों को संभालते समय domain-specific formulas को प्रोसेस करने, reasoning steps को सटीक रूप से execute करने, और code को प्रभावी ढंग से integrate करने में कठिनाई होती है। इन चुनौतियों को हल करने के लिए Unity ने ChemAgent पेश किया है, जो एक नया framework है और self-updating dynamic library के माध्यम से LLMs के प्रदर्शन को बेहतर बनाने के लिए डिज़ाइन किया गया है। यह library रासायनिक कार्यों को sub-tasks में विभाजित करके और उन sub-tasks को एक structured collection में compile करके विकसित की जाती है, जिसे भविष्य के queries में refer किया जा सके। फिर जब कोई नई समस्या दी जाती है, तो ChemAgent library से, जिसे यह memory कहता है, संबंधित जानकारी retrieve और refine करता है, जिससे प्रभावी task decomposition और solution generation में मदद मिलती है। यह तरीका तीन प्रकार की memory और एक library-enhanced reasoning component डिज़ाइन करता है, जिससे LLMs अनुभव के साथ समय के साथ बेहतर हो सकें। SciBench के चार रासायनिक reasoning datasets पर किए गए प्रयोगों के परिणाम दिखाते हैं कि ChemAgent ने अधिकतम 46% (GPT-4) performance improvement हासिल किया, जो मौजूदा तरीकों से काफी बेहतर है। ये निष्कर्ष drug discovery और materials science जैसे कार्यों सहित भविष्य के applications के लिए महत्वपूर्ण संभावनाएँ दिखाते हैं। अधिक जानकारी के लिए https://github.com/gersteinlab/chemagent देखें

Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at https://github.com/gersteinlab/chemagent

पेपर लिंक

https://arxiv.org/abs/2501.06590

मूल लेख

https://nlp.elvissaravia.com/p/top-ml-papers-of-the-week-adb

यह लेख GPT मॉडल की मदद से संक्षेपित किया गया है, इसलिए इसमें कुछ त्रुटियाँ हो सकती हैं। कृपया नीचे दिए गए मूल लेख को भी साथ में देखें। पढ़ते समय यदि आपको कोई अटपटा या गलत हिस्सा मिले, तो कृपया टिप्पणी में बताएं!* 🤗

⚠️विज्ञापन⚠️: क्या 🔥PyTorch Korean User Group🇰🇷 द्वारा संकलित यह लेख आपको उपयोगी लगा? सदस्य के रूप में जुड़ें, तो हम प्रमुख लेख आपको ईमेल💌 से भेजेंगे! (डिफ़ॉल्ट Weekly है, लेकिन Daily में भी बदला जा सकता है.)

[2025/01/13 ~ 01/19] इस हफ्ते के प्रमुख ML पेपर्स (Top ML Papers of the Week)

$\text{Transformer}^2$: स्व-अनुकूलनशील LLM / $\text{Transformer}^2$: Self-adaptive LLMs

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

MiniMax-01: बिजली जैसी तेज़ी से स्केल होने वाले foundation models / MiniMax-01: Scaling Foundation Models with Lightning Attention

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

और पढ़ें

VideoRAG: वीडियो कॉर्पस पर Retrieval-Augmented Generation / VideoRAG: Retrieval-Augmented Generation over Video Corpus

पेपर परिचय

पेपर सारांश (Abstract)

शोध-पत्र लिंक

आगे पढ़ें

टाइटन्स: टेस्ट टाइम पर याद रखना सीखना / Titans: Learning to Memorize at Test Time

शोध-पत्र परिचय

शोध-पत्र सार (Abstract)

पेपर लिंक

और पढ़ें

बड़े भाषा मॉडल की बुनियाद / Foundations of Large Language Models

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

OmniThink: सोच के माध्यम से मशीन राइटिंग में ज्ञान की सीमाओं का विस्तार / OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

आगे पढ़ें

RAG में सुधार: सर्वोत्तम प्रथाओं का अध्ययन / Enhancing Retrieval-Augmented Generation: A Study of Best Practices

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

आगे पढ़ें

AutoCBT: मनोवैज्ञानिक परामर्श में Cognitive Behavioral Therapy के लिए स्वायत्त multi-agent framework / AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling

पेपर परिचय

पेपर सार(Abstract)

पेपर लिंक

आगे पढ़ें

स्पेस में reasoning करते हुए कल्पना करें: विचारों का visualization: multimodal visualization / Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

पेपर परिचय

पेपर सार(Abstract)

पेपर लिंक

और पढ़ें

ChemAgent: Large Language Models में self-updating library से chemical reasoning में सुधार / ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

मूल लेख

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.