22] इस सप्ताह के प्रमुख ML पेपर (Top ML Papers of the Week)

(discuss.pytorch.kr)

5 पॉइंट द्वारा ninebow 2024-09-23 | 3 टिप्पणियां | WhatsApp पर शेयर करें

DAIR.AI द्वारा हर हफ्ते प्रकाशित ML पेपर्स पर आधारित इस लेख का स्वचालित अनुवाद किया गया है।
इस सप्ताह चुने गए पेपर्स को देखने पर कुछ प्रमुख रुझान साफ़ दिखाई देते हैं। पहला, Large Language Models (LLM) पर शोध का बड़ा हिस्सा केंद्रित है। 'Training LLMs to Self-Correct via RL', 'Qwen2.5 Coder', 'A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs' जैसे विभिन्न पेपर्स LLM के प्रदर्शन सुधार और उसके अनुप्रयोगों पर चर्चा करते हैं। यह दर्शाता है कि LLM इस समय AI research के प्रमुख विषयों में से एक है।
दूसरा, AI की सोचने की प्रक्रिया से जुड़े शोध भी बड़ी संख्या में हैं। 'Diagram of Thought (DoT)', 'Iteration of Thought', 'To CoT or not to CoT?' जैसे पेपर्स AI के सोचने के तरीकों या reasoning process की गहराई से पड़ताल करते हैं। इससे AI systems की accuracy और efficiency बढ़ाने के प्रयास स्पष्ट दिखते हैं।
इन रुझानों के सामने आने के कई कारण हो सकते हैं। सबसे पहले, Large Language Models अपनी विविध application संभावनाओं और उच्च प्रदर्शन के कारण industry और academia दोनों में बहुत रुचि का विषय बने हुए हैं। खासकर, model की self-correction क्षमता या performance improvement के लिए विभिन्न तकनीकों पर सक्रिय रूप से शोध हो रहा है। साथ ही, AI की सोचने की प्रक्रिया पर शोध का संबंध उस अंतिम लक्ष्य से है जिसमें इंसानों जैसी सोचने की क्षमता वाले AI का विकास किया जाए। इसे अधिक जटिल और बुद्धिमान कार्यों के automation के लिए एक आवश्यक तत्व माना जाता है।
संक्षेप में, इस सप्ताह के पेपर्स के मुख्य ट्रेंड Large Language Models के performance improvement और AI की सोचने की प्रक्रिया पर research हैं। यह इस बात का अच्छा उदाहरण है कि मौजूदा AI research किस दिशा में आगे बढ़ रही है।

Moshi

पेपर परिचय

speech-text foundation model, full-duplex spoken dialogue framework, system के कई components, 7B parameter text LLM Helium, audio quality में state-of-the-art performance वाला semantic-acoustic neural audio code Mimi, और speech-to-speech तरीके से arbitrary conversation generate कर सकने वाली hierarchical multi-stream architecture का परिचय दिया गया है।

Introduces a speech-text foundation model and full-duplex spoken dialogue framework; they present several components of the systems; Helium is a 7B parameter text LLM; Mimi is a semantic-acoustic neural audio code with state-of-the-art performance on audio quality; a hierarchical multi-stream architecture that can generate arbitrary conversation in a speech-to-speech manner.

पेपर सारांश(Abstract)

हम Moshi का परिचय देते हैं, जो एक speech-text foundation model और full-duplex spoken dialogue framework है। वर्तमान spoken dialogue systems स्वतंत्र components की pipelines पर निर्भर करते हैं, जैसे voice activity detection, speech recognition, textual dialogue और text-to-speech। ऐसे frameworks वास्तविक बातचीत के अनुभव की नकल नहीं कर सकते। पहला, उनकी जटिलता interactions के बीच कई सेकंड की latency पैदा करती है। दूसरा, क्योंकि dialogue के लिए text एक intermediate modality होता है, इसलिए meaning को प्रभावित करने वाली non-linguistic information—जैसे emotion या non-speech sounds—interaction में खो जाती है। अंत में, वे speaker turns में segmentation पर निर्भर करते हैं, जो overlapping speech, interruptions और interjections को ध्यान में नहीं रखता। Moshi spoken dialogue को speech-to-speech generation के रूप में ढालकर इन सभी स्वतंत्र समस्याओं को एक साथ हल करता है। text language model backbone से शुरू करते हुए, Moshi neural audio codec के residual quantizer से speech को tokens के रूप में generate करता है, और साथ ही अपनी speech तथा user की speech को parallel streams में अलग-अलग model करता है। इससे explicit speaker turns की आवश्यकता समाप्त हो जाती है और arbitrary conversational dynamics को model करना संभव हो जाता है। इसके अलावा, हम previous work के hierarchical semantic-to-acoustic token generation का विस्तार करते हैं ताकि audio tokens के prefix के रूप में पहले time-aligned text tokens predict किए जा सकें। यह “Inner Monologue” तरीका न केवल generated speech की linguistic quality को काफी बेहतर बनाता है, बल्कि यह भी दिखाता है कि इससे streaming speech recognition और text-to-speech कैसे उपलब्ध कराया जा सकता है। हमारा तैयार मॉडल पहला real-time full-duplex spoken large language model है, जिसकी theoretical latency 160ms और व्यवहारिक latency 200ms है, और यह github.com/kyutai-labs/moshi पर उपलब्ध है।

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.

पेपर लिंक

https://kyutai.org/Moshi.pdf

Reinforcement Learning के जरिए language models को स्वयं-सुधार के लिए प्रशिक्षित करना / Training Language Models to Self-Correct via Reinforcement Learning

पेपर परिचय

LLM की self-correction क्षमताओं को बेहतर बनाने के लिए multi-turn online reinforcement learning विकसित किया गया है, जो पूरी तरह self-generated data पर आधारित है। यह दिखाया गया कि SFT, self-correction सीखने में अप्रभावी है और training data तथा model responses के बीच distribution mismatch की समस्या से जूझता है। इसके लिए two-stage approach प्रस्तावित की गई है, जो पहले correction behavior को optimize करती है और फिर reward bonus का उपयोग करके training के दौरान self-correction को बढ़ाती है। Gemini 1.0 Pro और 1.5 Flash models पर लागू करने पर, इसने MATH और HumanEval benchmarks पर base models की self-correction performance को क्रमशः 15.6% और 9.1% तक सुधारते हुए state-of-the-art प्रदर्शन हासिल किया।

LLM की self-correction क्षमताओं को बेहतर बनाने के लिए multi-turn online reinforcement learning विकसित किया गया है; यह पूरी तरह self-generated data पर आधारित है; यह दिखाया गया कि SFT, self-correction सीखने में अप्रभावी है और training data तथा model responses के बीच distribution mismatch से प्रभावित होता है; यह two-stage approach प्रस्तावित करता है, जो पहले correction behavior को optimize करती है और फिर reward bonus का उपयोग करके training के दौरान self-correction को बढ़ाती है; Gemini 1.0 Pro और 1.5 Flash models पर लागू करने पर, यह MATH और HumanEval benchmarks पर base models की self-correction को क्रमशः 15.6% और 9.1% तक सुधारते हुए state-of-the-art self-correction performance हासिल करता है।

पेपर सारांश(Abstract)

Self-correction, large language models (LLMs) की एक बेहद वांछनीय क्षमता है, लेकिन लगातार यह पाया गया है कि आधुनिक LLMs में यह काफी हद तक अप्रभावी रहती है। Self-correction को train करने के लिए मौजूदा approaches या तो कई models की मांग करते हैं या फिर किसी अधिक सक्षम model अथवा supervision के अन्य रूपों पर निर्भर होते हैं। इसी उद्देश्य से, Unity ने SCoRe नाम का multi-turn online reinforcement learning (RL) approach विकसित किया है, जो पूरी तरह self-generated data का उपयोग करके LLM की self-correction क्षमता में उल्लेखनीय सुधार करता है। SCoRe बनाने के लिए, शोधकर्ता पहले दिखाते हैं कि offline model-generated correction traces पर supervised fine-tuning (SFT) के variants, self-correction behavior को स्थापित करने के लिए पर्याप्त नहीं हैं। खास तौर पर, उन्होंने देखा कि SFT के जरिए training या तो training data और model की अपनी responses के बीच distribution mismatch से प्रभावित होती है, या फिर correction behavior के केवल एक खास mode को अप्रत्यक्ष रूप से प्राथमिकता देती है, जो अक्सर test time पर प्रभावी नहीं होता। SCoRe इन चुनौतियों को इस तरह हल करता है कि यह model द्वारा स्वयं उत्पन्न correction traces के distribution के तहत training करता है और उचित regularization का उपयोग करके learning process को इस दिशा में ले जाता है कि model, किसी दिए गए prompt पर सिर्फ high-reward responses fit करने के बजाय test time पर प्रभावी self-correction strategy सीखे। यह regularization पहले base model पर RL का पहला phase चलाकर ऐसी policy initialization तैयार करने को कहता है जो collapse के प्रति कम संवेदनशील हो, और फिर reward bonus का उपयोग करके training के दौरान self-correction को बढ़ाता है। Gemini 1.0 Pro और 1.5 Flash models पर लागू करने पर, SCoRe ने MATH और HumanEval benchmarks पर base models की self-correction performance को क्रमशः 15.6% और 9.1% तक सुधारते हुए state-of-the-art self-correction performance हासिल की।

Self-correction, large language models (LLMs) की एक अत्यंत वांछनीय क्षमता है, फिर भी लगातार पाया गया है कि आधुनिक LLMs में यह काफी हद तक अप्रभावी रहती है। Self-correction training के मौजूदा approaches या तो कई models की आवश्यकता रखते हैं या किसी अधिक सक्षम model अथवा supervision के अन्य रूपों पर निर्भर करते हैं। इसी उद्देश्य से, हम SCoRe नाम का multi-turn online reinforcement learning (RL) approach विकसित करते हैं, जो पूरी तरह self-generated data का उपयोग करके LLM की self-correction क्षमता में उल्लेखनीय सुधार करता है। SCoRe बनाने के लिए, हम पहले दिखाते हैं कि offline model-generated correction traces पर supervised fine-tuning (SFT) के variants, self-correction behavior स्थापित करने के लिए पर्याप्त नहीं हैं। विशेष रूप से, हम देखते हैं कि SFT के जरिए training या तो training data और model की अपनी responses के बीच distribution mismatch से प्रभावित होती है, या correction behavior के केवल एक निश्चित mode को अप्रत्यक्ष रूप से प्राथमिकता देती है, जो अक्सर test time पर प्रभावी नहीं होता। SCoRe इन चुनौतियों का समाधान model द्वारा स्वयं उत्पन्न correction traces के distribution के तहत training करके और उपयुक्त regularization का उपयोग करके learning process को इस दिशा में मोड़कर करता है कि वह किसी दिए गए prompt पर सिर्फ high-reward responses fit करने के बजाय test time पर प्रभावी self-correction strategy सीखे। यह regularization base model पर RL का पहला phase चलाकर ऐसी policy initialization उत्पन्न करने का निर्देश देता है जो collapse के प्रति कम संवेदनशील हो, और फिर reward bonus का उपयोग करके training के दौरान self-correction को बढ़ाता है। Gemini 1.0 Pro और 1.5 Flash models पर लागू करने पर, हमने पाया कि SCoRe, MATH और HumanEval benchmarks पर base models की self-correction को क्रमशः 15.6% और 9.1% तक सुधारते हुए state-of-the-art self-correction performance हासिल करता है।

पेपर लिंक

https://arxiv.org/abs/2409.12917

आगे पढ़ें

https://x.com/omarsar0/status/1837228446839361984

Qwen2.5-Coder तकनीकी दस्तावेज़ / Qwen2.5-Coder Technical Report

पेपर परिचय

1.5B और 7B parameters सहित models की एक series; यह Qwen2.5 architecture पर आधारित है, जिसे 5.5 trillion tokens पर लगातार pretrain किया गया है; यह 10 से अधिक benchmarks पर state-of-the-art performance हासिल करती है; और इसमें code generation, completion, reasoning, तथा repairing की मजबूत क्षमताएँ शामिल हैं।

1.5B और 7B parameters सहित models की एक series; यह Qwen2.5 architecture पर आधारित है, जिसे 5.5 trillion tokens पर लगातार pretrain किया गया है; यह 10 से अधिक benchmarks पर state-of-the-art performance हासिल करती है; और इसमें code generation, completion, reasoning, तथा repairing की मजबूत क्षमताएँ शामिल हैं।

पेपर सारांश(Abstract)

इस रिपोर्ट में Qwen2.5-Coder series का परिचय दिया गया है, जो इसके पिछले संस्करण CodeQwen1.5 की तुलना में एक बड़ा अपग्रेड है। इस series में दो मॉडल शामिल हैं: Qwen2.5-Coder-1.5B और Qwen2.5-Coder-7B। कोड-विशिष्ट मॉडल के रूप में, Qwen2.5-Coder को Qwen2.5 architecture पर बनाया गया है और इसे 5.5 trillion से अधिक tokens वाले विशाल corpus पर आगे pretrain किया गया है। सावधानीपूर्वक data cleaning, scalable synthetic data generation, और balanced data mixing के माध्यम से, Qwen2.5-Coder सामान्य बहुउपयोगिता बनाए रखते हुए प्रभावशाली code generation क्षमताएँ दिखाता है। इस मॉडल का मूल्यांकन code generation, completion, reasoning, और repair सहित 10 से अधिक benchmarks पर किया गया है, जहाँ इसने state-of-the-art (SOTA) प्रदर्शन हासिल किया और समान model size में बड़े मॉडलों को लगातार पीछे छोड़ा। Unity का मानना है कि Qwen2.5-Coder series की रिलीज़ न केवल code intelligence research की सीमाओं को आगे बढ़ाएगी, बल्कि permissive licensing के माध्यम से developers को real-world applications में इसके व्यापक adoption के लिए भी प्रोत्साहित करेगी।

In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general versatility. The model has been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will not only push the boundaries of research in code intelligence but also, through its permissive licensing, encourage broader adoption by developers in real-world applications.

पेपर लिंक

https://arxiv.org/abs/2409.12186

विचार के आरेख (DoT) में / On the Diagram of Thought

पेपर परिचय

गणितीय कठोरता के माध्यम से LLM की reasoning क्षमता को बेहतर बनाते हुए, DAT LLM में iterative reasoning को एक directed acyclic graph के निर्माण के रूप में मॉडल करता है, और propositions, critiques, refinement, तथा verification को एकीकृत DAG संरचना में जोड़ता है, जिससे यह linear या tree-based approaches से आगे बढ़कर जटिल logical reasoning को पकड़ सकता है।

Enhances the reasoning capabilities of LLMs through mathematical rigor; DAT models iterative reasoning in LLM as the construction of a directed acyclic graph; it integrates propositions, critiques, refinement, and verification into a unified DAG structure; this allows DoT to capture complex logical deduction beyond linear or tree-based approaches.

पेपर सारांश(Abstract)

हम Diagram of Thought (DoT) प्रस्तुत करते हैं, जो एक ऐसा framework है जो बड़े language models (LLMs) में iterative reasoning को एक ही मॉडल के भीतर directed acyclic graph (DAG) के निर्माण के रूप में मॉडल करता है। reasoning को linear chains या trees के रूप में दर्शाने वाले पारंपरिक approaches के विपरीत, DoT propositions, critiques, refinements, और verifications को एक सुसंगत DAG संरचना में व्यवस्थित करता है, जिससे मॉडल logical consistency बनाए रखते हुए जटिल reasoning pathways का अन्वेषण कर सकता है। आरेख का प्रत्येक node उस proposition से संबंधित होता है जिसे propose, critique, refine, या verify किया गया है, जिससे LLM natural language feedback के माध्यम से अपनी reasoning को बार-बार सुधार सकता है। role-specific tokens के साथ auto-regressive next-token prediction का उपयोग करके, DoT ideas propose करने और उनका critical evaluation करने के बीच सहज परिवर्तन संभव बनाता है, और binary signals की तुलना में अधिक समृद्ध feedback प्रदान करता है। इसके अलावा, हम Topos Theory का उपयोग करके DoT framework को formalize करते हैं, जो reasoning process में logical consistency और soundness सुनिश्चित करने के लिए एक mathematical foundation प्रदान करता है। यह approach एक ही LLM के भीतर training और inference दोनों प्रक्रियाओं को बेहतर बनाता है, जिससे multiple models या external control mechanisms की आवश्यकता समाप्त हो जाती है। DoT अगली पीढ़ी के reasoning-specialized models को डिज़ाइन करने के लिए एक conceptual framework प्रदान करता है, जो training efficiency, मजबूत reasoning capabilities, और theoretical grounding पर जोर देता है। कोड https://github.com/diagram-of-thought/diagram-of-thought पर उपलब्ध है।

We introduce Diagram of Thought (DoT), a framework that models iterative reasoning in large language models (LLMs) as the construction of a directed acyclic graph (DAG) within a single model. Unlike traditional approaches that represent reasoning as linear chains or trees, DoT organizes propositions, critiques, refinements, and verifications into a cohesive DAG structure, allowing the model to explore complex reasoning pathways while maintaining logical consistency. Each node in the diagram corresponds to a proposition that has been proposed, critiqued, refined, or verified, enabling the LLM to iteratively improve its reasoning through natural language feedback. By leveraging auto-regressive next-token prediction with role-specific tokens, DoT facilitates seamless transitions between proposing ideas and critically evaluating them, providing richer feedback than binary signals. Furthermore, we formalize the DoT framework using Topos Theory, providing a mathematical foundation that ensures logical consistency and soundness in the reasoning process. This approach enhances both the training and inference processes within a single LLM, eliminating the need for multiple models or external control mechanisms. DoT offers a conceptual framework for designing next-generation reasoning-specialized models, emphasizing training efficiency, robust reasoning capabilities, and theoretical grounding. The code is available at https://github.com/diagram-of-thought/diagram-of-thought.

पेपर लिंक

https://arxiv.org/abs/2409.10038

आगे पढ़ें

https://github.com/diagram-of-thought/diagram-of-thought

https://x.com/omarsar0/status/1835882277563179512

सॉफ़्टवेयर इंजीनियरिंग एजेंट: सर्वेक्षण, परिदृश्य और विज़न / Agents in Software Engineering: Survey, Landscape, and Vision

पेपर परिचय

यह सॉफ़्टवेयर इंजीनियरिंग में LLM-आधारित एजेंटों के फ़्रेमवर्क का एक व्यापक अवलोकन प्रदान करता है।

Provides a comprehensive overview of frameworks of LLM-based agents in software engineering.

पेपर सारांश (Abstract)

हाल के वर्षों में, बड़े भाषा मॉडल (LLMs) ने उल्लेखनीय सफलता हासिल की है और विभिन्न डाउनस्ट्रीम कार्यों में, विशेष रूप से सॉफ़्टवेयर इंजीनियरिंग (SE) क्षेत्र के कार्यों में, व्यापक रूप से उपयोग किए गए हैं। हम पाते हैं कि LLMs और SE को मिलाने वाले कई अध्ययनों ने एजेंट की अवधारणा का उपयोग या तो स्पष्ट रूप से या अप्रत्यक्ष रूप से किया है। हालांकि, मौजूदा कार्यों के विकास संदर्भ को व्यवस्थित करने, यह विश्लेषण करने कि मौजूदा कार्य विभिन्न कार्यों को बेहतर बनाने के लिए LLM-आधारित एजेंट तकनीकों को कैसे जोड़ते हैं, और SE में LLM-आधारित एजेंटों के फ़्रेमवर्क को स्पष्ट करने वाला गहन सर्वेक्षण अभी भी नहीं है। इस पेपर में, हम LLM-आधारित एजेंटों और SE के संयोजन पर किए गए अध्ययनों का पहला सर्वेक्षण प्रस्तुत करते हैं और SE में LLM-आधारित एजेंटों का एक फ़्रेमवर्क पेश करते हैं, जिसमें तीन प्रमुख मॉड्यूल शामिल हैं: perception, memory, और action। हम दोनों क्षेत्रों को जोड़ने में वर्तमान चुनौतियों का भी सार प्रस्तुत करते हैं और मौजूदा चुनौतियों के जवाब में भविष्य के अवसर प्रस्तावित करते हैं। संबंधित पेपर्स का GitHub रिपॉज़िटरी यहाँ उपलब्ध है: https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.

In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many studies combining LLMs with SE have employed the concept of agents either explicitly or implicitly. However, there is a lack of an in-depth survey to sort out the development context of existing works, analyze how existing works combine the LLM-based agent technologies to optimize various tasks, and clarify the framework of LLM-based agents in SE. In this paper, we conduct the first survey of the studies on combining LLM-based agents with SE and present a framework of LLM-based agents in SE which includes three key modules: perception, memory, and action. We also summarize the current challenges in combining the two fields and propose future opportunities in response to existing challenges. We maintain a GitHub repository of the related papers at: https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE.

पेपर लिंक

https://arxiv.org/abs/2409.09030

आगे पढ़ें

https://github.com/DeepSoftwareAnalytics/Awesome-Agent4SE

https://x.com/omarsar0/status/1835705359723319702

CoT करना चाहिए या नहीं? Chain-of-thought मुख्य रूप से गणित और symbolic reasoning में मदद करता है / To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

पेपर परिचय

100 से अधिक पेपर्स और कई evaluations पर किए गए एक meta-analysis के माध्यम से यह जाँच की गई कि किस प्रकार के कार्य chain-of-thought (CoT) prompting से सबसे अधिक लाभ उठाते हैं; परिणाम में पाया गया कि CoT मुख्य रूप से गणित और तर्क से जुड़े कार्यों में मजबूत प्रदर्शन लाभ देता है। साथ ही, यह भी सामने आया कि CoT का अधिकांश लाभ symbolic execution को बेहतर बनाने से आता है, लेकिन एक symbolic solver उससे भी बेहतर प्रदर्शन करता है।

Investigates what kinds of tasks benefit the most from chain-of-thought (CoT) prompting; after a meta-analysis on 100+ papers and several evaluations, it finds that CoT produces strong performance benefits primarily on tasks involving math and logic; they find that most of the CoT gain comes from improving symbolic execution, but a symbolic solver outperforms it.

पेपर सारांश (Abstract)

Prompting के ज़रिए Chain-of-thought (CoT) बड़े language models (LLM) में reasoning क्षमता निकालने का वास्तविक मानक तरीका है। लेकिन यह अतिरिक्त "सोच" वास्तव में किस तरह के कामों में मददगार होती है? इसका विश्लेषण करने के लिए CoT का उपयोग करने वाले 100 से अधिक पेपरों पर quantitative meta-analysis किया गया और 14 मॉडलों में 20 datasets पर स्वयं मूल्यांकन चलाया गया। नतीजों से पता चला कि CoT मुख्य रूप से math या logic से जुड़े tasks में मजबूत performance लाभ देता है, जबकि अन्य प्रकार के tasks में इसका लाभ काफी कम है। MMLU में, जब तक प्रश्न या मॉडल के उत्तर में symbolic operations और reasoning को दर्शाने वाला equals sign शामिल न हो, CoT के बिना सीधे उत्तर जनरेट करना CoT के लगभग समान accuracy देता है। इस निष्कर्ष के आधार पर, planning और execution को अलग करके तथा tool-augmented LLMs से तुलना करते हुए इन समस्याओं पर CoT के व्यवहार का विश्लेषण किया गया। CoT का अधिकांश लाभ symbolic execution को बेहतर बनाने से आता है, लेकिन यह symbolic solver के उपयोग की तुलना में कमज़ोर प्रदर्शन करता है। अध्ययन के नतीजे दिखाते हैं कि performance बनाए रखते हुए inference cost घटाने के लिए CoT को चुनिंदा रूप से लागू किया जा सकता है। साथ ही, यह भी संकेत मिलता है कि prompt-based CoT से आगे बढ़कर ऐसे नए paradigm की ज़रूरत है जो पूरे LLM application spectrum में intermediate computation का बेहतर उपयोग करें।

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

पेपर लिंक

https://arxiv.org/abs/2409.12183

Quantized Instruction-Tuned बड़े language models का व्यापक मूल्यांकन: 405B तक का प्रायोगिक विश्लेषण / A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

पेपर परिचय

7B से 405B तक के मॉडलों में विभिन्न quantization methods के बीच instruction-tuned LLMs की performance का मूल्यांकन करने पर यह पाया गया कि 1) बड़े LLM को छोटे FP16 LLM के समान आकार तक quantize करना आमतौर पर अधिकांश benchmarks में बेहतर प्रदर्शन करता है, 2) performance quantization method, model size और bit-width के अनुसार काफी बदलती है, और weight-only methods बड़े मॉडलों में अक्सर अच्छे नतीजे देते हैं, तथा 3) task difficulty का quantization के कारण होने वाली accuracy degradation पर कोई बड़ा प्रभाव नहीं पड़ता।

Evaluates the performance of instruction-tuned LLMs across various quantization methods on models ranging from 7B to 405B; the key findings are 1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, 2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models, and 3) task difficulty does not significantly impact accuracy degradation due to quantization.

पेपर सारांश (Abstract)

पिछले शोधों में quantized LLMs का मूल्यांकन perplexity, कुछ बुनियादी knowledge tasks, और पुराने datasets जैसे सीमित metrics के आधार पर किया गया था। इसके अलावा, Llama 3.1 जैसे हालिया बड़े models, जो 405B तक जाते हैं, का गहराई से परीक्षण नहीं किया गया था। यह श्वेतपत्र 7B से 405B तक के models में विभिन्न quantization methods (GPTQ, AWQ, SmoothQuant, FP8) के तहत instruction-tuned LLMs के प्रदर्शन का मूल्यांकन करता है। 13 benchmarks का उपयोग करके commonsense Q&A, knowledge और language understanding, instruction following, hallucination detection, mathematics, और dialogue सहित 6 task types पर प्रदर्शन का आकलन किया गया। मुख्य निष्कर्ष यह हैं कि (1) किसी बड़े LLM को किसी छोटे FP16 LLM के समान आकार तक quantize करना आम तौर पर hallucination detection और instruction following को छोड़कर अधिकांश benchmarks में बेहतर प्रदर्शन दिखाता है, (2) प्रदर्शन quantization method, model size, और bit-width के अनुसार काफी बदलता है, और weight-only methods अक्सर बड़े models में बेहतर परिणाम देते हैं, (3) task difficulty का quantization से होने वाली accuracy degradation पर कोई बड़ा प्रभाव नहीं पड़ता, और MT-Bench evaluation method की हालिया high-performing LLMs के बीच भेद करने की क्षमता सीमित पाई गई।

पहले के शोधों ने quantized LLMs का मूल्यांकन perplexity, कुछ बुनियादी knowledge tasks और पुराने datasets जैसे सीमित metrics के आधार पर किया है। इसके अलावा, Llama 3.1 जैसे हालिया बड़े models, जिनका आकार 405B तक है, की गहराई से जांच नहीं की गई थी। यह पेपर 7B से 405B तक के models पर विभिन्न quantization methods (GPTQ, AWQ, SmoothQuant, और FP8) के तहत instruction-tuned LLMs के प्रदर्शन का मूल्यांकन करता है। 13 benchmarks का उपयोग करते हुए, हम छह प्रकार के tasks में प्रदर्शन का आकलन करते हैं: commonsense Q&A, knowledge और language understanding, instruction following, hallucination detection, mathematics, और dialogue। हमारे मुख्य निष्कर्ष बताते हैं कि (1) किसी बड़े LLM को किसी छोटे FP16 LLM के समान आकार तक quantize करना आम तौर पर अधिकांश benchmarks में बेहतर प्रदर्शन देता है, hallucination detection और instruction following को छोड़कर; (2) प्रदर्शन अलग-अलग quantization methods, model size, और bit-width के साथ काफी बदलता है, और weight-only methods अक्सर बड़े models में बेहतर परिणाम देते हैं; (3) task difficulty का quantization के कारण accuracy degradation पर महत्वपूर्ण प्रभाव नहीं पड़ता; और (4) MT-Bench evaluation method में हालिया high-performing LLMs के बीच भेद करने की क्षमता सीमित है।

पेपर लिंक

https://arxiv.org/abs/2409.11055

विचारों की पुनरावृत्ति: स्वायत्त Large Language Model reasoning के लिए inner dialogue का उपयोग / Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

पेपर परिचय

LLM responses और reasoning capabilities को adaptive reasoning paths के साथ बेहतर बनाने के लिए Iteration of Thought (IoT) framework प्रस्तावित किया गया है। यह guide की भूमिका निभाने वाले एक inner dialogue agent का उपयोग करता है, जो reasoning paths को dynamic तरीके से समायोजित करता है, जिससे adaptive cross-path exploration संभव होता है और response accuracy बेहतर होती है। यह CoT और ToT (दोनों fixed processes हैं) से इस मायने में अलग है कि इसकी prompt generation एक dynamic process है, जो इसे अनुकूलित होने देती है।

Iteration of Thought (IoT) framework प्रस्तावित करता है, जो adaptive reasoning paths के साथ LLM responses और reasoning capabilities को बेहतर बनाता है; यह guide की तरह काम करने वाले एक inner dialogue agent का उपयोग करता है, जो reasoning paths को dynamic तरीके से adjust करता है, जिससे adaptive cross-path exploration संभव होता है और response accuracy बेहतर होती है; यह CoT और ToT (दोनों rigid processes हैं) से अलग है, क्योंकि इसकी prompt generation एक dynamic process है जो इसे adapt करने देती है।

पेपर सारांश (Abstract)

बार-बार मानवीय भागीदारी, बड़े भाषा मॉडल (LLM) की उन्नत language processing क्षमता का लाभ उठाने का एक सामान्य और प्रभावी तरीका है। अच्छी तरह संरचित conversational prompts का उपयोग करके, मानव उपयोगकर्ता LLM को अधिक विचारपूर्ण और सटीक उत्तर विकसित करने के लिए प्रभावी रूप से प्रभावित कर सकते हैं। इसी insight से प्रेरित होकर, हम input query और LLM के वर्तमान response iteration के संदर्भ में "thought"-उत्तेजक prompts उत्पन्न करके LLM responses को बेहतर बनाने के लिए Iteration of Thought (IoT) framework प्रस्तावित करते हैं। static या semi-static approaches, जैसे Chain of Thought (CoT) या Tree of Thoughts (ToT), के विपरीत IoT बदलते context के आधार पर अपने reasoning path को dynamic रूप से समायोजित करता है, और ऐसे वैकल्पिक exploratory thoughts उत्पन्न नहीं करता जो अंततः त्याग दिए जाते हैं। IoT framework के तीन components हैं: (1) IDA (Inner Dialogue Agent), जो उपयोगी context-specific prompts उत्पन्न करता है, (2) LLMA (LLM Agent), जो इन prompts को process करके response को परिष्कृत करता है, और (3) iterative prompting loop, जो इन दोनों components के बीच संवाद को लागू करता है। इस framework के दो variants पेश किए गए हैं: Autonomous Iteration of Thought (AIoT), जिसमें LLM खुद तय करता है कि iteration कब रोकना है, और Guided Iteration of Thought (GIoT), जो हमेशा iterations की एक निश्चित संख्या लागू करता है। हम GPQA dataset के complex reasoning tasks, Game of 24 में exploratory problem-solving, Mini Crosswords में puzzle solving, और HotpotQA dataset में multi-hop question answering सहित विभिन्न datasets पर IoT के प्रदर्शन की जांच करते हैं। परिणाम दिखाते हैं कि IoT, LLM में autonomous response refinement के लिए एक व्यवहार्य paradigm है, जो CoT की तुलना में महत्वपूर्ण सुधार दिखाता है और इस प्रकार ऐसे अधिक adaptive और efficient reasoning systems को संभव बनाता है जो मानवीय हस्तक्षेप को न्यूनतम करते हैं।

बड़े भाषा मॉडल (LLMs) की उन्नत भाषा-प्रसंस्करण क्षमता का लाभ उठाने के लिए दोहरावयुक्त मानवीय सहभागिता एक सामान्य और प्रभावी तरीका है। सुव्यवस्थित prompts को conversational तरीके से उपयोग करके, मानव उपयोगकर्ता LLM को अधिक विचारशील और सटीक उत्तर विकसित करने के लिए प्रभावी रूप से प्रभावित कर सकते हैं। इसी insight से प्रेरित होकर, हम Iteration of Thought (IoT) framework प्रस्तावित करते हैं, जो input query और LLM के response की वर्तमान iteration के संदर्भ में "thought"-provoking prompts उत्पन्न करके LLM responses को बेहतर बनाता है। static या semi-static approaches, जैसे Chain of Thought (CoT) या Tree of Thoughts (ToT), के विपरीत IoT evolving context के आधार पर अपने reasoning path को dynamic रूप से अनुकूलित करता है, और ऐसे वैकल्पिक exploratory thoughts उत्पन्न नहीं करता जो अंततः त्याग दिए जाते हैं। IoT framework के तीन components हैं: (1) Inner Dialogue Agent (IDA), जो instructive और context-specific prompts उत्पन्न करने के लिए जिम्मेदार है; (2) LLM Agent (LLMA), जो इन prompts को process करके अपने responses को refine करता है; और (3) एक iterative prompting loop, जो इन दोनों components के बीच बातचीत को लागू करता है। हम अपने framework के दो variants प्रस्तुत करते हैं: Autonomous Iteration of Thought (AIoT), जिसमें LLM खुद तय करता है कि iteration कब रोकना है, और Guided Iteration of Thought (GIoT), जो हमेशा iterations की एक निश्चित संख्या लागू करता है। हम IoT के प्रदर्शन की जांच विभिन्न datasets पर करते हैं, जिनमें GPQA dataset के complex reasoning tasks, Game of 24 में exploratory problem-solving, Mini Crosswords में puzzle solving, और HotpotQA dataset से multi-hop question answering शामिल हैं। हमारे परिणाम दिखाते हैं कि IoT, LLMs में autonomous response refinement के लिए एक व्यवहार्य paradigm है, जो CoT की तुलना में महत्वपूर्ण सुधार प्रदर्शित करता है, और इस प्रकार ऐसे अधिक adaptive और efficient reasoning systems को सक्षम बनाता है जो मानवीय हस्तक्षेप को न्यूनतम करते हैं।

पेपर लिंक

https://arxiv.org/abs/2409.12618

श्रोडिंगर की मेमोरी: बड़े भाषा मॉडल / Schrodinger's Memory: Large Language Models

पेपर परिचय

Universal Approximation Theorem का उपयोग करके LLMs की memory mechanism की व्याख्या की गई है। साथ ही, विभिन्न models की memory capacity की तुलना करके LLM प्रदर्शन का मूल्यांकन करने के लिए एक नया दृष्टिकोण प्रस्तावित किया गया है; Transformer architecture एक dynamic fitting UAT model की तरह काम करता है, जिसमें inputs को adaptively fit करने की मजबूत क्षमता होती है; इससे LLMs न्यूनतम input information के आधार पर पूरे content को recall कर सकते हैं।

LLMs की memory mechanism को समझाने के लिए Universal Approximation Theorem का उपयोग किया गया है। यह अलग-अलग models की memory capacities की तुलना करके LLM performance का मूल्यांकन करने का एक नया तरीका भी प्रस्तावित करता है; Transformer architecture एक dynamic fitting UAT model के रूप में कार्य करता है, जिसमें inputs को adaptively fit करने की मजबूत क्षमता होती है; इससे LLMs बहुत कम input information के आधार पर पूरे content को recall कर सकते हैं।

पेपर सारांश (Abstract)

स्मृति सभी मानवीय गतिविधियों की नींव है, और स्मृति के बिना दैनिक जीवन में कोई भी काम करना लगभग असंभव होगा। बड़े भाषा मॉडल (LLM) के विकास के साथ उनकी भाषा क्षमताएँ धीरे-धीरे मनुष्यों जैसी होती जा रही हैं। लेकिन क्या LLM के पास भी memory होती है? मौजूदा प्रदर्शन को देखते हुए, LLM में स्मृति होने जैसी क्षमता दिखाई देती है। तो फिर इस स्मृति का मूल मेकैनिज़्म क्या है? पिछले शोधों में LLM की memory क्षमता और उसके आधारभूत सिद्धांत पर गहन पड़ताल की कमी रही है। इस पेपर में Universal Approximation Theorem (UAT) का उपयोग करके LLM की memory mechanism को समझाया गया है। साथ ही, विभिन्न LLM की memory क्षमता को सत्यापित करने के लिए प्रयोग किए गए हैं और इन memory abilities के आधार पर क्षमताओं का आकलन करने की एक नई विधि प्रस्तावित की गई है। हमारा तर्क है कि LLM memory, Schr"odinger की memory की तरह काम करती है, यानी किसी विशेष memory को query करने पर ही वह observable होती है। किसी query के उत्तर के आधार पर ही हम यह तय कर सकते हैं कि मॉडल उस memory को रखता है या नहीं; अन्यथा वह अनिश्चित अवस्था में बनी रहती है। अंत में, मानव मस्तिष्क और LLM की memory क्षमताओं की तुलना करके इस अवधारणा का विस्तार किया गया है, और उनके कार्य-तंत्र में समानताओं व भिन्नताओं को रेखांकित किया गया है।

Memory is the foundation of all human activities; without memory, it would be nearly impossible for people to perform any task in daily life. With the development of Large Language Models (LLMs), their language capabilities are becoming increasingly comparable to those of humans. But do LLMs have memory? Based on current performance, LLMs do appear to exhibit memory. So, what is the underlying mechanism of this memory? Previous research has lacked a deep exploration of LLMs' memory capabilities and the underlying theory. In this paper, we use Universal Approximation Theorem (UAT) to explain the memory mechanism in LLMs. We also conduct experiments to verify the memory capabilities of various LLMs, proposing a new method to assess their abilities based on these memory ability. We argue that LLM memory operates like Schr"odinger's memory, meaning that it only becomes observable when a specific memory is queried. We can only determine if the model retains a memory based on its output in response to the query; otherwise, it remains indeterminate. Finally, we expand on this concept by comparing the memory capabilities of the human brain and LLMs, highlighting the similarities and differences in their operational mechanisms.

पेपर लिंक

https://arxiv.org/abs/2409.10482

आगे पढ़ें

https://x.com/omarsar0/status/1835882330323554321

प्रतीकात्मक गणित के जरिए बड़े भाषा मॉडलों को jailbreak करना / Jailbreaking Large Language Models with Symbolic Mathematics

पेपर परिचय

एक प्रभावी jailbreak तकनीक के रूप में उपयोग होने वाले गणितीय रूप से encoded prompts बनाने के लिए GPT-4o का उपयोग किया गया है, और 13 अत्याधुनिक मॉडलों पर औसतन 73.6% attack success rate दिखाकर यह रेखांकित किया गया है कि मौजूदा safety training mechanisms गणितीय रूप से encoded inputs पर generalize नहीं कर पाते।

Uses GPT-4o to generate mathematically encoded prompts that serve as an effective jailbreaking technique; shows an average attack success rate of 73.6% across 13 state-of-the-art; this highlights the inability of existing safety training mechanisms to generalize to mathematically encoded inputs.

पेपर सारांश (Abstract)

हाल के AI safety advancements के कारण unsafe content generation को कम करने के लिए बड़े भाषा मॉडलों (LLM) को train करने और red-teaming करने के प्रयास बढ़े हैं। हालांकि, ये safety mechanisms व्यापक नहीं हो सकते, जिससे संभावित कमजोरियाँ अनदेखी रह जाती हैं। यह पेपर MathPrompt नामक एक नई jailbreaking technique पेश करता है, जो LLM की उन्नत symbolic mathematics क्षमताओं का दुरुपयोग करके उनके safety mechanisms को bypass करती है। हानिकारक natural language prompts को गणितीय समस्याओं में encode करके, यह वर्तमान AI safety measures में एक गंभीर vulnerability को प्रदर्शित करता है। 13 अत्याधुनिक LLM पर किए गए प्रयोगों में औसत attack success rate 73.6% रहा, जो दिखाता है कि मौजूदा safety training mechanisms गणितीय रूप से encoded inputs पर generalize नहीं कर पाते। embedding vectors के विश्लेषण से पता चला कि मूल prompt और encoded prompt के बीच काफ़ी semantic shift है, जो इस attack की सफलता को समझाने में मदद करता है। यह अध्ययन AI safety के लिए एक holistic approach के महत्व पर ज़ोर देता है और सभी संभावित input types तथा उनसे जुड़े जोखिमों के लिए मज़बूत safeguards विकसित करने हेतु red-teaming प्रयासों का विस्तार करने का आह्वान करता है।

Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack's success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.

यह लेख GPT मॉडल की मदद से तैयार किया गया है, इसलिए इसमें कुछ गलतियाँ हो सकती हैं। कृपया नीचे दिए गए मूल लेख को भी साथ में देखें! पढ़ते समय अगर आपको कोई अटपटी या गलत बात मिले, तो कृपया कमेंट में बताएं।* 🤗

⚠️विज्ञापन⚠️: 🔥PyTorch Korea User Group🇰🇷 द्वारा संकलित यह लेख क्या आपके लिए उपयोगी रहा? सदस्य के रूप में जुड़ें, तो हम प्रमुख लेख आपको ईमेल💌 से भेजेंगे! (डिफ़ॉल्ट रूप से Weekly है, लेकिन Daily में भी बदला जा सकता है.)

3 टिप्पणियां

savvykang 2024-09-23

शीर्षक जून का है और लिंक किया गया पोस्ट सितंबर का है। क्या यह autocomplete की वजह से ऐसा हुआ होगा?

ninebow 2024-09-23

अरे, सही कहा आपने;;; बताने के लिए धन्यवाद। T_T
शीर्षक को '[2024/09/16 ~ 09/22] इस सप्ताह के प्रमुख ML शोधपत्र (Top ML Papers of the Week)' होना चाहिए था, लेकिन template इस्तेमाल करते समय मुझसे गलती हो गई। अगर xguru ji यह देखें, तो कृपया इसे बदल दें। 🙇‍♂️

ninebow 2024-09-23

धन्यवाद!!

[2024/09/16 ~ 09/22] इस सप्ताह के प्रमुख ML पेपर (Top ML Papers of the Week)

Moshi

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

Reinforcement Learning के जरिए language models को स्वयं-सुधार के लिए प्रशिक्षित करना / Training Language Models to Self-Correct via Reinforcement Learning

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

आगे पढ़ें

Qwen2.5-Coder तकनीकी दस्तावेज़ / Qwen2.5-Coder Technical Report

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

विचार के आरेख (DoT) में / On the Diagram of Thought

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

आगे पढ़ें

सॉफ़्टवेयर इंजीनियरिंग एजेंट: सर्वेक्षण, परिदृश्य और विज़न / Agents in Software Engineering: Survey, Landscape, and Vision

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

आगे पढ़ें

CoT करना चाहिए या नहीं? Chain-of-thought मुख्य रूप से गणित और symbolic reasoning में मदद करता है / To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

और पढ़ें

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

और पढ़ें

विचारों की पुनरावृत्ति: स्वायत्त Large Language Model reasoning के लिए inner dialogue का उपयोग / Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

और पढ़ें

श्रोडिंगर की मेमोरी: बड़े भाषा मॉडल / Schrodinger's Memory: Large Language Models

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

आगे पढ़ें

प्रतीकात्मक गणित के जरिए बड़े भाषा मॉडलों को jailbreak करना / Jailbreaking Large Language Models with Symbolic Mathematics

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

आगे पढ़ें

मूल लेख

संबंधित पढ़ाई

3 टिप्पणियां