[2024/06/17 ~ 06/23] इस सप्ताह के प्रमुख ML पेपर्स (Top ML Papers of the Week)

DAIR.AI द्वारा हर हफ्ते प्रकाशित ML पेपर्स पर आधारित इस लेख का स्वचालित अनुवाद किया गया है.
इस हफ्ते चुने गए पेपर्स को देखें तो दो बड़े रुझान साफ़ दिखाई देते हैं। पहला, ज़्यादातर पेपर्स natural language processing (NLP) से जुड़े विषयों पर केंद्रित हैं। इनमें खास तौर पर लंबे context को संभालने वाले language models (LM), information retrieval, और question answering (QA) systems की efficiency बढ़ाने के तरीक़े प्रमुख रुचि के विषय बनकर उभर रहे हैं। उदाहरण के लिए, ‘Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?’ जैसे पेपर लंबे context को समझने वाले language model की संभावनाओं की पड़ताल करते हैं, जबकि ‘PlanRAG’ और ‘From RAG to Rich Parameters’ information retrieval और question answering systems को बेहतर बनाने के लिए नए approaches प्रस्तुत करते हैं.
एक और उल्लेखनीय रुझान language models में memorization (रट लेने की प्रवृत्ति) को कम करने या self-refine प्रक्रिया के ज़रिए performance सुधारने की कोशिश है। ‘Mitigating Memorization in LLMs’ और ‘Monte Carlos Tree Self-Refine’ इस दृष्टिकोण से ध्यान देने योग्य हैं। memorization को कम करना इसलिए महत्वपूर्ण है ताकि language model सिर्फ training data को दोहराने तक सीमित न रहें, बल्कि अधिक generalized knowledge सीख सकें और रचनात्मक responses दे सकें। यह language models की practical usefulness और utility को अधिकतम करने की कुंजियों में से एक है.
ये रुझान संभवतः कई कारकों से प्रेरित हैं। पहला, AI क्षेत्र में natural language processing का महत्व लगातार बढ़ रहा है, और इसके लिए तकनीकी प्रगति तेज़ी से हो रही है। दूसरा, जानकारी की मात्रा बहुत विशाल हो जाने के कारण, उसे प्रभावी ढंग से प्रोसेस करके उपयोगकर्ताओं को उपयोगी जानकारी देने वाली तकनीकों की ज़रूरत बढ़ती जा रही है। अंत में, हाल के language models लगातार अधिक जटिल और शक्तिशाली होते जा रहे हैं, लेकिन ऐसे models के सामने आने वाली समस्याओं को हल करने के लिए नए approaches की निरंतर आवश्यकता बनी हुई है। इन ज़रूरतों को पूरा करने के लिए शोधकर्ता मौजूदा frameworks से आगे बढ़कर नए ideas और methodologies की लगातार खोज कर रहे हैं.

Claude 3.5 Sonnet / Claude 3.5 Sonnet

पेपर परिचय

यह एक नया model है जो MMLU और HumanEval जैसे कई सामान्य benchmarks पर state-of-the-art performance हासिल करता है। गणितीय शब्द-समस्या समाधान कार्यों को छोड़कर कई benchmarks में यह Claude 3 Opus और GPT-4o से बेहतर प्रदर्शन करता है, और vision tasks में भी मजबूत performance दिखाता है, जिससे image-text transcription और artifacts generation जैसी कई नई सुविधाएँ संभव होती हैं.

A new model that achieves state-of-the-art performance on several common benchmarks such as MMLU and HumanEval; it outperforms Claude 3 Opus and GPT-4o on several benchmarks with the exception of math word problem-solving tasks; achieves strong performance on vision tasks which also helps power several new features like image-text transcription and generation of artifacts.

पेपर लिंक

https://www.anthropic.com/news/claude-3-5-sonnet

DeepSeek-Coder-V2

पेपर परिचय

कोड और गणित generation tasks में closed-source models को टक्कर देता है, HumanEval में 90.2% और MATH में 75.7% हासिल करता है। रिपोर्ट के अनुसार ये नतीजे GPT-4-Turbo-0409 के performance से बेहतर हैं, और इसमें 128K context length वाले 16B तथा 236B parameter models शामिल हैं.

Competes with closed-sourced models on code and math generation tasks; achieves 90.2% on HumanEval and 75.7% on MATH; these results are higher than GPT-4-Turbo-0409 performance according to their report; includes a 16B and 236B parameter model with 128K context length.

पेपर सार (Abstract)

हम DeepSeek-Coder-V2 प्रस्तुत करते हैं, जो एक open-source Mixture-of-Experts (MoE) code language model है और code-specific tasks में GPT4-Turbo के तुलनीय performance हासिल करता है। विशेष रूप से, DeepSeek-Coder-V2 को DeepSeek-V2 के एक intermediate checkpoint से अतिरिक्त 6 trillion tokens के साथ आगे pre-train किया गया है। इस continued pre-training के माध्यम से DeepSeek-Coder-V2, सामान्य language tasks में तुलनीय performance बनाए रखते हुए, DeepSeek-V2 की coding और mathematical reasoning capabilities को काफ़ी बढ़ाता है। DeepSeek-Coder-33B की तुलना में DeepSeek-Coder-V2 code-related tasks के विभिन्न पहलुओं के साथ-साथ reasoning और general capabilities में भी महत्वपूर्ण प्रगति दिखाता है। इसके अलावा, DeepSeek-Coder-V2 programming languages के समर्थन को 86 से बढ़ाकर 338 तक ले जाता है और context length को 16K से 128K तक बढ़ाता है। मानक benchmark evaluations में DeepSeek-Coder-V2 coding और math benchmarks पर GPT4-Turbo, Claude 3 Opus, और Gemini 1.5 Pro जैसे closed-source models की तुलना में बेहतर performance हासिल करता है।

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek- Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder- V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

पेपर लिंक

https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdf

TextGrad: टेक्स्ट के माध्यम से स्वचालित 'डिफरेंशिएशन' / TextGrad: Automatic "Differentiation" via Text

पेपर परिचय

LLM द्वारा दिए गए टेक्स्ट फ़ीडबैक पर backpropagation के ज़रिए automatic differentiation के लिए एक नया framework, जो अलग-अलग components को बेहतर बनाता है और natural language computation graph optimization में मदद करती है; यह prompts या components को tune किए बिना objective function देने पर काम करता है; और GPT4o के साथ मिलाकर GPQA पर LeetCodeHard के सर्वश्रेष्ठ स्कोर तथा SoTA performance हासिल करने का दावा करता है।

A new framework for automatic differentiation through backpropagation on textual feedback provided by an LLM; this improves individual components and the natural language helps to optimize the computation graph; it works by providing an objective function without tuning prompts or components; claims to achieve LeetCodeHard best scores and SoTA performance on GPQA when combined with GPT4o.

पेपर सारांश(Abstract)

AI इस समय एक paradigm shift से गुज़र रहा है, जहाँ कई large language models (LLM) और अन्य जटिल components को orchestrate करने वाले systems के ज़रिए बड़ी प्रगति हासिल की जा रही है। इसलिए compound AI systems के लिए principled और automated optimization methods विकसित करना सबसे महत्वपूर्ण नई चुनौतियों में से एक बन गया है। शुरुआती दौर में neural networks ने भी इसी तरह की समस्या का सामना किया था, लेकिन backpropagation और automatic differentiation ने optimization को turn-key बनाकर इस क्षेत्र को बदल दिया। इसी से प्रेरित होकर हम TextGrad पेश करते हैं, जो टेक्स्ट के माध्यम से automatic "differentiation" करने वाला एक शक्तिशाली framework है। TextGrad, LLM द्वारा दिए गए textual feedback को backpropagate करके compound AI system के individual components को बेहतर बनाता है। हमारे framework में LLM, computation graph के variables को optimize करने के लिए समृद्ध, सामान्य और natural language suggestions देते हैं, जिनकी range code snippets से लेकर molecular structures तक जाती है। TextGrad, PyTorch के syntax और abstraction का पालन करता है और flexible व उपयोग में आसान है। यह विभिन्न tasks के लिए out-of-the-box काम करता है, जहाँ users को framework के components या prompts को tune किए बिना केवल objective function देना होता है। हम question answering और molecule optimization से लेकर radiotherapy treatment planning तक, कई तरह के applications में TextGrad की effectiveness और generality दिखाते हैं। framework में कोई बदलाव किए बिना, TextGrad Google-Proof Question Answering में GPT-4o की zero-shot accuracy को $51%$ से $55%$ तक बढ़ाता है, LeetCode-Hard coding problem solutions के optimization में $20%$ का relative performance gain देता है, reasoning के लिए prompts को बेहतर बनाता है, वांछित in silico binding वाले druglike small molecules डिज़ाइन करता है, और high specificity के साथ radiation oncology treatment plans तैयार कर सकता है। TextGrad अगली पीढ़ी के AI systems के development को तेज़ करने की नींव रखता है।

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51%$ to $55%$, yields $20%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.

पेपर लिंक

https://arxiv.org/abs/2406.07496v1

क्या long-context language models retrieval, RAG, SQL आदि की जगह ले सकते हैं? / Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

पेपर परिचय

in-context retrieval और reasoning पर long-context LLMs के performance का गहन विश्लेषण किया गया है; पहले वे 1M token context की आवश्यकता वाले real-world tasks के साथ एक benchmark प्रस्तुत करते हैं; रिपोर्ट के अनुसार, task पर किसी explicit training के बिना भी long-context LLMs state-of-the-art retrieval और RAG systems की बराबरी कर सकते हैं; यह भी संकेत मिलता है कि compositional reasoning, जो SQL-जैसे tasks में आवश्यक होता है, अभी भी इन LLMs के लिए चुनौतीपूर्ण है; और advanced prompting strategies पर निरंतर शोध की ज़रूरत पर ज़ोर दिया गया है, क्योंकि long-context समस्याओं में इन्हें लागू करने पर performance में उल्लेखनीय बढ़ोतरी देखी गई।

Conducts a deep performance analysis of long-context LLMs on in-context retrieval and reasoning; they first present a benchmark with real-world tasks requiring 1M token context; reports that long-context LLMs can rival state-of-the-art retrieval and RAG systems, without any explicit training on the tasks; suggests that compositional reasoning (required in SQL-like tasks) is still challenging for these LLMs; they also encourage the need for continued research on advanced prompting strategies as they noted significant boosts in performance when applying them for long context problems.

पेपर सारांश(Abstract)

लंबे-context language models (LCLM) में उन कार्यों के प्रति हमारे दृष्टिकोण को क्रांतिकारी रूप से बदलने की क्षमता है, जो पारंपरिक रूप से retrieval systems या databases जैसे बाहरी tools पर निर्भर रहे हैं। LCLM की उस क्षमता का लाभ उठाकर, जिसमें वे मूल रूप से पूरे information corpus को ingest और process कर सकते हैं, कई फायदे मिलते हैं। इससे user-friendliness बढ़ती है क्योंकि tools के विशेष ज्ञान की आवश्यकता नहीं रहती, जटिल pipelines में cascading errors को कम करने वाला मज़बूत end-to-end modeling मिलता है, और पूरे system में sophisticated prompting techniques लागू की जा सकती हैं। इस paradigm shift का मूल्यांकन करने के लिए हम LOFT पेश करते हैं, जो वास्तविक दुनिया के ऐसे कार्यों का benchmark है जिनमें लाखों tokens तक का context चाहिए और जिसे in-context retrieval तथा reasoning पर LCLM के प्रदर्शन का आकलन करने के लिए डिज़ाइन किया गया है। शोध के निष्कर्ष बताते हैं कि LCLM में चौंकाने वाली क्षमता है कि वे state-of-the-art retrieval और RAG systems की बराबरी कर सकते हैं, जबकि उन्हें इन कार्यों के लिए स्पष्ट रूप से प्रशिक्षित नहीं किया गया था। हालांकि, SQL-जैसे कार्यों में आवश्यक compositional reasoning जैसे क्षेत्रों में LCLM अभी भी चुनौतियों का सामना करते हैं। खास तौर पर, prompting strategies प्रदर्शन को काफ़ी प्रभावित करती हैं, जो यह रेखांकित करता है कि context length बढ़ने के साथ निरंतर शोध की आवश्यकता बनी हुई है। कुल मिलाकर, LOFT, LCLM के लिए एक कठोर testing ground प्रदान करता है और यह दिखाता है कि जैसे-जैसे model capabilities बढ़ती हैं, वे मौजूदा paradigms का स्थान ले सकते हैं और नए कार्यों को संभाल सकते हैं।

Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.

पेपर लिंक

https://arxiv.org/abs/2406.13121

आगे पढ़ें

https://github.com/google-deepmind/loft

https://x.com/omarsar0/status/1804184820806766875

PlanRAG: निर्णय-निर्माताओं के रूप में generative large language models के लिए plan-then-retrieval augmented generation / PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers

पेपर परिचय

यह iterative plan-then-RAG (PlanRAG) नामक एक नई RAG तकनीक के ज़रिए decision making को बेहतर बनाता है, और इसमें दो चरण होते हैं: 1) LM data schema और प्रश्नों की जाँच करके decision making के लिए एक plan बनाता है और 2) retriever data analysis के लिए queries तैयार करता है। अंतिम चरण में यह देखा जाता है कि आगे के analysis के लिए किसी नए plan की आवश्यकता है या नहीं, और उसी के अनुसार पिछले चरणों को दोहराया जाता है या data पर निर्णय लिया जाता है। प्रस्तावित Decision QA tasks पर PlanRAG, iterative RAG की तुलना में अधिक प्रभावी पाया गया।

Enhances decision making with a new RAG technique called iterative plan-then-RAG (PlanRAG); involves two steps: 1) an LM generates the plan for decision making by examining data schema and questions and 2) the retriever generates the queries for data analysis; the final step checks if a new plan for further analysis is needed and iterates on previous steps or makes a decision on the data; PlanRAG is found to be more effective than iterative RAG on the proposed Decision QA tasks.

पेपर सारांश(Abstract)

यह पेपर जटिल डेटा विश्लेषण की आवश्यकता वाले decision-making के लिए LLMs के उपयोग पर शोध करता है। इसमें Decision QA को decision-making प्रश्न $Q$, business rules $R$, और database $D$ के लिए सर्वोत्तम निर्णय $d_{best}$ का उत्तर देने वाले कार्य के रूप में परिभाषित किया गया है। चूंकि Decision QA का परीक्षण करने के लिए कोई benchmark मौजूद नहीं है, लेखक Decision QA benchmark DQA प्रस्तावित करते हैं। यह benchmark दो scenarios—Locating और Building—से बना है, जिन्हें दो video games (Europa Universalis IV और Victoria 3) से निर्मित किया गया है, जिनका लक्ष्य Decision QA से लगभग मिलता-जुलता है। Decision QA को प्रभावी ढंग से संभालने के लिए, लेखक iterative plan-then-retrieval augmented generation (PlanRAG) नामक एक नई RAG तकनीक भी प्रस्तावित करते हैं। PlanRAG-आधारित LM पहले चरण में decision-making के लिए एक plan बनाता है, और दूसरे चरण में retriever डेटा विश्लेषण के लिए queries जनरेट करता है। प्रस्तावित विधि ने Locating scenario में नवीनतम iterative RAG method की तुलना में 15.8% और Building scenario में 7.4% बेहतर प्रदर्शन दिखाया। कोड और benchmark https://github.com/myeon9h/PlanRAG पर उपलब्ध हैं。

इस पेपर में, हम जटिल डेटा विश्लेषण की आवश्यकता वाले decision making के समाधान के रूप में LLMs का उपयोग करने का अध्ययन करते हैं। हम Decision QA को decision-making प्रश्न $Q$, business rules $R$ और database $D$ के लिए सर्वोत्तम निर्णय $d_{best}$ का उत्तर देने वाले कार्य के रूप में परिभाषित करते हैं। चूंकि Decision QA की जांच करने वाला कोई benchmark नहीं है, हम Decision QA benchmark DQA प्रस्तावित करते हैं। इसमें दो scenarios, Locating और Building, शामिल हैं, जो दो video games (Europa Universalis IV और Victoria 3) से बनाए गए हैं और जिनका लक्ष्य Decision QA से लगभग समान है। Decision QA को प्रभावी ढंग से संभालने के लिए, हम iterative plan-then-retrieval augmented generation (PlanRAG) नामक एक नई RAG तकनीक भी प्रस्तावित करते हैं। हमारा PlanRAG-आधारित LM पहले चरण में decision making के लिए plan जनरेट करता है, और दूसरे चरण में retriever डेटा विश्लेषण के लिए queries जनरेट करता है। प्रस्तावित विधि ने state-of-the-art iterative RAG method की तुलना में Locating scenario में 15.8% और Building scenario में 7.4% बेहतर प्रदर्शन किया। हमारा code और benchmark https://github.com/myeon9h/PlanRAG पर उपलब्ध है।

पेपर लिंक

https://arxiv.org/abs/2406.12430

गोल्डफ़िश की तरह बनें, रटें नहीं! Generative LLMs में memorization को कम करना / Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

पेपर परिचय

यह काम goldfish loss नामक next-token prediction objective में संशोधन करके याद किए गए training data की शब्दशः generation को कम करता है। इसमें training के दौरान training tokens के एक pseudorandom subset को हटाने की सरल तकनीक का उपयोग किया गया है, और दिखाया गया है कि goldfish loss memorization के प्रति प्रतिरोधी है तथा मॉडल को उपयोगी बनाए रखता है, हालांकि training data से अधिक प्रभावी ढंग से सीखने के लिए इसे अधिक समय तक train करना पड़ सकता है।

next-token prediction objective में goldfish loss नामक एक संशोधन प्रस्तुत किया गया है, जो memorized training data की verbatim generation को कम करने में मदद करता है; यह training time पर training tokens के एक pseudorandom subset को बाहर करने की सरल तकनीक का उपयोग करता है; वे दिखाते हैं कि goldfish loss memorization का प्रतिरोध करता है और मॉडल को उपयोगी बनाए रखता है; हालांकि, training data से अधिक प्रभावी ढंग से सीखने के लिए इसे अधिक समय तक train करना पड़ सकता है।

पेपर सारांश(Abstract)

Large language models अपने training data को याद करके दोहरा सकते हैं, जिससे privacy और copyright से जुड़े जोखिम पैदा हो सकते हैं। memorization को कम करने के लिए, लेखकों ने next-token training objective में एक सूक्ष्म संशोधन पेश किया है, जिसे वे goldfish loss कहते हैं। training के दौरान, tokens के यादृच्छिक रूप से sampled subset को loss computation से बाहर रखा जाता है। हटाए गए ये tokens मॉडल द्वारा याद नहीं किए जाते, जिससे training set से tokens की पूरी श्रृंखला की शब्दशः पुनरुत्पत्ति रोकी जा सकती है। pre-trained models और scratch से train किए गए models दोनों पर अरब-स्तरीय Llama-2 models को train करने वाले व्यापक experiments के परिणाम दिखाते हैं कि downstream benchmarks पर बहुत कम या लगभग शून्य प्रभाव के साथ extractable memorization में महत्वपूर्ण कमी आई।

Large language models अपने training data को याद कर सकते हैं और उसे दोहरा सकते हैं, जिससे privacy और copyright risks उत्पन्न हो सकते हैं। memorization को कम करने के लिए, हम next-token training objective में एक सूक्ष्म संशोधन प्रस्तुत करते हैं, जिसे हम goldfish loss कहते हैं। training के दौरान, tokens के यादृच्छिक रूप से sampled subset को loss computation से बाहर रखा जाता है। ये dropped tokens मॉडल द्वारा memorized नहीं किए जाते, जिससे training set से tokens की पूरी chain की verbatim reproduction रुकती है। हम अरब-स्तरीय Llama-2 models पर व्यापक experiments चलाते हैं, जिनमें pre-trained models और scratch से train किए गए models दोनों शामिल हैं, और दिखाते हैं कि downstream benchmarks पर बहुत कम या बिना किसी प्रभाव के extractable memorization में महत्वपूर्ण कमी आती है।

पेपर लिंक

https://arxiv.org/abs/2406.10209

Monte Carlo Tree के जरिए GPT-4-स्तर के Mathematical Olympiad solutions तक पहुंच: LLaMa-3 8B के साथ self-improve / Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

पेपर परिचय

यह रिपोर्ट किया गया है कि LLMs और Monte Carlo Tree Search को एकीकृत करने वाले approach का उपयोग करके GPT-4-स्तर के Mathematical Olympiad solutions हासिल किए गए। यह approach systematic exploration, self-refinement, और self-evaluation जैसी क्षमताओं के माध्यम से सिस्टम की mathematical reasoning performance को बेहतर बनाने पर केंद्रित है।

यह रिपोर्ट किया गया है कि LLMs को Monte Carlo Tree Search के साथ एकीकृत करने वाले approach का उपयोग करके GPT-4-स्तर का mathematical olympiad solution हासिल किया गया; यह approach systematic exploration, self-refinement, और self-evaluation जैसी क्षमताओं के माध्यम से सिस्टम की mathematical reasoning performance को बढ़ाने पर केंद्रित है।

पेपर सारांश(Abstract)

यह श्वेतपत्र MCT Self-Refine (MCTSr) एल्गोरिदम प्रस्तुत करता है, जो Large Language Models (LLMs) और Monte Carlo Tree Search (MCTS) का एक अभिनव एकीकरण है, और जिसे जटिल गणितीय reasoning कार्यों में प्रदर्शन बेहतर बनाने के लिए डिज़ाइन किया गया है। खास तौर पर रणनीतिक और गणितीय reasoning में LLMs की accuracy और reliability से जुड़ी समस्याओं को संबोधित करते हुए, MCTSr LLMs के भीतर decision-making framework को बेहतर बनाने के लिए systematic exploration और heuristic self-refine mechanisms का उपयोग करता है। यह एल्गोरिदम Selection, self-refine, self-evaluation, और Backpropagation की पुनरावृत्त प्रक्रियाओं के माध्यम से एक Monte Carlo search tree बनाता है, और exploration-exploitation balance को optimize करने के लिए बेहतर Upper Confidence Bound (UCB) formula का उपयोग करता है। व्यापक प्रयोगों से Olympiad-स्तर की गणितीय समस्याएँ हल करने में MCTSr की प्रभावशीलता साबित हुई है, और इसने GSM8K, GSM Hard, MATH, तथा Math Odyssey, AIME, और OlympiadBench जैसे कई datasets में सफलता दर को उल्लेखनीय रूप से बढ़ाया है। यह अध्ययन जटिल reasoning कार्यों में LLMs के उपयोग को आगे बढ़ाता है और भविष्य के AI integration के लिए आधार तैयार करता है, जिससे LLM-आधारित applications में decision-making की accuracy और reliability बेहतर होती है।

This paper introduces the MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS), designed to enhance performance in complex mathematical reasoning tasks. Addressing the challenges of accuracy and reliability in LLMs, particularly in strategic and mathematical reasoning, MCTSr leverages systematic exploration and heuristic self-refine mechanisms to improve decision-making frameworks within LLMs. The algorithm constructs a Monte Carlo search tree through iterative processes of Selection, self-refine, self-evaluation, and Backpropagation, utilizing an improved Upper Confidence Bound (UCB) formula to optimize the exploration-exploitation balance. Extensive experiments demonstrate MCTSr's efficacy in solving Olympiad-level mathematical problems, significantly improving success rates across multiple datasets, including GSM8K, GSM Hard, MATH, and Olympiad-level benchmarks, including Math Odyssey, AIME, and OlympiadBench. The study advances the application of LLMs in complex reasoning tasks and sets a foundation for future AI integration, enhancing decision-making accuracy and reliability in LLM-driven applications.

पेपर लिंक

https://arxiv.org/abs/2406.07394v2

RAG से rich parameters तक: यह जाँच कि factual queries के लिए language models parametric information की तुलना में external knowledge का उपयोग कैसे करते हैं / From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

पेपर परिचय

factual queries के लिए LLMs external knowledge का parametric information की तुलना में कैसे उपयोग करते हैं, इसकी अधिक बारीकी से जाँच में यह पाया गया कि RAG pipeline में LLMs एक "shortcut" अपनाते हैं और प्रश्न का उत्तर देने के लिए context information का ही उपयोग करने की मजबूत प्रवृत्ति दिखाते हैं, जबकि अपनी parametric memory पर बहुत कम निर्भर रहते हैं।

Investigates more closely how LLMs utilize external knowledge over parametric information for factual queries; finds that in a RAG pipeline, LLMs take a “shortcut” and display a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory.

पेपर सारांश(Abstract)

Retrieval Augmented Generation (RAG) भाषा मॉडलों की external context का उपयोग करके reasoning करने की क्षमता को समृद्ध करता है, ताकि दिए गए user prompt के लिए responses को बेहतर बनाया जा सके। search, question/answering, और chat-bots में language models के व्यावहारिक उपयोग के कारण यह approach काफी लोकप्रिय हुई है। हालांकि, यह approach वास्तव में कैसे काम करती है, इसकी सटीक प्रकृति अभी स्पष्ट रूप से समझी नहीं गई है। इस श्वेतपत्र में हम RAG pipeline की यांत्रिक जाँच करते हैं और दिखाते हैं कि language models shortcut अपनाते हैं तथा प्रश्न का उत्तर देने के लिए केवल context information का उपयोग करने की मजबूत प्रवृत्ति रखते हैं, जबकि parametric memory पर बहुत कम निर्भर रहते हैं। हम language models में इस mechanistic behavior की जाँच इस प्रकार करते हैं: (i) Causal Mediation Analysis के माध्यम से यह दिखाते हैं कि प्रश्न का उत्तर देते समय parametric memory का न्यूनतम उपयोग होता है, और (ii) Attention Contributions तथा Knockouts के माध्यम से यह दिखाते हैं कि last token residual stream प्रश्न के subject token से समृद्ध नहीं होता, बल्कि context के अन्य informative tokens से समृद्ध होता है। हमने पाया कि यह स्पष्ट shortcut behavior LLaMa और Phi, दोनों model families में सत्य है।

Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take shortcut and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced shortcut behaviour true across both LLaMa and Phi family of models.

पेपर लिंक

https://arxiv.org/abs/2406.12824

Open-Sora / Open-Sora

पेपर परिचय

16-सेकंड 720p वीडियो बना सकने वाला एक open-source वीडियो generation मॉडल, 30 मिलियन से अधिक data पर trained 1.1B parameter मॉडल, जो अब image-to-video को support करता है; यह spatial और temporal compression के लिए enhanced diffusion model और video compression network प्रदान करता है, तथा generation की controllability बढ़ाता है और training cost कम करता है।

An open-source video generation model that can generate 16-second 720p videos; it’s a 1.1B parameter model trained on more than 30m data and now supports image-to-video; presents an enhanced diffusion model and video compression network for spatial and temporal compression; increases controllability of generations and reduces training costs.

पेपर लिंक

[IMG] Open-Sora 1.2 Report|1028x812

आगे पढ़ें

https://discuss.pytorch.kr/t/open-sora-feat-hpc-ai/3794

https://x.com/omarsar0/status/1803176105010171957

भाषा मॉडल एजेंटों के लिए Tree Search / Tree Search for Language Model Agents

पेपर परिचय

यह inference-time tree search algorithm प्रस्तावित करता है, जिससे LM agents exploration कर सकें और multi-step reasoning संभव हो; इसे interactive web environments में test किया गया है और GPT-4o पर लागू करके performance में बड़ा सुधार दिखाया गया है; साथ ही यह भी प्रदर्शित किया गया है कि test-time compute बढ़ाने पर performance scale करती है।

Proposes an inference-time tree search algorithm for LM agents to perform exploration and enable multi-step reasoning; it’s tested on interactive web environments and applied to GPT-4o to significantly improve performance; demonstrates that performance scales when increasing test-time compute.

पेपर सारांश(Abstract)

भाषा मॉडल (LM) द्वारा संचालित autonomous agents ने web automation जैसे decision-making tasks को करने की अपनी क्षमता में संभावनाएँ दिखाई हैं। हालांकि, एक बुनियादी चुनौती अब भी बनी हुई है: मुख्य रूप से natural language understanding और generation के लिए optimized LMs, वास्तविक कंप्यूटर tasks को हल करने की कोशिश करते समय multi-step reasoning, planning, और environmental feedback का उपयोग करने में संघर्ष करते हैं। इस समस्या के समाधान की दिशा में, हम interactive web environments में LM agents को exploration और multi-step planning स्पष्ट रूप से करने में सक्षम बनाने वाला एक inference-time search algorithm प्रस्तावित करते हैं। हमारा approach best-first tree search का एक रूप है, जो वास्तविक environment space के भीतर काम करता है, और अधिकांश मौजूदा state-of-the-art agents के साथ complementary है। यह LM agents के लिए पहला tree search algorithm है जो वास्तविक web tasks पर प्रभावशीलता दिखाता है। चुनौतीपूर्ण VisualWebArena benchmark पर, GPT-4o agent के ऊपर हमारे search algorithm को लागू करने से, search के बिना उसी baseline की तुलना में success rate में 39.7% की relative वृद्धि मिलती है, जिससे 26.4% का state-of-the-art success rate स्थापित होता है। WebArena पर भी, search baseline agent की तुलना में 28.0% का relative improvement देता है, जिससे 19.2% का competitive success rate हासिल होता है। हमारे experiments web agents के लिए search की प्रभावशीलता को रेखांकित करते हैं, और हम यह प्रदर्शित करते हैं कि test-time compute बढ़ने के साथ performance scale करती है। हम अपने results का गहन विश्लेषण करते हैं ताकि search से हुए सुधार, इसकी सीमाएँ, और future work के लिए promising directions को उजागर किया जा सके।

Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a fundamental challenge remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work.

यह लेख GPT मॉडल की मदद से तैयार किया गया है, इसलिए इसमें कुछ त्रुटियाँ हो सकती हैं; कृपया नीचे दिए गए मूल लेख को भी साथ में देखें। पढ़ते समय यदि आपको कोई अटपटा या गलत हिस्सा दिखे, तो कृपया टिप्पणी में बताएं! 🤗

⚠️विज्ञापन⚠️: 🔥PyTorch Korea User Group🇰🇷 द्वारा संकलित यह लेख क्या आपके लिए उपयोगी था? सदस्य के रूप में जुड़ें, तो हम प्रमुख लेख आपको ईमेल💌 से भेजेंगे! (डिफ़ॉल्ट रूप से Weekly, लेकिन Daily में भी बदला जा सकता है.)

[2024/06/17 ~ 06/23] इस सप्ताह के प्रमुख ML पेपर्स (Top ML Papers of the Week)

Claude 3.5 Sonnet / Claude 3.5 Sonnet

पेपर परिचय

पेपर लिंक

और पढ़ें

DeepSeek-Coder-V2

पेपर परिचय

पेपर सार (Abstract)

पेपर लिंक

और पढ़ें

TextGrad: टेक्स्ट के माध्यम से स्वचालित 'डिफरेंशिएशन' / TextGrad: Automatic "Differentiation" via Text

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

क्या long-context language models retrieval, RAG, SQL आदि की जगह ले सकते हैं? / Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

आगे पढ़ें

PlanRAG: निर्णय-निर्माताओं के रूप में generative large language models के लिए plan-then-retrieval augmented generation / PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

गोल्डफ़िश की तरह बनें, रटें नहीं! Generative LLMs में memorization को कम करना / Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

Monte Carlo Tree के जरिए GPT-4-स्तर के Mathematical Olympiad solutions तक पहुंच: LLaMa-3 8B के साथ self-improve / Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

Open-Sora / Open-Sora

पेपर परिचय

पेपर लिंक

आगे पढ़ें

भाषा मॉडल एजेंटों के लिए Tree Search / Tree Search for Language Model Agents

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

आगे पढ़ें

मूल लेख

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.