24] इस सप्ताह के प्रमुख ML पेपर (Top ML Papers of the Week)

(discuss.pytorch.kr)

1 पॉइंट द्वारा ninebow 2024-03-27 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

DAIR.AI द्वारा हर हफ्ते प्रकाशित किए जाने वाले ML पेपर्स पर इस लेख का स्वचालित अनुवाद किया गया है.

इस सप्ताह चुने गए पेपर्स में कुल मिलाकर बड़े language models (LLMs) का उपयोग करने वाले शोध काफी अधिक दिखाई देते हैं। खास तौर पर, 'Tool Use in LLMs', 'Step-by-Step Comparisons Make LLMs Better Reasoners', 'LLM4Decompile', 'Agent-FLAN', 'LLMs Leak Proprietary Information', 'Retrieval-Augmented Fine-Tuning' जैसे शीर्षकों से पता चलता है कि ये पेपर्स LLMs के विविध application क्षेत्रों, performance सुधार के तरीकों, और यहां तक कि security issues तक को कवर करते हैं.
इस तरह की प्रवृत्ति को पिछले कुछ वर्षों में AI क्षेत्र में LLMs पर बढ़ते ध्यान के बीच, विभिन्न शोध क्षेत्रों में इनके उपयोग के दायरे को तलाशने की कोशिशों के परिणाम के रूप में देखा जा सकता है। खास तौर पर, मौजूदा कार्यों को अधिक कुशलता से संभालने की methodologies के साथ-साथ, tool use या problem-solving प्रक्रिया में reasoning क्षमता को बेहतर बनाना, software reverse engineering जैसे नए application क्षेत्रों की खोज, और model stability तथा security पर शोध—ये सभी LLMs की विकास संभावनाओं को और विस्तृत कर रहे हैं। साथ ही, ऐसे शोध इस बात की समझ को गहरा करने में महत्वपूर्ण भूमिका निभा रहे हैं कि LLMs को वास्तविक environments में कैसे उपयोग किया जा सकता है, और इससे जुड़े संभावित समस्यात्मक पहलू क्या हो सकते हैं.
इसके अलावा, 'Evolutionary Model Merge', 'DROID' जैसे पेपर्स model integration और development process पर शोध का प्रस्ताव रखते हुए यह भी दिखाते हैं कि model performance को लगातार बेहतर और optimize करने के तरीकों में रुचि बढ़ रही है। यह केवल LLMs ही नहीं बल्कि विभिन्न AI technologies के विकास और एकीकरण के लिए भी एक महत्वपूर्ण दिशा सुझाता है, और उम्मीद है कि आगे के शोध में भी यह एक अहम विषय बना रहेगा। इसलिए, इस सप्ताह चुने गए पेपर्स LLMs से जुड़े शोध की मौजूदा प्रवृत्तियों और भविष्य की दिशाओं पर मूल्यवान insights प्रदान करते हैं.

Grok-1

पेपर परिचय

314B parameters वाला एक mixture-of-experts model, जिसमें base model weights और network architecture की open release शामिल है; यह MoE model किसी दिए गए token के लिए 25% weights को activate करता है और इसकी pretraining cutoff date October 2023 है।

a mixture-of-experts model with 314B parameters which includes the open release of the base model weights and network architecture; the MoE model activates 25% of the weights for a given token and its pretraining cutoff date is October 2023.

पेपर लिंक

https://x.ai/blog/grok-os

मॉडल मर्जिंग रेसिपीज़ का evolutionary optimization / Evolutionary Optimization of Model Merging Recipes

पेपर परिचय

open source models को combine करने के लिए evolution का उपयोग कर foundation model development को automate करने वाला एक approach; यह cross-domain merging को संभव बनाता है, जिसके तहत एक Japanese Math LLM ने इन tasks के लिए explicitly train न किए जाने के बावजूद Japanese LLM benchmarks पर state-of-the-art performance हासिल की, और यहां तक कि कहीं अधिक parameters वाले models को भी पीछे छोड़ दिया।

an approach for automating foundation model development using evolution to combine open-source models; facilitates cross-domain merging where a Japanese Math LLM achieved state-of-the-art performance on Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not explicitly trained for these tasks.

पेपर सारांश (Abstract)

हम शक्तिशाली foundation models के निर्माण को automate करने के लिए evolutionary algorithms के एक नए application को प्रस्तुत करते हैं। cost-effectiveness के कारण model merging, LLM development के लिए एक promising approach के रूप में उभरा है, लेकिन फिलहाल यह human intuition और domain knowledge पर निर्भर है, जिससे इसकी क्षमता सीमित हो जाती है। यहां हम एक evolutionary approach का प्रस्ताव करते हैं, जो व्यापक अतिरिक्त training data या compute की आवश्यकता के बिना, विविध open source models के प्रभावी combinations को स्वतः खोजकर और उनकी collective intelligence का उपयोग कर इस सीमा को पार करता है। हमारा approach parameter space और data flow space दोनों में काम करता है, जिससे केवल individual models के weights से आगे जाकर भी optimization संभव होता hai। यह approach cross-domain merging को भी सक्षम बनाता है, जिससे math reasoning capabilities वाला Japanese LLM जैसे models तैयार किए जा सकते हैं। हैरानी की बात यह है कि हमारा Japanese Math LLM, ऐसे tasks के लिए explicitly train न किए जाने के बावजूद, स्थापित Japanese LLM benchmarks की विविध श्रेणियों पर state-of-the-art performance तक पहुंचा और यहां तक कि काफी अधिक parameters वाले models को भी पीछे छोड़ दिया। इसके अलावा, हमारे approach से तैयार किया गया culturally-aware Japanese VLM, Japanese culture-specific content का वर्णन करने में अपनी प्रभावशीलता दिखाता है और पहले के Japanese VLMs से बेहतर प्रदर्शन करता है। यह कार्य न केवल open source community को नए state-of-the-art models वापस योगदान देता है, बल्कि automated model composition के लिए एक नया paradigm भी प्रस्तुत करता है, जो foundation model development के वैकल्पिक और कुशल approaches की खोज का रास्ता खोलता है।

We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.

पेपर लिंक

https://arxiv.org/abs/2403.13187

TacticAI: फुटबॉल रणनीति के लिए AI सहायक / TacticAI: an AI assistant for football tactics

पेपर परिचय

Liverpool FC के डोमेन विशेषज्ञों के साथ मिलकर विकसित और मूल्यांकित किया गया फुटबॉल रणनीति के लिए AI-सहायता सिस्टम, जो कोचों को कॉर्नर किक रूटीन के लिए वैकल्पिक खिलाड़ी सेटअप का सैंपल लेकर उन्हें एक्सप्लोर करने और सफलता की सबसे अधिक अनुमानित संभावना वाली रणनीति चुनने का तरीका देता है; TacticAI के मॉडल सुझाव 90% मामलों में मौजूदा रणनीतियों से अधिक पसंद किए गए और यह एक प्रभावी corner kick retrieval system प्रदान करता है।

an AI-powered assistant for football tactics developed and evaluated in collaboration with domain experts from Liverpool FC; the systems offer coaches a way to sample and explore alternative player setups for a corner kick routine and select the tactic with the highest predicted likelihood of success; TacticAI’s model suggestions are favored over existing tactics 90% of the time and it offers an effective corner kick retrieval system.

पेपर सारांश(Abstract)

प्रतिद्वंद्वी टीमों द्वारा इस्तेमाल की जाने वाली रणनीतियों के प्रमुख पैटर्न की पहचान करना और प्रभावी जवाबी उपाय विकसित करना आधुनिक फुटबॉल के केंद्र में है। हालांकि, इसे algorithmic तरीके से करना अब भी एक खुली शोध चुनौती बना हुआ है। इस अपूर्ण आवश्यकता को पूरा करने के लिए Unity, Liverpool FC के डोमेन विशेषज्ञों के साथ घनिष्ठ सहयोग में विकसित और मूल्यांकित AI फुटबॉल रणनीति सहायक TacticAI प्रस्तुत करता है। फोकस corner kicks के विश्लेषण पर रखा गया है, क्योंकि ये कोचों को सबसे सीधे हस्तक्षेप और सुधार के अवसर देते हैं। TacticAI predictive और generative दोनों components को एकीकृत करता है, जिससे कोच प्रत्येक corner kick routine के लिए वैकल्पिक player setups का प्रभावी ढंग से सैंपल और अन्वेषण कर सकते हैं और सफलता की सबसे अधिक अनुमानित संभावना वाले विकल्प चुन सकते हैं। Unity, receivers और shot attempts की भविष्यवाणी तथा player position adjustments की recommendation जैसे कई प्रासंगिक benchmark tasks पर TacticAI को validate करता है। Liverpool FC के फुटबॉल डोमेन विशेषज्ञों के साथ किए गए एक qualitative study के माध्यम से TacticAI की उपयोगिता को सत्यापित किया गया। अध्ययन से पता चला कि TacticAI के मॉडल सुझाव न केवल वास्तविक रणनीतियों से अलग पहचानना मुश्किल थे, बल्कि 90% मामलों में मौजूदा रणनीतियों की तुलना में अधिक पसंद किए गए, और यह एक प्रभावी corner kick retrieval system भी प्रदान करता है। TacticAI ने geometric deep learning के जरिए data efficiency हासिल कर ये परिणाम प्राप्त किए, जबकि gold-standard data की उपलब्धता सीमित थी।

Identifying key patterns of tactics implemented by rival teams, and developing effective responses, lies at the heart of modern football. However, doing so algorithmically remains an open research challenge. To address this unmet need, we propose TacticAI, an AI football tactics assistant developed and evaluated in close collaboration with domain experts from Liverpool FC. We focus on analysing corner kicks, as they offer coaches the most direct opportunities for interventions and improvements. TacticAI incorporates both a predictive and a generative component, allowing the coaches to effectively sample and explore alternative player setups for each corner kick routine and to select those with the highest predicted likelihood of success. We validate TacticAI on a number of relevant benchmark tasks: predicting receivers and shot attempts and recommending player position adjustments. The utility of TacticAI is validated by a qualitative study conducted with football domain experts at Liverpool FC. We show that TacticAI’s model suggestions are not only indistinguishable from real tactics, but also favoured over existing tactics 90% of the time, and that TacticAI offers an effective corner kick retrieval system. TacticAI achieves these results despite the limited availability of gold-standard data, achieving data efficiency through geometric deep learning.

पेपर लिंक

https://www.nature.com/articles/s41467-024-45965-x

LLM में tool use / Tool Use in LLMs

पेपर परिचय

यह LLMs में tool use का एक overview प्रदान करता है, जिसमें tool-use paradigm की औपचारिक परिभाषा, वे scenarios जहाँ LLM tool usage का लाभ उठाते हैं, और वे कार्य जिनमें यह approach अच्छी तरह काम करती है, शामिल हैं; साथ ही यह complex tool usage का विश्लेषण और LM tooling कार्यों में testbeds तथा evaluation metrics का सार भी प्रस्तुत करता है।

provides an overview of tool use in LLMs, including a formal definition of the tool-use paradigm, scenarios where LLMs leverage tool usage, and for which tasks this approach works well; it also provides an analysis of complex tool usage and summarize testbeds and evaluation metrics across LM tooling works.

पेपर सारांश (Abstract)

Language model (LM) शक्तिशाली हैं, लेकिन उनका उपयोग मुख्य रूप से text generation tasks के लिए होता है। जटिल कौशल की जरूरत वाले कार्यों में tools ने उनके performance को काफ़ी बेहतर बनाया है। लेकिन कई कार्यों में “tool” शब्द का अलग-अलग अर्थों में उपयोग किया जाता है, जिससे यह सवाल उठता है: आखिर tool क्या है? और फिर, tools कहाँ और कैसे LMs की मदद करते हैं? इस survey में हम tools की एक unified definition देते हैं, जिन्हें LMs द्वारा उपयोग किए जाने वाले external programs के रूप में परिभाषित किया गया है, और LM tooling scenarios और approaches की एक systematic review प्रस्तुत करते हैं। इस review के आधार पर, हम विभिन्न benchmarks पर आवश्यक compute और performance gains को मापकर अलग-अलग tooling methods की efficiency का empirical study करते हैं, और इस क्षेत्र की कुछ चुनौतियों तथा भविष्य के संभावित research directions को रेखांकित करते हैं。

Language models (LMs) are powerful yet mostly for text generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills. However, many works adopt the term “tool” in different ways, raising the question: What is a tool anyway? Subsequently, where and how do tools help LMs? In this survey, we provide a unified definition of tools as external programs used by LMs, and perform a systematic review of LM tooling scenarios and approaches. Grounded on this review, we empirically study the efficiency of various tooling methods by measuring their required compute and performance gains on various benchmarks, and highlight some challenges and potential future research in the field.

पेपर लिंक

https://zorazrw.github.io/files/WhatAreToolsAnyway.pdf

आगे पढ़ें

https://x.com/omarsar0/status/1770497515898433896

RankPrompt: चरण-दर-चरण तुलना के जरिए language models को बेहतर reasoner बनाना / RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners

पेपर परिचय

RankPrompt का प्रस्ताव किया गया है, जो एक prompting method है जिससे LLM बिना अतिरिक्त resources के अपने responses की ranking खुद कर सकता है। यह self-ranking approach candidates की ranking को व्यवस्थित, step-by-step comparative evaluation के जरिए करता है, और demonstrations के रूप में comparison chains generate करने की LLM की क्षमता का उपयोग करता है, इसलिए यह अच्छी तरह काम करता हुआ दिखाई देता है। RankPrompt कई arithmetic और commonsense reasoning tasks पर ChatGPT और GPT-4 की reasoning performance को काफ़ी बढ़ाता है।

proposes RankPrompt, a prompting method to enable LLMs to self-rank their responses without additional resources; this self-ranking approach ranks candidates through a systematic, step-by-step comparative evaluation; it seems to work well as it leverages the capabilities of LLMs to generate chains of comparisons as demonstrations; RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4 on many arithmetic and commonsense reasoning tasks.

पेपर सारांश (Abstract)

Large Language Models (LLMs) ने विभिन्न reasoning tasks में प्रभावशाली performance हासिल किया है। लेकिन ChatGPT जैसे state-of-the-art LLMs भी अपने reasoning process के दौरान logical errors करने की प्रवृत्ति रखते हैं। मौजूदा solutions, जैसे task-specific verifiers को deploy करना या कई reasoning paths पर voting करना, या तो व्यापक human annotations की मांग करते हैं या inconsistent responses वाले scenarios में विफल हो जाते हैं। इन चुनौतियों से निपटने के लिए, हमने RankPrompt पेश किया है, जो एक नया prompting method है और LLMs को बिना अतिरिक्त resources के अपने responses की ranking स्वयं करने में सक्षम बनाता है। RankPrompt ranking problem को विविध responses के बीच comparisons की एक श्रृंखला में तोड़ता है, और contextual exemplars के रूप में comparison chains generate करने की LLMs की अंतर्निहित क्षमता का उपयोग करता है। 11 arithmetic और commonsense reasoning tasks पर किए गए experiments दिखाते हैं कि RankPrompt, ChatGPT और GPT-4 की reasoning performance को उल्लेखनीय रूप से बढ़ाता है, जिसमें अधिकतम 13% तक सुधार देखा गया। इसके अलावा, RankPrompt open-ended tasks के लिए LLM-based automatic evaluations में उत्कृष्ट प्रदर्शन करता है और AlpacaEval dataset में 74% मामलों में human judgments के साथ मेल खाता है। यह response order और consistency में बदलाव के प्रति robustness भी दिखाता है। सामूहिक रूप से, ये परिणाम RankPrompt को language models से high-quality feedback प्राप्त करने के एक प्रभावी method के रूप में प्रमाणित करते हैं।

Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, such as deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent responses. To address these challenges, we introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources. RankPrompt breaks down the ranking problem into a series of comparisons among diverse responses, leveraging the inherent capabilities of LLMs to generate chains of comparison as contextual exemplars. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13%. Moreover, RankPrompt excels in LLM-based automatic evaluations for open-ended tasks, aligning with human judgments 74% of the time in the AlpacaEval dataset. It also exhibits robustness to variations in response order and consistency. Collectively, our results validate RankPrompt as an effective method for eliciting high-quality feedback from language models.

पेपर लिंक

https://arxiv.org/abs/2403.12373

आगे पढ़ें

https://x.com/omarsar0/status/1770492690129359135

LLM4Decompile: large language models के साथ binary code को decompile करना / LLM4Decompile: Decompiling Binary Code with Large Language Models

पेपर परिचय

1B से 33B पैरामीटर तक फैली open-access decompilation LLM फैमिली; इन मॉडलों को 4 billion tokens के C source code और उसके corresponding assembly code पर train किया गया है; लेखक Decompile-Eval भी पेश करते हैं, जो decompilation के लिए recompilability और re-executability का आकलन करने तथा program semantics के दृष्टिकोण से मूल्यांकन करने वाला एक dataset है; LLM4Decompile ने assembly code के 21% को decompile करने की क्षमता दिखाई है, जो GPT-4 की तुलना में 50% बेहतर है।

a family of open-access decompilation LLMs ranging from 1B to 33B parameters; these models are trained on 4 billion tokens of C source code and corresponding assembly code; the authors also introduce Decompile-Eval, a dataset for assessing re-compatibility and re-executability for decompilation and evaluating with a perspective of program semantics; LLM4Decompile demonstrates the capability to decompile 21% of the assembly code, achieving a 50% improvement over GPT-4.

पेपर सारांश(Abstract)

Decompilation का उद्देश्य compiled code को इंसानों के पढ़ने योग्य source code में वापस लाना है, लेकिन नाम और संरचना जैसी बारीकियों के कारण यह कठिन हो जाता है। Large language models (LLMs) ने programming tasks में संभावना दिखाई है, जिससे decompilation में उनके उपयोग की प्रेरणा मिलती है। हालांकि, decompilation के लिए कोई open-source LLM मौजूद नहीं है। इसके अलावा, मौजूदा decompilation evaluation systems मुख्य रूप से token-level accuracy पर ध्यान देते हैं और code executability, जो किसी भी program की सबसे महत्वपूर्ण विशेषता है, उसे काफी हद तक नज़रअंदाज़ करते हैं। इसलिए, हम 1B से 33B तक के पहले open-access decompilation LLMs जारी करते हैं, जिन्हें 4 billion tokens के C source code और संबंधित assembly code पर pre-train किया गया है। ये open-source LLMs इस क्षेत्र में आगे के विकास के लिए baseline का काम कर सकते हैं। व्यावहारिक program evaluation सुनिश्चित करने के लिए, हम Decompile-Eval पेश करते हैं, जो decompilation के लिए recompilability और re-executability को ध्यान में रखने वाला पहला dataset है। यह benchmark program semantics के दृष्टिकोण से decompilation model का मूल्यांकन करने के महत्व पर ज़ोर देता है। प्रयोगों से पता चलता है कि हमारा LLM4Decompile assembly code के 21% को सटीक रूप से decompile कर सकता है, जो GPT-4 की तुलना में 50% बेहतर प्रदर्शन है। हमारा code, dataset और models https://github.com/albertan017/LLM4Decompile पर जारी किए गए हैं।

Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile

पेपर लिंक

https://arxiv.org/abs/2403.05286v1

Agent-FLAN: बड़े भाषा मॉडलों के लिए डेटा डिज़ाइन और प्रभावी agent tuning की विधियाँ / Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

पेपर परिचय

agents के लिए language models को प्रभावी ढंग से fine-tune करने हेतु data और methods डिज़ाइन किए गए हैं, जिन्हें Agent-FLAN कहा जाता है; इसके माध्यम से Llama2-7B विभिन्न agent evaluation datasets पर पिछले सर्वश्रेष्ठ कार्यों से 3.5% बेहतर प्रदर्शन करता है, और Agent-FLAN model size को scale करने पर hallucination समस्याओं को काफी कम करता है तथा सामान्य रूप से LLM को बेहतर बनाते हुए agent capabilities को लगातार सुधारता है;

Designs data and methods to effectively fine-tune language models for agents, referred to as Agent-FLAN; this enables Llama2-7B to outperform prior best works by 3.5% across various agent evaluation datasets; Agent-FLAN greatly alleviates the hallucination issues and consistently improves the agent capability of LLMs when scaling model sizes while generally improving the LLM;

पेपर सारांश(Abstract)

ओपन सोर्स बड़े भाषा मॉडल (LLM) ने विभिन्न NLP कार्यों में बड़ी सफलता हासिल की है, लेकिन एजेंट के रूप में काम करते समय वे अभी भी API-आधारित मॉडलों की तुलना में काफी कमजोर हैं। सामान्य LLM में agent capability को कैसे एकीकृत किया जाए, यह एक महत्वपूर्ण और तात्कालिक समस्या बन गया है। यह पेपर पहले तीन प्रमुख अवलोकन प्रस्तुत करता है: (1) वर्तमान agent training corpus में format following और agent reasoning दोनों उलझे हुए हैं, जिससे यह pre-training data के distribution से काफी अलग हो जाता है, (2) agent tasks के लिए आवश्यक capabilities पर LLM अलग-अलग learning speed दिखाते हैं, और (3) वर्तमान approaches, hallucination शामिल करके agent abilities सुधारने की कोशिश में, दुष्प्रभाव पैदा करते हैं। इन निष्कर्षों के आधार पर, एजेंट्स के लिए language models को प्रभावी ढंग से fine-tune करने हेतु Agent-FLAN प्रस्तावित किया गया है। training corpus को सावधानीपूर्वक विभाजित और पुनः डिज़ाइन करके, Agent-FLAN विभिन्न agent evaluation datasets पर पिछले सर्वश्रेष्ठ कार्यों से 3.5% बेहतर प्रदर्शन करने में सक्षम बनाता है। व्यापक रूप से तैयार किए गए negative samples की मदद से, Agent-FLAN स्थापित evaluation benchmark के आधार पर hallucination की समस्या को काफी हद तक कम करता है। इसके अलावा, यह model size बढ़ाने पर LLM की agent capability को लगातार सुधारता है, साथ ही LLM की सामान्य capability में भी हल्का सुधार करता है। कोड https://github.com/InternLM/Agent-FLAN पर उपलब्ध है।

Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem. This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations. Based on the above findings, we propose Agent-FLAN to effectively Fine-tune LANguage models for Agents. Through careful decomposition and redesign of the training corpus, Agent-FLAN enables Llama2-7B to outperform prior best works by 3.5% across various agent evaluation datasets. With comprehensively constructed negative samples, Agent-FLAN greatly alleviates the hallucination issues based on our established evaluation benchmark. Besides, it consistently improves the agent capability of LLMs when scaling model sizes while slightly enhancing the general capability of LLMs. The code will be available at https://github.com/InternLM/Agent-FLAN.

पेपर लिंक

https://arxiv.org/abs/2403.12881v1

API से सुरक्षित LLM के logits के कारण स्वामित्व वाली जानकारी लीक होने की स्थिति / Logits of API-Protected LLMs Leak Proprietary Information

पेपर परिचय

यह दिखाया गया है कि logits का उपयोग करके API से सुरक्षित LLM के बारे में बड़ी मात्रा में गैर-सार्वजनिक जानकारी सीखी जा सकती है। अपेक्षाकृत कम संख्या के API queries के साथ, यह approach OpenAI के gpt-3.5-turbo के embedding size को लगभग 4,096 आँकता है, और उपयोग किए गए attacks के खिलाफ guardrails भी प्रस्तावित करता है।

shows that it’s possible to learn a large amount of non-public information about an API-protected LLM using the logits; with a relatively small number of API queries, the approach estimates that the embedding size of OpenAI's gpt-3.5-turbo to be about 4,096; the paper also proposes guardrails against the attacks used.

पेपर सार(Abstract)

बड़े language models (LLM) के व्यावसायीकरण के कारण proprietary models तक high-level API-only access अब एक आम प्रथा बन गई है। इस शोध में दिखाया गया है कि model architecture के बारे में सतर्क मान्यताएँ रखने पर भी, अपेक्षाकृत कम API queries के जरिए API से सुरक्षित LLM के बारे में चौंकाने वाली मात्रा में non-public जानकारी सीखी जा सकती है, उदाहरण के लिए OpenAI के gpt-3.5-turbo के मामले में $1,000 से कम लागत में। इस अध्ययन का मुख्य निष्कर्ष यह है कि अधिकांश आधुनिक LLM softmax bottleneck से प्रभावित होते हैं, जिसके कारण model output पूरे output space के एक linear subspace तक सीमित हो जाता है। हम दिखाते हैं कि यह model image या model signature के लिए उपयुक्त है, जिससे कम लागत में कई क्षमताएँ मिलती हैं: LLM का hidden size कुशलता से पता लगाना, full-vocabulary outputs प्राप्त करना, अलग-अलग model updates का पता लगाना और उन्हें अलग करना, एक single full LLM output दिए जाने पर source LLM की पहचान करना, और यहाँ तक कि output layer parameters का अनुमान लगाना। अनुभवजन्य जाँच से इन तरीकों की प्रभावशीलता की पुष्टि हुई, और इन्हीं के आधार पर OpenAI के gpt-3.5-turbo का embedding size लगभग 4,096 आंका जा सका। अंत में, शोधकर्ता इस पर चर्चा करते हैं कि LLM providers ऐसे हमलों से कैसे बचाव कर सकते हैं, और इन क्षमताओं को bug नहीं बल्कि feature के रूप में कैसे देखा जा सकता है ताकि transparency और accountability बढ़ाई जा सके।

The commercialization of large language models (LLMs) has led to the common practice of high-level API-only access to proprietary models. In this work, we show that even with a conservative assumption about the model architecture, it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries (e.g., costing under $1,000 for OpenAI's gpt-3.5-turbo). Our findings are centered on one key observation: most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We show that this lends itself to a model image or a model signature which unlocks several capabilities with affordable cost: efficiently discovering the LLM's hidden size, obtaining full-vocabulary outputs, detecting and disambiguating different model updates, identifying the source LLM given a single full LLM output, and even estimating the output layer parameters. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4,096. Lastly, we discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.

पेपर लिंक

https://arxiv.org/abs/2403.09539

आगे पढ़ें

https://x.com/DimitrisPapail/status/1768654579254579385

DROID: बड़े पैमाने का in-the-wild robot manipulation dataset / DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

पेपर परिचय

अधिक सक्षम और मजबूत robotic manipulation policies को train और build करने के लिए यह एक open-source, बड़े पैमाने का robot manipulation dataset है, जिसमें 564 scenes और 86 tasks से एकत्र की गई 76,000 demonstration trajectories शामिल हैं। DROID के साथ training करने पर बेहतर प्रदर्शन वाली policies और बेहतर generalization मिल सकती है।

an open-source, large-scale robot manipulation dataset to train and build more capable and robust robotic manipulation policies; it contains 76K demonstration trajectories, collected across 564 scenes and 86 tasks; training with DROID leads to higher performing policies and generalization.

पेपर सारांश(Abstract)

बड़े, विविध और उच्च-गुणवत्ता वाले robot manipulation datasets का निर्माण अधिक सक्षम और मजबूत robotic manipulation policies की दिशा में एक महत्वपूर्ण कदम है। लेकिन ऐसे datasets बनाना चुनौतीपूर्ण है: विविध environments में robot manipulation data इकट्ठा करना logistics और safety से जुड़ी चुनौतियाँ पैदा करता है और hardware तथा human labour में बड़े निवेश की माँग करता है। परिणामस्वरूप, आज की सबसे सामान्य robot manipulation policies भी प्रायः बहुत कम environments से एकत्र किए गए data पर train होती हैं, जहाँ scenes और tasks की विविधता सीमित होती है। इस शोध में DROID (Distributed Robot Interaction Dataset) प्रस्तुत किया गया है, जो एक विविध robot manipulation dataset है। इसमें 76k demonstration trajectories, यानी 350 घंटे का interaction data शामिल है, जिसे 12 महीनों में North America, Asia और Europe के 50 data collectors ने 564 scenes और 84 tasks में एकत्र किया। शोधकर्ताओं ने दिखाया कि DROID के साथ training करने पर ऐसी policies मिलती हैं जिनका प्रदर्शन बेहतर होता है और generalization क्षमता भी बढ़ती है। पूरा dataset, policy learning code, और robot hardware setup को reproduce करने के लिए विस्तृत guide को open source के रूप में जारी किया गया है।

The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.

पेपर लिंक

https://arxiv.org/abs/2403.12945

आगे पढ़ें

https://x.com/chelseabfinn/status/1770311755140575413

RAFT: डोमेन-विशिष्ट RAG के लिए भाषा मॉडल को अनुकूलित करना / RAFT: Adapting Language Model to Domain Specific RAG

पेपर परिचय

RAG के फ़ायदों और fine-tuning को मिलाकर "open-book" in-domain सेटिंग्स में सवालों के जवाब देने की मॉडल की क्षमता बेहतर की जाती है; इसे RAFT की CoT-style प्रतिक्रिया के साथ जोड़ने पर reasoning में सुधार करने में मदद मिलती है.

combines the benefits of RAG and fine-tuning to improve a model's ability to answer questions in "open-book" in-domain settings; combining it with RAFT's CoT-style response helps to improve reasoning.

पेपर सारांश (Abstract)

बड़े टेक्स्ट डेटा corpus पर Large Language Models (LLMs) को pretraining करना अब एक standard paradigm बन चुका है. कई downstream applications में इन LLMs का उपयोग करते समय, pretrained model में नया knowledge (जैसे time-critical news या private domain knowledge) अतिरिक्त रूप से शामिल करना आम है, चाहे वह RAG-based prompting के ज़रिए हो या fine-tuning के माध्यम से. लेकिन मॉडल इस नए knowledge को हासिल करने के लिए कौन-सी कार्यप्रणाली सबसे बेहतर है, यह अभी भी एक खुला प्रश्न है. इस पेपर में हम Retrieval Augmented FineTuning (RAFT) प्रस्तुत करते हैं, जो एक training recipe है और in-domain "open-book" सेटिंग्स में सवालों के जवाब देने की मॉडल की क्षमता को बेहतर बनाती है. RAFT में, किसी प्रश्न और retrieved documents के एक सेट को दिए जाने पर, हम मॉडल को उन documents को नज़रअंदाज़ करना सिखाते हैं जो प्रश्न का उत्तर देने में मदद नहीं करते; इन्हें हम distractor documents कहते हैं. RAFT यह काम उन relevant documents से सही sequence को verbatim cite करके करता है जो प्रश्न का उत्तर देने में मदद करते हैं. RAFT की chain-of-thought-style प्रतिक्रिया के साथ यह मॉडल की reasoning क्षमता को बेहतर करने में मदद करता है. domain-specific RAG में, RAFT PubMed, HotpotQA और Gorilla datasets में लगातार मॉडल के प्रदर्शन को बेहतर बनाता है, और pretrained LLMs को in-domain RAG के लिए बेहतर बनाने की एक post-training recipe प्रस्तुत करता है. RAFT का code और demo github.com/ShishirPatil/gorilla पर open source के रूप में उपलब्ध है.

Pretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for many downstream applications, it is common to additionally bake in new knowledge (e.g., time-critical news, or private domain knowledge) into the pretrained model either through RAG-based-prompting, or fine-tuning. However, the optimal methodology for the model to gain such new knowledge remains an open question. In this paper, we present Retrieval Augmented FineTuning (RAFT), a training recipe that improves the model's ability to answer questions in a "open-book" in-domain settings. In RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don't help in answering the question, which we call, distractor documents. RAFT accomplishes this by citing verbatim the right sequence from the relevant document that would help answer the question. This coupled with RAFT's chain-of-thought-style response helps improve the model's ability to reason. In domain-specific RAG, RAFT consistently improves the model's performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG. RAFT's code and demo are open-sourced at github.com/ShishirPatil/gorilla.

पेपर लिंक

https://arxiv.org/abs/2403.10131

मूल लेख

https://nlp.elvissaravia.com/p/top-ml-papers-of-the-week-01b

यह लेख GPT मॉडल की मदद से संकलित किया गया है, इसलिए इसमें कुछ त्रुटियाँ हो सकती हैं; कृपया नीचे दिए गए मूल लेख को भी साथ में देखें! पढ़ते समय यदि आपको कोई अटपटी या गलत सामग्री मिले, तो कृपया टिप्पणी में बताएं.

⚠️विज्ञापन⚠️: क्या PyTorch Korea User Group द्वारा संकलित यह लेख आपको उपयोगी लगा? सदस्य के रूप में जुड़ें, तो हम प्रमुख लेख आपको ईमेल से भेजेंगे! (डिफ़ॉल्ट Weekly है, लेकिन Daily में भी बदल सकते हैं.)

[2024/03/18 ~ 03/24] इस सप्ताह के प्रमुख ML पेपर (Top ML Papers of the Week)

Grok-1

पेपर परिचय

पेपर लिंक

और पढ़ें

मॉडल मर्जिंग रेसिपीज़ का evolutionary optimization / Evolutionary Optimization of Model Merging Recipes

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

और पढ़ें

TacticAI: फुटबॉल रणनीति के लिए AI सहायक / TacticAI: an AI assistant for football tactics

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

LLM में tool use / Tool Use in LLMs

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

आगे पढ़ें

RankPrompt: चरण-दर-चरण तुलना के जरिए language models को बेहतर reasoner बनाना / RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

आगे पढ़ें

LLM4Decompile: large language models के साथ binary code को decompile करना / LLM4Decompile: Decompiling Binary Code with Large Language Models

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

Agent-FLAN: बड़े भाषा मॉडलों के लिए डेटा डिज़ाइन और प्रभावी agent tuning की विधियाँ / Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

और पढ़ें

API से सुरक्षित LLM के logits के कारण स्वामित्व वाली जानकारी लीक होने की स्थिति / Logits of API-Protected LLMs Leak Proprietary Information

पेपर परिचय

पेपर सार(Abstract)

पेपर लिंक

आगे पढ़ें

DROID: बड़े पैमाने का in-the-wild robot manipulation dataset / DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

पेपर परिचय

पेपर सारांश(Abstract)

पेपर लिंक

आगे पढ़ें

RAFT: डोमेन-विशिष्ट RAG के लिए भाषा मॉडल को अनुकूलित करना / RAFT: Adapting Language Model to Domain Specific RAG

पेपर परिचय

पेपर सारांश (Abstract)

पेपर लिंक

और पढ़ें

मूल लेख

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.