22] इस हफ्ते के प्रमुख ML पेपर (Top ML Papers of the Week)

(discuss.pytorch.kr)

7 पॉइंट द्वारा ninebow 2023-10-23 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

अवलोकन

DAIR.AI पर हर हफ्ते प्रकाशित होने वाले ML पेपरों पर आधारित इस लेख का स्वचालित अनुवाद किया गया है.
इस सप्ताह चुने गए पेपरों के मुख्य ट्रेंड मोटे तौर पर दो थे। पहला, open domain dialog system, और दूसरा, ऐसे शोध-पत्र जिनका उद्देश्य AI को स्वयं explanations जनरेट करने या समस्याओं को हल करने में सक्षम बनाना है.
Open domain dialog system ऐसी तकनीक है जो AI system को उपयोगकर्ताओं के साथ स्वाभाविक रूप से बातचीत करने में सक्षम बनाती है, और इसे "OpenAgents", "LLMs for Software Engineering", "Eliciting Human Preferences with LLMs" जैसे पेपरों में विषय के रूप में लिया गया था। ऐसे पेपर इस बात का अध्ययन करते हैं कि उपयोगकर्ताओं के साथ संवाद के जरिए AI system किस तरह स्वयं सीख सकता है और बेहतर हो सकता है.
इसके अलावा, AI द्वारा स्वयं explanations जनरेट करने या समस्याएँ हल करने पर केंद्रित शोध "A Study of LLM-Generated Self-Explanations", "Self-RAG", "Retrieval-Augmentation for Long-form Question Answering" जैसे पेपरों में शामिल था। ऐसे पेपरों का मुख्य लक्ष्य AI के समस्या-समाधान या explanation जनरेशन की प्रक्रिया को उपयोगकर्ताओं के लिए समझने योग्य और पारदर्शी बनाना है। यह ट्रेंड स्वाभाविक लगता है, क्योंकि AI को अधिक पारदर्शी और व्यापक रूप से उपयोगी तकनीक बनाने वाले शोध का महत्व लगातार बढ़ रहा है.

Llemma(Remma): गणित के लिए एक open language model / Llemma: An Open Language Model For Mathematics

पेपर परिचय

Proof-Pile-2 dataset पर Code Llama की continued training से बना गणित-उन्मुख Llemma model। इसने scientific papers, mathematics वाले web data, mathematical code शामिल datasets, और math benchmarks में open base models और unreleased Minerva से बेहतर प्रदर्शन दिखाया। मॉडल को dataset और experiments दोहराने के लिए code सहित जारी किया गया है। #mathglm #

An llm for mathematics which is based on continued pretraining from code llama on the proof-pile-2 dataset; the dataset involves scientific paper, web data containing mathematics, and mathematical code; llemma outperforms open base models and the unreleased minerva on the math benchmark; the model is released, including dataset and code to replicate experiments.

पेपर सारांश

हम Llemma प्रस्तुत करते हैं, जो गणित के लिए एक large language model है। हमने scientific papers, mathematics वाले web data, और mathematical code के मिश्रण Proof-Pile-2 पर Code Llama का continued pretraining करके Llemma बनाया। MATH benchmark पर Llemma ने समान parameter basis पर सभी ज्ञात open base models और unreleased Minerva model suite से बेहतर प्रदर्शन किया। इसके अलावा, Llemma बिना किसी अतिरिक्त finetuning के tool use और formal theorem proving भी कर सकता है। हम 7 billion और 34 billion parameter models, Proof-Pile-2, और हमारे experiments को reproduce करने वाला code सहित सभी artifacts को सार्वजनिक रूप से जारी करते हैं।

We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

पेपर लिंक

https://arxiv.org/abs/2310.10631

सॉफ्टवेयर इंजीनियरिंग के लिए large language models: सर्वे और open problems / Large Language Models for Software Engineering: Survey and Open Problems

पेपर परिचय

सॉफ्टवेयर इंजीनियरिंग के लिए LLMs पर एक व्यापक survey paper, जिसमें open research और technical challenges शामिल हैं

A comprehensive survey of llms for software engineering, including open research and technical challenges.

पेपर सारांश

यह पेपर Software Engineering (SE) के लिए Large Language Models (LLMs) के उभरते क्षेत्र का सर्वे प्रस्तुत करता है। साथ ही, यह उन तकनीकी समस्याओं पर LLMs के उपयोग से जुड़ी open research challenges को सामने रखता है जिनका सामना software engineers करते हैं। LLMs की emergent properties coding, design, requirements, repair, refactoring, performance improvement, documentation और analytics सहित software engineering activities के पूरे स्पेक्ट्रम में नवीनता और रचनात्मकता लाती हैं। लेकिन यही emergent properties महत्वपूर्ण तकनीकी चुनौतियाँ भी पैदा करती हैं; हमें hallucinations जैसे गलत समाधानों को भरोसेमंद तरीके से छाँटने वाली techniques की जरूरत है। यह survey दिखाता है कि reliable, efficient और effective LLM-based SE के development और deployment में hybrid techniques (traditional SE plus LLMs) की भूमिका कितनी महत्वपूर्ण है।

This paper provides a survey of the emerging area of Large Language Models (LLMs) for Software Engineering (SE). It also sets out open research challenges for the application of LLMs to technical problems faced by software engineers. LLMs' emergent properties bring novelty and creativity with applications right across the spectrum of Software Engineering activities including coding, design, requirements, repair, refactoring, performance improvement, documentation and analytics. However, these very same emergent properties also pose significant technical challenges; we need techniques that can reliably weed out incorrect solutions, such as hallucinations. Our survey reveals the pivotal role that hybrid techniques (traditional SE plus LLMs) have to play in the development and deployment of reliable, efficient and effective LLM-based SE.

पेपर लिंक

https://arxiv.org/abs/2310.03533

Self-RAG: आत्म-चिंतन के माध्यम से retrieval, generation और critique सीखना / Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

पेपर परिचय

खोज और self-reflection के ज़रिए LM की quality और factuality को बेहतर बनाने वाला एक नया retrieval-augmented framework प्रस्तुत करता है, और आवश्यकता पड़ने पर passages को adaptively retrieve करता है तथा special reflection tokens का उपयोग करके passages और अपनी ही generations को generate और reflect करने वाला LM train करता है; यह factuality सुधार सहित open-domain QA, reasoning और fact verification tasks में sota llms (chatgpt और retrieval-augmented llama2-chat) से काफी बेहतर प्रदर्शन करता है। #rag

Presents a new retrieval-augmented framework that enhances an lm’s quality and factuality through retrieval and self-reflection; trains an lm that adaptively retrieves passages on demand, and generates and reflects on the passages and its own generations using special reflection tokens; it significantly outperforms sota llms (chatgpt and retrieval-augmented llama2-chat) on open-domain qa, reasoning, and fact verification tasks, including factuality improvements.

पेपर सारांश

शानदार क्षमताओं के बावजूद, बड़े language models (LLM) अक्सर अपने भीतर समाहित parametric knowledge पर ही निर्भर रहने के कारण तथ्यात्मक रूप से गलत जवाब उत्पन्न करते हैं। संबंधित knowledge retrieval के ज़रिए LM को augment करने वाला ad hoc approach, Retrieval-Augmented Generation (RAG), इन समस्याओं को कम कर सकता है। लेकिन retrieval की ज़रूरत है या नहीं, या passages प्रासंगिक हैं या नहीं, इसकी परवाह किए बिना एक निश्चित संख्या में retrieved passages को अंधाधुंध retrieve और integrate करने से LM की versatility घट सकती है या बेकार जवाब उत्पन्न हो सकते हैं। हम Self-Reflective Retrieval-Augmented Generation (Self-RAG) नामक एक नया framework पेश करते हैं, जो retrieval और self-reflection के माध्यम से LM की quality और factuality को बेहतर बनाता है। Facebook का framework एक single arbitrary LM को train करता है जो ज़रूरत पड़ने पर adaptively passages retrieve करता है, और reflection tokens नामक special tokens का उपयोग करके retrieved passages तथा अपनी ही generations को generate और reflect करता है। Reflection tokens को generate करने से inference phase के दौरान LM को controllable बनाया जा सकता है, जिससे वह अलग-अलग task requirements के अनुसार अपना behavior ढाल सके। Experiments से पता चलता है कि Self-RAG (7B और 13B parameters) विभिन्न task sets पर state-of-the-art LLMs और retrieval-augmented models की तुलना में काफ़ी बेहतर प्रदर्शन करता है। खास तौर पर, Self-RAG open-domain QA, reasoning और fact verification tasks में ChatGPT और retrieval-augmented Llama2-chat से बेहतर है, और इन models की तुलना में long-form generations की factuality तथा citation accuracy सुधारने में महत्वपूर्ण लाभ दिखाता है।

Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.

पेपर लिंक

https://arxiv.org/abs/2310.11511

लंबी-फ़ॉर्म प्रश्न उत्तर के लिए Retrieval Augmentation को समझना / Understanding Retrieval Augmentation for Long-Form Question Answering

पेपर परिचय

लंबी-फ़ॉर्म प्रश्न उत्तर में retrieval-augmented language models का विश्लेषण करता है, और पाता है कि retrieval एक महत्वपूर्ण component है, लेकिन evidence documents को llm में सावधानी से जोड़ा जाना चाहिए; साथ ही यह भी पाता है कि जब retrieved documents में प्रश्न का उत्तर देने के लिए पर्याप्त जानकारी/साक्ष्य नहीं होता, तब attribution error अधिक बार होता है।

Explores retrieval-augmented language models on long-form question answering; finds that retrieval is an important component but evidence documents should be carefully added to the llm; finds that attribution error happens more frequently when retrieved documents lack sufficient information/evidence for answering the question.

पेपर सारांश

लंबे-फॉर्म प्रश्नोत्तर पर retrieval-augmented language models (LMs) का एक अध्ययन प्रस्तुत किया गया है। समान evidence documents का उपयोग करते हुए मॉडलों द्वारा जनरेट किए गए उत्तरों की तुलना करके यह विश्लेषण किया गया है कि retrieval augmentation अलग-अलग LMs को कैसे प्रभावित करता है, और retrieval document set की गुणवत्ता में अंतर उसी LM द्वारा जनरेट किए गए उत्तरों को कैसे प्रभावित करता है। जनरेट किए गए उत्तरों के विभिन्न गुणों (जैसे fluency, लंबाई, variance) का अध्ययन किया गया है, खास तौर पर in-context evidence documents के संदर्भ में लंबे-फॉर्म उत्तरों की attribution पर ज़ोर देते हुए। उत्तर attribution पर मानव annotations एकत्र किए गए हैं और attribution का स्वचालित आकलन करने के तरीकों का मूल्यांकन किया गया है। यह अध्ययन इस बारे में नई अंतर्दृष्टि देता है कि retrieval augmentation, LMs द्वारा ज्ञान-समृद्ध लंबे टेक्स्ट जनरेशन को कैसे प्रभावित करता है। साथ ही, लंबे टेक्स्ट जनरेशन के लिए attribution patterns की पहचान की गई है और attribution errors के मुख्य कारणों का विश्लेषण किया गया है। यह विश्लेषण मिलकर स्पष्ट करता है कि retrieval augmentation, ज्ञान-समृद्ध लंबे टेक्स्ट जनरेशन को कैसे प्रभावित करता है और भविष्य के कार्य के लिए दिशाएँ सुझाता है।

We present a study of retrieval-augmented language models (LMs) on long-form question answering. We analyze how retrieval augmentation impacts different LMs, by comparing answers generated from models while using the same evidence documents, and how differing quality of retrieval document set impacts the answers generated from the same LM. We study various attributes of generated answers (e.g., fluency, length, variance) with an emphasis on the attribution of generated long-form answers to in-context evidence documents. We collect human annotations of answer attribution and evaluate methods for automatically judging attribution. Our study provides new insights on how retrieval augmentation impacts long, knowledge-rich text generation of LMs. We further identify attribution patterns for long text generation and analyze the main culprits of attribution errors. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.

पेपर लिंक

https://arxiv.org/abs/2310.12150

GenBench

पेपर परिचय

NLP में generalization research को characterise और समझने के लिए एक framework प्रस्तुत किया गया है, जिसमें 543 papers का meta-analysis और generalization studies को explore तथा बेहतर समझने के लिए tools का एक set शामिल है।

Presents a framework for characterizing and understanding generalization research in nlp; involves a meta-analysis of 543 papers and a set of tools to explore and better understand generalization studies.

पेपर लिंक

https://nature.com/articles/s42256-023-00729-y/…

क्या बड़े language models स्वयं को समझा सकते हैं? LLM-जनरेटेड self-explanations पर एक अध्ययन / Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

पेपर परिचय

feature attribution explanations को स्वयं जनरेट करने की LLM की क्षमता का मूल्यांकन किया गया है। self-explanation, LLMs में performance और truthfulness को बेहतर बनाने में उपयोगी है, और इस क्षमता का उपयोग chain-of-thought prompting के साथ किया जा सकता है। #chain-of-thought

Assesses an llm's capability to self-generate feature attribution explanations; self-explanation is useful to improve performance and truthfulness in llms; this capability can be used together with chain-of-thought prompting.

पेपर सारांश

ChatGPT जैसे बड़े language models (LLM) ने sentiment analysis, mathematical reasoning और summarization जैसे विभिन्न natural language processing (NLP) कार्यों में बेहतरीन प्रदर्शन दिखाया है। साथ ही, क्योंकि इन models को इंसानी बातचीत के निर्देशों के अनुसार “helpful” responses देने के लिए instruction-tune किया जाता है, ये response के साथ explanation भी बना सकते हैं, और अक्सर बनाते हैं; इन्हें self-explanations कहा जाता है। उदाहरण के लिए, किसी movie review के sentiment का विश्लेषण करते समय model सिर्फ sentiment की positivity ही नहीं, बल्कि explanation भी दे सकता है (जैसे review में मौजूद “fantastic”, “memorable” जैसे sentiment वाले शब्दों की सूची देना)। ऐसे अपने-आप उत्पन्न self-explanations कितने सटीक होते हैं? इस पेपर में इस सवाल की जांच sentiment analysis task और feature attribution explanation के संदर्भ में की गई है, जो interpretability literature में सबसे आम तौर पर अध्ययन किए जाने वाले settings में से एक है (pre-ChatGPT models के लिए)। विशेष रूप से, हम self-explanations निकलवाने के अलग-अलग तरीकों का अध्ययन करते हैं, evaluation metrics के एक सेट के आधार पर उनकी faithfulness का मूल्यांकन करते हैं, और उनकी तुलना occlusion या LIME saliency maps जैसे पारंपरिक explanation methods से करते हैं। व्यापक experiments के जरिए हमने पाया कि ChatGPT के self-explanations प्रदर्शन के मामले में पारंपरिक explanations के बराबर हैं, लेकिन विभिन्न agreement metrics के अनुसार उनसे काफी अलग हैं, जबकि इन्हें तैयार करना बहुत सस्ता है क्योंकि ये prediction के साथ ही generate हो जाते हैं। इसके अलावा, हमने इनके कुछ दिलचस्प गुण भी पहचाने, जिनकी वजह से ChatGPT (और उससे मिलते-जुलते) LLMs के दौर में मौजूदा model interpretability practices पर फिर से सोचने की जरूरत महसूस होती है।

Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs.

पेपर लिंक

https://arxiv.org/abs/2310.11207

OpenAgents(ओपनएजेंट्स): जंगली परिवेश में language agents के लिए एक open platform / OpenAgents: An Open Platform for Language Agents in the Wild

पेपर परिचय

यह जंगली परिवेश में language agents के उपयोग और hosting के लिए एक open platform है; इसमें तीन agents शामिल हैं: data analysis के लिए data agent, 200+ दैनिक API tools वाला plugins agent, और autonomous web browsing के लिए web agent।

An open platform for using and hosting language agents in the wild; includes three agents, including a data agent for data analysis, a plugins agent with 200+ daily api tools, and a web agent for autonomous web browsing.

पेपर सारांश

भाषा एजेंट विविध वातावरणों में तरह-तरह के जटिल कार्यों के लिए प्राकृतिक भाषा का उपयोग करने की क्षमता दिखाते हैं, खासकर जब उन्हें बड़े भाषा मॉडल (LLM) पर बनाया गया हो। मौजूदा भाषा एजेंट framework का लक्ष्य proof-of-concept भाषा एजेंट बनाना आसान करना है, लेकिन वे गैर-विशेषज्ञ उपयोगकर्ताओं की agent access को नज़रअंदाज़ करते हैं और application-level design पर बहुत कम ध्यान देते हैं। हम OpenAgents प्रस्तुत करते हैं, जो रोज़मर्रा की ज़िंदगी में भाषा एजेंटों का उपयोग और होस्टिंग करने के लिए एक open platform है। OpenAgents में तीन एजेंट शामिल हैं: (1) Python/SQL और data tools के साथ data analysis के लिए Data Agent, (2) 200+ रोज़मर्रा के API tools वाला Plugins Agent, (3) autonomous web browsing के लिए Web Agent। सामान्य उपयोगकर्ता तेज़ response और आम failures के लिए optimized web user interface के ज़रिए agent functionalities के साथ interact कर सकते हैं, जबकि developers और researchers को local setup पर seamless deployment experience मिलता है, जो innovative language agents बनाने और real-world evaluation को आसान करने की बुनियाद देता है। हम चुनौतियों और अवसरों को स्पष्ट करते हैं, ताकि भविष्य में real-world language agents के research और development के लिए आधार तैयार किया जा सके।

भाषा एजेंट विविध और जटिल कार्यों के लिए अलग-अलग वातावरणों में प्राकृतिक भाषा का उपयोग करने की क्षमता दिखाते हैं, खासकर जब वे बड़े भाषा मॉडल (LLMs) पर आधारित हों। वर्तमान भाषा एजेंट framework proof-of-concept भाषा एजेंटों के निर्माण को आसान बनाने पर केंद्रित हैं, लेकिन वे गैर-विशेषज्ञ उपयोगकर्ताओं की पहुंच को नज़रअंदाज़ करते हैं और application-level design पर बहुत कम ध्यान देते हैं। हम OpenAgents प्रस्तुत करते हैं, जो रोज़मर्रा की वास्तविक दुनिया में भाषा एजेंटों के उपयोग और होस्टिंग के लिए एक open platform है। OpenAgents में तीन एजेंट शामिल हैं: (1) Python/SQL और data tools के साथ data analysis के लिए Data Agent; (2) 200+ दैनिक API tools वाला Plugins Agent; (3) autonomous web browsing के लिए Web Agent। OpenAgents सामान्य उपयोगकर्ताओं को तेज़ response और आम failures के लिए optimized web user interface के माध्यम से agent functionalities के साथ interact करने देता है, साथ ही developers और researchers को local setup पर seamless deployment experience भी देता है, जिससे innovative language agents बनाना और real-world evaluations करना आसान होता है। हम चुनौतियों और अवसरों को स्पष्ट करते हैं और भविष्य के real-world language agents के research और development के लिए एक आधार स्थापित करने का लक्ष्य रखते हैं.

शोधपत्र लिंक

https://arxiv.org/abs/2310.10634v1

आगे पढ़ें

https://x.com/ChengZhoujun/status/1714343204148113860

भाषा मॉडल से मानव प्राथमिकताएँ उभारना / Eliciting Human Preferences with Language Models

शोधपत्र परिचय

task specification process को guide करने के लिए language models का उपयोग करता है और एक learning framework के माध्यम से models को users के साथ free-form, language-based interaction से इच्छित व्यवहार को उभारने और infer करने में मदद करता है; यह दिखाता है कि open-ended questions generate करके system, user द्वारा लिखे गए prompts की तुलना में अधिक जानकारीपूर्ण responses उत्पन्न करता है।

task specification process को guide करने के लिए language models का उपयोग करता है और एक learning framework के ज़रिए models को users के साथ free-form, language-based interaction से intended behavior को उभारने और infer करने में मदद करता है; यह दिखाता है कि open-ended questions generate करके system ऐसे responses बनाता है जो user-written prompts की तुलना में अधिक informative होते हैं.

शोधपत्र सार

भाषा मॉडल (LM) को लेबल किए गए उदाहरणों या natural language prompts का उपयोग करके लक्ष्य कार्य करने के लिए निर्देशित किया जा सकता है। लेकिन उदाहरण चुनना या prompts लिखना कठिन हो सकता है—खासकर उन कार्यों में जिनमें असामान्य edge cases शामिल हों, धुंधली प्राथमिकताओं को सटीक रूप से व्यक्त करना हो, या LM के व्यवहार का एक सटीक mental model चाहिए हो। हम task specification process को guide करने के लिए LMs स्वयं का उपयोग करने का प्रस्ताव रखते हैं। इस पेपर में हम Generative Active Task Elicitation (GATE) प्रस्तुत करते हैं: एक learning framework जिसमें models, users के साथ free-form, language-based interaction के माध्यम से इच्छित व्यवहार को elicitation और infer करते हैं। हम GATE का अध्ययन तीन domains में करते हैं: email validation, content recommendation, और moral reasoning। preregistered experiments में हम दिखाते हैं कि GATE करने के लिए prompt किए गए LMs (जैसे open-ended questions generate करना या informative edge cases synthesize करना) अक्सर user-written prompts या labels की तुलना में अधिक informative responses प्राप्त करते हैं। users का कहना है कि interactive task elicitation, prompting या example labeling की तुलना में कम effort मांगता है और ऐसे नए विचार सामने लाता है जिनकी users ने शुरुआत में कल्पना नहीं की थी। हमारे निष्कर्ष संकेत देते हैं कि LM-driven elicitation, models को जटिल मानवीय प्राथमिकताओं और मूल्यों के अनुरूप align करने के लिए एक शक्तिशाली tool हो सकता है।

Language models (LMs) can be directed to perform target tasks by using labeled examples or natural language prompts. But selecting examples or writing prompts for can be challenging--especially in tasks that involve unusual edge cases, demand precise articulation of nebulous preferences, or require an accurate mental model of LM behavior. We propose to use LMs themselves to guide the task specification process. In this paper, we introduce Generative Active Task Elicitation (GATE): a learning framework in which models elicit and infer intended behavior through free-form, language-based interaction with users. We study GATE in three domains: email validation, content recommendation, and moral reasoning. In preregistered experiments, we show that LMs prompted to perform GATE (e.g., by generating open-ended questions or synthesizing informative edge cases) elicit responses that are often more informative than user-written prompts or labels. Users report that interactive task elicitation requires less effort than prompting or example labeling and surfaces novel considerations not initially anticipated by users. Our findings suggest that LM-driven elicitation can be a powerful tool for aligning models to complex human preferences and values.

पेपर लिंक

https://arxiv.org/abs/2310.11589

आगे पढ़ें

https://x.com/AlexTamkin/status/1715040019520569395

AutoMix: भाषा मॉडल का स्वचालित मिश्रण / AutoMix: Automatically Mixing Language Models

पेपर परिचय

छोटे भाषा मॉडलों की शुद्धता के आधार पर queries को llms की ओर route करने का एक approach (जो few-shot self-verification के माध्यम से किया जाता है); verifier के output (आमतौर पर एक छोटा मॉडल) की जांच करने और आवश्यकता पड़ने पर query को बड़े language model की ओर route करने के लिए एक meta-verifier प्रस्तुत किया गया है। पाँच context-grounded reasoning datasets पर llama2-13/70b का उपयोग करके किए गए experiments दिखाते हैं कि AutoMix स्थापित baselines को पार कर जाता है और cost per incremental benefit को अधिकतम 89% तक बेहतर बनाता है।

An approach to route queries to llms based on the correctness of smaller language models (done via few-shot self-verification); a meta-verifier is introduced to check the verifier's output (typically a smaller model) and route the query to a larger language model if needed. experiments using llama2-13/70b, on five context-grounded reasoning datasets demonstrate that automix surpasses established baselines, improving the incremental benefit per cost by up to 89%.

पेपर सारांश

अब cloud API providers के पास विभिन्न sizes और configurations में बड़े language models (LLMs) उपलब्ध हैं। यह विविधता भले ही कई विकल्प देती है, लेकिन computational cost और performance को optimize करने के लिए इन विकल्पों का प्रभावी उपयोग करना अब भी चुनौतीपूर्ण है। इस काम में हम AutoMix प्रस्तुत करते हैं, जो छोटे LM के outputs की अनुमानित correctness के आधार पर queries को रणनीतिक रूप से बड़े LM की ओर route करने का एक approach है। AutoMix के केंद्र में एक few-shot self-verification mechanism है, जो बिना training की आवश्यकता के अपने ही outputs की reliability का अनुमान लगाता है। चूंकि verifications में noise हो सकता है, AutoMix इन आकलनों की accuracy को बेहतर बनाने के लिए एक meta verifier का उपयोग करता है। पाँच context-grounded reasoning datasets पर LLAMA2-13/70B का उपयोग करके किए गए हमारे experiments दिखाते हैं कि AutoMix स्थापित baselines को पार करता है और cost per incremental benefit को अधिकतम 89% तक सुधारता है। हमारा code और data https://github.com/automix-llm/automix पर उपलब्ध है।

Large language models (LLMs) are now available in various sizes and configurations from cloud API providers. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present AutoMix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to AutoMix is a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring training. Given that verifications can be noisy, we employ a meta verifier in AutoMix to refine the accuracy of these assessments. Our experiments using LLAMA2-13/70B, on five context-grounded reasoning datasets demonstrate that AutoMix surpasses established baselines, improving the incremental benefit per cost by up to 89%. Our code and data are available at https://github.com/automix-llm/automix.

पेपर लिंक

https://arxiv.org/abs/2310.12963

आगे पढ़ें

https://x.com/omarsar0/status/1715385477627334718

वीडियो भाषा योजना / Video Language Planning

पेपर परिचय

प्रस्तावित एल्गोरिद्म tree search प्रक्रिया के ज़रिए robotics domains में complex long-horizon video plans synthesize कर सकता है, जिसमें vision-language models को policy और value function के रूप में, और text-to-video models को dynamic models के रूप में train किया जाता है।

Enables synthesizing complex long-horizon video plans across robotics domains; the proposed algorithm involves a tree search procedure that trains vision-language models to serve as policies and value functions, and text-to-video models as dynamic models.

पेपर सारांश

हम इंटरनेट-स्तर के डेटा पर pretrain किए गए बड़े generative models में हाल की प्रगति का उपयोग करके generated videos और language के space में complex long-horizon tasks के लिए visual planning को सक्षम करने में रुचि रखते हैं। इसके लिए हम video language planning (VLP) पेश करते हैं, जो tree search प्रक्रिया से बना एक एल्गोरिद्म है, जिसमें हम (i) vision-language models को policy और value functions की भूमिका निभाने के लिए train करते हैं और (ii) text-to-video models को dynamics models के रूप में train करते हैं। VLP long-horizon task instruction और current image observation को input के रूप में लेता है, और एक लंबा video plan output करता है जो final task को पूरा करने के तरीके का विस्तृत multimodal (video और language) specification देता है। VLP बढ़ते computation budget के साथ scale करता है, जहाँ अधिक computation time बेहतर video plans देता है, और यह multi-object rearrangement से लेकर multi-camera bi-arm dexterous manipulation तक अलग-अलग robotics domains में long-horizon video plans synthesize कर सकता है। Generated video plans को generated video के हर intermediate frame पर conditioned goal-conditioned policies के माध्यम से वास्तविक robot actions में बदला जा सकता है। प्रयोगों से पता चलता है कि VLP simulated robots और real robots दोनों पर (3 hardware platforms में) prior methods की तुलना में long-horizon task success rates को काफ़ी बेहतर बनाता है।

We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models. VLP takes as input a long-horizon task instruction and current image observation, and outputs a long video plan that provides detailed multimodal (video and language) specifications that describe how to complete the final task. VLP scales with increasing computation budget where more computation time results in improved video plans, and is able to synthesize long-horizon video plans across different robotics domains: from multi-object rearrangement, to multi-camera bi-arm dexterous manipulation. Generated video plans can be translated into real robot actions via goal-conditioned policies, conditioned on each intermediate frame of the generated video. Experiments show that VLP substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots (across 3 hardware platforms).

पेपर लिंक

https://arxiv.org/abs/2310.10625

मूल लेख

https://nlp.elvissaravia.com/p/top-ml-papers-of-the-week-ff8

[2023/10/16 ~ 10/22] इस हफ्ते के प्रमुख ML पेपर (Top ML Papers of the Week)

अवलोकन

Llemma(Remma): गणित के लिए एक open language model / Llemma: An Open Language Model For Mathematics

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

सॉफ्टवेयर इंजीनियरिंग के लिए large language models: सर्वे और open problems / Large Language Models for Software Engineering: Survey and Open Problems

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

Self-RAG: आत्म-चिंतन के माध्यम से retrieval, generation और critique सीखना / Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

लंबी-फ़ॉर्म प्रश्न उत्तर के लिए Retrieval Augmentation को समझना / Understanding Retrieval Augmentation for Long-Form Question Answering

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

GenBench

पेपर परिचय

पेपर लिंक

और पढ़ें

क्या बड़े language models स्वयं को समझा सकते हैं? LLM-जनरेटेड self-explanations पर एक अध्ययन / Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

OpenAgents(ओपनएजेंट्स): जंगली परिवेश में language agents के लिए एक open platform / OpenAgents: An Open Platform for Language Agents in the Wild

पेपर परिचय

पेपर सारांश

शोधपत्र लिंक

आगे पढ़ें

भाषा मॉडल से मानव प्राथमिकताएँ उभारना / Eliciting Human Preferences with Language Models

शोधपत्र परिचय

शोधपत्र सार

पेपर लिंक

आगे पढ़ें

AutoMix: भाषा मॉडल का स्वचालित मिश्रण / AutoMix: Automatically Mixing Language Models

पेपर परिचय

पेपर सारांश

पेपर लिंक

आगे पढ़ें

वीडियो भाषा योजना / Video Language Planning

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

मूल लेख

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.