08] इस हफ़्ते के प्रमुख ML पेपर (Top ML Papers of the Week)

(discuss.pytorch.kr)

2 पॉइंट द्वारा ninebow 2023-10-09 | अभी कोई टिप्पणी नहीं है. | WhatsApp पर शेयर करें

अवलोकन

DAIR.AI द्वारा हर सप्ताह प्रकाशित ML पेपर्स पर आधारित इस लेख का स्वचालित अनुवाद किया गया है।
इस सप्ताह दिए गए पेपर्स को देखने पर पता चला कि Long Context को संभालने वाले language models (Language Models, LLM) पर कई शोध सामने आए। खास तौर पर 'LLMs Represent Space and Time', 'Retrieval meets Long Context LLMs', 'StreamingLLM', 'The Dawn of LLMs', 'Training LLMs with Pause Tokens' जैसे पेपर्स LLM के विभिन्न पहलुओं पर प्रकाश डालते हैं।
यह ट्रेंड मशीन लर्निंग और deep learning में language models के लगातार बढ़ते महत्व का एक प्रमुख उदाहरण है। LLM बड़े पैमाने के language data पर train होकर sentence generation, machine translation, spell correction जैसी कई language understanding tasks में समग्र performance improvement संभव बनाते हैं। हालांकि, लंबे context को प्रोसेस करने में अब भी कई कठिनाइयाँ बनी हुई हैं। इन्हें हल करने के लिए अलग-अलग approaches प्रस्तावित होते दिख रहे हैं।
इसके अलावा 'Neural Developmental Programs', 'Recursively Self-Improving Code Generation', 'Retrieval-Augmented Dual Instruction Tuning' जैसे पेपर्स AI के self-learning, code generation और instruction tuning जैसे विषयों की पड़ताल करते हैं। इससे पता चलता है कि AI में नई methodologies लगातार उभर रही हैं, और ऐसे शोध AI तकनीक की self-learning capability और adaptability को बेहतर बनाने में बहुत महत्वपूर्ण भूमिका निभा सकते हैं।
इसलिए कहा जा सकता है कि इस सप्ताह के पेपर्स का ट्रेंड language models के long-context processing और AI के self-learning तथा code generation क्षेत्रों में नई research directions को दिखाता है।

स्थान और समय को दर्शाने वाले language models / Language Models Represent Space and Time

पेपर परिचय

यह पाया गया कि language models कई scales पर space और time की linear representations सीखते हैं, और ये representations prompt variations के प्रति मज़बूत हैं तथा अलग-अलग entity types में एकीकृत रूप से दिखाई देते हैं। इसके आधार पर यह दावा किया गया है कि language models केवल सतही statistics नहीं, बल्कि literal world models सीखते हैं, और इस तरह वे space और time जैसे बुनियादी structured knowledge को अर्जित करते हैं। #llm #llama2

Discovers that llms learn linear representations of space and time across multiple scales; the representations are robust to prompt variations and unified across different entity types; demonstrate that llms acquire fundamental structured knowledge such as space and time, claiming that language models learn beyond superficial statistics, but literal world models.

पेपर सार

बड़े language models (LLM) की क्षमताओं ने इस बहस को जन्म दिया है कि क्या ऐसे systems केवल सतही statistics का एक विशाल संग्रह सीखते हैं, या फिर data generating process का एक सुसंगत model — यानी एक world model — सीखते हैं। हमने Llama-2 model family में तीन spatial datasets (world, US, NYC places) और तीन temporal datasets (historical figures, artworks, news headlines) की learned representations का विश्लेषण करके दूसरे दृष्टिकोण के समर्थन में प्रमाण पाए। परिणामस्वरूप, हमने पाया कि LLM कई scales पर space और time की linear representations सीखते हैं। ये representations prompting variations के प्रति robust हैं और अलग-अलग entity types (उदाहरण के लिए cities और landmarks) में unified हैं। इसके अलावा, हमने अलग-अलग 'space neurons' और 'time neurons' की पहचान की, जो spatial और temporal coordinates को विश्वसनीय रूप से encode करते हैं। हमारा analysis दिखाता है that modern LLM space और time जैसे बुनियादी dimensions के बारे में structured knowledge हासिल करते हैं, जो इस दृष्टिकोण का समर्थन करता है कि वे सिर्फ सतही statistics नहीं, बल्कि literal world models सीखते हैं।

The capabilities of large language models (LLMs) have sparked debate over whether such systems just learn an enormous collection of superficial statistics or a coherent model of the data generating process -- a world model. We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models. We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual space neurons'' and time neurons'' that reliably encode spatial and temporal coordinates. Our analysis demonstrates that modern LLMs acquire structured knowledge about fundamental dimensions such as space and time, supporting the view that they learn not merely superficial statistics, but literal world models.

पेपर लिंक

https://arxiv.org/abs/2310.02207

आगे पढ़ें

https://x.com/wesg52/status/1709551516577902782

search और long-context बड़े language models का मेल / Retrieval meets Long Context Large Language Models

पेपर परिचय

यह अध्ययन downstream tasks के लिए retrieval augmentation और लंबे context window की तुलना करता है, ताकि यह जाँचा जा सके कि क्या दोनों तरीकों को मिलाकर दोनों के फायदे साथ में हासिल किए जा सकते हैं। simple RAG का उपयोग करने वाला 4K context window वाला llm, 16K context वाले fine-tuned llm के बराबर performance हासिल कर सकता है। retrieval, extended context window size की परवाह किए बिना llm की performance को काफी बेहतर बना सकता है, और 32K context window वाला retrieval-augmented llama2-70b question answering और query-based summarization सहित 7 long-context tasks में gpt-3.5-turbo-16k से बेहतर प्रदर्शन करता है। #llama #llama2-7b-32k-context #llama2-long #100k-context-window #streamingllm

Compares retrieval augmentation and long-context windows for downstream tasks to investigate if the methods can be combined to get the best of both worlds; an llm with a 4k context window using simple rag can achieve comparable performance to a fine-tuned llm with 16k context; retrieval can significantly improve the performance of llms regardless of their extended context window sizes; a retrieval-augmented llama2-70b with a 32k context window outperforms gpt-3.5-turbo-16k on seven long context tasks including question answering and query-based summarization.

पेपर सार

बड़े भाषा मॉडल (LLM) की context window को बढ़ाना हाल के समय में लोकप्रिय हो रहा है, जबकि retrieval के जरिए LLM को augment करने वाला समाधान कई वर्षों से मौजूद है। स्वाभाविक सवाल हैं: i) downstream tasks के लिए retrieval augmentation और long context window में से कौन बेहतर है? ii) क्या दोनों तरीकों को मिलाकर दोनों दुनियाओं के फायदे हासिल किए जा सकते हैं? यह अध्ययन इन सवालों का जवाब देने के लिए दो अत्याधुनिक pretrained LLMs, यानी proprietary 43B GPT और LLaMA2-70B, का उपयोग करके दोनों समाधानों की जांच करता है। आश्चर्यजनक रूप से, शोध में पाया गया कि generation के समय simple retrieval augmentation का उपयोग करने वाला 4K context window वाला LLM, long context tasks पर positional interpolation के जरिए 16K context window वाले fine-tuned LLM के बराबर प्रदर्शन हासिल कर सकता है, और इसके लिए बहुत कम computation की जरूरत होती है। इससे भी महत्वपूर्ण बात यह है कि retrieval, context window के बढ़े हुए आकार की परवाह किए बिना, LLMs के प्रदर्शन को काफी बेहतर बना सकता है। 32K context window का उपयोग करने वाला retrieval-augmented LLaMA2-70B, question answering और query-based summarization सहित 7 long context tasks पर औसत score के मामले में GPT-3.5-turbo-16k और Davinci003 से बेहतर प्रदर्शन करता है। यह non-retrieval LLaMA2-70B-32k baseline से भी अच्छे अंतर से आगे है, और generation speed भी काफी तेज है। यह अध्ययन practitioners को retrieval augmentation और long context extension में से क्या चुनना चाहिए, इस पर उपयोगी सामान्य insights प्रदान करता है।

Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? ii) Can both methods be combined to get the best of both worlds? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and LLaMA2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes. Our best model, retrieval-augmented LLaMA2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on seven long context tasks including question answering and query-based summarization. It also outperforms its non-retrieval LLaMA2-70B-32k baseline by a margin, while being much faster at generation. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners.

पेपर लिंक

https://arxiv.org/abs/2310.03025

attention sinks के साथ efficient streaming language models / Efficient Streaming Language Models with Attention Sinks

पेपर परिचय

attention sinks के साथ efficient streaming LLMs को सक्षम बनाने वाला एक framework, जिसमें शुरुआती tokens की kv states window attention के प्रदर्शन को काफी हद तक पुनर्स्थापित कर देती हैं; attention sink का उभरना शुरुआती tokens की ओर मजबूत attention scores के कारण होता है; यह तरीका finite length attention windows पर प्रशिक्षित LLMs को बिना किसी अतिरिक्त fine-tuning के infinite sequence length तक generalize करने में सक्षम बनाता है। #streamingllm

A framework that enables efficient streaming llms with attention sinks, a phenomenon where the kv states of initial tokens will largely recover the performance of window attention; the emergence of the attention sink is due to strong attention scores towards the initial tokens; this approach enables llms trained with finite length attention windows to generalize to infinite sequence length without any additional fine-tuning.

पेपर सारांश

लंबे इंटरैक्शन की अपेक्षा वाले multi-round dialogue जैसे streaming applications में Large Language Models (LLM) को deploy करना बेहद ज़रूरी है, लेकिन इसमें दो बड़ी चुनौतियाँ हैं। पहली, decoding चरण के दौरान पिछले tokens की Key और Value state (KV) को cache करने में बहुत अधिक memory लगती है। दूसरी, व्यापक रूप से उपयोग किए जाने वाले LLM training sequence length से लंबे text पर generalize नहीं कर पाते। केवल सबसे हाल के KV को cache करने वाला window attention एक स्वाभाविक तरीका है, लेकिन यह तब विफल हो जाता है जब text की लंबाई cache size से अधिक हो जाती है। एक दिलचस्प phenomenon, यानी attention sink, देखा गया है जिसमें शुरुआती tokens के KV को बनाए रखने पर window attention का performance काफी हद तक वापस आ जाता है। इस पेपर में पहले यह दिखाया गया है कि attention sink का उभरना इसलिए होता है क्योंकि शुरुआती tokens की ओर attention score बहुत अधिक होता है और वे semantic रूप से महत्वपूर्ण न होने पर भी एक "sink" की तरह काम करते हैं। इस विश्लेषण के आधार पर, पेपर StreamingLLM पेश करता है, जो एक efficient framework है और finite-length attention window के साथ trained LLM को बिना किसी fine-tuning के infinite sequence length तक generalize करने में सक्षम बनाता है। लेखक दिखाते हैं कि StreamingLLM, Llama-2, MPT, Falcon, और Pythia को 4 million से अधिक tokens तक stable और efficient language modeling करने में सक्षम बनाता है। इसके अलावा, यह भी पाया गया कि pre-training के दौरान placeholder token को dedicated attention sink के रूप में जोड़ने से streaming deployment और बेहतर हो सकता है। streaming setting में StreamingLLM, sliding window recomputation baseline की तुलना में अधिकतम 22.2x तक तेज़ performance देता है। code और dataset https://github.com/mit-han-lab/streaming-llm पर उपलब्ध हैं।

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm.

पेपर लिंक

https://arxiv.org/abs/2309.17453

आगे पढ़ें

https://x.com/Guangxuan_Xiao/status/1708943505731801325

https://discuss.pytorch.kr/t/…

न्यूरल developmental programs के माध्यम से self-assembling artificial neural networks की ओर / Towards Self-Assembling Artificial Neural Networks through Neural Developmental Programs

पेपर परिचय

जैविक जीवों के embryonic development गुणों को प्रतिबिंबित करने वाली developmental process (जिसे neural developmental programs कहा गया है) के माध्यम से स्वयं assemble होने वाले neural networks के उपयोग का प्रस्ताव किया गया है, और continuous control problems तथा growing topologies में इस approach की व्यवहार्यता दिखाई गई है.

Proposes to use neural networks that self-assemble through a developmental process that mirrors properties of embryonic development in biological organisms (referred to as neural developmental programs); shows the feasibility of the approach in continuous control problems and growing topologies.

पेपर सारांश

जैविक तंत्रिका तंत्र मौजूदा artificial neural networks की तुलना में मूल रूप से अलग तरीके से बनते हैं। Deep learning ने कई क्षेत्रों में प्रभावशाली नतीजे दिखाए हैं, लेकिन उच्च-प्रदर्शन वाले neural architectures डिज़ाइन करने के लिए अक्सर काफ़ी engineering effort की ज़रूरत होती है। इसके विपरीत, जैविक तंत्रिका तंत्र एक गतिशील self-organizing प्रक्रिया के ज़रिए विकसित होते हैं। इस पेपर में लेखक ऐसे neural networks की दिशा में शुरुआती कदम उठाते हैं जो एक developmental process के माध्यम से बढ़ते हैं, जो जैविक जीवों में embryonic development के प्रमुख गुणों को प्रतिबिंबित करता है। यह growth process एक दूसरे neural network द्वारा निर्देशित होता है, जिसे लेखक Neural Developmental Program (NDP) कहते हैं, और जो केवल local communication के माध्यम से काम करता है। लेखक विभिन्न machine learning benchmarks और अलग-अलग optimization methods (evolutionary training, online RL, offline RL, और supervised learning) में neural growth की भूमिका की जाँच करते हैं। साथ ही, वे self-organization द्वारा neural networks की growth को संचालित करने से खुलने वाली भविष्य की research directions और अवसरों पर भी प्रकाश डालते हैं।

Biological nervous systems are created in a fundamentally different way than current artificial neural networks. Despite its impressive results in a variety of different domains, deep learning often requires considerable engineering effort to design high-performing neural architectures. By contrast, biological nervous systems are grown through a dynamic self-organizing process. In this paper, we take initial steps toward neural networks that grow through a developmental process that mirrors key properties of embryonic development in biological organisms. The growth process is guided by another neural network, which we call a Neural Developmental Program (NDP) and which operates through local communication alone. We investigate the role of neural growth on different machine learning benchmarks and different optimization methods (evolutionary training, online RL, offline RL, and supervised learning). Additionally, we highlight future research directions and opportunities enabled by having self-organization driving the growth of neural networks.

पेपर लिंक

https://arxiv.org/abs/2307.08197

LMM का उदय: GPT-4V(ision) के साथ प्रारंभिक अन्वेषण / The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

पेपर परिचय

बड़े multimodal models (LMM) की समझ को गहरा करने के लिए यह gpt-4v का व्यापक विश्लेषण करता है, विभिन्न application scenarios में gpt-4v को probe करने पर ध्यान केंद्रित करता है, और vision के साथ code capabilities से लेकर retrieval-augmented LMMs तक विविध उदाहरण प्रस्तुत करता है। #multimodal #gpt-4v

A comprehensive analysis of gpt-4v to deepen the understanding of large multimodal models (lmms); it focuses on probing gpt-4v across various application scenarios; provides examples ranging from code capabilities with vision to retrieval-augmented lmms.

पेपर सारांश

बड़े multimodal models (LMM) visual understanding जैसी multi-sensory क्षमताओं के साथ बड़े language models (LLM) का विस्तार करते हैं, ताकि अधिक शक्तिशाली general intelligence हासिल की जा सके। इस पेपर में लेखक नवीनतम मॉडल GPT-4V(ision) का विश्लेषण करके LMMs की समझ को गहरा करते हैं। यह विश्लेषण उन रोचक कार्यों पर केंद्रित है जिन्हें GPT-4V कर सकता है, और इसमें GPT-4V की क्षमताओं की quality और generality, उसके समर्थित inputs और working modes, तथा मॉडल को प्रभावी ढंग से prompt करने के तरीकों की जाँच के लिए test samples शामिल हैं। GPT-4V का अन्वेषण करने के अपने दृष्टिकोण में लेखक विभिन्न domains और tasks में फैले सावधानीपूर्वक डिज़ाइन किए गए qualitative samples के एक संग्रह को curate और organize करते हैं। इन samples से प्राप्त अवलोकन दिखाते हैं कि मनमाने ढंग से interleaved multimodal inputs को process करने की GPT-4V की अभूतपूर्व क्षमता और उसकी क्षमताओं की generality मिलकर GPT-4V को एक शक्तिशाली multimodal generalist system बनाती हैं। इसके अलावा, input images पर बनाए गए visual markers को समझने की GPT-4V की अनोखी क्षमता visual referring prompting जैसे नए human-computer interaction तरीकों को जन्म दे सकती है। यह रिपोर्ट उभरते application scenarios और GPT-4V-आधारित systems के लिए भविष्य की research directions पर गहन चर्चा के साथ समाप्त होती है। लेखकों को उम्मीद है कि यह प्रारंभिक अन्वेषण अगली पीढ़ी की multimodal task formulation, वास्तविक दुनिया की समस्याओं को हल करने के लिए LMMs का उपयोग और उन्हें बेहतर बनाने के नए तरीकों, तथा multimodal foundation models की बेहतर समझ पर भविष्य के शोध को प्रेरित करेगा।

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models.

पेपर लिंक

https://arxiv.org/abs/2309.17421

बोलने से पहले पहले सोचें: pause token के साथ language models को train करना / Think before you speak: Training Language Models With Pause Tokens

पेपर परिचय

train किए जा सकने वाले <pause> token का उपयोग करके LLMs पर training और inference किया जाता है, जिससे मॉडल के उत्तर जनरेशन में देरी होती है और commonsense question answering तथा math word problem solving जैसे सामान्य understanding tasks में performance बेहतर होती है। प्रयोगों से पता चलता है कि इसका लाभ तभी मिलता है जब यह देरी pretraining और downstream fine-tuning दोनों में शामिल की जाए। #pause-for-thought

Performs training and inference on llms with a learnable <pause> token which helps to delay the model's answer generation and attain performance gains on general understanding tasks of commonsense qa and math word problem-solving; experiments show that this is only beneficial provided that the delay is introduced in both pertaining and downstream fine-tuning.

पेपर सारांश

language models लगातार token की एक श्रृंखला बनाकर responses उत्पन्न करते हैं। $(K+1)^{th}$ token, हर layer में $K$ hidden vectors को manipulate करने का परिणाम होता है, जहाँ हर पिछले token के लिए एक vector होता है। लेकिन अगर मॉडल $(K+1)^{th}$ token आउटपुट करने से पहले $K+10$ hidden vectors को manipulate करे तो क्या होगा? इस विचार को हम language models पर training और inference के दौरान input prefix में जोड़ी जाने वाली (सीखी जा सकने वाली) $\textit{pause}$ token sequence का उपयोग करके लागू करते हैं। इसके बाद हम मॉडल के outputs को आख़िरी pause token दिखाई देने तक निकालना टाल देते हैं, ताकि मॉडल उत्तर देने से पहले अतिरिक्त computation कर सके। हम C4 पर causal pretraining वाले 1B और 130M parameters के decoder-only models पर, और reasoning, question-answering, general understanding तथा fact recall को कवर करने वाले downstream tasks पर $\textit{pause-training}$ का अनुभवजन्य मूल्यांकन करते हैं। हमारा मुख्य निष्कर्ष यह है कि inference-time delay से तभी सुधार मिलता है जब मॉडल को delays के साथ pre-train भी किया गया हो और fine-tune भी। 1B मॉडल के लिए 9 में से 8 tasks में सुधार देखा गया, जिनमें सबसे प्रमुख SQuAD के QA task पर $EM\ score\ में\ 18%$, $CommonSenseQA\ पर\ 8%$, और $GSM8k\ के\ reasoning\ task\ में\ accuracy\ में\ 1%$ की बढ़ोतरी थी। हमारा काम delayed next-token prediction को व्यापक रूप से लागू किए जा सकने वाले नए paradigm में बदलने के लिए कई वैचारिक और व्यावहारिक future research questions उठाता है।

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18%$ EM score on the QA task of SQuAD, $8%$ on CommonSenseQA and $1%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.

पेपर लिंक

https://arxiv.org/abs/2310.02226

आगे पढ़ें

https://x.com/omarsar0/status/1709573238123122959

Self-Taught Optimizer (STOP): recursive self-improving code generation / Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation

पेपर परिचय

language model-आधारित scaffolding program का उपयोग करके recursive तरीके से खुद को बेहतर बनाने का प्रस्ताव है। एक seed improver पहले उस input program को बेहतर करता है जो सबसे अच्छा solution लौटाता है, और फिर उसे आगे खुद को सुधारने का काम दिया जाता है। यह दिखाया गया है कि gpt-4 models ऐसा code लिख सकते हैं जो खुद को बेहतर बनाने के लिए खुद को ही call कर सके। #self-training-survey-paper

Proposes the use of a language model-infused scaffolding program to recursively improve itself; a seed improver first improves an input program that returns the best solution which is then further tasked to improve itself; shows that the gpt-4 models can write code that can call itself to improve itself.

पेपर सारांश

हाल के AI सिस्टम्स में कुछ प्रगति (जैसे Tree-of-Thoughts और Program-Aided Language Models) समस्याओं को हल करने के लिए ऐसे "scaffolding" प्रोग्राम उपलब्ध कराती हैं, जो बेहतर आउटपुट बनाने के लिए language models पर कई calls को संरचित करते हैं। Scaffolding प्रोग्राम Python जैसी programming language में लिखा जाता है। इस काम में, हम language-model-infused scaffolding program का उपयोग करके उसे स्वयं में सुधार करने देते हैं। हम एक seed "improver" से शुरुआत करते हैं, जो किसी दिए गए utility function के अनुसार किसी input program को बेहतर बनाने के लिए language model को कई बार query करता है और सबसे अच्छा solution लौटाता है। फिर हम इस seed improver को चलाकर उसी में सुधार करते हैं। छोटे downstream task set पर, इस तरह बेहतर बनाया गया improver अपने seed improver की तुलना में कहीं बेहतर performance वाले programs उत्पन्न करता है। इसके बाद, हम language model द्वारा प्रस्तावित विभिन्न self-improvement strategies का विश्लेषण करते हैं, जिनमें beam search, genetic algorithms, और simulated annealing शामिल हैं। चूंकि language models स्वयं बदले नहीं जाते, इसलिए यह पूर्ण recursive self-improvement नहीं है। फिर भी, proof-of-concept experiments में यह दिखाया गया है कि आधुनिक language model GPT-4 ऐसा code लिख सकता है, जो स्वयं को बेहतर बनाने के लिए खुद को call कर सके। हम self-improving technologies के विकास से जुड़ी चिंताओं पर आलोचनात्मक रूप से विचार करते हैं और यह भी आकलन करते हैं कि generated code कितनी बार sandbox को bypass करता है।

AI systems में हाल की कई प्रगतियां (जैसे Tree-of-Thoughts और Program-Aided Language Models) एक ऐसे "scaffolding" program के जरिए समस्याओं को हल करती हैं, जो बेहतर outputs बनाने के लिए language models को कई बार call करने की संरचना देता है। Scaffolding program Python जैसी programming language में लिखा जाता है। इस कार्य में, हम language-model-infused scaffolding program का उपयोग स्वयं को सुधारने के लिए करते हैं। हम एक seed "improver" से शुरुआत करते हैं, जो किसी दिए गए utility function के अनुसार input program में सुधार करने के लिए language model को कई बार query करता है और सबसे अच्छा solution लौटाता है। फिर हम इस seed improver को चलाकर उसी में सुधार करते हैं। छोटे downstream tasks के एक सेट में, इस तरह बेहतर किया गया improver अपने seed improver की तुलना में उल्लेखनीय रूप से बेहतर performance वाले programs बनाता है। इसके बाद, हम language model द्वारा सुझाई गई self-improvement strategies की विविधता का विश्लेषण करते हैं, जिनमें beam search, genetic algorithms, और simulated annealing शामिल हैं। चूंकि language models स्वयं परिवर्तित नहीं किए जाते, इसलिए यह पूर्ण recursive self-improvement नहीं है। फिर भी, हमारे proof-of-concept experiments में यह दिखता है कि आधुनिक language model GPT-4 ऐसा code लिखने में सक्षम है, जो स्वयं को बेहतर बनाने के लिए खुद को call कर सके। हम self-improving technologies के विकास को लेकर चिंताओं पर आलोचनात्मक रूप से विचार करते हैं और यह भी मूल्यांकन करते हैं कि generated code कितनी बार sandbox को bypass करता है.

पेपर लिंक

https://arxiv.org/abs/2310.02304

RA-DIT: Retrieval-Augmented Dual Instruction Tuning / RA-DIT: Retrieval-Augmented Dual Instruction Tuning

पेपर परिचय

retrieval capabilities के साथ artificial neural networks को retrofit करने के लिए एक lightweight fine-tuning method प्रस्तावित किया गया है। इसमें 2-step approach शामिल है: 1) pretrained artificial neural network को update किया जाता है ताकि वह retrieved information का बेहतर उपयोग कर सके, और 2) retriever को update किया जाता है ताकि वह अधिक relevant results लौटाए। परिणाम दिखाते हैं कि knowledge utilization और contextual awareness, दोनों की आवश्यकता वाले tasks पर fine-tuning करने से हर चरण में अतिरिक्त लाभ मिलता है। 65b model ने विभिन्न knowledge-intensive zero-shot और few-shot learning benchmarks पर state-of-the-art results हासिल किए हैं, और यह मौजूदा retrieval-augmented language approaches की तुलना में zero-shot में अधिकतम +8.9% और 5-shot में +1.4% बेहतर प्रदर्शन करता है। #rag #instruct-tuning

retrieval capabilities के साथ llms को retrofit करने के लिए एक lightweight fine-tuning method प्रस्तावित किया गया है; इसमें 2-step approach शामिल है: 1) pretrained lm को update किया जाता है ताकि वह retrieved information का बेहतर उपयोग कर सके 2) retriever को update किया जाता है ताकि वह अधिक relevant results लौटाए, जैसा कि lm पसंद करता है। परिणाम दिखाते हैं कि knowledge utilization और contextual awareness, दोनों की आवश्यकता वाले tasks पर fine-tuning करने से हर चरण अतिरिक्त लाभ देता है; 65b model ने कई knowledge-intensive zero- और few-shot learning benchmarks पर state-of-the-art results हासिल किए; यह मौजूदा retrieval-augmented language approaches की तुलना में zero-shot में अधिकतम +8.9% और 5-shot में +1.4% बेहतर प्रदर्शन करता है।

पेपर सारांश

Retrieval-augmented language model (RALM) बाहरी data store से long-tail और up-to-date knowledge तक पहुँचकर performance बेहतर बनाते हैं, लेकिन इन्हें बनाना कठिन है। मौजूदा approaches में या तो LM pre-training में retrieval-specific modifications करने पड़ते हैं, जो महंगे हैं, या फिर data store का post-hoc integration इस्तेमाल करना पड़ता है, जिससे performance optimal नहीं रहती। हम Retrieval-Augmented Dual Instruction Tuning (RA-DIT) पेश करते हैं, जो एक lightweight fine-tuning methodology है और retrieval capabilities के साथ किसी भी LLM को retrofit करके तीसरा विकल्प देती है। यह approach दो अलग-अलग fine-tuning steps in काम करती है: (1) एक pre-trained LM को update करता है ताकि वह retrieved information का बेहतर उपयोग कर सके, और (2) दूसरा retriever को update करता है ताकि वह LM की पसंद के अनुसार अधिक relevant results लौटाए। knowledge utilization और contextual awareness दोनों की ज़रूरत वाले tasks पर fine-tuning करके हमने दिखाया कि हर stage से उल्लेखनीय performance improvement मिलता है, और दोनों stages को साथ इस्तेमाल करने पर अतिरिक्त gains मिलते हैं। हमारा सर्वश्रेष्ठ model, RA-DIT 65B, विभिन्न knowledge-intensive zero-shot और few-shot learning benchmarks पर state-of-the-art performance हासिल करता है, और औसतन 0-shot setting में +8.9% तक तथा 5-shot setting में +1.4% तक मौजूदा in-context RALM approaches से स्पष्ट रूप से बेहतर प्रदर्शन करता है。

Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average.

पेपर लिंक

https://arxiv.org/abs/2310.01352

आगे पढ़ें

https://x.com/omarsar0/status/1709204756013490494

Kosmos-G: मल्टीमॉडल बड़े language model का उपयोग करके context के अनुरूप image generation / Kosmos-G: Generating Images in Context with Multimodal Large Language Models

पेपर परिचय

यह एक ऐसा model है जो कई images पर फैले generalized vision-language input से high-fidelity zero-shot image generation करता है, zero-shot subject-driven image generation को multi-entity scenarios तक बढ़ाता है, और CLIP को replace करने की सुविधा देकर ControlNet, LoRA जैसी अन्य U-Net techniques के साथ नई applications को संभव बनाता है। #multimodal

A model that performs high-fidelity zero-shot image generation from generalized vision-language input that spans multiple images; extends zero-shot subject-driven image generation to multi-entity scenarios; allows the replacement of clip, unlocking new applications with other u-net techniques such as controlnet and lora.

पेपर सार

हाल के वर्षों में text-to-image (T2I) और vision-language-to-image (VL2I) generation techniques में काफ़ी प्रगति हुई है। लेकिन generalized vision-language inputs, खासकर कई images को शामिल करने वाले inputs, से generation अभी भी पर्याप्त रूप से explored नहीं है। यह paper Kosmos-G पेश करता है, जो ऊपर बताए गए challenge को हल करने के लिए Multimodal Large Language Models (MLLMs) की advanced perception capabilities का उपयोग करता है। हमारी approach, textual modality को anchor की तरह इस्तेमाल करके, MLLM के output space को CLIP के साथ align करती है और curated data पर compositional instruction tuning करती है। Kosmos-G, zero-shot multi-entity subject-driven generation की एक विशिष्ट क्षमता दिखाता है। खास बात यह है कि score distillation instruction tuning में image decoder में किसी modification की ज़रूरत नहीं होती। इससे CLIP को आसानी से replace किया जा सकता है और fine-grained controls से लेकर personalized image decoder variants तक, असंख्य U-Net techniques के साथ सहज integration संभव होता है। हम Kosmos-G को “image generation में image as a foreign language” के लक्ष्य की दिशा में एक प्रारंभिक प्रयास मानते हैं।

Recent advancements in text-to-image (T2I) and vision-language-to-image (VL2I) generation have made significant strides. However, the generation from generalized vision-language inputs, especially involving multiple images, remains under-explored. This paper presents Kosmos-G, a model that leverages the advanced perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates a unique capability of zero-shot multi-entity subject-driven generation. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation."

पेपर लिंक

https://arxiv.org/abs/2310.02992

आगे पढ़ें

https://x.com/omarsar0/status/1709934741158510625

सादृश्यात्मक तर्ककर्ता के रूप में बड़े language model / Large Language Models as Analogical Reasoners

पेपर परिचय

यह तरीका reasoning process के लिए labeled examples की ज़रूरत न होने के कारण chain-of-thought से अलग है, और analogical reasoning से प्रेरित एक नया prompting approach है जो context में प्रासंगिक उदाहरण या ज्ञान स्वयं उत्पन्न करने के लिए प्रेरित करता है। #llm-reasoning #chain-of-thought

llms की reasoning process को अपने-आप guide करने के लिए एक नया prompting approach; यह तरीका chain-of-thought से अलग है क्योंकि इसमें reasoning process के labeled exemplars की आवश्यकता नहीं होती; यह approach analogical reasoning से प्रेरित है और lms को context में प्रासंगिक exemplars या knowledge स्वयं उत्पन्न करने के लिए prompt करता है.

पेपर सारांश

भाषा मॉडलों के लिए chain-of-thought (CoT) prompting ने reasoning tasks में प्रभावशाली प्रदर्शन दिखाया है, लेकिन आम तौर पर इसके लिए reasoning process के labeled exemplars की आवश्यकता होती है। इस शोध में हम एक नया prompting approach, Analogical Prompting, प्रस्तुत करते हैं, जिसे बड़े language models की reasoning process को अपने-आप guide करने के लिए डिज़ाइन किया गया है। analogical reasoning से प्रेरित यह approach—जो एक cognitive process है जिसमें मनुष्य नई समस्याओं को हल करने के लिए प्रासंगिक पुराने अनुभवों का सहारा लेते हैं—language models को दिए गए प्रश्न को हल करने से पहले context के अनुरूप उदाहरण या ज्ञान स्वयं उत्पन्न करने के लिए prompt करता है। इस विधि के कई फायदे हैं: यह exemplars को label करने या retrieve करने की आवश्यकता को समाप्त करती है, जिससे generality और convenience मिलती है; साथ ही यह हर समस्या के अनुसार उत्पन्न उदाहरणों और ज्ञान को अनुकूलित कर सकती है, जिससे adaptability मिलती है। प्रयोगों के परिणाम दिखाते हैं कि यह approach GSM8K और MATH में गणितीय समस्या-समाधान, Codeforces में code generation, और BIG-Bench में अन्य reasoning tasks सहित विभिन्न reasoning tasks पर 0-shot CoT और manual few-shot CoT से बेहतर प्रदर्शन करता है।

language models के लिए Chain-of-thought (CoT) prompting reasoning tasks में प्रभावशाली प्रदर्शन दिखाता है, लेकिन आम तौर पर इसे reasoning process के labeled exemplars की आवश्यकता होती है। इस काम में हम एक नया prompting approach, Analogical Prompting, पेश करते हैं, जिसे large language models की reasoning process को अपने-आप guide करने के लिए डिज़ाइन किया गया है। analogical reasoning से प्रेरित, जो एक cognitive process है जिसमें मनुष्य नई समस्याओं से निपटने के लिए प्रासंगिक पिछले अनुभवों का उपयोग करते हैं, हमारा approach language models को दिए गए प्रश्न को हल करने से पहले context में प्रासंगिक exemplars या knowledge स्वयं उत्पन्न करने के लिए prompt करता है। यह विधि कई फायदे देती है: यह exemplars को label करने या retrieve करने की ज़रूरत खत्म करती है, जिससे generality और convenience मिलती है; साथ ही यह हर समस्या के अनुसार उत्पन्न exemplars और knowledge को ढाल सकती है, जिससे adaptability मिलती है। experimental results दिखाते हैं कि हमारा approach GSM8K और MATH में math problem solving, Codeforces में code generation, और BIG-Bench में अन्य reasoning tasks सहित विभिन्न reasoning tasks में 0-shot CoT और manual few-shot CoT से बेहतर प्रदर्शन करता है.

पेपर लिंक

https://arxiv.org/abs/2310.01714

मूल लेख

https://nlp.elvissaravia.com/p/top-ml-papers-of-the-week-9d9

[2023/10/02 ~ 10/08] इस हफ़्ते के प्रमुख ML पेपर (Top ML Papers of the Week)

अवलोकन

स्थान और समय को दर्शाने वाले language models / Language Models Represent Space and Time

पेपर परिचय

पेपर सार

पेपर लिंक

आगे पढ़ें

search और long-context बड़े language models का मेल / Retrieval meets Long Context Large Language Models

पेपर परिचय

पेपर सार

पेपर लिंक

और पढ़ें

attention sinks के साथ efficient streaming language models / Efficient Streaming Language Models with Attention Sinks

पेपर परिचय

पेपर सारांश

पेपर लिंक

आगे पढ़ें

न्यूरल developmental programs के माध्यम से self-assembling artificial neural networks की ओर / Towards Self-Assembling Artificial Neural Networks through Neural Developmental Programs

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

LMM का उदय: GPT-4V(ision) के साथ प्रारंभिक अन्वेषण / The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

बोलने से पहले पहले सोचें: pause token के साथ language models को train करना / Think before you speak: Training Language Models With Pause Tokens

पेपर परिचय

पेपर सारांश

पेपर लिंक

आगे पढ़ें

Self-Taught Optimizer (STOP): recursive self-improving code generation / Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

RA-DIT: Retrieval-Augmented Dual Instruction Tuning / RA-DIT: Retrieval-Augmented Dual Instruction Tuning

पेपर परिचय

पेपर सारांश

पेपर लिंक

आगे पढ़ें

Kosmos-G: मल्टीमॉडल बड़े language model का उपयोग करके context के अनुरूप image generation / Kosmos-G: Generating Images in Context with Multimodal Large Language Models

पेपर परिचय

पेपर सार

पेपर लिंक

आगे पढ़ें

सादृश्यात्मक तर्ककर्ता के रूप में बड़े language model / Large Language Models as Analogical Reasoners

पेपर परिचय

पेपर सारांश

पेपर लिंक

और पढ़ें

मूल लेख

संबंधित पढ़ाई

अभी कोई टिप्पणी नहीं है.