Mistral-finetune - Mistral मॉडल को फाइन-ट्यून करना

(github.com/mistralai)

1 पॉइंट द्वारा GN⁺ 2024-05-27 | 1 टिप्पणियां | WhatsApp पर शेयर करें

mistral-finetune Mistral मॉडल को memory-efficient और अच्छी performance के साथ fine-tune करने के लिए एक lightweight codebase है, और मौजूदा repository archived है इसलिए अब सक्रिय रूप से maintain नहीं की जाती
training तरीका LoRA पर आधारित है, जिसमें ज्यादातर weights को freeze किया जाता है और low-rank matrix perturbation के रूप में अतिरिक्त weights के सिर्फ 1~2% को train किया जाता है
अधिकतम efficiency के लिए A100 या H100 GPU इस्तेमाल करने की सलाह है, और code multi-GPU single-node training के लिए optimized है, हालांकि 7B जैसे छोटे मॉडल single GPU पर भी चल सकते हैं
supported models में 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral-Nemo 12B, Mistral Large v2 123B Instruct शामिल हैं, और Mistral-Nemo तथा Large v2 में क्रमशः sequence length और learning rate से जुड़ी constraints हैं
data को jsonl format और strict schema का पालन करना चाहिए, और training से पहले utils.validate_data से format validation और training time estimation करना महत्वपूर्ण है

project की स्थिति और उद्देश्य

mistral-finetune repository Archived स्थिति में है और अब सक्रिय रूप से maintain नहीं की जाती
अगर community demand हो या माना जाए कि यह fine-tuning ecosystem में value जोड़ सकती है, तो भविष्य में कोई नई library या बड़ा update आ सकता है
लक्ष्य Mistral models को fine-tune करने के लिए सरल और guided entry point देना है
यह codebase खासकर data format को लेकर काफी opinionated है, और कई model architectures या hardware types को cover करने वाला general-purpose tool बनने का लक्ष्य नहीं रखता
ज्यादा general approach के लिए torchtune जैसे projects देखे जा सकते हैं

fine-tuning तरीका और hardware recommendations

mistral-finetune LoRA पर आधारित है
- model के ज्यादातर weights fixed रहते हैं
- low-rank matrix perturbation के रूप में अतिरिक्त weights के सिर्फ 1~2% को train करता है
अधिकतम efficiency के लिए A100 या H100 GPU इस्तेमाल करने की सलाह है
code multi-GPU single-node training environment के लिए optimized है
7B जैसे छोटे models के लिए single GPU भी पर्याप्त है

हाल के compatible model updates

13 अगस्त 2024 से Mistral Large v2 mistral-finetune के साथ compatible है
- 123B Instruct checkpoint download करके model_id_or_path को उस checkpoint directory पर set करना होगा
- model size बड़ा होने के कारण fine-tuning के लिए काफी ज्यादा memory चाहिए
- फिलहाल seq_len को 8192 या कम set करना होगा
- दूसरे models की तुलना में lower learning rate की सलाह है, और ज्यादातर मामलों में lr=1e-6 अच्छी तरह काम करेगा, ऐसा बताया गया है
19 जुलाई 2024 से Mistral Nemo mistral-finetune के साथ compatible है
- 12B Base या Instruct model download करके model_id_or_path को checkpoint directory पर set करना होगा
- Tekkenizer support करने वाला mistral-common version चाहिए, और pip install --upgrade mistral-common से >=1.3.1 version install करना होगा
- बड़ी vocabulary size के कारण CE loss की peak memory requirement बढ़ती है, इसलिए अभी ज्यादा memory चाहिए
- फिलहाल seq_len को 16384 या कम set करना होगा
- 7B v3 जैसे hyperparameters इस्तेमाल करने की सलाह है

installation और model download

शुरुआत की प्रक्रिया repository clone और dependencies install करने से बनती है
- git clone https://github.com/mistralai/mistral-finetune.git
- pip install -r requirements.txt
official Mistral models की fine-tuning की सलाह दी जाती है, और README निम्न model download links और checksums देता है
- 7B Base: 0663b293810d7571dad25dae2f2a5806
- 7B Instruct v3: 80b71fcb6416085bcb4efad86dfb4d52
- 8x7B Base: Hugging Face link
- 8x7B Instruct: 8e2d3930145dc43d3084396f49d38a3f
- 8x22 Instruct: 471a02a6902706a2f1e44a693813855b
- 8x22B Base: a2fa75117174f87d1197e3a4eb50371a
- 12B Instruct (Mistral-Nemo): 296fbdf911cb88e6f0be74cd04827fe7
- 12 Base (Mistral-Nemo): c5d079ac4b55fc1ae35f51f0a3c0eb83
- 123B Instruct (Large v2): fc602155f9e39151fba81fcaab2fa7c4
8x7B Base V1 और 8x7B Instruct V1 को fine-tuning से पहले v3 tokenizer इस्तेमाल करना और vocabulary size को 32768 तक expand करना होगा
downloaded model folder path को training YAML के model_id_or_path में absolute path के रूप में specify करना होगा

data format requirements

सभी data files jsonl format में होनी चाहिए
pretraining data में plain text को "text" key में store किया जाता है
Instruction data में conversations की list "messages" key में store होती है
- हर item में "content" और "role" keys शामिल होती हैं
- "role" "user", "assistant", "system" में से एक होता है
- loss केवल तब calculate होता है जब "role" == "assistant" हो
- assistant message में "weight": 0 specify करके उस message की training exclude की जा सकती है
function calling data भी conversations की list को "messages" key में store करता है
- हर item में "role" और "content" या "tool_calls" key शामिल होती है
- "role" "user", "assistant", "system", "tool" में से एक होता है
- loss केवल तब calculate होता है जब "role" == "assistant" हो
- "tool_calls" के "id" और "tool_call_id" बिल्कुल 9-character लंबी random strings होनी चाहिए
- README इन्हें data preparation script में automatically generate करने का तरीका recommend करता है

data validation और example workflow

training शुरू करने से पहले utils.validate_data से data format validate करना और training time estimate करना चाहिए
Instruction example Ultachat_200k के एक हिस्से का इस्तेमाल करता है
- Pandas से parquet data load करता है
- training 95%, evaluation 5% में split करता है
- jsonl में save करता है
- example/7B.yaml के data.instruct_data और data.eval_instruct_data में paths specify करता है
validation process में कुछ conversations के user role पर खत्म होने की समस्या मिल सकती है
- क्योंकि सिर्फ assistant messages train होते हैं, आखिरी user message unnecessary processing target बन जाता है
- utils.reformat_data.py से data fix किया जा सकता है
fix के बाद दोबारा validate करने पर data token count, training token count, epoch count, max_steps, estimated time जैसी summary output होती है
README example में max_steps=500 dataset को करीब 5 बार iterate करता है, और 8xH100 cluster पर करीब 30 मिनट लगने वाली setting के रूप में max_steps=300 recommend करता है

function calling fine-tuning example

function calling example Glaive function calling dataset का इस्तेमाल करता है
data को Pandas से load करके, training 95% और evaluation 5% में बाँटने के बाद jsonl में save किया जाता है
original dataset required function calling format का पालन नहीं करता, इसलिए reformatting चाहिए
- "from" को "user" में बदलना होगा
- unnecessary "\n" characters हटाने होंगे
utils.reformat_data_glaive.py इस्तेमाल करने पर ज्यादातर samples सही format में बन सकते हैं
हर तरह के dataset पर काम करने वाली reformat script लिखना असंभव है, इसलिए required format का पालन न करने वाले datasets के लिए अलग reformat script की जरूरत हो सकती है
utils.validate_data --create_corrected इस्तेमाल करने पर बची हुई errors हटाकर .corrected dataset generate किया जा सकता है

training execution और result examples

data validation के बाद training शुरू की जा सकती है
तेज training के लिए max_steps को 300 पर set करने वाली configuration recommend की जाती है
run_dir को experiment folder के रूप में set करना चाहिए, और वैकल्पिक रूप से wandb.project specify करके Weights & Biases logging इस्तेमाल की जा सकती है
training execution torchrun का इस्तेमाल करता है, और --nproc-per-node को available GPU count पर set करना होगा
UltraChat training 8xH100 node पर करीब 30 मिनट लेती है, और resulting weights MT Bench score करीब 6.3 दे सकते हैं
Glaive training 8xH100 node पर करीब 1 घंटा लेती है, और बताया गया है कि resulting weights function calling में अच्छी तरह काम करते हैं

training configuration के मुख्य items

model_id_or_path: training शुरू करने के लिए pretrained model या local model directory path
run_dir: checkpoints और metrics store करने की directory
seq_len: training sequence length, और samples efficiency के लिए seq_len length के हिसाब से pack किए जाते हैं
batch_size: प्रति GPU training examples की संख्या
- total effective token batch size num_gpus x batch_size x seq_len है
max_steps: कुल training iterations की संख्या
- training के दौरान देखे जाने वाले total tokens max_steps x num_gpus x batch_size x seq_len हैं
optim.lr: optimizer initial learning rate
optim.weight_decay: weight decay, और README 0.1 बनाए रखने की सलाह देता है
optim.pct_start: PyTorch OneCycleLR के warm-up phase का ratio
lora.rank: LoRA adapter size, और 64 या कम recommend किया जाता है
seed: initialization और data shuffling/sampling की reproducibility के लिए random seed
data.instruct_data: instruction training data path
- single jsonl file, jsonl directory, या weights के साथ multiple data sources specify किए जा सकते हैं
data.data: optional additional pretraining data path
data.eval_instruct_data: optional evaluation instruction data path
eval_freq, no_eval, ckpt_freq: evaluation, intermediate evaluation, checkpoint saving frequency को control करते हैं
save_adapters: तय करता है कि केवल LoRA checkpoints save होंगे या LoRA को base model में merge करके full model के रूप में save किया जाएगा
- save_adapters=False को single process में full model save करने के लिए पर्याप्त CPU और GPU memory चाहिए, और आम तौर पर सिर्फ 7B models में संभव है

inference और Weights & Biases

trained model inference के लिए mistral-inference इस्तेमाल करने की सलाह है
pip install mistral_inference से install किया जा सकता है
mistral-chat run करते समय --lora_path में saved lora.safetensors path specify करके LoRA weights इस्तेमाल किए जा सकते हैं
Weights and Biases support शामिल है, जिससे training metrics और experiments monitor किए जा सकते हैं
- pip install wandb से install करें
- API key को WANDB_API_KEY environment variable के रूप में provide करने की सलाह है
- security कारणों से API key YAML configuration से नहीं पढ़ी जाती
- training loss, evaluation loss, learning rate आदि wandb project dashboard में record और visualize होते हैं
detailed usage के लिए Weights and Biases documentation देख सकते हैं

model expansion और FAQ

सिर्फ v3 tokenizer के साथ compatible Mistral models को fine-tune किया जा सकता है
compatible models की vocabulary size 32768 होनी चाहिए, 32000 नहीं
32000 vocabulary size वाले पुराने models को utils.extend_model_vocab से 32768 तक expand किया जा सकता है
MoE models की fine-tuning में performance variance ज्यादा दिखता है
- अलग-अलग seed के साथ वही MoE fine-tuning कई बार run करके सबसे अच्छा performance वाला result चुनने का तरीका सुझाया गया है
- dense models में ऐसा high variance observe नहीं हुआ
training में इस्तेमाल हुए token count को utils.validate_data.py में YAML training file input देकर check किया जा सकता है
CUDA out-of-memory error आने पर प्रति GPU batch size कम किया जा सकता है
- batch size seq_len x batch_size है
- batch_size को 1 पर set करके seq_len घटाने का तरीका सुझाया गया है
library Apache 2.0 License के तहत उपलब्ध है
इस library या models का इस्तेमाल तीसरे पक्ष के intellectual property rights सहित अधिकारों का उल्लंघन करने, उनके दुरुपयोग करने या नियमों का उल्लंघन करने के तरीके से नहीं किया जाना चाहिए

1 टिप्पणियां

GN⁺ 2024-05-27

Hacker News की राय

मॉडल इतनी तेज़ी से आगे बढ़ रहे हैं, तो क्या fine-tuning की अब भी कोई वैल्यू है? असली उपयोग के उदाहरणों को लेकर उत्सुक हूँ
उदाहरण के लिए, Bloomberg ने पिछले साल वित्तीय डेटा पर GPT-3.5-स्तर का LLM train किया था, लेकिन कुछ ही समय बाद GPT-4-8k ने लगभग सभी financial tasks में उसे पीछे छोड़ दिया
आखिरकार हमारा फोकस high-quality evaluation data और ऐसी architecture पर आ गया जिससे नए मॉडल पर आसानी से switch किया जा सके
- हाँ। हमारे पास गैर-अंग्रेज़ी लोगों का डेटा है, और उसे एक खास health-related research के लिए बनाए गए format में annotate किया गया है
  LLM ने ऐसी annotations कभी नहीं देखी हैं, गैर-अंग्रेज़ी LLM कंपनियों की top priority भी नहीं हैं, और data privacy के कारण हम सिर्फ offline-first models ही इस्तेमाल कर सकते हैं
  ऐसी स्थिति में general-purpose language model को fine-tune करना बहुत सही बैठता है
- अगर किसी खास format का output बड़ी मात्रा में generate करना हो, तो fine-tuning उपयोगी हो सकती है
  formatted messages पर fine-tune कर देने से model अपने-आप वही format generate करता है, इसलिए हर prompt में output format समझाने वाले बहुत सारे tokens बच सकते हैं
- अगर बात internal company data की हो जिसे GPT-4 ने कभी नहीं देखा, तो?
- पारंपरिक natural language processing tasks में LLM, POS tagging या feature tagging जैसी dedicated natural language processing pipelines से काफी पीछे हैं
  हालांकि fine-tuning दोनों के बीच की दूरी काफी हद तक कम कर देती है
  यह एक narrow area है, लेकिन programming के ज़्यादातर हिस्सों में भी ऐसा ही है। अगर मकसद general-purpose LLM को अपने data की तरफ ज़्यादा झुकाना है, तो fine-tuning शायद बहुत relevant न हो
  लेकिन अगर आप बहुत specific लेकिन अस्पष्ट समस्या हल करना चाहते हैं, और LLM उसका सिर्फ कुछ हिस्सा ही हल कर पा रहा है, तो fine-tuning सबसे अच्छा विकल्प हो सकती है
- function calling भी एक वजह हो सकती है
  अगर app में tools के साथ interact करने वाले बहुत सारे custom functions हैं, तो context tokens खर्च करने के बजाय fine-tuning पसंद की जा सकती है
इसे करने के लिए किस GPU की ज़रूरत होगी? मेरे पास 3060 Ti laptop version, i9, RAM 16GB है
AWS या GCP quota नहीं है और Paperspace के बारे में सुना है, लेकिन जिस client project पर काम कर रहा हूँ उसमें Mistral models के कुछ हिस्से इस्तेमाल करने की योजना है, इसलिए Mistral fine-tuning जल्दी शुरू करना चाहता हूँ
- अगर budget पूरी तरह 0 नहीं है, तो मैं gaming desktop लेने की जोरदार सलाह दूँगा
  gaming GPU 300W heat बिना समस्या के निकाल सकता है, लेकिन laptop GPU ऐसा करे तो पिघल जाएगा और शायद लगभग 100W तक सीमित रहेगा
  heat dissipation सीधे speed के अनुपात में है
  ऊपर से desktop में बाद में तेज़ GPU upgrade करना या कई GPUs इस्तेमाल करना भी संभव है
  लेकिन खासकर multi-GPU setup शोर करता है और इतना heat निकालता है कि एक कमरा जल्दी गर्म हो जाए
  अगर अगले कुछ वर्षों में GPU को full load पर चलाने का समय 10% से ज़्यादा नहीं होगा, तो cloud शायद सस्ता पड़ेगा
- यह site देख सकते हैं: https://www.hardware-corner.net/llm-database/Mistral/
  इसमें model-wise hardware requirementsまとめ किए गए हैं, और VRAM व system memory चुनकर available models filter किए जा सकते हैं
- Hetzner पर 184 euro/month वाला GPU server इस्तेमाल कर सकते हैं
  हमारी company ने वहाँ के RTX4000 पर Mistral और Llama 3 fine-tune किए हैं
  RAM सिर्फ 20GB है, इसलिए थोड़ा limiting है, लेकिन बड़े input token counts के लिए quantization level घटाने का तरीका मददगार रहा
  अब वे hourly rental भी offer करते हैं
- openpipe आज़माना अच्छा रहेगा
  फिलहाल company में इस्तेमाल कर रहे हैं और काफी अच्छे results मिले हैं
आम LLM use cases में हर category के लिए कौन सा tool de facto standard बनेगा, यह बहुत दिलचस्प है
ecosystem इतना fragmented है कि लगता है ज़्यादातर tools के बारे में सुना ही नहीं है
कुछ दिन पहले Microsoft का Olive देखा, और वह मेरे लिए बिल्कुल नया tool था
अब जब बहुत सारे open source LLM पहले ही “कामचलाऊ/usable” स्तर पर पहुँच चुके हैं, तो उनके आसपास development को आसान बनाना अहम है
खासकर users और developers—दोनों भूमिकाओं में मौजूद लोगों को private data, यानी model की pretraining में शामिल नहीं रहे data, का उपयोग कर पाना चाहिए
repository में लिखा है कि यह बड़े models के लिए optimized है और A100/H100 चाहिए, लेकिन फिर भी मुझे लगता है कि यह बड़े models की तुलना में छोटे models के लिए ज़्यादा मददगार हो सकता है
“बना दो तो लोग आएँगे” को “tools दो तो लोग बनाएँगे” तक बढ़ाया जा सकता है
- “tools दो तो लोग बनाएँगे” तभी सच होता है जब उस technology को सीखने का incentive भविष्य के लाभ की उम्मीद दिलाए
weights वाला हिस्सा दिलचस्प है
HuggingFace का SFTTrainer चाहें तो सिर्फ completion वाले हिस्से पर training करने देता है, लेकिन इंसानों को यह natural लगे, फिर भी LLMs के लिए आम तौर पर पूरे input को predict करना सीखना बेहतर होता है
इस तरीके से दोनों के फायदे मिल सकते हैं
क्या इसे इस तरह optimize किया जा सकता है कि 3090 या 4090 की दो cards पर बड़े variant models train हो सकें?
- काफी मेहनत लगेगी, लेकिन संभव लगता है
  कुछ विकल्पों को cover करने वाला शुरुआती point यहाँ है: https://huggingface.co/blog/trl-peft
अपने WhatsApp chat model को कैसे train कर सकता हूँ?
- यह और स्पष्ट होना चाहिए कि आपका मतलब क्या है
  क्या आप अपने WhatsApp messages पर model train करना चाहते हैं? उद्देश्य क्या है? यह इस पर निर्भर करेगा कि आप उसे अपनी तरह लिखवाना चाहते हैं या RAG-based Q&A करना चाहते हैं

Mistral-finetune - Mistral मॉडल को फाइन-ट्यून करना

project की स्थिति और उद्देश्य

fine-tuning तरीका और hardware recommendations

हाल के compatible model updates

installation और model download

data format requirements

data validation और example workflow

function calling fine-tuning example

training execution और result examples

training configuration के मुख्य items

inference और Weights & Biases

model expansion और FAQ

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय