Show HN: 80% तेज, 50% कम मेमोरी उपयोग और 0% accuracy loss के साथ Llama fine-tuning

(github.com/unslothai)

2 पॉइंट द्वारा GN⁺ 2023-12-03 | 1 टिप्पणियां | WhatsApp पर शेयर करें

Unsloth स्थानीय रूप से model चलाने और train करने के लिए Unsloth Studio और code-based Unsloth Core देता है, और Windows, Linux, WSL, macOS पर text, audio, embedding और vision models को संभालता है
Training feature 500 से ज़्यादा models की fine-tuning, RL और pretraining को support करता है, और अधिकतम 2x तेज training, अधिकतम 70% कम VRAM, बिना accuracy loss इसके मुख्य performance लक्ष्य हैं
Inference feature में GGUF, LoRA adapters, safetensors models की खोज, download और execution, model export, tool calling, web search, code execution और local API inference endpoints शामिल हैं
Unsloth Studio default रूप से localhost से bind होता है; --secure Cloudflare HTTPS tunnel इस्तेमाल करता है और -H 0.0.0.0 raw port को बाहर expose कर सकता है, इसलिए API key protection और --disable-tools का उपयोग महत्वपूर्ण है
License Apache 2.0 और AGPL-3.0 की dual structure में है; Core package Apache 2.0 है, जबकि Studio UI जैसे कुछ optional components AGPL-3.0 हैं

Unsloth क्या देता है

Unsloth Studio (Beta) स्थानीय रूप से models चलाने और train करने के लिए एक web UI है
- Windows, Linux, WSL, macOS पर काम करता है
- text, audio, embedding, vision models को support करता है
Unsloth Core code-based version है, और इसकी requirements Studio से अलग हैं
शुरुआत के installation commands OS के हिसाब से दिए गए हैं
- macOS, Linux, WSL: curl -fsSL https://unsloth.ai/install.sh | sh
- Windows: irm https://unsloth.ai/install.ps1 | iex

Inference features

Model search, download और run को support करता है; target formats में GGUF, LoRA adapters और safetensors शामिल हैं
Models को GGUF, 16-bit safetensors और अन्य formats में save या export किया जा सकता है
Tool calling self-healing tool calling और web search को support करता है
Code execution LLM को Claude artifacts और sandbox environment में code test करने देता है
API inference endpoint के ज़रिए local LLM को Claude Code और Codex tools के साथ deploy और run किया जा सकता है
OpenAI, Anthropic जैसे API providers या vLLM, Ollama जैसे servers से connect किया जा सकता है
Images, audio, PDF, code, DOCX आदि के साथ chat किया जा सकता है
बताया गया है कि gpt-oss, Qwen3, Llama 4, Mistral, Gemma 1-3, Phi-4 से जुड़ी teams के साथ सीधे collaboration कर model accuracy सुधारने वाले bugs fix किए गए

Training features और performance

Unsloth 500 से ज़्यादा models की training और RL को support करता है
- अधिकतम 2x तेज training
- अधिकतम 70% कम VRAM
- कोई accuracy loss नहीं
Custom Triton और mathematical kernels का उपयोग करता है
- PyTorch के साथ FP8 reinforcement learning collaboration case linked है
- Hugging Face के साथ faster MoE से जुड़ा collaboration case linked है
Data Recipes PDF, CSV, DOCX आदि से datasets automatically बनाता है, और visual node workflow में data edit करने देता है
Reinforcement learning के लिए GRPO, FP8 आदि में अधिकतम 80% कम VRAM उपयोग बताया गया है
Supported training methods में full fine-tuning, RL, pretraining, 4-bit, 16-bit और FP8 training शामिल हैं
Observability feature training status को real time में monitor करता है और loss, GPU usage, graph customization को support करता है
Multi-GPU training support है, और बड़े improvements जल्द आने वाले हैं

Installation और runtime requirements

Unsloth Studio Windows, Linux, WSL, macOS पर काम करता है
- CPU: फिलहाल Chat और Data Recipes support
- NVIDIA: RTX 30/40/50, Blackwell, DGX Spark, Station आदि पर training support
- macOS: training, MLX और GGUF inference सभी support
- AMD: Chat और Data support, training के लिए Unsloth Core इस्तेमाल करें, Studio support जल्द आने वाला है
- Multi-GPU: अभी available है और major upgrade planned है
Studio run command unsloth studio -p 8888 है
Docker image unsloth/unsloth container के रूप में उपलब्ध है
Unsloth Core installation के लिए uv और Python 3.13 based examples दिए गए हैं
- Linux, WSL: uv venv unsloth_env --python 3.13 के बाद uv pip install unsloth --torch-backend=auto
- Windows: Python 3.13 और astral-sh.uv install करने के बाद उसी तरीके से install
- Windows पर pip install unsloth तभी काम करता है जब PyTorch installed हो
AMD और Intel GPU installation के लिए क्रमशः AMD Guide, Intel Guide follow करें

Remote access और security requirements

Default रूप से unsloth studio 127.0.0.1 से bind होता है और केवल current machine से access किया जा सकता है
--secure केवल free Cloudflare HTTPS link के रूप में उपलब्ध कराता है
- Studio localhost पर ही रहता है
- अगर tunnel शुरू नहीं होता, तो यह fail-closed तरीके से काम करता है और raw port expose नहीं करता
-H 0.0.0.0 raw port को सभी network interfaces से bind करता है
- Network में कहीं से भी access संभव होता है, इसलिए इसे केवल trusted networks में ही इस्तेमाल करना चाहिए
Server-side tools जैसे web search, Python और terminal code execution user permissions से चलते हैं और default enabled होते हैं
Server तक access और API key रखने वाला व्यक्ति उस machine पर code चला सकता है, इसलिए API keys को private रखें और Studio expose करते समय --disable-tools का उपयोग ज़रूरी है

Free notebooks और supported model examples

Free Unsloth Studio notebook से web UI में models run और train किए जा सकते हैं
दिए गए notebook examples model-wise performance और memory saving numbers भी दिखाते हैं
- Gemma 4 (E2B): 1.5x तेज, memory 50% कम
- Qwen3.5 (4B): 1.5x तेज, memory 60% कम
- gpt-oss (20B): 2x तेज, memory 70% कम
- gpt-oss (20B): GRPO: 2x तेज, memory 80% कम
- Llama 3.1 (8B) Alpaca: 2x तेज, memory 70% कम
- Orpheus-TTS (3B): 1.5x तेज, memory 50% कम
Kaggle, GRPO, TTS, embedding और Vision के लिए notebooks की सूची भी अलग से दी गई है
सभी models Unsloth Catalog में, और सभी notebooks Unsloth notebooks में देखे जा सकते हैं

Recent feature items

Connections: OpenAI, Anthropic जैसे API providers या vLLM, Ollama जैसे servers से connection support
MTP: Qwen3.6 MTP run support, hardware-specific MTP settings का automatic configuration
Qwen3.6: Qwen3.6-35B-A3B को Unsloth Studio में train और run किया जा सकता है
Gemma 4: Google का नया model Unsloth में सीधे run और train किया जा सकता है
MoE LLM: DeepSeek, GLM, Qwen, gpt-oss के लिए 12x तेज training और 35% कम VRAM बताया गया है
Embedding models: Embedding fine-tuning को लगभग 1.8~3.3x तेज support करता है
7x longer context RL: नया batching algorithm अन्य settings की तुलना में 7x लंबा context RL देता है
500K Context: 80GB GPU पर 20B model को 500K से अधिक context के साथ train किया जा सकता है
FP8 & Vision RL: Consumer GPUs पर FP8 और VLM GRPO चलाए जा सकते हैं

License और base projects

Unsloth Apache 2.0 और AGPL-3.0 dual license model का उपयोग करता है
- Core Unsloth package Apache 2.0 पर रहता है
- Unsloth Studio UI जैसे कुछ optional components पर AGPL-3.0 लागू होता है
Project llama.cpp, Hugging Face transformers, TRL, PyTorch, Torch AO, NVIDIA NeMo DataDesigner आदि का उल्लेख करता है

1 टिप्पणियां

GN⁺ 2023-12-03

Hacker News की राय

मैंने कोड खुद चलाकर नहीं देखा, लेकिन यह कैसे संभव है, समझ नहीं आ रहा
PyTorch में QLoRA Llama-2-70B fine-tuning को profile करने पर execution time का बड़ा हिस्सा MLP layers के बड़े matrix multiplications में जाता है, और उसमें attention थोड़ा और जुड़ता है
अंदर से यह repo भी MLP के लिए torch.matmul() और attention के लिए flash_attn_func() call करके HuggingFace जैसा ही path इस्तेमाल करता लगता है, तो यह इतना ज्यादा तेज कैसे हो सकता है, इस पर सवाल है
कुछ Triton kernels जरूर हैं, लेकिन bottleneck के ज्यादातर हिस्से, यानी MLP या attention में Triton दिखता नहीं
- वे इसे optimized custom autograd की वजह बताते हैं, और autograd differentiation computation का core component है, इसलिए बात में दम लगता है
  function inlining या memory optimization जैसे सरल improvements का भी जिक्र है, और इन हिस्सों में optimization की अच्छी संभावना है
  हालांकि यह पक्का नहीं कि उनके फायदे closed-source “pro” version में बचे रह पाएंगे या नहीं
  अगर यह low-hanging fruit है, तो open source implementations शायद जल्द ही इसे अपना लेंगी
- और detailed explanation https://unsloth.ai/introducing पर है
- काफी बड़े दावे paid pro version के पीछे locked हैं। यह warning sign लगता है
यहां pricing पर criticism को फिलहाल ignore करके, किसी early-stage database company में काम कर चुके sales rep या solutions engineer को तुरंत ढूंढकर हजारों GPU वाले high-end customers को cold-call करना शुरू करना बेहतर होगा
इसे बेचने के लिए 200k–300k डॉलर या उससे अधिक की B2B deal सबसे संभावित रास्ता लगता है
इच्छुक लोगों के लिए, सभी optimizations को cover करने वाला नया blog post अभी publish किया है
पूरी तरह reproducible benchmarks भी 59 हैं: https://unsloth.ai/blog/mistral-benchmark
Results promising लग रहे हैं, इसलिए खुद try करना चाहता हूं
performance benchmark से जुड़ा सवाल है: 2 GPUs और DDP इस्तेमाल करने वाले सभी results single GPU से ज्यादा समय क्यों लेते हैं?
दोनों benchmarks में एक training epoch में उतना ही काम होता है, इसलिए ऐसी reverse-scaling unexpected है
- मुख्य वजहें दो हैं
  पहली, DDP में खुद overhead होता है। हर training step पर GPU0 और GPU1 को gradients GPU0 तक भेजकर synchronize करना पड़ता है
  दूसरी, inefficient data movement की वजह से HuggingFace DDP के लिए अच्छी तरह optimized नहीं लगता, और हमने यह हिस्सा fix किया। दिलचस्प बात यह है कि single GPU पर भी speed बढ़ी
इन अलग-अलग attempts की कोई chronology हो तो अच्छा होगा। variants इतने ज्यादा हो गए हैं कि काफी पहले ही flow छूट गया
self-reported metrics को सच मानकर स्वीकार न करें तो यह काफी बड़ा काम होगा
वह भी हमेशा hardware और usage scope के हिसाब से conditional रहेगा
इसे सच में useful बनाने के लिए अलग-अलग machine configurations और benchmarks वाला CI/CD pipeline, और results को समझदारी से communicate करने का तरीका चाहिए
अगर कोई यह कर पाए, तो वह सचमुच indispensable बन जाएगा
- मैंने भी बिल्कुल यही सोचा था
  https://colab.research.google.com/drive/1AOuhMVILE06mD-Go7-R... पर एक blog post लिख रहा हूं, जिसमें मैंने किए गए सभी changes step-by-step दिखाए हैं, साथ में timing measurements और memory savings भी डाले हैं
  अगर interest हो तो पूरा होते ही post कर दूंगा
यह PyTorch Labs के Sam और llama2 optimizations से कैसे compare होता है, यह जानना चाहूंगा
https://github.com/pytorch-labs/segment-anything-fast
https://github.com/pytorch-labs/gpt-fast
- वह inference के लिए है, और हमारा code training के लिए है
  आगे faster inference भी plan में है
  Chillee का GPT Fast देखा, सच में बेहद तेज है
थोड़ा related, सोच रहा हूं कि P100 या P40 इस्तेमाल करना अभी भी worth it है या नहीं
एक खरीदने वाला था, लेकिन लगता है Pascal के लिए support ज्यादा से ज्यादा projects में हटता जा रहा है
- P100 में Xformers का Flash Attention support शायद मिल जाए, लेकिन Triton Compute Capability 7.0 या उससे ऊपर support करता है और P100 6.0 है, इसलिए दिक्कत है
  technically code चल सकता है, लेकिन Triton changes हटाने के लिए modifications करने होंगे
बहुत दिलचस्प लगता है, लेकिन confusion है कि maximum speedup version को enterprise-only क्यों रखा है
Free और Paid plans में performance difference रखना, और Enterprise को support जैसे factors से अलग करना ज्यादा logical लगता है
- अच्छा point है। हमने भी इस पर सोचा है, और अभी pricing policy लगातार tune कर रहे हैं, इसलिए सभी suggestions welcome हैं
  ये सब हमारे लिए पहली बार है, इसलिए करते-करते सीखकर बना रहे हैं
वे 2018 के बाद के GPUs की बात कर रहे थे, तो उदाहरण के लिए 1080 Ti पर यह क्यों नहीं चलता, यह जानना चाहूंगा
hardware specs को मोटे तौर पर देखने पर यह CUDA 8 या ऊपर support करता लगता है, और यहां 7.5 लिखा है
क्या कोई और explain कर सकता है?
- 1080 Ti के लिए अफसोस है, लेकिन Triton और Xformers CUDA 7.0 support करते हैं, इसलिए जब तक OpenAI और Meta CUDA 6.0 support नहीं करते, हमारे लिए भी support करना मुश्किल है
  मुख्य वजह यह है कि Turing से Tensor Cores मिलने लगे, जिससे matrix multiplication Tensor Cores based हो गया
- 1080 Ti की CUDA Compute Capability 6.1 है

Show HN: 80% तेज, 50% कम मेमोरी उपयोग और 0% accuracy loss के साथ Llama fine-tuning

Unsloth क्या देता है

Inference features

Training features और performance

Installation और runtime requirements

Remote access और security requirements

Free notebooks और supported model examples

Recent feature items

License और base projects

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय