HN पर रिलीज़: Sparse Autoencoders का उपयोग करके Llama 3.2 की interpretability पर अध्ययन

(github.com/PaulPauls)

1 पॉइंट द्वारा GN⁺ 2024-11-22 | 1 टिप्पणियां | WhatsApp पर शेयर करें

यह प्रोजेक्ट Llama 3.2-3B की internal representations को Sparse Autoencoder(SAE) से decompose करके interpretable features निकालने की कोशिश करता है, और activation capture से लेकर training, interpretation और verification तक एक बार चलाए गए पूरे pipeline और outputs को सार्वजनिक करता है
Pipeline OpenWebText के sentence-level data से Llama 3.2-3B की 23वीं layer residual activation capture करता है, और PyTorch में 65,536 latents और TopK=64 setting वाला SAE train करता है -公開 resources में sentence-level OpenWebText dataset, 2.5 करोड़ sentences की 3.2TB activations, Weights & Biases training logs, और 10 epoch तक trained SAE model शामिल हैं
Training 8x Nvidia RTX4090 पर लगभग 7 दिन चली, final normalized loss लगभग 0.144 था, और auxiliary loss ने शुरुआत में लगभग 40% रहे dead latents को तेज़ी से revive करने का pattern दिखाया
Interpretability analysis में हर latent को सबसे strongly activate करने वाले top 50 sentences का Claude 3.5 से analysis कराया गया; feature steering संभव है, लेकिन पहले beta version में results consistent नहीं हैं

प्रोजेक्ट का लक्ष्य और दायरा

यह project Llama 3.2-3B पर Sparse Autoencoder(SAE) लागू करके LLM की internal representations को अधिक interpretable features में decompose करने का प्रयास है
Modern LLM कई features को एक ही neuron में overlap करके store करने वाली superposition का इस्तेमाल करते हैं, और SAE activations को बहुत बड़े और sparse latent space में project करके overlapped representations को अलग करने की कोशिश करता है
लक्ष्य ऐसा पूरा pipeline देना है जिसमें ये steps शामिल हों
- LLM activations capture करना
- SAE training data बनाना और preprocessing
- SAE training
- trained features के meaning का analysis
- experimental verification और feature steering
मौजूदा version 0.2 में पूरा pipeline एक बार run करके Llama 3.2-3B के लिए interpretable SAE बनाया गया है, और यह final version नहीं है
Project का character Anthropic, OpenAI, Google DeepMind की हालिया SAE-based mechanistic interpretability research को reproduce करने का है

मुख्य features

Pipeline activation capture से verification तक end-to-end बना है, और pure PyTorch व minimal dependencies के साथ लिखा गया है
प्रमुख features ये हैं
- sentence-level OpenWebText variant dataset से LLM residual activation capture
- efficient training के लिए prebatching और statistics calculation
- single-node multi-GPU distributed SAE training
- dead latents रोकने और recover करने के लिए auxiliary loss
- training stabilize करने के लिए gradient projection
- Weights & Biases और console logs पर आधारित training, verification और dead latent monitoring
- latents को strongly activate करने वाले inputs capture करना और Frontier LLM-based semantic analysis
- external Fairscale dependency के बिना Llama 3.1/3.2 chat और text completion implementation
- text/chat completion और optional Gradio UI के जरिए SAE impact verification और feature steering
स्पष्ट किया गया है कि सभी components को scalability, efficiency और maintainability को ध्यान में रखकर design किया गया है

सार्वजनिक किए गए outputs

OpenWebText Sentence Dataset
- OpenWebText को sentence level पर process किया गया variant dataset
- original OpenWebText के सभी text और order को बनाए रखता है
- sentences parquet format में अलग-अलग store हैं, जिससे fast access support होता है
- sentence splitting NLTK 3.9.1 के pre-trained “Punkt” tokenizer से किया गया है
Captured Llama 3.2-3B Activations
- Llama 3.2-3B layer 23 residual activation के 2.5 करोड़ sentences
- original 4TB को 3.2TB में compress किया गया
- download management के लिए 100 archives में split किया गया
SAE Training Log
- Weights & Biases-based training, verification और debug metrics logs
- 10 epoch, 10,000 logged steps
- train/val main loss, auxiliary loss, dead latent statistics शामिल हैं
Trained 65,536 latents SAE Model
- 10 epoch training पूरा कर चुका final SAE model
- Llama 3.2-3B layer 23 से निकली 6.5 अरब activations पर trained

Code structure

Project चार मुख्य components में बंटा है
Data Capture
- capture_activations.py: LLM residual activation capture
- openwebtext_sentences_dataset.py: sentence-level processing के लिए custom dataset
SAE Training
- sae.py: core SAE model implementation
- sae_preprocessing.py: SAE training data preprocessing
- sae_training.py: distributed SAE training implementation
Interpretability
- capture_top_activating_sentences.py: feature activation को maximize करने वाले sentences identify करना
- interpret_top_sentences_send_batches.py: interpretation batches बनाना और भेजना
- interpret_top_sentences_retrieve_batches.py: interpretation results receive करना
- interpret_top_sentences_parse_responses.py: interpretation results parse और analyze करना
Verification and Testing
- llama_3_inference.py: core inference implementation
- llama_3_inference_text_completion_test.py: text completion test
- llama_3_inference_chat_completion_test.py: chat completion test
- llama_3_inference_text_completion_gradio.py: interactive testing के लिए Gradio interface

Llama 3.1/3.2 कस्टम implementation

शोध का आधार llama_3/model_text_only.py में मौजूद Llama 3.1/3.2 transformer implementation है
यह implementation Llama models repository के reference implementation पर आधारित है, लेकिन project के उद्देश्य के अनुसार बदला गया है
- Fairscale पर भारी dependency हटाई गई
- शुरुआती release में image interpretability तक शामिल करने से complexity बढ़ती, इसलिए multimodal features हटाए गए
Transformer constructor में ऐसे arguments जोड़े गए हैं जो किसी खास layer पर activation values capture करने या trained SAE inject करने की सुविधा देते हैं
- store_layer_activ
- sae_layer_forward_fn
llama_3/ directory की अधिकांश supporting files मूल Llama models repository से बरकरार रखी गई हैं
- supporting code का 95% इस्तेमाल नहीं होता, लेकिन chat formatter आपस में जुड़े imports पर निर्भर करता है, इसलिए उन्हें वैसे ही शामिल रखा गया
वास्तविक inference implementation llama_3_inference.py में है, और chat तथा text completion दोनों में streaming support करता है
inference batched inference, temperature और top-p settings support करता है; temperature 0 होने पर यह अपने-आप greedy sampling पर switch हो जाता है

डेटा capture और preprocessing

activation capture के लिए OpenWebText को sentence-level पर process करके बनाए गए custom variant dataset का उपयोग किया गया
capture की settings और scale इस प्रकार हैं
- 2.5 करोड़ sentences
- प्रति sentence अधिकतम 192 tokens
- raw activation values 4TB
- tar.gz compression के बाद 3.2TB
- लगभग 70 करोड़ activation
- औसत sentence length 27.3 tokens
यह dataset Anthropic और Google DeepMind द्वारा इस्तेमाल किए गए लगभग 8 अरब unique activation से करीब एक order of magnitude छोटा है
छोटे dataset की भरपाई के लिए SAE को 10 epoch तक train करके कुल processed activations की संख्या Anthropic और Google DeepMind experiments के बराबर करने की कोशिश की गई
- फर्क यह है कि इस project का SAE हर activation को 10 बार देखता है
- 32TB scale तक बढ़ाने पर GCP bucket cost लगभग $80/month से $800/month तक बढ़ने का अनुमान था, इसलिए non-profit side project के cost constraints हैं
sentence-level processing प्राकृतिक language units में meaning बनाए रखने के लिए चुनी गई
- sentence को complete thought और concepts रखने वाली unit माना गया
- context की artificial truncation से बचा गया
- sentence boundaries के पार meaning mixing, यानी contextual bleed, को कम करने की कोशिश की गई
- बाद की interpretation analysis में भी वही sentence-level activations इस्तेमाल करने के लिए यह चुनाव किया गया
sentences को BOS token के बिना process किया गया
- उद्देश्य position-specific patterns से बचते हुए meaning-based features की व्याख्या करना था
capture point Llama 3.2-3B की 28 layers में से 23वीं layer है, और यह layer normalization के बाद की residual stream activation है
- यह model depth के लगभग 5/6 point पर है, और OpenAI implementation का अनुसरण करता है
capture को NCCL-based single-node multi-GPU inference के रूप में implement किया गया
- एक अलग process asynchronous disk I/O handle करता है, जिससे GPU processing bottleneck कम होता है
- पूरा capture 4x Nvidia RTX4090 पर लगभग 12 घंटे में पूरा हुआ
preprocessing 1024 activations के prebuilt batches बनाने का step है
- variable sequence length और carryover handling training के दौरान complex bugs या I/O bottlenecks पैदा कर सकते थे, इसलिए अलग preprocessing चुनी गई
- Welford algorithm से पूरी activation mean tensor calculate की गई
- calculated mean को SAE के b_pre bias initial value के रूप में इस्तेमाल किया गया
- पूरी preprocessing pipeline multiprocessing के जरिए CPU-parallelized है

SAE design और training method

SAE मुख्यतः OpenAI की पसंद का अनुसरण करने वाली TopK Autoencoder structure है
forward pass इस रूप में बना है
- Encoder: h = TopK(W_enc(x - b_pre) + b_enc)
- Decoder: x^ = W_dec * h (+ h_bias) + b_pre
b_pre encoder और decoder दोनों में इस्तेमाल होता है, और preprocessing में calculate की गई mean से initialize किया जाता है
b_enc encoder-only bias है और randomly initialize किया जाता है
latent sparsity को TopK activation function से enforce किया जाता है
- केवल सबसे बड़े k activations रखे जाते हैं और बाकी को 0 set किया जाता है
- Anthropic method की तरह L1 penalty इस्तेमाल नहीं होती
optional h_bias training के दौरान disabled रहता है, लेकिन बाद में feature steering के लिए enable किया जा सकता है
numerical precision के लिए float32 का उपयोग किया गया
- बताया गया है कि यह Llama द्वारा आवश्यक bfloat16 के साथ 1 sign bit और 8 exponent bits share करता है, इसलिए conversion तेज और accurate होता है
इस project के मुख्य SAE hyperparameters इस प्रकार हैं
- d_model = 3072
- n_latents = 2**16, यानी 65,536
- k = 64
- k_aux = 2048
- aux_loss_coeff = 1 / 32
- dead_steps_threshold = 80_000
- batch_size = 1024
- num_epochs = 10
- learning_rate = 5e-5
- train_val_split = 0.95
Llama 3.2 3B के residual stream dimension 3,072 की तुलना में लगभग 21 गुना बड़ी latent dimension चुनी गई
loss function main reconstruction loss और auxiliary loss का combination है
- total_loss = main_loss + aux_loss_coeff * aux_loss
- दोनों losses normalized space में calculate होते हैं
auxiliary loss OpenAI द्वारा सुझाया गया तरीका है, जिसका role dead latents को रोकना और revive करना है
- main reconstruction residual और auxiliary reconstruction के बीच MSE calculate किया जाता है
- हाल में activate न हुए latents में से top-k_aux values को decoder में वापस भेजकर training signal दिया जाता है
- इसका उद्देश्य top k latents ही इस्तेमाल करने वाली main training से बाहर रह गए inactive latents को छूटी हुई information capture करने के लिए प्रेरित करना है
यदि कोई latent dead_steps_threshold यानी 80,000 training steps तक activate नहीं होता, तो उसे dead माना जाता है
- यह setting लगभग 1 epoch के बराबर है
- effective batch size 8192 के आधार पर इसका मतलब है कि हाल के लगभग 65 करोड़ activations की reconstruction में वह एक बार भी activate नहीं हुआ
training NCCL backend वाले single-node multi-GPU distributed training से की गई
- 8x Nvidia RTX4090
- 10 epoch
- per-GPU batch size 1024
- effective batch size 8192
- लगभग 7 अरब activations process किए गए
- इसमें 7 दिन से थोड़ा अधिक समय लगा
AdamW settings को sparse autoencoder के rare activation patterns को ध्यान में रखकर adjust किया गया
- beta_1 = 0.85
- beta_2 = 0.9999
- eps = 6.25e-10
- learning rate cosine annealing के साथ 5e-5 से घटकर 1e-5 तक जाता है
decoder weight को initialization के बाद और हर training step पर unit norm से normalize किया जाता है
project_decoder_grads() decoder weight की unit-norm constraint बनाए रखने के लिए existing dictionary vector के parallel gradient components को हटाता है

प्रशिक्षण के नतीजे

SAE training 8x Nvidia RTX4090 पर लगभग 7 दिनों तक चली और इसने स्थिर convergence दिखाया
अंतिम total normalized loss लगभग 0.144 तक पहुंचा
validation loss training data के 5% held-out हिस्से पर calculate किया गया, और training loss जैसा ही log decrease pattern दिखा
warm-up के 80,000 training steps के बाद लगभग 40% latents dead के रूप में पहचाने गए
auxiliary loss ने dead latents को जल्दी revive किया, और dead latent ratio तेजी से घटा
auxiliary loss केवल तब calculate किया गया जब dead latents कम से कम k_aux, यानी 2,048 या उससे ज्यादा हों
- इस condition ने 65,536 latents में से लगभग 3% को soft lower bound जैसा बना दिया
- बाद के चरणों में dead latents की कमी के कारण auxiliary loss अक्सर 0 हो गया
Anthropic और OpenAI ने कुछ configurations में अधिकतम 65% dead latents report किए थे, लेकिन इस project में छोटे latent size, auxiliary loss और gradient projection के combination के साथ dead latents तेजी से घटते दिखे
भविष्य के experiments में auxiliary loss calculation की न्यूनतम dead latent condition हटाने से dead latents और कम हो सकते हैं, ऐसा लिखा है

Interpretability analysis

interpretability analysis Anthropic की scaling monosemanticity method को reference करता है, लेकिन single token के बजाय sentence level का analysis करता है
हर latent के लिए सबसे मजबूत activation दिखाने वाले top 50 sentences capture किए गए
activation strength को sentence के सभी tokens पर दो तरीकों से aggregate किया गया
- mean: पूरे sentence में लगातार activate होने वाले semantic topics खोजने का तरीका
- last: autoregressive model में पूरा sentence देखने के बाद last token representation का उपयोग करने का तरीका
semantic analysis के लिए Claude 3.5, specifically claude-3-5-sonnet-20241022, इस्तेमाल किया गया
prompt को 50 sentences पर ये steps perform करने के लिए बनाया गया
- key words और phrases identify करना
- topic elements को group करना
- संभावित outliers पर विचार करना
- confidence score के साथ final semantic interpretation देना
analysis pipeline तीन stages में implement की गई
- cost-efficient batch में analysis requests भेजना
- responses receive करना
- semantic interpretations को parse और process करना
intermediate outputs reproducibility और further analysis के लिए preserve किए गए
- capture_top_sentences/: original sentences, activation aggregation, OpenWebText index
- top_sentences_last_responses/ और top_sentences_mean_responses/: processing से पहले के semantic analysis responses
- latent_index_meaning/: latent index और common_semantic, certainty score की mapping
उदाहरण के तौर पर latent #896 को “United Nations agencies, people, operations और official documents के बारे में formal institutional terminology references” के रूप में पहचाना गया
- 50 में से 50 sentences ने सीधे UN को reference किया
- इसमें UN, United Nations, Secretary-General, Special Rapporteur, UNDP, UNHCR, OCHA, UNODC जैसे terms शामिल थे
- certainty 1.0 निकली
Claude 3.5 batch mode में 24,828,558 input tokens और 3,920,044 output tokens process करने में $66.74 खर्च हुए
यह तरीका feature extraction और संभावित feature steering के लिए शुरुआती method के रूप में चुना गया था, और लिखा है कि result quality के मामले में इसकी simplicity की एक cost है

Verification और feature steering

verification infrastructure SAE का model behavior पर प्रभाव analyze और verify करने के लिए तीन scripts से बना है
- llama_3_inference_chat_completion_test.py
- llama_3_inference_text_completion_test.py
- llama_3_inference_text_completion_gradio.py
हर implementation ये support करता है
- batched inference
- हर line को अलग batch element के रूप में process करना
- temperature और top-p settings
- trained SAE injection
- feature activation analysis
- feature steering
latent_index_meaning/ के semantic meaning और certainty score को feature activation analysis और steering experiments के basis के रूप में इस्तेमाल किया गया
example prompts ये चार हैं
- The delegates gathered at the
- Foreign officials released a statement
- Humanitarian staff coordinated their efforts
- Senior diplomats met to discuss
text completion example max_new_tokens=128, temperature=0.7, top_p=0.9, seed=42 settings के साथ चलाया गया
feature steering example latent #896 को target करता है
- h_bias के जरिए latent activation value को 20 बढ़ाया गया
- model की text completion को UN-related content की ओर guide किया जा सकता है
पहली beta version का feature steering मजबूत नहीं है
- example में भी केवल दूसरे और तीसरे sentences ही UN-related content में बदले
- ऐसे starting sentences जानबूझकर चुने गए जिनके UN की ओर जाने की संभावना थी
- लिखा है कि For any n, if 2n - 1 is odd जैसे UN से असंबंधित sentence start पर यह fail होगा
मौजूदा interpretability analysis steering optimization की बजाय feature extraction पर focus करता है, इसलिए steering results consistent नहीं हैं
feature steering को first release में एक अतिरिक्त demo के रूप में रखा गया है, और निष्कर्ष है कि feature extraction खुद model understanding के लिए उपयोगी है

आगे सुधार की दिशाएँ

latent dimension को कम-से-कम 2^18, यानी 262,144 features तक बढ़ाने और k को घटाकर 32 करने का प्रयोग प्रस्तावित है
- इसका उद्देश्य अधिक unique features खोजना और ज़्यादा मजबूत sparsity बनाए रखना है
- बढ़े हुए compute को efficiency सुधारों या gradient accumulation जैसे तरीकों से offset करना होगा
latent activation tracking को और व्यवस्थित करने की योजना है
- training के दौरान latent_last_nonzero tensor की स्थिति को बार-बार रिकॉर्ड करने से यह और गहराई से देखा जा सकेगा कि latent कब activate होता है या dead हो जाता है
sparse latent space में co-activation patterns को track करके feature interaction का विश्लेषण करने के लिए support प्रस्तावित है
high-activation sentences और n-grams को अधिक सटीक रूप से group करने वाली interpretability analysis विधि को भविष्य के कार्य के रूप में रखा गया है
feature extraction के अलावा feature steering आधारित interpretability analysis भी किया जा सकता है
शोध को Llama 3.1-8B activations तक विस्तारित किया जा सकता है
- चूँकि यह Llama 3.2 के साथ codebase साझा करता है, इसलिए hyperparameters और काफी compute power की tuning मुख्य आवश्यकताएँ हैं
activation capture point बदलने के प्रयोग भी प्रस्तावित हैं
- model की शुरुआती layers
- transformer block के अंदर attention head output
- MLP output
auxiliary loss mechanism को और optimize किया जा सकता है
- मौजूदा implementation ने dead latents रोकने में मजबूत performance दिखाई है, और minimum dead latent threshold तथा feature quality के संबंध की जाँच की जा सकती है
SAE architecture के bias term और main loss function में बदलाव भी भविष्य के experiment candidates हैं
पूरे codebase में docstrings जोड़ने की ज़रूरत है
- inline documentation तो जोड़ी गई है, लेकिन लिखा है कि पहली release में proper docstrings जोड़ने का समय नहीं था

1 टिप्पणियां

GN⁺ 2024-11-22

Hacker News टिप्पणियाँ

mechanistic interpretability उस आम समस्या से निपटती है जो तब आती है जब LLM से पूछा जाता है, “तुमने ऐसा जवाब क्यों दिया?” मॉडल की self-explanation अक्सर असली वजह नहीं होती, बल्कि training data के patterns के आधार पर कोई विश्वसनीय लगने वाली वजह गढ़कर मनाने वाला एक rhetoric game होती है
मॉडल जितना शक्तिशाली होता जाता है, बाद में झूठ को उतना ही अधिक विश्वसनीय तरीके से justify कर सकता है, इसलिए “असत्यता” को खुद पहचानने वाले tests में वह उल्टा और खराब भी हो सकता है. लक्ष्य truth नहीं बल्कि consistency है
rhetoric reasoning नहीं है, और overfit sparse autoencoder जिस वास्तविक explainability का दावा करता है, वह उस causal flow के अधिक करीब है जिसे मॉडल जवाब बनाते समय “सोच” के रूप में पार करता है
- इंसान भी कुछ ऐसा ही करते हैं. अक्सर उन्हें खुद नहीं पता होता कि उन्होंने वैसा क्यों सोचा या किया, और बाद में वे कोई plausible confabulation बनाकर समझा देते हैं
- यह मानो art/AI life की नकल कर रहे हों. मानवीय reasoning भी पहले तेज़ी से निर्णय लेने और फिर उस विश्वास को दूसरों को मनवाने के लिए reason का इस्तेमाल करने जैसी हो सकती है
  reasoning को social influence के एक tool के रूप में देखने की चर्चा रही है, और इससे यह भी समझ आता है कि बहुत धाराप्रवाह बोलने वाले लोगों के लिए यह मानना क्यों कठिन होता है कि वे गलत थे. आम तौर पर वे बहसों में दूसरों पर जीतते आए होते हैं. X इसका एक प्रतिनिधि उदाहरण लगता है
- mechanistic interpretability पर हो रहे काफी काम मुझे किसी दूसरे तरह के जादू-टोना जैसे लगे. integer quantum Hall effect वगैरह में, बिना कठोर group representation theory या स्पष्ट symmetry के “superposition” शब्द को अजीब उपमा की तरह बहुत ज्यादा लाद देना मजबूरन ठूँसा हुआ लगता है. मैंने papers पूरे पढ़े हैं, और यह कुछ ऐसा भी लगता है जैसे किसी funded postdoc को ढूँढ़ने की कोशिश हो
  फिर भी, एक बात को मैं शानदार insight और एक plausible research program की शुरुआत मानता हूँ. उच्च-आयामी सीमाबद्ध लगभग-लंबवत vector spaces बेहद counterintuitive होते हैं, और इन्हें सख्ती से संभालने के लिए मौजूदा परिणाम भी हैं https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...
- मॉडल की logic और truthfulness को आसानी से test किया जा सकता है. बस मॉडल को ऐसा दीजिए जैसे गलत निर्णय उसी ने लिया हो और उससे explanation माँगिए
  मॉडल के पास memory नहीं होती और वह text के source में फर्क नहीं कर पाता, इसलिए अगर मॉडल “truthful” हो तो उसे पूछे बिना ही गलती मान लेनी चाहिए. व्यवहार में उसके अपने निर्णय को support करने के लिए parallel construction करने की संभावना अधिक है
- causality वाला हिस्सा कैसे काम करता है, यह जानने की जिज्ञासा है. क्या यह graph model उगल सकता है?
यह चौंकाने वाला और बहुत अच्छी तरह documented काम है. खासकर loss curves और dead latents का evaluation ध्यान खींचते हैं
हमारी टीम ने भी SAE पर शोध किया था, लेकिन हमने इसे individual tokens की जगह paper abstracts के dense embeddings को reconstruct करने के लिए train किया https://arxiv.org/abs/2408.00657
sparsity level और SAE latent space की dimensions बदलने पर भी हमने loss curve के lower bound में power-law scaling देखा, और auxiliary loss से dead latents को पूरी तरह कम किया जा सका. training iterations के दौरान smooth sine-wave pattern भी दिखा, हालांकि यह abstracts embeddings वाले इस खास application की वजह से था या अधिक सामान्य phenomenon है, यह पता नहीं
- documentation की सराहना करने के लिए खास तौर पर खुशी हुई. code लिखने से कहीं ज्यादा मुश्किल documentation लिखना था, और आपने जो paper साझा किया है उसे मैंने डाउनलोड कर लिया है, कल सुबह पढ़ूँगा
ऊपर-ऊपर से देखने पर यह alignment के लिए सकारात्मक काम लगता है, लेकिन मैंने अभी details नहीं देखी हैं. यह संभव भी हो पाएगा या नहीं, नहीं जानता, मगर समय, लागत और जोखिम की भरपाई के लिए कितना भुगतान करना पड़ेगा, यह जानना दिलचस्प है
हाल ही में मैंने SAE evaluation की कठिनाइयों पर एक लेख पढ़ा था: https://adamkarvonen.github.io/machine_learning/2024/06/11/s...
यह जानना चाहता हूँ कि आपने इस समस्या को कैसे संभाला, और repository में उस approach को समझने के लिए कहाँ देखना चाहिए
- SAE evaluation बहुत जटिल है, क्योंकि सवाल यह है कि कौन-सा SAE संभवतः अधिक sparse रहते हुए सबसे unique features सबसे अच्छी तरह बनाता है, और यह LLM interpretability via SAE research के लगभग केंद्र में है
  मान लीजिए कि कई perfect SAE architectures खोजने और उन्हें पूरी तरह train करने की समस्या पहले ही हल हो चुकी हो, तब भी कौन-सा SAE बेहतर है यह इस आधार पर तय होगा कि automated interpretability methodologies के metrics पर वह कितना अच्छा perform करता है. खासकर OpenAI की methodology SAE को कई technical metrics पर score करती है और large-scale automated interpretability पर जोर देती है
  optimal metrics और methodology खुद अभी भी एक open research question हैं, इसलिए मैं कुछ महीने और experiments कर सकता था, लेकिन इस पहली release में मैंने एक simple approach चुनी. implementation details और results के chapter 4 Interpretability Analysis में मेरी methodology और OpenAI methodology के अंतर पर चर्चा है https://github.com/PaulPauls/llama3_interpretability_sae#4-i...
  OpenAI paper को सीधे पढ़ने या Anthropic के transformer-circuits.pub को देखने की भी सिफारिश करूँगा https://transformer-circuits.pub/
यह काम हटा दिया गया है और repository भी archive कर दी गई है. क्या हुआ, इसकी कोई explanation नहीं है
- मैं भी जानना चाहता हूँ. कई forks अभी भी बचे हुए हैं, जैसे यहाँ: https://github.com/plastic-labs/llama3_interpretability_sae मैं संबंधित व्यक्ति नहीं हूँ
यह सच में शानदार काम है. क्या इसे SAELens के साथ integrate करने की कोई योजना है?
- अभी निश्चित नहीं हूँ. इस पर विचार करूँगा, लेकिन अगले हफ्ते दिशा और आगे क्या करना है, इसे फिर से व्यवस्थित करने का सोच रहा हूँ
  एक और सरल project के रूप में, मैं शायद यह भी दिखाऊँ कि मौजूदा Llama 3.2 implementation के पूरे model को pure PyTorch में scratch से कैसे बनाया जाए. मुझे ground-up बनाना पसंद है, लेकिन इस SAE project के लिए Llama 3.2 background section का documentation ढूँढ़ते समय लगा कि मौजूदा docs या तो बहुत सतही हैं या Llama 1/2 के लिए पुराने हो चुके हैं. आजकल machine learning documentation बहुत जल्दी पुराना हो जाता है
mechanistic interpretability के बारे में एक थोड़ा अटपटा सवाल है. इंसानों को किसी metric से मापा जाए तो इंसान उसी metric को game करने लगते हैं, तो क्या भविष्य की AI भी mechanistic interpretability को game कर सकती है?
समझाने के लिए मान लें कि tokens को 2D matrix में encode किया गया है, जहाँ Apple=1a, Pear=1b, Donkey=2a, Horse=2b जैसा mapping हो; तब neuron 1,2,a,b सभी सक्रिय हों तो यह समझना कठिन हो जाता है कि यह apple+horse है या donkey+pear
अगर भविष्य की बहुत अधिक सक्षम AI अपनी खुद की training को supervise करे, तो क्या वह जानबूझकर ऐसे weights चुन सकती है कि इस तरह की encoding collision बनी रहे, mechanistic interpretability observers को धोखा दे, और मूलतः euphemism में सोच सके?
- यह उससे भी कठिन AI safety scenario है. इस तरह की latent समस्या पैदा करने के लिए जरूरी नहीं कि “अपनी training को supervise करने वाली बहुत अधिक सक्षम AI” ही चाहिए, एक दुर्भावनापूर्ण AI researcher भी काफी हो सकता है
  उदाहरण के लिए, कोई ऐसा model खोजा जा सकता है जो नस्लवादी हो लेकिन जिसमें नस्लवाद के रूप में पहचाने जा सकने वाले interpretability activation patterns न हों. इस Show HN का काम संकेत देता है कि पर्याप्त funding वाला कोई व्यक्ति भी ऐसी adversarial training को मुश्किल से आजमा सकता है, और अगर इससे नए परिणाम निकलते हैं तो वह काफी दिलचस्प होगा
और अधिक public SAE work देखना सच में अच्छा लग रहा है. engineering effort भी कम नहीं लगता, और मैं कल data loading code देखने वाला हूँ
vision models में SAE train करने वाला मेरा चल रहा project भी आपकी रुचि का हो सकता है: https://github.com/samuelstevens/saev
अगर आप Golden Gate Bridge latent खोजकर Golden Gate Llama 3.2 को HuggingFace पर डालें, तो शायद इसे और ज्यादा attention और response मिलेगा
अगर उसके साथ बातचीत के लिए कोई Space link भी हो तो और अच्छा रहेगा. और यह आपने नहीं पूछा, लेकिन README के बिलकुल ऊपर कुछ दिलचस्प results या visualizations रखना बहुत अच्छा विचार होगा

HN पर रिलीज़: Sparse Autoencoders का उपयोग करके Llama 3.2 की interpretability पर अध्ययन

प्रोजेक्ट का लक्ष्य और दायरा

मुख्य features

सार्वजनिक किए गए outputs

Code structure

Data Capture

SAE Training

Interpretability

Verification and Testing

Llama 3.1/3.2 कस्टम implementation

डेटा capture और preprocessing

SAE design और training method

प्रशिक्षण के नतीजे

Interpretability analysis

Verification और feature steering

आगे सुधार की दिशाएँ

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News टिप्पणियाँ