Загрузка...

Probing LLM Fine-Tuning via Sparse Autoencoders

In this AI Research Roundup episode, Alex discusses the paper: 'A Mechanistic Investigation of Supervised Fine Tuning' This research investigates why Supervised Fine-Tuning significantly changes LLM behavior despite high cosine similarity in hidden activations. The authors introduce a diagnostic pipeline using pretrained Sparse Autoencoders to identify hidden representational shifts. Their analysis reveals that while raw activations appear similar, the underlying sparse latents diverge in task-specific and layer-specific ways. The study identifies precise semantic features that are systematically altered during the fine-tuning process. Additionally, the researchers discover a unique layer-wise update profile specifically associated with safety alignment. Paper URL: https://arxiv.org/pdf/2605.11426 #AI #MachineLearning #DeepLearning #LLM #SparseAutoencoders #FineTuning #Interpretability #SFT

Resources:
- GitHub: https://github.com/ruhzi/sae-investigation

Видео Probing LLM Fine-Tuning via Sparse Autoencoders канала AI Research Roundup

Deep Learning LLM Machine Learning Mechanistic Interpretability Model Alignment Neural Networks Podcast Representation Learning Research SAE SFT Safety Alignment Sparse Autoencoders Supervised Fine-Tuning

Комментарии отсутствуют

Информация о видео

Вчера, 3:21:44

00:05:05

AI Research Roundup

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Поделиться

Другие видео канала

Spectral Diffusion: Faster Image and Video Models

FST: Fast-Slow Training for Adaptive LLMs

VGGT-Omega: Scaling 3D Scene Reconstruction

Detecting NN Overfitting Without Test Data

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

RT-Splatting: Real-Time Glass and Reflections

Reward Hacking in Rubric-Based RL for LLMs

PPol: Realistic User Simulators for LLM Agents

SANA-WM: Efficient Minute-Scale World Model

THUD: Exposing Audio Shortcuts in Multimodal LLMs

Surveying LLM Multi-Agent Systems: The LIFE Framework

GoLongRL: Multitask RL for Long-Context LLMs

Code as Agent Harness: Reliable LLM Framework

OPD: The Foresight Mechanism in LLM Training

Lance: Unified Image and Video Generation Model

CHI-Bench: New Benchmark for Healthcare Agents

D-IPG: Faster Solvers for Inverse Problems

BetaPRM: Reliable Process Reward Models

CoRD: Multi-Teacher Distillation for Long-CoT

MMSkills: Multimodal Skills for Visual Agents

Все заметки Новая заметка Страницу в заметки

Страницу в закладки Мои закладки

На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.

О Cookies Напомнить позже Принять