TRIBE v2: Meta's Tri-Modal Brain Model, Explained

The provided text introduces **TRIBE v2**, a sophisticated **artificial intelligence foundation model** designed to predict human brain activity by processing **video, audio, and language** inputs. Researchers developed this framework using a massive dataset of over **1,000 hours of fMRI recordings** from 720 individuals, allowing the model to accurately simulate neural responses to both naturalistic and controlled experiments. By outperforming traditional linear models, this tool facilitates **in silico neuroscience**, enabling scientists to replicate decades of empirical findings through computer simulation rather than live subjects. The architecture identifies specific **functional networks** and reveals how the brain integrates multiple senses, such as vision and speech, within the cortex. Ultimately, the authors position this technology as a **unifying platform** to bridge fragmented cognitive research and accelerate our understanding of the brain's complex organisation. This progress suggests a paradigm shift where **data-driven AI** serves as a robust digital proxy for exploring human cognition.

**Title:** A foundation model of vision, audition, and language for in-silico neuroscience.
**Authors:** Stéphane d’Ascoli, Jérémy Rapin, Yohann Benchetrit, Teon Brooks, Katelyn Begany, Joséphine Raugel, Hubert Banville, and Jean-Rémi King.
**Institutions:** FAIR at Meta (all authors) and Laboratoire de Neurosciences Cognitives et Computationnelles, Ecole Normale Supérieure - PSL (Joséphine Raugel).

**What problem the paper was trying to solve**
Cognitive neuroscience has traditionally relied on a fragmented, "divide-and-conquer" approach, producing highly specialized models tailored to specific tasks that fail to provide a unified understanding of human cognition. Prior predictive brain models have also been severely limited because they typically assume linear relationships, train separately for individual subjects and tasks, and only process single sensory modalities, which prevents them from capturing how the brain integrates complex, multisensory information.

**What are the papers key novel ideas?**
The core innovation is **TRIBE v2**, a tri-modal foundation model that jointly processes video, audio, and language to accurately predict high-resolution functional Magnetic Resonance Imaging (fMRI) brain activity. Crucially, the model can **zero-shot generalize to novel tasks and unseen subjects**, capturing subject-averaged brain responses so accurately that it enables **in-silico experimentation**—the ability to computationally replicate decades of empirical neuroscience findings without needing to scan actual human participants.

**What is the architecture or method they are using?**
The system extracts high-dimensional embeddings from intermediate layers of **frozen, state-of-the-art pretrained AI models**: Llama-3.2-3B for text, Wav2Vec-Bert-2.0 for audio, and Video-JEPA-2-Giant for video. These multimodal time-series features are concatenated and fed into an **8-layer Transformer encoder** to aggregate temporal context, followed by an adaptive pooling layer to match the frequency of the fMRI data. Finally, the latent space is projected to specific brain regions using a **subject-conditional linear layer**, while utilizing "modality dropout" and "subject dropout" during training to ensure the model makes robust predictions even when modalities or subject identities are missing.

**What the paper matters**
The research validates a major paradigm shift away from isolated mapping toward using deep neural networks as unified predictive frameworks for brain function. TRIBE v2 supersedes the current gold standard of linear voxel-wise encoding, delivering multi-fold accuracy improvements and demonstrating a continuous log-linear scaling law where more data reliably improves performance. Furthermore, the model won the Algonauts 2025 brain prediction competition and successfully extracts interpretable features that map perfectly onto known biological brain networks, offering deep insights into multisensory integration.

**What are the potential applications**
Because TRIBE v2 can predict group-level brain responses more accurately than some individual human recordings, it can be used as a robust digital platform to **pilot and pre-screen naturalistic neuroimaging studies**. Researchers can use it for rapid in-silico hypothesis testing, such as identifying the specific brain areas involved in language processing or face recognition, which will help augment existing datasets and allow scientists to identify the most critical physical experiments to run in the future.

The description, research summary based on a human template and video were generated by Google's NotebookLM on 10th May 2026.

Видео TRIBE v2: Meta's Tri-Modal Brain Model, Explained канала MLSlops

Комментарии отсутствуют