SAM 3: Segment Anything with Concepts

The sources introduce **SAM 3**, a unified artificial intelligence model created by Meta for detecting, segmenting, and tracking objects in images and videos. Building upon its predecessors, this version introduces **Promptable Concept Segmentation**, allowing users to identify all instances of a specific category using text phrases or image examples. The system utilizes a **dual encoder-decoder architecture** that separates the tasks of finding an object's location from recognizing its identity. To power this technology, the researchers developed a massive **data engine** that combines human effort with AI verifiers to produce millions of high-quality labels. Extensive testing demonstrates that **SAM 3** significantly outperforms existing tools in both accuracy and speed, even when handling complex or rare visual concepts. The authors have released the model code and the new **SA-Co benchmark** to support further advancements in multimodal AI research.

**SAM 3: Segment Anything with Concepts**
**Authors:** Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, Christoph Feichtenhofer.
**Institutions:** Meta Superintelligence Labs.

**What problem the paper was trying to solve**
Previous models in the SAM series achieved breakthroughs in Promptable Visual Segmentation by using points, boxes, or masks to segment a single object, but they lacked the ability to find and segment all instances of a specific visual concept (e.g., finding all "cats" in a video). The paper aims to solve this gap by introducing the **Promptable Concept Segmentation (PCS)** task, which tackles the intrinsic ambiguity of grounding open-vocabulary concepts across entire images and video frames.

**What are the papers key novel ideas?**
The core innovation is enabling models to **take short noun phrases, image exemplars, or a combination of both as prompts** to return segmentation masks and unique identities for all matching objects. To make this possible, the authors designed a novel **"presence token"** that decouples the tasks of globally recognizing a concept and locally segmenting it. They also developed a highly scalable **human- and AI-in-the-loop data engine** to automatically verify masks and exhaustively generate training data with challenging hard negatives.

**What is the architecture or method they are using?**
SAM 3 utilizes a **dual encoder-decoder transformer architecture** consisting of an image-level detector and a memory-based video tracker that share a single Perception Encoder (PE) backbone. The detector is DETR-based and utilizes the presence head to predict if a concept exists in the input before object queries resolve local boundaries. For videos, the tracker inherits SAM 2's memory bank architecture to propagate spatial-temporal masklets across frames, utilizing **temporal disambiguation strategies** like periodic re-prompting and confirmation delays to handle occlusions and crowded scenes.

**What the paper matters**
SAM 3 sets a new state-of-the-art standard, **doubling the accuracy of existing systems** for promptable concept segmentation in both images and video. Alongside the model, the authors open-sourced the **Segment Anything with Concepts (SA-Co) benchmark**, which provides over 207,000 unique concepts—more than 50 times the scale of existing benchmarks—significantly advancing the tools available for computer vision research.

**What are the potential applications**
The ability to exhaustively find and segment anything is a foundational capability for multimodal AI, directly driving advancements in **robotics, augmented reality, content creation, data annotation, and broader scientific fields**. Additionally, SAM 3 can function as a specialized "vision tool" for Multimodal Large Language Models (MLLMs), creating **visual agentic systems** capable of handling complex language prompts and executing advanced spatial reasoning tasks.

The description, research summary based on a human template and video were generated by Google's NotebookLM on April 19th 2026.

Видео SAM 3: Segment Anything with Concepts канала MLSlops

Комментарии отсутствуют