Загрузка...

Sapiens2: 4K High-Fidelity Human Vision Models

In this AI Research Roundup episode, Alex discusses the paper: 'Sapiens2' Sapiens2 is a new family of high-resolution vision transformers designed for precise human-centric tasks like pose estimation and body-part segmentation. The models range from 0.4 to 5 billion parameters and support native resolutions up to 4K using windowed attention. By combining masked image reconstruction with self-distilled contrastive objectives, the researchers improved both low-level detail capture and high-level semantics. The models were pretrained on a massive curated dataset of 1 billion high-quality human images, setting new state-of-the-art benchmarks. Beyond existing tasks, Sapiens2 also introduces capabilities for pointmap and albedo estimation with significantly lower error rates. Paper URL: https://arxiv.org/pdf/2604.21681 #AI #MachineLearning #DeepLearning #ComputerVision #VisionTransformer #HumanCentric #PoseEstimation #ImageSegmentation

Resources:
- GitHub: https://github.com/facebookresearch/sapiens2

Видео Sapiens2: 4K High-Fidelity Human Vision Models канала AI Research Roundup

AI ComputerVision DeepLearning FoundationModels HighResolution HumanCentric ImageSegmentation MachineLearning NeuralNetworks Podcast PoseEstimation Research Sapiens2 Transformers VisionTransformer

Комментарии отсутствуют

Информация о видео

14 ч. 52 мин. назад

00:04:53

AI Research Roundup

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Поделиться

Другие видео канала

NGC: LLMs Learning to Manage Their Own KV Cache

OpenGame: New Framework for Coding Playable Games

DELEGATE-52: Measuring LLM Document Corruption

AeroTransformer: 3D Aerodynamic Prediction

SAW-INT4: 4-Bit KV-Cache Quantization for LLMs

VLA Foundry: Unified Vision-Language-Action Training

Vision Banana: Image Generators as Vision Learners

COS-PLAY: LLM Skill Discovery for Long Tasks

StyleID: Face Recognition for Stylized Portraits

GSI-Bench: Testing 3D Spatial Logic in MLLMs

WorldMark: Testing Interactive Video World Models

OpenMobile: Synthesis Framework for Mobile Agents

OmniMouse: Scaling Brain Models with 150B Tokens

DeVI: Dexterous Hand Interaction via Video

Volt: SOTA 3D Segmentation with Vanilla Transformers

BLF: SOTA LLM Forecasting via Linguistic Beliefs

One-Shot 3D Avatars with Physics-Based Hair

SGS: Scaling LLM Self-Play via Self-Guidance

LLM Reward Hacking: New Theory and Taxonomy

LLaTiSA: New Hierarchical Time Series Reasoning

Все заметки Новая заметка Страницу в заметки

Страницу в закладки Мои закладки

На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.

О Cookies Напомнить позже Принять