Авто	Видео-блоги	ДТП, аварии	Для маленьких	Еда, напитки
Животные	Закон и право	Знаменитости	Игры	Искусство
Комедии	Красота, мода	Кулинария, рецепты	Люди	Мото
Музыка	Мультфильмы	Наука, технологии	Новости	Образование
Политика	Праздники	Приколы	Природа	Происшествия
Путешествия	Развлечения	Ржач	Семья	Сериалы
Спорт	Стиль жизни	ТВ передачи	Танцы	Технологии
Товары	Ужасы	Фильмы	Шоу-бизнес	Юмор

SmolVLA: A vision-language-action model for affordable and efficient robotics

SmolVLA is introduced as a small, efficient, and community-driven vision-language-action (VLA) model designed for affordable and efficient robotics. It addresses limitations of existing VLAs, which are typically massive and incur high training and deployment costs. SmolVLA significantly reduces these costs, capable of training on a single GPU and deployment on consumer-grade GPUs or even CPUs. The model consists of a compact pretrained vision-language model (VLM) and an action expert. The VLM processes inputs like language instructions, images, and robot state, generating features that condition the action expert. The action expert, trained using flow matching, predicts chunks of low-level actions. Key architectural choices contributing to its efficiency include skipping layers in the VLM, using a minimal number of visual tokens, leveraging smaller pretrained VLMs, and interleaving cross-attention and self-attention layers in the action expert. **SmolVLA is pretrained entirely on publicly available, community-contributed datasets, using substantially less data than previous state-of-the-art models**. It also features an **asynchronous inference stack that decouples action execution from perception and action prediction**, enabling faster and more responsive control by triggering new chunk predictions while the robot is still executing previously available actions. **Despite its compact size, SmolVLA achieves performance comparable to or surpassing much larger VLA models** in both simulated and real-world robotic benchmarks. The project is presented as an open-source initiative, providing code, pretrained models, and training data to foster accessibility and accelerate progress in the robotics community.

https://arxiv.org/pdf/2506.01844

Видео SmolVLA: A vision-language-action model for affordable and efficient robotics канала AI Papers Podcast Daily