SmolVLA: A vision-language-action model for affordable and efficient robotics
SmolVLA is introduced as a small, efficient, and community-driven vision-language-action (VLA) model designed for affordable and efficient robotics. It addresses limitations of existing VLAs, which are typically massive and incur high training and deployment costs. SmolVLA significantly reduces these costs, capable of training on a single GPU and deployment on consumer-grade GPUs or even CPUs. The model consists of a compact pretrained vision-language model (VLM) and an action expert. The VLM processes inputs like language instructions, images, and robot state, generating features that condition the action expert. The action expert, trained using flow matching, predicts chunks of low-level actions. Key architectural choices contributing to its efficiency include skipping layers in the VLM, using a minimal number of visual tokens, leveraging smaller pretrained VLMs, and interleaving cross-attention and self-attention layers in the action expert. **SmolVLA is pretrained entirely on publicly available, community-contributed datasets, using substantially less data than previous state-of-the-art models**. It also features an **asynchronous inference stack that decouples action execution from perception and action prediction**, enabling faster and more responsive control by triggering new chunk predictions while the robot is still executing previously available actions. **Despite its compact size, SmolVLA achieves performance comparable to or surpassing much larger VLA models** in both simulated and real-world robotic benchmarks. The project is presented as an open-source initiative, providing code, pretrained models, and training data to foster accessibility and accelerate progress in the robotics community.
https://arxiv.org/pdf/2506.01844
Видео SmolVLA: A vision-language-action model for affordable and efficient robotics канала AI Papers Podcast Daily
https://arxiv.org/pdf/2506.01844
Видео SmolVLA: A vision-language-action model for affordable and efficient robotics канала AI Papers Podcast Daily
AI research machine learning deep learning arxiv papers hugging face artificial intelligence AI papers NLP neural networks AI podcast research papers AI trends transformer models GPT AI news tech podcast computer vision AI breakthroughs ML models data science AI tools generative AI AI updates research insights AI developments academic AI ML research
Комментарии отсутствуют
Информация о видео
4 июня 2025 г. 3:50:06
00:21:42
Другие видео канала