#308 SigLIP and SigLIP2

SigLIP stands for Sigmoid loss for Language-Image Pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, a SigLiT model achieves 84.5% ImageNet zero-shot accuracy in two days. The disentanglement of the batch size from the loss further allows to study the impact of examples vs pairs and negative to positive ratio. Pushing the batch size to the extreme, up to one million, shows that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32 k being sufficient.
SigLIP 2 is a family of new multilingual vision-language encoders that build on the success of the original SigLIP. The original image-text training objective is extended with several prior, independently developed techniques into a unified recipe. This includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for VLMs. Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. The authors also train variants which support multiple resolutions and preserve the input’s native aspect ratio. Model checkpoints are available in four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

In this video, I talk about the following: How do the SigLIP and SigLiT models differ from CLIP and LiT? How is the SigLIP2 model trained? How does the SigLIP2 model perform?

For more details, please look at https://arxiv.org/pdf/2303.15343 and https://arxiv.org/pdf/2502.14786

Tschannen, Michael, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans et al. "Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features." arXiv preprint arXiv:2502.14786 (2025).

Thanks for watching!
LinkedIn: http://aka.ms/manishgupta
HomePage: https://sites.google.com/view/manishg/

Видео #308 SigLIP and SigLIP2 канала Data Science Gems