Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes
Tobias Pohlen, Alexander Hermans, Markus Mathias, Bastian Leibe
Semantic image segmentation is an essential component of modern autonomous driving systems, as an accurate understanding of the surrounding scene is crucial to navigation and action planning. Current state-of-the-art approaches in semantic image segmentation rely on pre-trained networks that were initially developed for classifying images as a whole. While these networks exhibit outstanding recognition performance (i.e., what is visible?), they lack localization accuracy (i.e., where precisely is something located?). Therefore, additional processing steps have to be performed in order to obtain pixel-accurate segmentation masks at the full image resolution. To alleviate this problem we propose a novel ResNet-like architecture that exhibits strong localization and recognition performance. We combine multi-scale context with pixel-level accuracy by using two processing streams within our network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using residuals. Without additional processing steps and without pre-training, our approach achieves an intersection-over-union score of 71.8% on the Cityscapes dataset.
Видео Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes канала ComputerVisionFoundation Videos
Semantic image segmentation is an essential component of modern autonomous driving systems, as an accurate understanding of the surrounding scene is crucial to navigation and action planning. Current state-of-the-art approaches in semantic image segmentation rely on pre-trained networks that were initially developed for classifying images as a whole. While these networks exhibit outstanding recognition performance (i.e., what is visible?), they lack localization accuracy (i.e., where precisely is something located?). Therefore, additional processing steps have to be performed in order to obtain pixel-accurate segmentation masks at the full image resolution. To alleviate this problem we propose a novel ResNet-like architecture that exhibits strong localization and recognition performance. We combine multi-scale context with pixel-level accuracy by using two processing streams within our network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using residuals. Without additional processing steps and without pre-training, our approach achieves an intersection-over-union score of 71.8% on the Cityscapes dataset.
Видео Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes канала ComputerVisionFoundation Videos
Показать
Комментарии отсутствуют
Информация о видео
26 июля 2017 г. 2:14:27
00:11:28
Другие видео канала
![Unsupervised Representation Learning for Gaze Estimation](https://i.ytimg.com/vi/qQNU5WEt3Es/default.jpg)
![Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance Disparity Estimation](https://i.ytimg.com/vi/KnslISMQBlQ/default.jpg)
![Learning to Dress 3D People in Generative Clothing](https://i.ytimg.com/vi/NOEA-Rtq6vM/default.jpg)
![Tutorial : Deep learning for Objects and Scenes - Part 2](https://i.ytimg.com/vi/pK6XAk95kUY/default.jpg)
![Learning Physics-Guided Face Relighting Under Directional Light](https://i.ytimg.com/vi/cYwsaUQFMU8/default.jpg)
![WACV20: Keynote Talk: Maja Pantic, Imperial College London and SAIC](https://i.ytimg.com/vi/m3rFFlRx4LE/default.jpg)
![Orthogonal Convolutional Neural Networks](https://i.ytimg.com/vi/xq4udlgu6Z4/default.jpg)
![232 - Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding](https://i.ytimg.com/vi/dW9FQnwrg_0/default.jpg)
![1276 - ClassMix: Segmentation-Based Data Augmentation for Semi-Supervised Learning](https://i.ytimg.com/vi/DmRKRHxsQfo/default.jpg)
![Match or No Match: Keypoint Filtering Based on Matching Probability](https://i.ytimg.com/vi/4jV3S04ejFc/default.jpg)
![HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose Estimation From a Single Depth Map](https://i.ytimg.com/vi/MFhNBTUkxvY/default.jpg)
![DSGN: Deep Stereo Geometry Network for 3D Object Detection](https://i.ytimg.com/vi/HYoPzVEWu9A/default.jpg)
![DeepLPF: Deep Local Parametric Filters for Image Enhancement](https://i.ytimg.com/vi/Sxach3FM6FY/default.jpg)
![324 - Weakly Supervised Deep Reinforcement Learning for Video Summarization With Semantically Meani](https://i.ytimg.com/vi/gaq868XeWn8/default.jpg)
![Inverse Rendering for Complex Indoor Scenes: Shape, Spatially-Varying Lighting and SVBRDF From a...](https://i.ytimg.com/vi/RvWlDWtTozw/default.jpg)
![368 - DB-GAN: Boosting Object Recognition Under Strong Lighting Conditions](https://i.ytimg.com/vi/4SJLWWRkcKo/default.jpg)
![Neural Pose Transfer by Spatially Adaptive Instance Normalization](https://i.ytimg.com/vi/6EpmGLBzT1s/default.jpg)
![ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks](https://i.ytimg.com/vi/_Qamhop2JDU/default.jpg)
![High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks](https://i.ytimg.com/vi/8H0QQbMFb-k/default.jpg)
![1257 - Multimodal Prototypical Networks for Few-shot Learning](https://i.ytimg.com/vi/nq2yYbGIRwc/default.jpg)
![1369 - CenterFusion:Center-based Radar and Camera Fusionfor 3D Object Detection](https://i.ytimg.com/vi/tr5jyfO55U8/default.jpg)