Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection (Paper Explained)
Object detection often does not occur in a vacuum. Static cameras, such as wildlife traps, collect lots of irregularly sampled data over a large time frame and often capture repeating or similar events. This model learns to dynamically incorporate other frames taken by the same camera into its object detection pipeline.
OUTLINE:
0:00 - Intro & Overview
1:10 - Problem Formulation
2:10 - Static Camera Data
6:45 - Architecture Overview
10:00 - Short-Term Memory
15:40 - Long-Term Memory
20:10 - Quantitative Results
22:30 - Qualitative Results
30:10 - False Positives
32:50 - Appendix & Conclusion
Paper: https://arxiv.org/abs/1912.03538
My Video On Attention Is All You Need: https://youtu.be/iDulhoQ2pro
Abstract:
In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame.
We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP.
Authors: Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, Jonathan Huang
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Видео Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection (Paper Explained) канала Yannic Kilcher
OUTLINE:
0:00 - Intro & Overview
1:10 - Problem Formulation
2:10 - Static Camera Data
6:45 - Architecture Overview
10:00 - Short-Term Memory
15:40 - Long-Term Memory
20:10 - Quantitative Results
22:30 - Qualitative Results
30:10 - False Positives
32:50 - Appendix & Conclusion
Paper: https://arxiv.org/abs/1912.03538
My Video On Attention Is All You Need: https://youtu.be/iDulhoQ2pro
Abstract:
In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame.
We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP.
Authors: Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, Jonathan Huang
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Видео Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection (Paper Explained) канала Yannic Kilcher
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Blockwise Parallel Decoding for Deep Autoregressive ModelsWHO ARE YOU? 10k Subscribers Special (w/ Channel Analytics)Datasets for Data-Driven Reinforcement LearningReinforcement Learning with Augmented Data (Paper Explained)The Odds are Odd: A Statistical Test for Detecting Adversarial ExamplesRepNet: Counting Out Time - Class Agnostic Video Repetition Counting in the Wild (Paper Explained)Expire-Span: Not All Memories are Created Equal: Learning to Forget by Expiring (Paper Explained)On the Measure of Intelligence by François Chollet - Part 4: The ARC Challenge (Paper Explained)Enhanced POET: Open-Ended RL through Unbounded Invention of Learning Challenges and their SolutionsAxial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation (Paper Explained)[Classic] Playing Atari with Deep Reinforcement Learning (Paper Explained)Big Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)Symbolic Knowledge Distillation: from General Language Models to Commonsense Models (Explained)Longformer: The Long-Document TransformerGradient Origin Networks (Paper Explained w/ Live Coding)Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)Feature Visualization & The OpenAI microscopeWeight Standardization (Paper Explained)ALiBi - Train Short, Test Long: Attention with linear biases enables input length extrapolationOn the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained)Growing Neural Cellular Automata