Systems @Scale 2019 - Disaster Recovery at Facebook Scale
Shruti Padmanabha, Research Scientist, Facebook
Justin Meza, Research Scientist, Facebook
https://code.fb.com/core-data/systems-scale/
Facebook operates dozens of data centers globally, each of which serves thousands of interdependent microservices to provide seamless experiences to billions of users across the family of Facebook products. At this scale, seemingly rare occurrences, from hurricanes looming over a data center to lightning striking a switchboard, have threatened the site’s health. These events cause large-scale machine failures at the scope of a data center or significant portions of it, which cannot be addressed by traditional fault-tolerance mechanisms designed for individual machine failures. Handling these failures requires us to develop solutions across the stack, from placing hardware and spare capacity across fault domains to being able to shift traffic smoothly away from affected fault domains to rearchitecting large-scale distributed systems in a fault domain-aware manner. In this talk, Shruti and Justin will describe principles Facebook follows for designing reliable software, tools we built to mitigate and respond to failures, and our continuous testing and validation process.
Видео Systems @Scale 2019 - Disaster Recovery at Facebook Scale канала Justin Miller
Justin Meza, Research Scientist, Facebook
https://code.fb.com/core-data/systems-scale/
Facebook operates dozens of data centers globally, each of which serves thousands of interdependent microservices to provide seamless experiences to billions of users across the family of Facebook products. At this scale, seemingly rare occurrences, from hurricanes looming over a data center to lightning striking a switchboard, have threatened the site’s health. These events cause large-scale machine failures at the scope of a data center or significant portions of it, which cannot be addressed by traditional fault-tolerance mechanisms designed for individual machine failures. Handling these failures requires us to develop solutions across the stack, from placing hardware and spare capacity across fault domains to being able to shift traffic smoothly away from affected fault domains to rearchitecting large-scale distributed systems in a fault domain-aware manner. In this talk, Shruti and Justin will describe principles Facebook follows for designing reliable software, tools we built to mitigate and respond to failures, and our continuous testing and validation process.
Видео Systems @Scale 2019 - Disaster Recovery at Facebook Scale канала Justin Miller
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Systems @Scale 2019 - Scaling Cluster Management at Facebook with TupperwareWhat is a Disaster Recovery Plan? And How to Make one.Systems @Scale 2019 - Delos Storage for the Facebook Control PlaneSREcon18 Europe - Kernel Upgrades at FacebookScaling Instagram InfrastructureHow to Make a Disaster Recovery Plan - Project Management TrainingLarge-Scale Low-Latency Storage for the Social Network - Data@ScaleSREcon15 - Notes from Production EngineeringSystems @Scale 2019 - Continuous Deployment at Facebook ScaleYupeng Fu, Uber - Disaster recovery for multi-region Kafka at Uber | Bay Area Apache Kafka® Meetup#ProductCon: The Skills to Become a Director, VP, or CPO by Facebook VP of ProductHow Slack WorksStorage Systems at a Rapidly Scaling Startup with a Small Team - Data@ScaleSystems @Scale 2019 - Continuous Deployment at Facebook ScaleFacebook Ads: VERTICAL Scaling vs HORIZONTAL Scaling?Keynote - Systems at Facebook ScaleFacebook's A B Platform Interactive Analysis in Realtime - @Scale 2014 - DataBuilding Fault Tolerant MicroservicesAdam Wolff: Rebuilding Facebook ChatHigh Resolution Performance Telemetry at Scale