Загрузка страницы

Systems @Scale 2019 - Disaster Recovery at Facebook Scale

Shruti Padmanabha, Research Scientist, Facebook
Justin Meza, Research Scientist, Facebook
https://code.fb.com/core-data/systems-scale/
Facebook operates dozens of data centers globally, each of which serves thousands of interdependent microservices to provide seamless experiences to billions of users across the family of Facebook products. At this scale, seemingly rare occurrences, from hurricanes looming over a data center to lightning striking a switchboard, have threatened the site’s health. These events cause large-scale machine failures at the scope of a data center or significant portions of it, which cannot be addressed by traditional fault-tolerance mechanisms designed for individual machine failures. Handling these failures requires us to develop solutions across the stack, from placing hardware and spare capacity across fault domains to being able to shift traffic smoothly away from affected fault domains to rearchitecting large-scale distributed systems in a fault domain-aware manner. In this talk, Shruti and Justin will describe principles Facebook follows for designing reliable software, tools we built to mitigate and respond to failures, and our continuous testing and validation process.

Видео Systems @Scale 2019 - Disaster Recovery at Facebook Scale канала Justin Miller
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
8 августа 2019 г. 19:01:21
00:26:16
Яндекс.Метрика