Загрузка страницы

Optimizing inference on CPU in Apache MXNet 2.0

Bartłomiej Gawrych, Deep Learning Software Engineer @ Intel

Deep Learning inference is the process of deploying a trained neural network to perform prediction on unseen data. It is a commonly deployed workload in cloud servers. To provide a good user experience it has to have high performance so it is important to use optimized solutions. Optimized means also reduced hardware load and energy cost.
There are at least two types of performance bottlenecks to consider when optimizing a neural network model. One is heavy compute-bound operations like convolution or fully-connected and the other is memory-bound operations which perform little compute but still operate on large data. Examples of such operations include activation functions (i.e. ReLU), elementwise_add, Transpose or Concat. A couple of methods have been invented for optimizing Neural Networks. Operator fusion allows to chain operations together to speed up memory-bound operations by reducing memory IO operations. We will discuss CPU fuses existing in MXNet 2.0 and present the performance benefits of using them. Another method is quantization that allows to speed up compute-bound operations by lowering data precision and therefore simplifying computation while also reducing the amount of data being processed. We will show an example process of quantizing a model in MXNet 2.0 and evaluate its performance and accuracy.
Apache MXNet (incubating) in version 2.0 introduces changes in the interface. Gluon API has now become the default, superseding symbolic and model API. It is unifying the flexibility of imperative programming with the performance benefits of symbolic programming. Also, MXNet 2.0 now supports a NumPy-like interface to enable data-scientists easier ramp-up in developing deep learning models. Quantizing a model is a complex task and we wanted to make it as user-friendly as possible. Therefore, we have adapted quantization API to be either simple and flexible, so it can be used both by experts and beginners. We will present how to utilize this flexibility e.g. by calibration API to increase the accuracy of models.
All of these features are enabled using Intel® oneAPI Deep Neural Network Library (oneDNN) CPU backend optimized for Intel Architecture Processors. oneDNN detects the instruction set architecture (ISA) at runtime and uses just-in-time (JIT) code generation to deploy the code optimized for the latest supported ISA on the particular platform.
There are still plenty of optimization opportunities to be done in different areas. In our presentation, we will also mention plans related to the discussed topics.

Видео Optimizing inference on CPU in Apache MXNet 2.0 канала Apache MXNet
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
8 февраля 2021 г. 22:00:15
00:12:11
Яндекс.Метрика