Accelerate Transformer inference with AWS Inferentia

In this video, I show you how to accelerate Transformer inference with AWS Inferentia, a custom chip designed by AWS.

Starting from a BERT model that I fine-tuned on AWS Trainium (https://youtu.be/HweP7OYNiIA) , I compile it with the Neuron SDK for Inferentia. Then, using an inf1.6xlarge instance (4 Inferentia chips, 16 Neuron Cores), I show you how to use pipeline mode to predict at scale, reaching over 4,000 predictions per second at 3-millisecond latency.

⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️
⭐️⭐️⭐️ Want to buy me a coffee? I can always use more :) https://www.buymeacoffee.com/julsimon ⭐️⭐️⭐️

- Amazon EC2 Inf1: https://aws.amazon.com/ec2/instance-types/inf1/
- AWS Neuron SDK documentation: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html
- AWS blog post: https://aws.amazon.com/fr/blogs/machine-learning/achieve-12x-higher-throughput-and-lowest-latency-for-pytorch-natural-language-processing-applications-out-of-the-box-on-aws-inferentia/
- Setup steps and code: https://gitlab.com/juliensimon/huggingface-demos/-/tree/main/inferentia

Interested in hardware acceleration for Transformers? Check out my other videos :
- Training on Habana Gaudi: https://youtu.be/56fpEa1Y1F8
- Training on Graphcore: https://youtu.be/DgcJscPu1Vo
- Predicting with ONNX: https://youtu.be/_AKFDOnrZz8
- Predicting with Intel OpenVINO: https://youtu.be/mfj1QrZWkk8
- Inferentia compilation on SageMaker: https://youtu.be/pokM1r3rgIg

Видео Accelerate Transformer inference with AWS Inferentia канала Julien Simon