Demystifying RDMA Protocols for GPU Data Centers | NVlink, Connectx, EFA, Infiniband, GPUDirect

RDMA (Remote Direct Memory Access) is the secret sauce behind fast GPU clusters, which make training billion parameter LLMs feasible.

But once you go beyond a single vendor stack, the protocols, drivers, and libraries start to feel like a treasure hunt.

In this video, we explore how RDMA protocols really work for GPU-accelerated deep learning, and what it takes to design a generic RDMA library that can run across InfiniBand, RoCEv2, cloud fabrics like AWS EFA, and different NIC / GPU generations.

We’ll break down:
- NVLink vs RDMA (collective or peer-to-peer)
- The pain of p2p RDMA: Hidden assumptions baked into common libraries (NCCL, Connectx, DeepEP)
- Why building a “portable” RDMA abstraction is hard: memory registration, congestion control, reliability, ordering, and NIC quirks across vendors and clouds

Lessons inspired by engineering write-ups from Perplexity and others on scaling LLMs across thousands of GPUs with custom RDMA kernels and point-to-point data transfer.
🔍 Who is this for?

ML / DL engineers working on distributed training (NCCL, ConnectX, DeepSpeed, KV cache transfer, custom MoE stacks)

Infra / platform teams running GPU clusters, AI data centers, or cloud-hosted training environments

If you are trying to squeeze more performance out of your multi-node GPU training jobs

Demystify what RDMA libraries are doing under the hood

Видео Demystifying RDMA Protocols for GPU Data Centers | NVlink, Connectx, EFA, Infiniband, GPUDirect канала OffNote Labs