OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol
Bojie Li
Read on arXiv →Key claim
UB reduces RDMA latency by 4.37x compared to RoCEv2.
This paper presents Huawei's Unified Bus (UB), which significantly reduces latency and increases throughput for RDMA operations compared to existing methods. The key result shows that UB's load/store path achieves an end-to-end latency of ~500 ns, which is 4.37 times faster than the matched baseline. This advancement could greatly enhance performance in datacenter environments.
In plain English
The authors developed OpenURMA, an open-source implementation of Huawei's Unified Bus (UB) protocol, which significantly improves the performance of Remote Direct Memory Access (RDMA) operations in data centers. Unlike previous methods that required each connection to maintain a lot of state information, UB simplifies this by separating application-specific data from transport data, leading to much lower latency. Their results show that UB can achieve an end-to-end latency of about 500 nanoseconds, which is over four times faster than the existing RoCEv2 protocol. This improvement means that data centers can handle more operations in less time, making them more efficient. Builders should care because adopting this technology could lead to faster and more responsive applications, ultimately enhancing user experience and system performance.
The paper introduces a significant new abstraction for RDMA that decouples state management, which could change how datacenter networking is approached.
The implementation is validated against a solid baseline and provides detailed performance metrics, supporting the claims made.
Deep reliability assessment
The methodology supports the claim that the OpenURMA implementation achieves lower latency and higher throughput compared to RoCEv2, but the results may be overclaimed if the specific hardware and configurations used are not widely replicable in other environments.
Reproducibility
yes, the implementation is available at https://github.com/bojieli/OpenURMA
Discussion questions
- 1.What assumptions about the scalability of the Unified Bus protocol might not hold in real-world applications?
- 2.How can the findings of this paper influence the design of future RDMA protocols for AI workloads?
- 3.What experimental conditions would need to change to potentially invalidate the latency and throughput claims made in this paper?
Key figure
Figure 1 illustrates the architectural comparison between RoCEv2 and the Unified Bus protocol, highlighting the differences in state management and data path efficiency.
