It seems the reliability, speed and scalability drops with the Ethernet are somewhat manageable.
To quote the article - From our tests, we found that Infiniband was systematically outperforming Ethernet interconnects in terms of speed. When using 16 nodes / 128 GPUs, the difference varied from 3% to 10% in terms of distributed training throughput[1]. The gap was widening as we were adding more nodes: Infiniband was scaling almost linearly, while other interconnects scaled less efficiently.
And then they do mention that the research team needs to debug unexplained failures on Ethernet that they’ve not seen on Infiniband. This actually can be the expensive part. Particularly if the failures are silent and cause numerical errors only.
A single switch should mitigate some of the throughput issues.
As for issues, this is why I have a full professional support contract with Dell and Advizex. If there are issues in the gear, they will step in to help out.
Especially on the switch, since it is a spof, we went with a 4 hour window.
To quote the article - From our tests, we found that Infiniband was systematically outperforming Ethernet interconnects in terms of speed. When using 16 nodes / 128 GPUs, the difference varied from 3% to 10% in terms of distributed training throughput[1]. The gap was widening as we were adding more nodes: Infiniband was scaling almost linearly, while other interconnects scaled less efficiently.
And then they do mention that the research team needs to debug unexplained failures on Ethernet that they’ve not seen on Infiniband. This actually can be the expensive part. Particularly if the failures are silent and cause numerical errors only.