Data centers and the optimal network architecture are ever-evolving topics that have fascinated the industry for a long time. On the one hand, there are mission-critical applications and data supported by “Cloud Computing,” while on the other hand massive investments continue in the data center infrastructure.
In the end, it is the technical challenges that grab our attention, and for a few years the networking field was simple and clear. Standards for the typical designs based upon IP and Ethernet had been established. Then virtualization appeared which was soon followed by the convergence of storage and data networks in the data center. The changes in network dynamics accelerated rapidly. Traffic flows and profiles were less predictable as new services became integrated into the network. The situation only became more complex with the virtualization of servers and storage, now commonly dispersed across any number of systems.
These changes demanded a new network design that allowed traffic to flow in a way that limited congestion, something we IT-evangelists call “Fabric-Design.“ Virtualization brought the Layer 2 issue back into the data center with its features, such as “vMotion,” High Availability or Dynamic Resource Scheduling (vSphere Vocabulary). Fibre-Channel over Ethernet (FCoE) calls for flat network structures and at the same time desires lossless transport. Eventually latency issues emerge, which have a direct and massive impact on application performance. In order to keep it at a minimum, a Fabric Design is the inevitable choice.
In general everyone is in agreement that large, active Layer 2 network structures with the fewest possible “hops“, the least latency, higher aggregate bandwidth and low congestion are what is needed. The question is -how do we achieve this? This is where opinions and approaches diverge greatly, and proprietary solutions hold dominance. If one could only go back and do it all again, like in some Hollywood film, we could have warned ourselves what we already know now. Namely that these proprietary solutions have actually only driven overall costs higher. This enormous market pressure shows us that some of our lessons from the past are quickly becoming forgotten and hollow words.
Completely new “xFabrics“ are being promoted that completely deny other manufacturer any sort of reasonable integration with their network products. To make matters worse, the standards organizations IEEE and IETF have maintained competing polices from the start. Reasonable observers should ask themselves why the IETF attempts to standardize Layer 2 topology protocols, an area that has been the domain of the IEEE for the past decade. The IEEE has committed itself to support all existing and new IEEE standards (especially the IEEE Data Center Bridging Protocol (DCB) as well as existing management protocols like Ethernet O&AM, etc.) over “Shortest Path Bridging,” or SPB. Therefore, for the foreseeable future customers will have to choose proprietary solutions or ask themselves whether it makes sense to adopt new, unapproved and potentially doomed standards. The reality is that the design and construction of heterogeneous networks will have to wait months, if not years, to become a reality.
Some manufacturers go even further and stay with completely proprietary “single hop” architectures to address these challenges, with their customers giving themselves over completely, adopting a completely new technology within their planning and business teams. The lure of extremely low latency through such solutions appears attractive on the surface but on closer inspection it is clear they are fool’s gold. Many of their claims are from data based on the so-called “Cut-Through Mode” inside the switch components. This is not a practical solution, as overload situations can be caused by the slightest “microburst” or a difference in the speed between interfaces. And let’s be realistic. Where does latency occur today in the server systems themselves? Does latency really play a role when the network only experiences three to seven nanosecond (ns) delay? Perhaps this makes sense if I were running an HPC cluster. If I have a 10 or 100ns delay, it would be relevant but these can be addressed with intelligent network designs, based on standards. A network design in the data center based on EoR or MoR (End of Row, Middle of Row) can massively reduce the hop count (often to two hops in a midsize data center) , as well as potentially provide all connected systems unblocked connections to the integrated High Speed Backplane of the EoR/MoR switches.
Clearly, data center network designs and the technology they are based are undergoing large and quick revolutions, but one should not throw the principles of standardization and scalability out the window in the race to solve the pertinent issues we face today. We should not allow ourselves to turn a blind eye and instead give it the proper weight and skepticism that it needs and deserves.