MLAG and Network Redundancy for a Zero-Downtime Fabric

How MLAG and redundant uplinks eliminate single points of failure in the data center fabric. Design patterns for zero-downtime networking.

Every infrastructure has a quiet question lurking inside it: what happens when something breaks? Not if, but when. A switch will eventually need a firmware upgrade, a power supply will fail, an optic will go dark, someone will trip over a cable. The mark of a well-designed data center network is not that these events never happen — it is that nobody notices when they do.

Network redundancy is the discipline of removing single points of failure so the fabric keeps forwarding through individual component failures. At the access layer, where servers meet the network, the workhorse technology for this is MLAG — Multi-Chassis Link Aggregation. This article walks through how MLAG works, how it combines with a redundant fabric, and what actually survives when a switch goes down.

The single point of failure you cannot afford

Consider the simplest possible server connection: one NIC, one cable, one switch. It works perfectly — until that switch reboots for an upgrade, and every server hanging off it drops offline simultaneously. For a private cloud hosting dozens of virtual machines per host, a single top-of-rack switch failure can take out an entire rack of workloads. That is not a hypothetical; it is a Tuesday maintenance window gone wrong.

The instinct is to add a second switch and a second cable. But naively connecting one server to two switches creates a loop, and traditional Layer 2 networks respond to loops by invoking Spanning Tree, which blocks one of the two links. You end up with redundancy that sits idle until a failure, wasting half your access bandwidth and inheriting Spanning Tree’s slow reconvergence. MLAG exists to give you the second switch without that compromise.

What MLAG actually does

MLAG lets two physical switches present themselves to a downstream device as if they were a single logical switch. A server connects one link to each of the two switches and bonds them into a single logical interface (a LAG, or port-channel). From the server’s perspective there is just one bundled connection running at the combined speed of both links. From the network’s perspective, both links are active and forwarding at the same time — there is no blocked standby.

The two switches coordinate over a dedicated link, usually called a peer link or inter-switch link, plus a keepalive path that lets each peer detect whether the other is alive. They synchronize their MAC address tables and ARP state so that, however traffic happens to land on either switch, it is forwarded correctly. The server neither knows nor cares that two independent boxes are involved — and that is exactly the point.

Dual-homed servers in practice

This pattern — a server with one link to each of two top-of-rack switches — is called dual-homing, and it is the foundation of a resilient access layer. The two switches form an MLAG pair; every server in the rack bonds across both. The bond can be configured active-active (using LACP) so both links carry traffic in normal operation, doubling usable bandwidth while the redundancy is, in effect, free. You are not paying for an idle backup; you are using everything you bought, and the spare capacity only becomes spare when something fails.

A failure-scenario walkthrough

Theory is reassuring; failure behavior is what matters. Let us walk through the events that actually happen in a dual-homed, MLAG-protected rack.

A switch reboots for an upgrade

You need to patch the firmware on one of the two top-of-rack switches. As it goes down, every server’s bond detects that one of its two member links is gone and shifts all traffic to the surviving link — typically within milliseconds, far faster than any Spanning Tree reconvergence. Throughput per server drops to that of a single link, but connectivity never breaks. The switch reboots, rejoins the MLAG pair, resynchronizes state, and the bonds rebalance. No VM lost a packet it cared about; no workload went offline. This is the scenario that justifies the entire design.

A single cable or optic fails

An individual link failing is the gentlest case. The bond simply continues on its remaining member, and an alert fires so an engineer can replace the faulty optic at leisure. The blast radius is one link’s worth of bandwidth on one server, with zero downtime.

The peer link itself fails

The trickier case is losing the link between the two MLAG peers while both switches are still alive. Handled badly, this risks a split-brain where both switches believe they are in charge. This is why the keepalive path matters: it lets the peers distinguish a dead partner from a severed peer link. A well-implemented MLAG uses that signal to keep one peer authoritative and shut down the orphaned ports on the other, avoiding duplicate or looped traffic. It is the scenario you must test before going live, not after.

Redundancy all the way up the fabric

MLAG protects the access edge, but resilience has to extend through the whole network. In a spine-leaf fabric, redundancy above the leaf is structural: every leaf connects to multiple spines, and equal-cost multipath routing spreads traffic across all of them. Lose a spine and the routing protocol reconverges across the survivors in a fraction of a second, with no server affected. So the leaves give you intra-rack redundancy via MLAG, and the spine layer gives you inter-rack redundancy via ECMP. Together they mean there is no single device whose failure can partition the network.

Power and cabling deserve the same discipline. Dual power supplies fed from independent feeds, the two members of an MLAG pair on separate circuits, and diverse cable paths all ensure that a failure upstream of the switch does not undo the redundancy you carefully built into it. Redundancy is only as strong as its least-redundant dependency.

Designing for the maintenance window

The real payoff of this architecture is operational freedom. When you can reboot any single switch without downtime, patching stops being a dreaded, after-hours event and becomes routine. You upgrade one peer, confirm health, then upgrade the other. Capacity planning should account for this: size the fabric so that running on N-1 devices during maintenance still comfortably carries production traffic, because for a slice of every upgrade you are deliberately running degraded. A design that only meets its targets when everything is up is not actually highly available.

Bringing zero downtime within reach

Zero-downtime networking is not magic; it is the cumulative result of removing single points of failure at every layer — dual-homed servers bonded across an MLAG pair, multiple spines tied together with ECMP, redundant power, and the operational discipline to use that redundancy. Done well, the network keeps forwarding through switch reboots, failed optics, and dead power supplies, and the workloads above never feel it.

Designing fabrics with this kind of resilience is a core part of how we build at clouditiv. As an authorized Arista Networks reseller, we deploy leaf-spine networks with MLAG-protected access and redundant spines underneath sovereign, OpenStack-based private clouds, so that planned maintenance and unplanned failures alike stay invisible to the platform. The goal is simple to state and hard to engineer: a network where the answer to what happens when something breaks is, reliably, nothing you will notice.