Asymmetric routing with SR Linux in EVPN VXLAN fabrics
This post dives deeper into the asymmetric routing model on SR Linux. The topology in use is a 3-stage Clos fabric with BGP EVPN and VXLAN, with server s1 single-homed to leaf1, s2 dual-homed to leaf2 and leaf3 and s3 single-homed to leaf4. Hosts s1 and s2 are in the same subnet, 172.16.10.0/24 while s3 is in a different subnet, 172.16.20.0/24. Thus, this post demonstrates Layer 2 extension over a routed fabric as well as how Layer 3 services are deployed over the same fabric, with an asymmetric routing model.
The physical topology is shown below:
The Containerlab file used for this is shown below:
name: srlinux-asymmetric-routing
topology:
nodes:
spine1:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
spine2:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
leaf1:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
leaf2:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
leaf3:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
leaf4:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
s1:
kind: linux
image: ghcr.io/srl-labs/network-multitool
exec:
- ip addr add 172.16.10.1/24 dev eth1
- ip route add 172.16.20.0/24 via 172.16.10.254
s2:
kind: linux
image: ghcr.io/srl-labs/network-multitool
exec:
- ip link add bond0 type bond mode 802.3ad
- ip link set eth1 down
- ip link set eth2 down
- ip link set eth1 master bond0
- ip link set eth2 master bond0
- ip addr add 172.16.10.2/24 dev bond0
- ip link set eth1 up
- ip link set eth2 up
- ip link set bond0 up
- ip route add 172.16.20.0/24 via 172.16.10.254
s3:
kind: linux
image: ghcr.io/srl-labs/network-multitool
exec:
- ip addr add 172.16.20.3/24 dev eth1
- ip route add 172.16.10.0/24 via 172.16.20.254
links:
- endpoints: ["leaf1:e1-1", "spine1:e1-1"]
- endpoints: ["leaf1:e1-2", "spine2:e1-1"]
- endpoints: ["leaf2:e1-1", "spine1:e1-2"]
- endpoints: ["leaf2:e1-2", "spine2:e1-2"]
- endpoints: ["leaf3:e1-1", "spine1:e1-3"]
- endpoints: ["leaf3:e1-2", "spine2:e1-3"]
- endpoints: ["leaf4:e1-1", "spine1:e1-4"]
- endpoints: ["leaf4:e1-2", "spine2:e1-4"]
- endpoints: ["leaf1:e1-3", "s1:eth1"]
- endpoints: ["leaf2:e1-3", "s2:eth1"]
- endpoints: ["leaf3:e1-3", "s3:eth2"]
- endpoints: ["leaf4:e1-3", "s3:eth1"]
Note
The server/host (image used is ghcr.io/srl-labs/network-multitool
) login credentials is user/multit00l
.
The end goal of this post is to ensure that host s1 can communicate with both s2 (same subnet) and s3 (different subnet) using an asymmetric routing model. To that end, the following IPv4 addressing is used (with the IRB addressing following a distributed, anycast model):
Resource | IPv4 scope |
---|---|
Underlay | 198.51.100.0/24 |
system0 interface |
192.0.2.0/24 |
VNI 10010 | 172.16.10.0/24 |
VNI 10020 | 172.16.20.0/24 |
server s1 | 172.16.10.1/24 |
server s2 | 172.16.10.2/24 |
server s3 | 172.16.20.3/24 |
irb0.10 interface |
172.16.10.254/24 |
irb0.20 interface |
172.16.20.254/24 |
Reviewing the asymmetric routing model
When routing between VNIs, in a VXLAN fabric, there are two major routing models that can be used - asymmetric and symmetric. Asymmetric routing, which is the focus of this post, uses a bridge-route-bridge
model, implying that the ingress leaf bridges the packet into the Layer 2 domain, routes it from one VLAN/VNI to another and then bridges the packet across the VXLAN fabric to the destination.
Such a design naturally implies that both the source and the destination IRBs (and the corresponding Layer 2 domains and bridge tables) must exist on all leafs hosting servers that need to communicate with each other. While this increases the operational state on the leafs themselves (ARP state and MAC address state is stored everywhere), it does offer configuration and operational simplicity.
Configuration walkthrough
With a basic understanding of the asymmetric routing model, let's start to configure this fabric. This configuration walkthrough includes building out the entire fabric from scratch - only the base configuration, loaded with Containerlab by default, exists on all nodes.
Point-to-point interfaces
The underlay of the fabric includes the physically connected point-to-point interfaces between the leafs and the spines, the IPv4/IPv6 addressing used for these interfaces and a routing protocol, deployed to distribute the loopback (system0) addresses across the fabric, with the simple end goal of achieving reachability between these loopback addresses. The configuration for these point-to-point addresses is shown below from all the nodes.
--{ + running }--[ ]--
A:leaf1# info interface ethernet-1/{1,2}
interface ethernet-1/1 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.0/31 {
}
}
}
}
interface ethernet-1/2 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.2/31 {
}
}
}
}
--{ + running }--[ ]--
A:leaf2# info interface ethernet-1/{1,2}
interface ethernet-1/1 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.4/31 {
}
}
}
}
interface ethernet-1/2 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.6/31 {
}
}
}
}
--{ + running }--[ ]--
A:leaf3# info interface ethernet-1/{1,2}
interface ethernet-1/1 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.8/31 {
}
}
}
}
interface ethernet-1/2 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.10/31 {
}
}
}
}
--{ + running }--[ ]--
A:leaf4# info interface ethernet-1/{1,2}
interface ethernet-1/1 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.12/31 {
}
}
}
}
interface ethernet-1/2 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.14/31 {
}
}
}
}
A:spine1# info interface ethernet-1/{1..4}
interface ethernet-1/1 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.1/31 {
}
}
}
}
interface ethernet-1/2 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.5/31 {
}
}
}
}
interface ethernet-1/3 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.9/31 {
}
}
}
}
interface ethernet-1/4 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.13/31 {
}
}
}
}
A:spine2# info interface ethernet-1/{1..4}
interface ethernet-1/1 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.3/31 {
}
}
}
}
interface ethernet-1/2 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.7/31 {
}
}
}
}
interface ethernet-1/3 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.11/31 {
}
}
}
}
interface ethernet-1/4 {
admin-state enable
mtu 9100
subinterface 0 {
admin-state enable
ipv4 {
admin-state enable
address 198.51.100.15/31 {
}
}
}
}
Tip
Notice that configuration for multiple interfaces are shown with a single command using the concept of ranges. Different ways of doing this are shown with one style used for the leafs and another for the spines. With interface ethernet-1{1,2}
, the comma-separation allows the user to enter any set of numbers (contiguous or not), which are subsequently expanded. Thus, this expands to interface ethernet-1/1
and interface ethernet-1/2
. On the other hand, you can also provide a contiguous range of numbers by using ..
, as shown for the spines. In that case, interface ethernet-1/{1..4}
implies ethernet-1/1 through ethernet-1/4.
Note
Remember, by default, there is no global routing instance/table in SR Linux. A network-instance
of type default
must be configured and these interfaces, including the system0
interface need to be added to this network instance for point-to-point connectivity.
Underlay and overlay BGP
For the underlay, eBGP is used to advertise the system0
interface addresses. However, since SR Linux has adapted eBGP behavior specifically for the L2VPN EVPN AFI/SAFI (no modification of next-hop address at every eBGP hop and the default use of system0
interface address as the next-hop when originating a route instead of the Layer 3 interface address over which the peering is formed), we can simply enable this address-family over the same peering (leveraging MP-BGP functionality). BGP is configured under the default network-instance
since this is for the underlay in the global routing table.
The BGP configuration from all nodes is shown below:
--{ + running }--[ ]--
A:leaf1# info network-instance default protocols bgp
network-instance default {
protocols {
bgp {
admin-state enable
autonomous-system 65411
router-id 192.0.2.11
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
multipath {
maximum-paths 2
}
}
group spine {
peer-as 65500
export-policy [
spine-export
]
import-policy [
spine-import
]
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
}
}
neighbor 198.51.100.1 {
peer-group spine
}
neighbor 198.51.100.3 {
peer-group spine
}
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf2# info network-instance default protocols bgp
network-instance default {
protocols {
bgp {
admin-state enable
autonomous-system 65412
router-id 192.0.2.12
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
multipath {
maximum-paths 2
}
}
group spine {
peer-as 65500
export-policy [
spine-export
]
import-policy [
spine-import
]
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
}
}
neighbor 198.51.100.5 {
peer-group spine
}
neighbor 198.51.100.7 {
peer-group spine
}
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf3# info network-instance default protocols bgp
network-instance default {
protocols {
bgp {
admin-state enable
autonomous-system 65413
router-id 192.0.2.13
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
multipath {
maximum-paths 2
}
}
group spine {
peer-as 65500
export-policy [
spine-export
]
import-policy [
spine-import
]
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
}
}
neighbor 198.51.100.9 {
peer-group spine
}
neighbor 198.51.100.11 {
peer-group spine
}
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf4# info network-instance default protocols bgp
network-instance default {
protocols {
bgp {
admin-state enable
autonomous-system 65414
router-id 192.0.2.14
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
multipath {
maximum-paths 2
}
}
group spine {
peer-as 65500
export-policy [
spine-export
]
import-policy [
spine-import
]
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
}
}
neighbor 198.51.100.13 {
peer-group spine
}
neighbor 198.51.100.15 {
peer-group spine
}
}
}
}
--{ + running }--[ ]--
--{ running }--[ ]--
A:spine1# info network-instance default protocols bgp
network-instance default {
protocols {
bgp {
admin-state enable
autonomous-system 65500
router-id 192.0.2.101
afi-safi evpn {
admin-state enable
evpn {
inter-as-vpn true
}
}
afi-safi ipv4-unicast {
admin-state enable
}
group leaf {
export-policy [
leaf-export
]
import-policy [
leaf-import
]
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
}
}
neighbor 198.51.100.0 {
peer-as 65411
peer-group leaf
}
neighbor 198.51.100.4 {
peer-as 65412
peer-group leaf
}
neighbor 198.51.100.8 {
peer-as 65413
peer-group leaf
}
neighbor 198.51.100.12 {
peer-as 65414
peer-group leaf
}
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:spine2# info network-instance default protocols bgp
network-instance default {
protocols {
bgp {
admin-state enable
autonomous-system 65500
router-id 192.0.2.102
afi-safi evpn {
admin-state enable
evpn {
inter-as-vpn true
}
}
afi-safi ipv4-unicast {
admin-state enable
}
group leaf {
export-policy [
leaf-export
]
import-policy [
leaf-import
]
afi-safi evpn {
admin-state enable
}
afi-safi ipv4-unicast {
admin-state enable
}
}
neighbor 198.51.100.2 {
peer-as 65411
peer-group leaf
}
neighbor 198.51.100.6 {
peer-as 65412
peer-group leaf
}
neighbor 198.51.100.10 {
peer-as 65413
peer-group leaf
}
neighbor 198.51.100.14 {
peer-as 65414
peer-group leaf
}
}
}
}
--{ + running }--[ ]--
Note
On the spines, the configuration option inter-as-vpn
must be set to true
under the protocols bgp afi-safi evpn evpn
hierarchy. Since the spines are not configured as VTEPs and act as pure IP forwarders in this design, there are no Layer 2 or Layer 3 VXLAN constructs created on the spines, associated to any route targets for EVPN route import. By default, such routes (which have no local route target for import) will be rejected and not advertised to other leafs. The inter-as-vpn
configuration option overrides this behavior.
The BGP configuration defines a peer-group called spine
on the leafs and leaf
on the spines to build out common configuration that can be applied across multiple neighbors. These peer-groups enable both the IPv4-unicast and EVPN address-families, using MP-BGP to establish a single peering for both families. In addition to this, export
and import
policies are defined, controlling what routes are exported and imported.
The following packet capture also confirms the MP-BGP capabilities exchanged with the BGP OPEN messages, where both IPv4 unicast and L2VPN EVPN capabilities are advertised:
Routing policies for the underlay and overlay
The configuration of the routing policies used for export and import of BGP routes is shown below. Since the policies for the leafs are the same across all leafs and the policies for the spines are the same across all spines, the configuration is only shown from two nodes, leaf1 and spine1, using them as references.
--{ + running }--[ ]--
A:leaf1# info routing-policy policy spine-*
routing-policy {
policy spine-export {
default-action {
policy-result reject
}
statement loopback {
match {
protocol local
}
action {
policy-result accept
}
}
statement allow-evpn {
match {
family [
evpn
]
}
action {
policy-result accept
}
}
}
policy spine-import {
default-action {
policy-result reject
}
statement bgp-underlay {
match {
protocol bgp
family [
ipv4-unicast
ipv6-unicast
]
}
action {
policy-result accept
}
}
statement bgp-evpn-overlay {
match {
family [
evpn
]
}
action {
policy-result accept
}
}
}
}
--{ + running }--[ ]--
--{ running }--[ ]--
A:spine1# info routing-policy policy leaf-*
routing-policy {
policy leaf-export {
default-action {
policy-result reject
}
statement loopback {
match {
protocol local
}
action {
policy-result accept
}
}
statement bgp-underlay {
match {
protocol bgp
family [
ipv4-unicast
ipv6-unicast
]
}
action {
policy-result accept
}
}
statement bgp-evpn-overlay {
match {
family [
evpn
]
}
action {
policy-result accept
}
}
}
policy leaf-import {
default-action {
policy-result reject
}
statement bgp-underlay {
match {
protocol bgp
family [
ipv4-unicast
ipv6-unicast
]
}
action {
policy-result accept
}
}
statement bgp-evpn-overlay {
match {
family [
evpn
]
}
action {
policy-result accept
}
}
}
}
--{ running }--[ ]--
Tip
Similar to how ranges can be used to pull configuration state from multiple interfaces as an example, in this case a wildcard *
is used to select multiple routing-policies. The wildcard spine-*
matches both policies named spine-import
and spine-export
.
Host connectivity and ESI LAG
With BGP configured, we can start to deploy the connectivity to the servers and configure the necessary VXLAN constructs for end-to-end connectivity. The interfaces, to the servers, are configured as untagged interfaces. Since server s2 is multi-homed to leaf2 and leaf3, this segment is configured as an ESI LAG. This includes:
- Mapping the physical interface to a LAG interface (
lag1
, in this case). - The LAG interface configured with the required LACP properties - mode
active
and a system-mac of00:00:00:00:23:23
. This LAG interface is also configured with a subinterface of typebridged
. - An Ethernet Segment defined under the
system network-instance protocols evpn ethernet-segments
hierarchy.
--{ + running }--[ ]--
A:leaf2# info interface ethernet-1/3
interface ethernet-1/3 {
admin-state enable
ethernet {
aggregate-id lag1
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf2# info interface lag1
interface lag1 {
admin-state enable
vlan-tagging false
subinterface 0 {
type bridged
admin-state enable
}
lag {
lag-type lacp
lacp {
lacp-mode ACTIVE
system-id-mac 00:00:00:00:23:23
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf2# info system network-instance protocols evpn
system {
network-instance {
protocols {
evpn {
ethernet-segments {
bgp-instance 1 {
ethernet-segment es1 {
admin-state enable
esi 00:00:11:11:11:11:11:11:23:23
multi-homing-mode all-active
interface lag1 {
}
}
}
}
}
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf3# info interface ethernet-1/3
interface ethernet-1/3 {
admin-state enable
ethernet {
aggregate-id lag1
}
}
--{ + running }--[ ]--
A:leaf3# info interface lag1
interface lag1 {
admin-state enable
vlan-tagging false
subinterface 0 {
type bridged
admin-state enable
}
lag {
lag-type lacp
lacp {
lacp-mode ACTIVE
system-id-mac 00:00:00:00:23:23
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf3# info system network-instance protocols evpn
system {
network-instance {
protocols {
evpn {
ethernet-segments {
bgp-instance 1 {
ethernet-segment es1 {
admin-state enable
esi 00:00:11:11:11:11:11:11:23:23
multi-homing-mode all-active
interface lag1 {
}
}
}
}
}
}
}
}
--{ + running }--[ ]--
VXLAN tunnel interfaces
On each leaf, VXLAN tunnel-interfaces are created next. In this case, two logical interfaces are created, one for VNI 10010 and another for VNI 10020 (since this is asymmetric routing, all VNIs must exist on all leafs that want to route between the respective VNIs). Since the end-goal is to have server s1 communicate with s2 and s3, only leaf1 and leaf4 are configured with VNI 10020 as well, while leaf2 and leaf3 are only configured with VNI 10010.
IRBs on the leafs
IRBs are deployed using an anycast, distributed gateway model, impplying that all leafs are configured with the same IP address and MAC address for a specific IRB subinterface. These IRB subinterfaces act as the default gateway for the endpoints. For our topology, we will create two subinterfaces irb0.10
and irb0.20
corresponding to hosts mapped to VNIs 10010 and 10020, respectively. The configuration of these IRB interfaces is shown below:
--{ + running }--[ ]--
A:leaf1# info interface irb0
interface irb0 {
admin-state enable
subinterface 10 {
admin-state enable
ipv4 {
admin-state enable
address 172.16.10.254/24 {
anycast-gw true
}
arp {
learn-unsolicited true
proxy-arp true
host-route {
populate dynamic {
}
populate evpn {
}
}
evpn {
advertise dynamic {
}
}
}
}
anycast-gw {
anycast-gw-mac 00:00:5E:00:53:00
}
}
subinterface 20 {
admin-state enable
ipv4 {
admin-state enable
address 172.16.20.254/24 {
anycast-gw true
}
arp {
learn-unsolicited true
host-route {
populate dynamic {
}
populate evpn {
}
}
evpn {
advertise dynamic {
}
}
}
}
anycast-gw {
anycast-gw-mac 00:00:5E:00:53:00
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf2# info interface irb0
interface irb0 {
admin-state enable
subinterface 10 {
admin-state enable
ipv4 {
admin-state enable
address 172.16.10.254/24 {
anycast-gw true
}
arp {
learn-unsolicited true
proxy-arp true
host-route {
populate dynamic {
}
populate evpn {
}
}
evpn {
advertise dynamic {
}
}
}
}
anycast-gw {
anycast-gw-mac 00:00:5E:00:53:00
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf2# info interface irb0
interface irb0 {
admin-state enable
subinterface 10 {
admin-state enable
ipv4 {
admin-state enable
address 172.16.10.254/24 {
anycast-gw true
}
arp {
learn-unsolicited true
proxy-arp true
host-route {
populate dynamic {
}
populate evpn {
}
}
evpn {
advertise dynamic {
}
}
}
}
anycast-gw {
anycast-gw-mac 00:00:5E:00:53:00
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf2# info interface irb0
interface irb0 {
admin-state enable
subinterface 10 {
admin-state enable
ipv4 {
admin-state enable
address 172.16.10.254/24 {
anycast-gw true
}
arp {
learn-unsolicited true
proxy-arp true
host-route {
populate dynamic {
}
populate evpn {
}
}
evpn {
advertise dynamic {
}
}
}
}
anycast-gw {
anycast-gw-mac 00:00:5E:00:53:00
}
}
subinterface 20 {
admin-state enable
ipv4 {
admin-state enable
address 172.16.20.254/24 {
anycast-gw true
}
arp {
learn-unsolicited true
host-route {
populate dynamic {
}
populate evpn {
}
}
evpn {
advertise dynamic {
}
}
}
}
anycast-gw {
anycast-gw-mac 00:00:5E:00:53:00
}
}
}
--{ + running }--[ ]--
There is a lot going on here, so let's breakdown some of the configuration options:
anycast-gw [true|false]
-
When this is set to
true
, the IPv4 address is associated to the anycast gateway MAC address and this MAC address is used to respond to any ARP requests for that IPv4 address. This also allows the same IPv4 address to be configured on other nodes for the same broadcast domain, essentially suppressing duplicate IP detection. anycast-gw anycast-gw-mac [mac-address]
-
The MAC address configured with this option is the anycast gateway MAC address and is associated to the IP address for that subinterface. If this is ommitted, the anycast gateway MAC address is auto-derived from the VRRP MAC address group range.
arp learn-unsolicited [true|false]
-
This enables the node to learn the IP-to-MAC binding from any ARP packet and not just ARP requests.
arp host-route populate [dynamic|static|evpn]
-
This enables the node to insert a host route (/32 for IPv4 and /128 for IPv6) in the routing table from dynaimc, static or EVPN-learnt ARP entries.
arp evpn advertise [dynamic|static]
-
This enables the node to advertise EVPN Type-2 MAC+IP routes from dynamic or static ARP entries.
MAC VRFs on leafs
Finally, MAC VRFs are created on the leafs to create a broadcast domain and corresponding bridge table for Layer 2 learning. Since, by default, a MAC VRF corresponds to a single broadcast domain and bridge table, we can map only one Layer 2 VNI to it. Thus, on leaf1 and leaf4, two MAC VRFs are created - one for VNI 10010 and another for VNI 10020. Under the MAC VRF, there are several important things to consider:
- The Layer 2 subinterface is bound to the MAC VRF using the
interface
configuration option. - The corresponding IRB subinterface is bound to the MAC VRF using the
interface
configuration option. - The VXLAN tunnel subinterface is bound to the MAC VRF using the
vxlan-interface
configuration option. - BGP EVPN learning is enabled for the MAC VRF using the
protocols bgp-evpn
hierarchy and the MAC VRF is bound to an EVI (EVPN virtual instance). - The
ecmp
configuration option determines how many VTEPs can be considered for load-balancing by the local VTEP (more on this in the validation section). - Route distinguishers and route targets are configured for the MAC VRF using the
protocols bgp-vpn
hierarchy.
--{ + running }--[ ]--
A:leaf1# info network-instance macvrf*
network-instance macvrf1 {
type mac-vrf
admin-state enable
interface ethernet-1/3.0 {
}
interface irb0.10 {
}
vxlan-interface vxlan1.1 {
}
protocols {
bgp-evpn {
bgp-instance 1 {
admin-state enable
vxlan-interface vxlan1.1
evi 10
ecmp 2
}
}
bgp-vpn {
bgp-instance 1 {
route-distinguisher {
rd 192.0.2.11:1
}
route-target {
export-rt target:10:10
import-rt target:10:10
}
}
}
}
}
network-instance macvrf2 {
type mac-vrf
admin-state enable
interface irb0.20 {
}
vxlan-interface vxlan1.2 {
}
protocols {
bgp-evpn {
bgp-instance 1 {
admin-state enable
vxlan-interface vxlan1.2
evi 20
ecmp 2
}
}
bgp-vpn {
bgp-instance 1 {
route-distinguisher {
rd 192.0.2.11:2
}
route-target {
export-rt target:20:20
import-rt target:20:20
}
}
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf2# info network-instance macvrf1
network-instance macvrf1 {
type mac-vrf
admin-state enable
interface irb0.10 {
}
interface lag1.0 {
}
vxlan-interface vxlan1.1 {
}
protocols {
bgp-evpn {
bgp-instance 1 {
admin-state enable
vxlan-interface vxlan1.1
evi 10
ecmp 2
}
}
bgp-vpn {
bgp-instance 1 {
route-distinguisher {
rd 192.0.2.12:1
}
route-target {
export-rt target:10:10
import-rt target:10:10
}
}
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf3# info network-instance macvrf1
network-instance macvrf1 {
type mac-vrf
admin-state enable
interface irb0.10 {
}
interface lag1.0 {
}
vxlan-interface vxlan1.1 {
}
protocols {
bgp-evpn {
bgp-instance 1 {
admin-state enable
vxlan-interface vxlan1.1
evi 10
ecmp 2
}
}
bgp-vpn {
bgp-instance 1 {
route-distinguisher {
rd 192.0.2.13:1
}
route-target {
export-rt target:10:10
import-rt target:10:10
}
}
}
}
}
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf4# info network-instance macvrf*
network-instance macvrf1 {
type mac-vrf
admin-state enable
interface irb0.10 {
}
vxlan-interface vxlan1.1 {
}
protocols {
bgp-evpn {
bgp-instance 1 {
admin-state enable
vxlan-interface vxlan1.1
evi 10
ecmp 2
}
}
bgp-vpn {
bgp-instance 1 {
route-distinguisher {
rd 192.0.2.14:1
}
route-target {
export-rt target:10:10
import-rt target:10:10
}
}
}
}
}
network-instance macvrf2 {
type mac-vrf
admin-state enable
interface ethernet-1/3.0 {
}
interface irb0.20 {
}
vxlan-interface vxlan1.2 {
}
protocols {
bgp-evpn {
bgp-instance 1 {
admin-state enable
vxlan-interface vxlan1.2
evi 20
ecmp 2
}
}
bgp-vpn {
bgp-instance 1 {
route-distinguisher {
rd 192.0.2.14:2
}
route-target {
export-rt target:20:20
import-rt target:20:20
}
}
}
}
}
--{ + running }--[ ]--
This completes the configuration walkthrough section of this post. Next, we'll cover the control plane and data plane validation.
Control plane & data plane validation
When the hosts come online, they typically send a GARP to ensure there is no duplicate IP address in their broadcast domain. This enables the locally attached leafs to learn the IP-to-MAC binding and build an ARP entry in the ARP cache table (since the arp learn-unsolicited
configuration option is set to true
). This, in turn, is advertised as an EVPN Type-2 MAC+IP route for remote leafs to learn this as well and eventually insert the IP-to-MAC binding as an entry in their ARP caches.
On leaf1, we can confirm that it has learnt the IP-to-MAC binding for server s1 (locally attached) and s3 (attached to remote leaf, leaf4).
A:leaf1# show arpnd arp-entries interface irb0
+-------------------+-------------------+-----------------+-------------------+-------------------------------------+------------------------------------------------------------------------+
| Interface | Subinterface | Neighbor | Origin | Link layer address | Expiry |
+===================+===================+=================+===================+=====================================+========================================================================+
| irb0 | 10 | 172.16.10.1 | dynamic | AA:C1:AB:CA:A0:83 | 3 hours from now |
| irb0 | 10 | 172.16.10.2 | evpn | AA:C1:AB:11:BE:88 | |
| irb0 | 20 | 172.16.20.3 | evpn | AA:C1:AB:9F:EF:E2 | |
+-------------------+-------------------+-----------------+-------------------+-------------------------------------+------------------------------------------------------------------------+
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total entries : 3 (0 static, 3 dynamic)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + candidate shared default }--[ ]--
The ARP entry for server s3 (172.16.20.3) is learnt via the EVPN Type-2 MAC+IP route received from leaf4, as shown below.
--{ + running }--[ ]--
A:leaf1# show network-instance default protocols bgp routes evpn route-type 2 ip-address 172.16.20.3 detail
---------------------------------------------------------------------------------------------------------------------------
Show report for the EVPN routes in network-instance "default"
---------------------------------------------------------------------------------------------------------------------------
Route Distinguisher: 192.0.2.14:2
Tag-ID : 0
MAC address : AA:C1:AB:9F:EF:E2
IP Address : 172.16.20.3
neighbor : 198.51.100.1
Received paths : 1
Path 1: <Best,Valid,Used,>
ESI : 00:00:00:00:00:00:00:00:00:00
Label : 10020
Route source : neighbor 198.51.100.1 (last modified 4d18h49m3s ago)
Route preference : No MED, No LocalPref
Atomic Aggr : false
BGP next-hop : 192.0.2.14
AS Path : i [65500, 65414]
Communities : [target:20:20, bgp-tunnel-encap:VXLAN]
RR Attributes : No Originator-ID, Cluster-List is []
Aggregation : None
Unknown Attr : None
Invalid Reason : None
Tie Break Reason : none
Path 1 was advertised to (Modified Attributes):
[ 198.51.100.3 ]
Route preference : No MED, No LocalPref
Atomic Aggr : false
BGP next-hop : 192.0.2.14
AS Path : i [65411, 65500, 65414]
Communities : [target:20:20, bgp-tunnel-encap:VXLAN]
RR Attributes : No Originator-ID, Cluster-List is []
Aggregation : None
Unknown Attr : None
---------------------------------------------------------------------------------------------------------------------------
Route Distinguisher: 192.0.2.14:2
Tag-ID : 0
MAC address : AA:C1:AB:9F:EF:E2
IP Address : 172.16.20.3
neighbor : 198.51.100.3
Received paths : 1
Path 1: <Valid,>
ESI : 00:00:00:00:00:00:00:00:00:00
Label : 10020
Route source : neighbor 198.51.100.3 (last modified 4d18h49m0s ago)
Route preference : No MED, No LocalPref
Atomic Aggr : false
BGP next-hop : 192.0.2.14
AS Path : i [65500, 65414]
Communities : [target:20:20, bgp-tunnel-encap:VXLAN]
RR Attributes : No Originator-ID, Cluster-List is []
Aggregation : None
Unknown Attr : None
Invalid Reason : None
Tie Break Reason : peer-router-id
---------------------------------------------------------------------------------------------------------------------------
--{ + running }--[ ]--
This is an important step for asymmetric routing. Consider a situation where server s1 wants to communicate with s3. When the IP packet hits leaf1, it will attempt to resolve the destination IP address via an ARP request, as it is directly attached locally (via the irb.20
interface), as shown below.
--{ + running }--[ ]--
A:leaf1# show network-instance default route-table ipv4-unicast prefix 172.16.20.0/24
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IPv4 unicast route table of network instance default
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+---------------------------+-------+------------+----------------------+----------+----------+---------+------------+-----------------+-----------------+-----------------+----------------------+
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop (Type) | Next-hop | Backup Next-hop | Backup Next-hop |
| | | | | | Network | | | | Interface | (Type) | Interface |
| | | | | | Instance | | | | | | |
+===========================+=======+============+======================+==========+==========+=========+============+=================+=================+=================+======================+
| 172.16.20.0/24 | 10 | local | net_inst_mgr | True | default | 0 | 0 | 172.16.20.254 | irb0.20 | | |
| | | | | | | | | (direct) | | | |
+---------------------------+-------+------------+----------------------+----------+----------+---------+------------+-----------------+-----------------+-----------------+----------------------+
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + running }--[ ]--
Since this IRB interface exists on leaf4 as well, the ARP reply will be consumed by it, never reaching leaf1, and thus, creating a failure in the ARP process. To circumvent this problem associated with an anycast, distributed IRB model, the EVPN Type-2 MAC+IP routes are used to populate the ARP cache. In addition to this, optionally, this EVPN-learnt ARP entry can also be used to inject a host route (/32 for IPv4 and /128 for IPv6) into the routing table using the arp host-route populate evpn
configuration option (as discussed earlier). Since this is enabled in our case, we can confirm that the route 172.16.20.3/32 exists in the routing table, inserted by the arp_nd_mgr process:
--{ + running }--[ ]--
A:leaf1# show network-instance default route-table ipv4-unicast prefix 172.16.20.3/32
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IPv4 unicast route table of network instance default
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+---------------------------+-------+------------+----------------------+----------+----------+---------+------------+-----------------+-----------------+-----------------+----------------------+
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop (Type) | Next-hop | Backup Next-hop | Backup Next-hop |
| | | | | | Network | | | | Interface | (Type) | Interface |
| | | | | | Instance | | | | | | |
+===========================+=======+============+======================+==========+==========+=========+============+=================+=================+=================+======================+
| 172.16.20.3/32 | 10 | arp-nd | arp_nd_mgr | True | default | 0 | 1 | 172.16.20.3 | irb0.20 | | |
| | | | | | | | | (direct) | | | |
+---------------------------+-------+------------+----------------------+----------+----------+---------+------------+-----------------+-----------------+-----------------+----------------------+
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + running }--[ ]--
Note
The arp host-route populate evpn
configuration option is purely a design choice. Since a routing lookup is based on the longest-prefix-match logic (where the longest prefix wins), the existence of the host routes ensure that when there is a routing lookup for the destination, the host route is selected instead of falling back to the subnet route, which relies on ARP resolution, making the forwarding process more efficient. However, this also implies that a host route is created for every EVPN-learnt ARP entry, which can lead to a large routing table, potentially creating an issue in large-scale fabrics.
Let's consider two flows to understand the data plane forwarding in such a design - server s1 communicating with s2 (same subnet) and s1 communicating with s3 (different subnet).
Since s1 is in the same subnet as s2, when communicating with s2, s1 will try to resolve its IP address directly via an ARP request. This is received on leaf1 and leaked to the CPU via irb0.10
. Since L2 proxy-arp is not enabled, the arp_nd_mgr
process picks up the ARP request and responds back using its own anycast gateway MAC address while suppressing the ARP request from being flooded in the fabric. A packet capture of this ARP reply is shown below.
Once this ARP process completes, server s1 generates an ICMP request (since we are testing communication between hosts using the ping
tool). When this IP packet arrives on leaf1, it does a routing lookup (since the destination MAC address is owned by itself) and this routing lookup will either hit the 172.16.10.0/24 prefix or the more-specific 172.16.10.2/32 prefix (installed from the ARP entry via the EVPN Type-2 MAC+IP route), as shown below. Since this is a directly attached route, it is further resolved into a MAC address via the ARP table and then the packet is bridged towards the destination. This MAC address points to an Ethernet Segment, which in turn resolves into VTEPs 192.0.2.12 and 192.0.2.13.
A:leaf1# show network-instance default route-table ipv4-unicast route 172.16.10.2
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IPv4 unicast route table of network instance default
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+------------------------+-------+------------+----------------------+----------+----------+---------+------------+---------------+---------------+---------------+------------------+
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop | Backup Next- | Backup Next-hop |
| | | | | | Network | | | (Type) | Interface | hop (Type) | Interface |
| | | | | | Instance | | | | | | |
+========================+=======+============+======================+==========+==========+=========+============+===============+===============+===============+==================+
| 172.16.10.2/32 | 8 | arp-nd | arp_nd_mgr | True | default | 0 | 1 | 172.16.10.2 | irb0.10 | | |
| | | | | | | | | (direct) | | | |
+------------------------+-------+------------+----------------------+----------+----------+---------+------------+---------------+---------------+---------------+------------------+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + candidate shared default }--[ ]--
A:leaf1# show arpnd arp-entries interface irb0 ipv4-address 172.16.10.2
+------------------+------------------+-----------------+------------------+-----------------------------------+--------------------------------------------------------------------+
| Interface | Subinterface | Neighbor | Origin | Link layer address | Expiry |
+==================+==================+=================+==================+===================================+====================================================================+
| irb0 | 10 | 172.16.10.2 | evpn | AA:C1:AB:11:BE:88 | |
+------------------+------------------+-----------------+------------------+-----------------------------------+--------------------------------------------------------------------+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total entries : 1 (0 static, 1 dynamic)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + candidate shared default }--[ ]--
--{ + candidate shared default }--[ ]--
A:leaf1# show network-instance macvrf1 bridge-table mac-table mac AA:C1:AB:11:BE:88
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mac-table of network instance macvrf1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mac : AA:C1:AB:11:BE:88
Destination : vxlan-interface:vxlan1.1 esi:00:00:11:11:11:11:11:11:23:23
Dest Index : 322085950259
Type : evpn
Programming Status : Success
Aging : N/A
Last Update : 2024-10-14T05:37:52.000Z
Duplicate Detect time : N/A
Hold down time remaining: N/A
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + candidate shared default }--[ ]--
A:leaf1# show tunnel-interface vxlan1 vxlan-interface 1 bridge-table unicast-destinations destination | grep -A 7 "Ethernet Segment Destinations"
Ethernet Segment Destinations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+-------------------------------+-------------------+------------------------+-----------------------------+
| ESI | Destination-index | VTEPs | Number MACs (Active/Failed) |
+===============================+===================+========================+=============================+
| 00:00:11:11:11:11:11:11:23:23 | 322085950259 | 192.0.2.12, 192.0.2.13 | 1(1/0) |
+-------------------------------+-------------------+------------------------+-----------------------------+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + candidate shared default }--[ ]--
A packet capture of the in-flight packet (as leaf1 sends it to spine1) is shown below, which confirms that the packet ICMP request is VXLAN-encapsulated with a VNI of 10010. It also confirms that because of the L3 proxy-arp approach to suppressing ARPs in an EVPN VXLAN fabric, the source MAC address in the inner Ethernet header is the anycast gateway MAC address.
The communication between server s1 and s3 follows a similar pattern - the packet is received in macvrf1, mapped VNI 10010, and since the destination MAC address is the anycast MAC address owned by leaf1, it is then routed locally into VNI 10020 (since irb0.20
is locally attached) and then bridged across to the destination, as confirmed below:
--{ + running }--[ ]--
A:leaf1# show network-instance default route-table ipv4-unicast route 172.16.20.3
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IPv4 unicast route table of network instance default
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+------------------------+-------+------------+----------------------+----------+----------+---------+------------+---------------+---------------+---------------+------------------+
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop | Backup Next- | Backup Next-hop |
| | | | | | Network | | | (Type) | Interface | hop (Type) | Interface |
| | | | | | Instance | | | | | | |
+========================+=======+============+======================+==========+==========+=========+============+===============+===============+===============+==================+
| 172.16.20.3/32 | 10 | arp-nd | arp_nd_mgr | True | default | 0 | 1 | 172.16.20.3 | irb0.20 | | |
| | | | | | | | | (direct) | | | |
+------------------------+-------+------------+----------------------+----------+----------+---------+------------+---------------+---------------+---------------+------------------+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf1# show network-instance * bridge-table mac-table mac AA:C1:AB:9F:EF:E2
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mac-table of network instance macvrf2
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mac : AA:C1:AB:9F:EF:E2
Destination : vxlan-interface:vxlan1.2 vtep:192.0.2.14 vni:10020
Dest Index : 322085950242
Type : evpn
Programming Status : Success
Aging : N/A
Last Update : 2024-10-14T01:05:54.000Z
Duplicate Detect time : N/A
Hold down time remaining: N/A
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + running }--[ ]--
Tip
Notice how the previous output used a wildcard for the network-instance name instead of a specific name (show network-instance * bridge-table ...
). This is useful since the operator may not always know exactly which MAC VRF is used for forwarding, and thus, the wildcard traverses across all to determine where the MAC address is learned.
The following packet capture confirms that the in-flight packet has been routed on the ingress leaf itself (leaf1) and the VNI, in the VXLAN header, is 10020.
Summary
Asymmetric routing uses a bridge-route-bridge
model where the packet, from the source, is bridged into the ingress leaf's L2 domain, routed into the destination VLAN/VNI and the bridged across the VXLAN fabric to the destination.
Such a model requires the existence of both source and destination IRBs and L2 bridge domains (and L2 VNIs) to exist on all leafs that want to participate in routing between the VNIs. While this is operationally simpler, it does add additional state since all leafs will have to maintain all IP-to-MAC bindings (in the ARP table) and all MAC addresses in the bridge table.