eBGP – ECMP in depth!

My client recently did a fairly big change to the edge network in their data centre, including a migration to 4-byte AS numbers. This wasn’t without it’s challenges. So here is a (long) post about the challenges we faced, and some explanations of some of the more advanced features of BGP such as local-as no-prepend replace-as, and bestpath as-path multipath-relax.

Here is a very simplified version of the topology, post change – everything is fictional. The two ISP’s provide a private MPLS.

Problem number 1: dc1-isp2-ce is old. Old enough that it won’t support 4 byte AS. Old enough that a software upgrade won’t fix it, because it’s actually the hardware which won’t support 4 bytes. It’s still under maintenance, and there’s nothing wrong with it for what it does (except the lack of 4 byte AS support of course), so it’s not being replaced.

Solution number 1: easy. Use a local-as on dc1-cs1, so that dc1-isp2-ce can peer with a two byte AS. If 4 byte AS’s are not supported by a router, then it’s 4 byte capable peer picks up on that and converts all 4 byte AS numbers to “23456”.

4-Byte AS is a negotiated BGP capability:

dc1-cs1# sh ip bgp neigh 10.1.0.1 | i 4-B
4-Byte AS capability: advertised received

So that’s fine – we just get a bunch of routes where we lost the AS path info, because they just traversed 23456 a bunch of times.

This introduces problem number 2: we now have unequal AS path lengths – and we need to use both WAN circuits. Local-AS doesn’t, by default, replace an AS. What it does it adds another one. So here are the routes (11.11.11.11 is in the DC, and 99.99.99.99 is in the branch):

dc1-cs1# sh ip bgp 99.99.99.0/24
BGP routing table information for VRF default, address family IPv4 Unicast
BGP routing table entry for 99.99.99.0/24, version 44
Paths: (2 available, best #1)
Flags: (0x08001a) on xmit-list, is in urib, is best urib route, is in HW,

Advertised path-id 1
Path type: external, path is valid, is best path
AS-Path: 101 100 102 , path sourced external to AS
10.0.128.1 (metric 0) from 10.0.128.1 (192.168.0.21)
Origin IGP, MED not set, localpref 100, weight 0

Path type: external, path is valid, not best reason: AS Path
AS-Path: 999 201 200 202 , path sourced external to AS
10.1.0.1 (metric 0) from 10.1.0.1 (192.168.0.1)
Origin IGP, MED not set, localpref 100, weight 0
br1-isp1-ce>sh ip bgp 11.11.11.0
BGP routing table entry for 11.11.11.0/24, version 15
Paths: (1 available, best #1, table default)
Not advertised to any peer
Refresh Epoch 2
100 101 10000001
10.2.128.1 from 10.2.128.1 (192.168.0.17)
Origin IGP, localpref 100, valid, external, best
rx pathid: 0, tx pathid: 0x0
br1-isp2-ce>sh ip bgp 11.11.11.0
BGP routing table entry for 11.11.11.0/24, version 34
Paths: (1 available, best #1, table default)
Not advertised to any peer
Refresh Epoch 3
200 201 999 10000001
10.3.0.1 from 10.3.0.1 (192.168.0.29)
Origin IGP, localpref 100, valid, external, best
rx pathid: 0, tx pathid: 0x0

Obviously this won’t ECMP, as there is a different path length in both directions when comparing the routes via the two ISP’s. BGP actually does this for a fairly good reason – if you replace the AS, and the route comes back to you from somewhere else, you’ve just got rid of eBGP’s main loop prevention mechanism. In our topology this wasn’t an issue though, so we could override that behaviour.

This led to some debate and confusion. To override it, there are some keywords you can put after the local-as commands: no-prepend and replace-as. Maybe it was me – but I couldn’t work out exactly what they did from the docs. So here it is.

First, if we just put no-prepend:

router bgp 10000001
neighbor 10.1.0.1 remote-as 201
local-as 999 no-prepend

Then we get this at the DC side:

dc1-cs1(config-router-neighbor)# sh ip bgp 99.99.99.0
BGP routing table information for VRF default, address family IPv4 Unicast
BGP routing table entry for 99.99.99.0/24, version 56
Paths: (2 available, best #1)
Flags: (0x08001a) on xmit-list, is in urib, is best urib route, is in HW,

Advertised path-id 1
Path type: external, path is valid, is best path
AS-Path: 101 100 102 , path sourced external to AS
10.0.128.1 (metric 0) from 10.0.128.1 (192.168.0.21)
Origin IGP, MED not set, localpref 100, weight 0

Path type: external, path is valid, not best reason: newer EBGP path
AS-Path: 201 200 202 , path sourced external to AS
10.1.0.1 (metric 0) from 10.1.0.1 (192.168.0.1)
Origin IGP, MED not set, localpref 100, weight 0

Path-id 1 advertised to peers:
10.1.0.1

Brilliant. Now we have three AS hops. But wait:

br1-isp2-ce>sh ip bgp 11.11.11.0
BGP routing table entry for 11.11.11.0/24, version 46
Paths: (1 available, best #1, table default)
Not advertised to any peer
Refresh Epoch 3
200 201 999 10000001
10.3.0.1 from 10.3.0.1 (192.168.0.29)
Origin IGP, localpref 100, valid, external, best
rx pathid: 0, tx pathid: 0x0

We still have 4 hops from the branch back. So we’ll ECMP outbound from the DC, but the branch will still only use one circuit to get back. This is an important point:

no-prepend only applies to incoming updates

So what about replace-as?

Reconfigure again:

router bgp 10000001
neighbor 10.1.0.1 remote-as 201
local-as 999 no-prepend replace-as

Now we get exactly the same on the DC end, and the branch looks like this:

br1-isp2-ce>sh ip bgp 11.11.11.0
BGP routing table entry for 11.11.11.0/24, version 58
Paths: (1 available, best #1, table default)
Not advertised to any peer
Refresh Epoch 3
200 201 999
10.3.0.1 from 10.3.0.1 (192.168.0.29)
Origin IGP, localpref 100, valid, external, best
rx pathid: 0, tx pathid: 0x0

Looking at that AS path, it looks like the AS 1000001 has been completely replaced by 999 – hence replace-as. So from the branch now, each edge router has a 3 hop path back to the datacentre and it’s good.

However – in the DC we still have some problems:

dc1-cs1# sh ip route 99.99.99.99
........
99.99.99.0/24, ubest/mbest: 1/0
*via 10.0.128.1, [20/0], 00:51:00, bgp-10000001, external, tag 101,

Whoops. Only one route made the RIB – but we have two in the BGP table. I need to enable maximum-paths (by default, this is 1):

dc1-cs1(config)# router bgp 10000001
dc1-cs1(config-router)# address-family ipv4 unicast
dc1-cs1(config-router-af)# maximum-paths 2

So now we have:

dc1-cs1# sh ip route 99.99.99.99
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%' in via output denotes VRF 

99.99.99.0/24, ubest/mbest: 1/0
*via 10.0.128.1, [20/0], 00:53:05, bgp-10000001, external, tag 101,
dc1-cs1# sh ip bgp 99.99.99.0
BGP routing table information for VRF default, address family IPv4 Unicast
BGP routing table entry for 99.99.99.0/24, version 80
Paths: (2 available, best #1)
Flags: (0x08001a) on xmit-list, is in urib, is best urib route, is in HW, 
Multipath: eBGP

Advertised path-id 1
Path type: external, path is valid, is best path
AS-Path: 101 100 102 , path sourced external to AS
10.0.128.1 (metric 0) from 10.0.128.1 (192.168.0.21)
Origin IGP, MED not set, localpref 100, weight 0

Path type: external, path is valid, not best reason: newer EBGP path
AS-Path: 201 200 202 , path sourced external to AS
10.1.0.1 (metric 0) from 10.1.0.1 (192.168.0.1)
Origin IGP, MED not set, localpref 100, weight 0

Path-id 1 advertised to peers:
10.1.0.1

Well that’s annoying – we have two paths in the BGP table, but still only one makes it into the RIB.

Here is where I learned something totally new. Let’s look at BGP’s best path algorithm (WLLAOMNI):

  1. Prefer the path with the highest WEIGHT.
  2. Prefer the path with the highest LOCAL_PREF.
  3. Prefer the path that was locally originated via a network or aggregate BGP subcommand or through redistribution from an IGP.
  4. Prefer the path with the shortest AS_PATH.
  5. Prefer the path with the lowest origin type.
  6. Prefer the path with the lowest multi-exit discriminator (MED).
  7. Prefer eBGP over iBGP paths.
  8. Prefer the path with the lowest IGP metric to the BGP next hop.

After step 8, we determine if multiple paths require installation in the routing table for BGP Multipath. If they don’t, then we go on to some tie breakers. Simple, right?

Right then. Why don’t we have two paths in the RIB? In the output above, we clearly saw the reason that it wasn’t the best path was because it was a newer route – but that’s a tie breaker, further down the list, after the multipath decision is mate. But for multipath, the AS Paths have to be equal, not just equal length. That wasn’t clear to me in the docs. I read a note buried somewhere on someone’s blog (sorry, I can’t credit because I couldn’t re-find it again – I read lots of stuff that day!) which pointed me in the right direction. I’ll put that again in bold:

When considering a route for multipath, the AS Path must be equal, not just equal length.

This seems odd to me. I guess this only works for dual connected to single ISP scenarios.

Anyway, we can turn this off:

dc1-cs1(config)# router bgp 10000001
dc1-cs1(config-router)# bestpath as-path multipath-relax

And now we have two routes in the RIB:

dc1-cs1# sh ip route 99.99.99.99
....
99.99.99.0/24, ubest/mbest: 2/0
*via 10.0.128.1, [20/0], 01:06:15, bgp-10000001, external, tag 101, 
*via 10.1.0.1, [20/0], 00:01:29, bgp-10000001, external, tag 201,

The multipath-relax only relaxes the fact that they have to be the same – they still do have to be the same length.

Sample VIRL topology here if anyone wants it: BGP ECMP

Thanks for reading!

Posted in Uncategorized.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.