The Networking Brain Dump, Part 5: BGP
Part 5 of 6. This is the one that actually connects the internet together. BGP is what ISPs use to exchange routes with each other, what large enterprises use to connect to multiple providers, and what every autonomous system on the planet uses to announce what address space it owns. It also has the most configuration surface area of almost any routing protocol you'll encounter.
What BGP Is
BGP (Border Gateway Protocol) is the only routing protocol that runs over TCP. It uses port 179. EIGRP and OSPF operate directly over IP. BGP relies on TCP for reliable, ordered delivery instead of building that reliability into the routing protocol itself.
BGP is a path-vector protocol. It doesn't select routes based on a numeric metric the way OSPF uses cost or EIGRP uses its composite formula. Instead, it makes decisions based on path attributes: things like the list of autonomous systems a route passes through, the origin of the route, local preference values, and a dozen other attributes that can be tuned with policy. This makes BGP extremely flexible for traffic engineering and extremely easy to misconfigure.
When two BGP routers establish a TCP connection on port 179, they become peers (or neighbors). They exchange their full BGP routing tables once. After that initial exchange, they only send incremental updates as routes change. BGP does not send periodic full-table refreshes the way some other protocols do.
eBGP vs iBGP
BGP has two operating modes depending on whether the peers are in the same autonomous system.
eBGP (external BGP): Peering between routers in different autonomous systems.
- By default, eBGP peers must be directly connected. If they're not, they need reachability to each other's neighbor address (e.g., static route), and you usually need 'ebgp-multihop' or equivalent TTL adjustment.
- TTL on eBGP packets is 1 by default, which is why direct connectivity is the default assumption.
- Each eBGP router adds its own AS number to the AS_PATH attribute before advertising a route to the next AS. This is how BGP detects and prevents loops: if a router sees its own AS in the AS_PATH of an incoming route, it discards that route.
- eBGP has an administrative distance of 20.
- When an eBGP router advertises a route, it sets itself as the next-hop for that route.
iBGP (internal BGP): Peering between routers within the same autonomous system.
- iBGP peers do not need to be directly connected. They typically have an IGP like OSPF running between them to provide reachability for the TCP session.
- Same AS number on both sides.
- iBGP does not change the next-hop attribute when forwarding routes to other iBGP peers. The next-hop remains whatever the originating eBGP router set it to. This matters: an iBGP router that receives a route with a next-hop pointing to an external eBGP router needs to have an IGP route to that eBGP router, or it won't be able to use the BGP route. The fix is the
next-hop-selfcommand, which tells the iBGP router to replace the next-hop with its own address before advertising to iBGP peers. - iBGP has an administrative distance of 200, higher than any IGP. BGP routes are less trusted than internal routes by design.
BGP Messages
Open: Sent after the TCP handshake completes. The two routers use this to agree on:
- AS number
- Hold time (how long to wait between keepalives before declaring the session dead)
- Router ID
Update: The workhorse message. Carries new routes, changed routes, and withdrawn routes. Each Update contains one advertisement per unique path, listing every network reachable via that path. This list of reachable networks is called NLRI (Network Layer Reachability Information). An Update can also carry withdrawn routes, telling the peer to remove those prefixes.
Notification: Sent when an error condition is detected. The session is closed immediately after. Notifications carry an error code and subcode describing what went wrong.
Keepalive: Sent at regular intervals to confirm the session is still alive. If the hold time expires without a keepalive, the session is torn down.
BGP States
BGP goes through a defined state machine when establishing a session. Knowing these states tells you exactly where in the process a session is stuck.
Idle: The starting state. Begins as soon as the neighbor command is configured. The router is waiting to initiate a TCP connection.
Connect: The router has initiated the TCP three-way handshake and is waiting for it to complete.
OpenSent: The TCP session is up. The router has sent an Open message and is waiting for one in return.
OpenConfirm: An Open message has been received. The router is waiting for a Keepalive to confirm the session parameters are acceptable.
Established: The session is fully operational. Both sides are exchanging BGP routes and sending keepalives. This is the only state where useful work is happening.
Active: The router is actively trying to establish the TCP session. If it stays in Active, something is wrong. If a BGP session is stuck in Active, it typically means the peer is unreachable, the neighbor address is misconfigured, or there's a firewall blocking port 179.
BGP Multihoming
Multihoming describes how an organization connects to one or more ISPs. BGP's role and complexity varies depending on the design.
Single-homed: One link to one ISP. There's no redundancy and typically no need for BGP. Static routes to the ISP are sufficient. BGP would add complexity with no benefit.
Dual-homed: Two or more links to the same ISP. Redundancy exists, but only to one provider. BGP is useful here for managing bandwidth across multiple links and for controlling which link carries which traffic.
Single multi-homed: One link to each of two or more ISPs. The organization uses provider-independent addressing, meaning it obtains its own address block from a regional registry like ARIN rather than using addresses assigned by the ISPs. This is usually the clean design because the same prefixes need to be advertised to multiple providers. The organization is responsible for advertising those prefixes to each ISP. One critical rule here: do not advertise routes learned from one ISP to the other. If you do, you become a transit path between two ISPs, and your links become responsible for routing internet traffic that has nothing to do with you.
Dual multi-homed: Two or more links to two or more ISPs. The most resilient configuration. Multiple routers on each side, provider-independent addressing, full control over traffic engineering. Same rule applies: no transit between ISPs.
BGP Route Reflectors
In a standard iBGP deployment, every iBGP router must peer with every other iBGP router directly. This full-mesh requirement becomes unmanageable at scale: 10 routers need 45 peering sessions, 100 routers need 4,950. Route reflectors solve this.
The AS is divided into clusters. Each cluster has one or more route reflectors that peer with all their clients. Route reflectors within the AS peer with each other, and with any non-client iBGP routers.
When a route reflector receives a route from an iBGP peer, it reflects that route to its other iBGP peers rather than dropping it (which is what a normal iBGP router would do, since iBGP-learned routes are not advertised to other iBGP peers in a standard setup). Loop prevention works as follows:
- When a route reflector forwards a route it received from an iBGP peer, it adds that peer's Router ID as the Originator ID attribute, if one isn't already present.
- It also adds its own cluster's identifier as the Cluster ID.
- If a router receives a route with its own Router ID as the Originator ID, it discards the route. If it sees its own Cluster ID in the cluster list, it discards the route. This prevents loops.
The next-hop issue still applies: iBGP route reflectors don't change the next-hop attribute when reflecting routes. The reflected routes still point to the original eBGP next-hop. Every iBGP router receiving those routes needs a path to that next-hop via its IGP, or the routes are unusable. The usual fix is to set next-hop handling intentionally. In Cisco devices, for example, the next-hop-self command on the reflector resolves this.
BGP Confederations
Confederations are another approach to the iBGP full-mesh problem. Instead of using a logical hierarchy of route reflectors, the AS is split into sub-ASes.
Each sub-AS uses a private AS number internally within the confederation and runs iBGP between its own routers. Between sub-ASes, eBGP-like sessions are used, which means routes can be advertised without the full-mesh restriction. As a route travels between sub-ASes, each sub-AS appends its sub-AS number to the AS_PATH attribute.
When the route finally exits the confederation AS to an external peer, the border router strips all the confederation sub-AS information from the AS_PATH and replaces it with the confederation's real, public AS number. External peers see a clean AS_PATH with just the real AS, as if the internal sub-AS structure doesn't exist.
Loop prevention works normally: the confederation AS_PATH information prevents loops within the confederation, and the real AS number in the external AS_PATH prevents loops between autonomous systems.
The next-hop is not changed when routes pass between sub-ASes, which means you still need an IGP running across the entire confederation to ensure reachability for all next-hops, or you need the next-hop-self command configured at the sub-AS boundaries.
Full-mesh peering is still required within each sub-AS. Route reflectors can be added within sub-ASes to reduce that requirement, and confederations and route reflectors can be used together in the same deployment.
Redistributing into BGP
To redistribute OSPF routes into BGP (Cisco-specific command):
router bgp <AS-number>
redistribute ospf <process-id> match [internal] [external {1 | 2}]
The internal keyword includes OSPF intra-area and inter-area routes. The external keyword includes redistributed external routes, with 1 or 2 specifying whether to include E1 or E2 route types. You can include both keywords to redistribute all OSPF route types into BGP.
Part 6 is the last one. It covers IP services (NTP and FHRP), multicast from IGMP through PIM sparse mode and rendezvous point discovery, GRE tunnels, LISP, and MPLS including VRFs, route distinguishers, route targets, and DMVPN.
Part 5 of 6 in the Networking Brain Dump series.