IPSec VPNs: How They Actually Work
These are my notes (cleaned up by an LLM) on IPSec VPN tunnels. These tunnels are the kind that connect branch offices, remote workers, and cloud environments over the public internet. Most resources either stay at a conceptual level ("it's an encrypted tunnel") or jump straight into vendor-specific configuration without explaining why anything works the way it does. This one tries to bridge that gap, covering the protocol mechanics in enough depth that the configuration options actually make sense. The troubleshooting section at the end is the part I've found most useful in practice.
What is a VPN
A VPN (Virtual Private Network) creates a secure communication channel between two points across an untrusted network. Four properties it provides:
- Authentication: Verifies the identity of the devices or users on each end.
- Data integrity: Uses hashing to ensure traffic hasn't been modified in transit.
- Data confidentiality: Encrypts traffic so it's unreadable to anyone intercepting it on the wire.
- Secure reachability: Makes resources at one site accessible from another over an untrusted network, as if they were privately reachable.
Site-to-Site vs Client-to-Site
Site-to-site: Connects two entire networks together. Edge devices (routers, firewalls, or VPN gateways) on each side form an encrypted tunnel between them. Nodes on Site A reach nodes on Site B over routed private connectivity. Authentication is between the edge devices, not individual users. The tunnel can be configured to stay up, or brought up by interesting traffic depending on the platform and design.
Client-to-site: A single user's device connects to a corporate VPN gateway over the internet. Each client has its own tunnel. The client device runs a VPN agent that handles connection setup and maintains the tunnel. Authentication is typically via certificate or username and password. Users connect and disconnect as needed rather than maintaining a permanent tunnel.
IPSec
IPSec (IP Security) is the protocol suite that underpins most enterprise VPNs. It operates at Layer 3 of the OSI model and supports both site-to-site and client-to-site configurations.
IPSec has two operating modes applied in Phase 2:
Transport Mode: Encrypts only the payload (Layer 4 and above). The original IP header is left in the clear. Commonly used for host-to-host security, though it can also appear in specific VPN designs like L2TP over IPSec. Packet structure: [Original IP Header | IPSec Header | Encrypted Payload].
Tunnel Mode: Encrypts the entire original IP packet (the original IP header, the transport layer header, and the payload) and encapsulates it in a new outer IP header. The outer header uses the public IPs of the VPN peers. Internal IPs are hidden inside the encrypted payload. This is the mode used for VPNs. Packet structure: [New IP Header | IPSec Header | Encrypted Original Packet].
IKE: The Negotiation Framework
Before any data traffic can be encrypted and sent, the two VPN peers need to agree on exactly how they'll communicate: which encryption algorithms, which hash functions, how long the keys will be valid. This negotiation is handled by IKE (Internet Key Exchange).
IKE works in two phases. Phase 1 builds a secure control channel between the peers. Phase 2 uses that control channel to negotiate the actual IPSec tunnel that carries data traffic. Both sides must use the same IKE version. Mixing IKEv1 on one side with IKEv2 on the other prevents the tunnel from forming.
IKEv1 Phase 1
Phase 1 works with the public IP addresses of the VPN peers, the addresses routable over the internet. Its output is a secure, authenticated channel the two devices can use to complete Phase 2 negotiation.
HAGLE
Phase 1 negotiates five parameters, remembered as HAGLE. All five values must match on both sides for Phase 1 to succeed.
H — Hashing: The algorithm used to generate integrity check values for the IKE Phase 1 negotiation. MD5 and SHA are common older options. The peers agree on one algorithm to use going forward.
A — Authentication: How the devices prove their identity to each other. Two options: a pre-shared key (PSK), essentially a password configured on both devices that must match, or a digital certificate issued by a trusted CA and pre-installed on both peers.
G — Group (Diffie-Hellman): The Diffie-Hellman group number specifies the parameters for the DH key exchange that will generate the symmetric encryption key. Higher group numbers use larger key sizes and are more computationally expensive but more secure.
L — Lifetime: How long the Phase 1 tunnel remains valid before it's torn down and rebuilt. Specified in seconds, kilobytes, or both. The default for most vendors is one day (86400 seconds). When the lifetime expires, the peers renegotiate rather than continuing to use aging keys indefinitely.
E — Encryption: The symmetric encryption algorithm used to protect the IKE Phase 1 control channel: 3DES, AES-128, AES-256, and so on. Phase 2 negotiates the encryption used for the actual IPSec data tunnel.
Phase 1 consists of three sub-phases: negotiating the SA settings (H, L, E), exchanging DH keys (G), and authenticating (A). Across the two available Phase 1 modes, these steps produce either 6 or 3 messages total.
Main Mode (6 Messages)
Main Mode is slower but more secure. The authentication values in messages 5 and 6 are encrypted.
- Initiator → Responder: Proposal for the Security Association. Contains the Phase 1 parameters the initiator wants to use, including H, A, G, L, and E.
- Responder → Initiator: Reply with SA parameters. If the responder's configuration matches, it echoes back the accepted values. If nothing matches, the tunnel fails here.
- Initiator → Responder: Diffie-Hellman public value and nonce from the initiator.
- Responder → Initiator: Diffie-Hellman public value and nonce from the responder. After messages 3 and 4, both sides can independently calculate the same shared symmetric key without either one having transmitted the key over the wire.
- Initiator → Responder: Authentication, either a PSK-derived authentication hash or certificate/signature-related material. Encrypted using the symmetric key established in the previous step.
- Responder → Initiator: Authentication confirmation, same general format as message 5. Encrypted.
The FQDN + PSK limitation in Main Mode:
VPN peers can be identified by IP address or by FQDN (domain name). If using IP addresses, PSK or certificate authentication both work cleanly. With IKEv1 Main Mode and per-peer PSKs, FQDN or dynamic-peer identification becomes difficult because the correct PSK normally has to be selected before the encrypted identity payload is available. Here's why:
In Main Mode, messages 1 through 4 carry only IP addresses. The peer's identity (e.g., vpn.example.com) is revealed in messages 5 and 6, which are encrypted. To use a PSK, both sides need to know which PSK to apply before they can decrypt those messages. The PSK is mapped to a peer identity in the VPN configuration. The problem: the initiator only knows the responder's IP address when messages 5 and 6 arrive, and with an FQDN, that IP may be dynamic. There's no reliable way to map the PSK before decryption, and decryption requires the PSK. Circular dependency. Per-peer PSK with FQDN or dynamic-peer identification in Main Mode is where this commonly breaks.
Certificates solve this because they carry the peer's identity inside the certificate itself, verifiable without a pre-mapped shared secret.
Aggressive Mode (3 Messages)
Aggressive Mode is faster but less secure. Authentication values are transmitted in plaintext.
- Initiator → Responder: Combines the SA proposal (H, L, E values), DH key exchange, and the initiator's identity into a single message.
- Responder → Initiator: Agrees to SA values, sends its DH key exchange, and provides authentication material (a PSK-derived hash or certificate/signature-related material). All in plaintext.
- Initiator → Responder: Authentication (PSK-derived hash or certificate/signature-related material). In plaintext.
The tradeoff is obvious: faster negotiation, but the authentication material is visible on the wire to anyone capturing traffic. An attacker capturing Phase 1 Aggressive Mode traffic can attempt offline dictionary attacks against the PSK.
Because the peer identity is exchanged in message 1 before encryption is established, there's no FQDN + PSK limitation. Either PSK or certificate authentication works with either IP or FQDN identification in Aggressive Mode.
IKEv1 Phase 2
Phase 2 runs inside the encrypted channel established by Phase 1. It negotiates the actual IPSec tunnel that will carry data traffic and operates on internal IP addresses rather than public peer IPs.
Security Association
Once Phase 2 is established and actively carrying traffic, the collection of agreed parameters is called the Security Association (SA). All SA values must match on both sides. The one exception: encryption domain subnets are mirrored (what one side calls its local encryption domain, the other side calls its remote, and vice versa).
Phase 2 negotiates:
- Hashing: Same concept as Phase 1, agree on the integrity algorithm.
- Lifetime: Phase 2 lifetimes are shorter than Phase 1. The default for most vendors is one hour (3600 seconds). The Phase 2 tunnel renegotiates more frequently than Phase 1.
- Encryption: Agree on the symmetric encryption algorithm.
- Encryption domain: Agree on the internal subnets that define which traffic should be encrypted and sent through the tunnel. On Site A's router, the local encryption domain is Site A's subnet and the remote encryption domain is Site B's subnet. On Site B's router, those are reversed.
Quick Mode (3 Messages)
Phase 2 uses Quick Mode, a 3-message exchange running inside the encrypted Phase 1 channel:
- Initiator → Responder: Propose the IPSec SA parameters and traffic selectors (encryption domains). Encrypted.
- Responder → Initiator: Agree to the SA and traffic selectors. Encrypted.
- Initiator → Responder: Confirm. Encrypted.
IKEv2
IKEv2 is the modern replacement for IKEv1. More efficient, more resilient, and addresses several weaknesses in IKEv1.
Built-in improvements:
- DPD/Keepalive: Dead Peer Detection is native in IKEv2. Re-establishment depends on platform behavior, configuration, and whether traffic or auto-negotiate brings the tunnel back up. That said, typically, the tunnel automatically detects and re-establishes if the peer goes unreachable.
- MOBIKE: If supported and enabled, MOBIKE allows the tunnel to survive address changes (a mobile device switching from Wi-Fi to cellular, or a dynamic IP address changing) without requiring a full re-authentication.
- PFS support: IKEv2 supports PFS for Child SAs, but each Child SA only gets its own additional DH exchange if that behavior is configured and negotiated.
IKEv2 Connection Setup (4 Messages)
IKE_SA_INIT:
- Request (Initiator → Responder): Proposes cryptographic algorithms, sends nonces, includes Diffie-Hellman material, and may include NAT detection information.
- Reply (Responder → Initiator): Agrees on policy, NAT-T, and DH.
- Result: The IKE SA is created, a secure control channel with its own DH-derived key. This is equivalent to IKEv1 Phase 1.
IKE_AUTH (Encrypted):
- Request (Initiator → Responder): Authenticates using PSK or certificate.
- Reply (Responder → Initiator): Authenticates the responder using the configured method.
- Result: The first Child SA is created, the data tunnel that carries actual VPN traffic. Additional DH/PFS behavior depends on what is negotiated and configured. This is equivalent to IKEv1 Phase 2.
Child SAs
A Child SA is the IPSec SA that protects actual data traffic. It's built under the IKE SA after the 4-message exchange completes. One IKE SA can support multiple Child SAs (for example, different Child SAs for different subnet pairs), each created via IKE_AUTH or a subsequent CREATE_CHILD_SA exchange.
Each Child SA has its own keying material. If PFS is configured and negotiated, a Child SA can also include an additional DH exchange for fresher, more independent keys. This is PFS at the per-SA level.
Client-to-Site: The VPN Adapter
For client-to-site connections, the VPN client on the user's device does more than establish the tunnel. It creates a virtual network adapter that is assigned VPN-specific network parameters: usually an IP from a VPN address pool, the corporate DNS server, and other network parameters. Traffic destined for the corporate network uses this adapter. From the perspective of corporate systems, the client appears reachable through the VPN.
Traffic is encrypted at the VPN peer interface on the gateway side, not within the corporate LAN. Decryption works the same way: incoming encrypted traffic is decrypted when it arrives on the VPN peer interface (the interface with the public IP). Internal traffic between devices on the same site is never encrypted by the VPN.
IPSec Routing Topologies
Full mesh / Site-to-site: Every edge device has a direct IPSec tunnel to every other edge device. Maximum redundancy (if one tunnel fails, traffic can often be rerouted through another). The cost scales with the number of sites: N sites require N(N-1)/2 tunnels, and each VPN gateway must handle N-1 simultaneous tunnels. Gets expensive quickly.
Hub and spoke / Star: A central hub device acts as the VPN server. All other sites (spokes) connect to the hub. Traffic between two spoke sites flows through the hub. Simpler to manage and less expensive than full mesh. The tradeoff: the hub is a single point of failure. If it goes down, all spoke-to-spoke connectivity is lost.
Hybrid topology: Multiple hub devices connected to each other in full mesh, with spoke sites connecting to one (or more) of the hubs. The hubs provide redundancy between themselves. Spokes benefit from that redundancy without requiring their own full mesh links. A balance between cost and resilience.
Client-to-site: Individual host endpoints connect to a central hub device using VPN client software. Each client has its own tunnel. Scales to large numbers of remote users.
DPD — Dead Peer Detection
DPD allows a VPN device to detect when the peer on the other end of a tunnel has become unreachable. Without it, a device might continue believing a tunnel is active long after the peer has gone offline, failing to attempt re-establishment.
DPD works by having the requester send a probe ("are you there?") to the peer. The peer responds with an acknowledgment. Three configuration modes:
On-demand: Send a probe only when there's no inbound traffic from the peer. If traffic is flowing, the peer is clearly reachable. Probe only in silence.
On-idle: Send a probe when there's no traffic in either direction, nothing inbound or outbound. More aggressive than on-demand.
Disabled: The device only responds to DPD probes from the peer but doesn't send its own. On some platforms, this disables DPD entirely. The device acts only as a responder.
DPD configuration must be compatible on both ends. A mismatch (one side sending probes that the other side isn't configured to respond to correctly) can cause unexpected tunnel behavior.
PFS — Perfect Forward Secrecy
Without PFS enabled, Phase 2 keying material is derived from the existing Phase 1 keying material rather than from a fresh Diffie-Hellman exchange. If the relevant Phase 1 keying material is compromised, the blast radius is larger.
With PFS enabled, an additional Diffie-Hellman key exchange is performed in Phase 2, contributing fresh key material that is not solely derived from Phase 1. Compromising Phase 1 keying material no longer automatically exposes later Phase 2 traffic. If a Phase 2 session key is compromised, it affects only that session, not past sessions.
The cost: DH key exchange is computationally expensive. Performing it twice per tunnel establishment increases CPU load. For most environments with modern hardware, this is negligible. PFS is best practice and should be enabled.
If PFS is disabled for performance reasons, the risk is clear: Phase 1 key material compromise cascades to all Phase 2 sessions derived from it.
NAT-T — NAT Traversal
IPSec and NAT don't naturally coexist. ESP (Encapsulating Security Payload), the protocol that carries encrypted IPSec data, operates at Layer 3 directly over IP (protocol number 50). It has no port numbers. NAT devices track connections by mapping IP and port information, but ESP gives them nothing to work with. In Tunnel Mode, the original TCP/UDP headers are inside the encrypted ESP payload. NAT can't see them either.
NAT-T solves this by wrapping IPSec packets in a UDP header using port 4500. NAT devices see standard UDP traffic and can track the connection normally.
How NAT-T negotiates in Main Mode:
During messages 3 and 4 of the Phase 1 Main Mode exchange (sent over UDP 500), both peers include NAT detection payloads. These payloads contain hashes that allow each side to determine whether a NAT device exists in the path. If NAT is detected and both sides support NAT-T, the exchange switches from UDP 500 to UDP 4500 starting with message 5.
Why UDP 4500 helps NAT devices:
NAT creates a mapping between the internal address and port and the external address and port. NAT-T wraps IPSec traffic in UDP so the NAT device has ports it can track. The peer may send traffic using UDP 4500, but a PAT device can still translate the external source port. The important part is that the flow now looks like UDP traffic instead of raw ESP.
When NAT-T is not required:
- Both VPN endpoints have public IP addresses with no NAT device between them.
- Enterprise environments with static public IPs on VPN gateways.
- VPN configurations deployed within environments where NAT doesn't exist in the path.
In practice, NAT-T is needed for any VPN endpoint behind a home router, a PAT device, or any environment with NAT in the path. Dynamic IP addressing by itself does not require NAT-T, though it often appears in the same environments. It's safer to enable NAT-T by default than to diagnose NAT-related failures after the fact.
Note: if IKE negotiation works but ESP traffic (protocol 50) is blocked or not passing, forcing NAT-T enables UDP 4500 encapsulation and often resolves this. If UDP 500 itself is blocked, the tunnel may not even begin negotiation unless both peers support starting on UDP 4500.
Encryption Domain
The encryption domain defines which traffic should be encrypted and sent through the IPSec tunnel. It's the combination of source subnet, destination subnet, and protocol/port criteria that identifies traffic belonging to the VPN.
Traffic arriving at the VPN gateway that matches the encryption domain parameters is routed into the tunnel and encrypted. Traffic that doesn't match is forwarded normally.
On a site-to-site VPN between Site A (192.168.1.0/24) and Site B (10.0.0.0/24): Site A's router defines its local encryption domain as 192.168.1.0/24 with a remote domain of 10.0.0.0/24. Site B's router defines its local domain as 10.0.0.0/24 with a remote domain of 192.168.1.0/24. The domains are symmetric; what one side calls local, the other calls remote.
Gateway Redundancy Note
When a site uses gateway redundancy, VPN redundancy should use the redundancy model supported by the VPN platform: HA cluster addresses, floating public IPs, redundant peers, dynamic routing over tunnels, or vendor-supported SD-WAN/IPSec failover. LAN gateway protocols like GLBP provide a virtual default gateway for local clients, but they do not automatically make an IPSec VPN endpoint redundant.
Troubleshooting IPSec
General Approach
Start with reachability before assuming a VPN misconfiguration. Confirm the peer's public IP is reachable. Try pinging an internal IP on the remote side, then try the reverse from the remote side. Check logs on both ends and filter for VPN-related entries.
On FortiGate, entering a real-time log filter and debug is the most direct investigation path:
diagnose vpn ike log-filter dst-addr4 <peer_public_ip>
diagnose debug application ike -1
diagnose debug enable
Look for mismatches in HAGLE parameters, the most common cause of Phase 1 failure. Always run a packet capture simultaneously so debug log timestamps can be correlated with specific packets in Wireshark.
Tunnel Not Establishing
Debug in real time. The error messages during IKE negotiation will tell you exactly which parameter is mismatched. Common culprits: encryption algorithm mismatch, hash mismatch, PSK doesn't match, DH group mismatch.
Tunnel Flapping
Identify the SPI (Security Parameter Index), SA values on both sides, and the destination IP. Compare them for discrepancies.
Check DPD behavior. If the debug log shows messages like "IKE giving up after 3 tries, peer set to lost," the VPN peer isn't responding to DPD probes. Get a packet capture to see whether DPD probes are actually being sent and received. Check SA values on both sides and investigate why responses aren't arriving.
Also check MTU. Fragmented or oversized packets can cause intermittent tunnel failures that look like flapping. Lower MTU and test.
Check physical cabling and interface stability. A marginal physical link causing intermittent packet loss will manifest as VPN instability.
Tunnel Goes Down and Stays Down
Same initial approach: compare SPI, SA values, and destination IP between both ends. Look for mismatches.
Check PFS configuration carefully. A subtle but common failure mode: the tunnel forms initially, operates for a while, and then goes down when the Phase 2 lifetime expires and rekeying begins. If one side has PFS enabled and the other doesn't, Phase 2 rekey fails.
Here's why it can work initially and fail later: during initial tunnel establishment, one side may not strictly enforce PFS, or the first Child SA is created without a DH exchange. When the Phase 2 SA expires and the side with PFS enabled attempts to rekey, it expects a new DH exchange. The other side tries to reuse the existing Phase 1 key. They can't agree on the rekey process and the tunnel goes down. Matching PFS configuration on both sides is required.
FortiGate-Specific Troubleshooting Workflow
Step 1: Identify the problem tunnel.
Check general VPN reachability and routing. If multiple tunnels exist, run:
get vpn tunnel summary
Look for tunnels showing total/up as 1/0 (one configured, zero up). This is your target.
Step 2: Check Phase 1 status.
Run IKE diagnostics:
diagnose vpn ike gateway
The status field shows either connecting (Phase 1 is down) or established (Phase 1 is up).
If connecting:
- Confirm UDP 500 (or 4500 if NAT-T is enabled) isn't being blocked between the peers.
- Run a traceroute or ping to the peer's public IP and confirm basic reachability.
- If reachability is confirmed, start IKE debug to look for parameter mismatches. Run a pcap on the outgoing interface simultaneously.
- If your traffic is reaching the peer but you see no response, run a pcap on the remote peer to determine whether they're receiving your packets and what they're sending back.
- Check upstream ACLs, ISP filtering, local-in policy, and interface exposure for UDP 500/4500. Regular IPv4 firewall policies matter for traffic through the tunnel after it is established.
If established: Phase 1 is fine. Move to Step 3.
Step 3: Check Phase 2 status.
Run phase 2 diagnostics and check the SA value:
- SA = 0: Selector mismatch. The encryption domains don't match between the two sides.
- SA = 1: Phase 2 is up. Use enc/dec counters to confirm whether traffic is actually flowing.
- SA = 2: Phase 2 is rekeying.
Also check enc/dec counters:
- enc increasing, dec not: Traffic is leaving your side but not reaching or being accepted by the remote side.
- dec increasing, enc not: Traffic is arriving from the remote side but nothing is going out.
- Both 0: Traffic isn't flowing at all. Start with the initiating device and confirm traffic is actually reaching the FortiGate from the local LAN.
Check the link monitor for peer status: alive or dead. Dead may indicate the peer is unreachable or DPD is failing.
Step 4: Phase 1 established, Phase 2 down (SA = 0).
- Verify encryption and hash algorithms match on both sides.
- Check PFS: if enabled, confirm the DH group matches on both sides.
- Confirm quick mode selectors (encryption domains) match (local subnet on one side must match the remote subnet on the other, and vice versa).
- Run a pcap on UDP 500/4500 and start the debug.
Step 5: Phase 2 up (SA = 1) but traffic not passing.
Traffic at this stage is carried as ESP protocol 50 when NAT-T is not in use, or UDP 4500 when NAT-T is encapsulating ESP. Check:
- Quick mode selectors: traffic that doesn't match the configured selectors is dropped rather than forwarded through the tunnel. Confirm source and destination subnets are defined correctly and broadly enough to include the traffic you're testing with.
- IPv4 policies: there must be a policy allowing traffic from the local subnet to the remote subnet through the tunnel interface. If source and destination address objects are defined in the policy, they must match the actual traffic.
- Routes: routing table entries must point to the tunnel interface for the remote subnet, not to the WAN interface. Check with
get router info routing-table all. - If debug shows traffic is being allowed but logged as "No matching IPSec selector, drop," check whether NAT is enabled on the firewall policy for this VPN traffic. NAT rewrites the source IP before it's checked against the Phase 2 selectors. Disable NAT on the outbound VPN policy and retest.
- If everything else checks out and ICMP tests to the far-end device fail, check whether Windows Firewall on the endpoint is blocking ICMP. The VPN may be working correctly while the endpoint is filtering the test traffic.
- Enable a pcap filtering for the peer's IP and run debug simultaneously:
# Without NAT-T:
diagnose sniffer packet <interface> 'host <peer_ip> and proto 50'
# With NAT-T:
diagnose sniffer packet <interface> 'host <peer_ip> and udp port 4500'
diagnose debug application ike -1
diagnose debug enable
Step 6: Tunnel flapping (alternating between SA = 0 and SA = 1).
First determine scope: is this one tunnel or all tunnels?
If all tunnels are flapping: the problem is likely the internet connection, not VPN configuration. Check the WAN link stability, run continuous pings to external IPs, and check for packet loss.
If one tunnel: pull VPN event logs from the GUI. Determine whether the tunnel was stable before and recently started flapping:
- If recently started: investigate any network or device changes made around that time. New equipment, configuration changes, ISP changes. Confirm those changes are correct.
- In either case: check MTU, check physical cabling, collect logs and packet captures.
Dashboard check: Go to Dashboard > IPSec Monitor. The enc/dec values there show cumulative traffic in both directions and help confirm which direction traffic is failing. If both are 0/0, start with the initiating device and confirm traffic is even reaching the FortiGate from the LAN.
The TLS/SSL post covers the other major tunneling protocol, the one your browser uses for every HTTPS connection, and which also underlies SSL VPN deployments.