AWS Transit Gateway: Advanced Patterns and Battle-Tested Lessons

10 min read
Transit GatewayDirect Connect

Table of Contents

The Reality of Transit Gateway at Scale

Everyone knows Transit Gateway is AWS's answer to the "VPC peering mesh from hell" problem. But after managing TGW across multiple accounts, dealing with cross-account resource sharing drama, and debugging route propagation issues at 2 AM, I've learned that the real complexity isn't in setting it up — it's in operating it at scale.

This isn't your typical "here's how to create a TGW" tutorial. This is about the patterns, pitfalls, and production lessons that only come from actually running this thing in anger.

Quick TGW Refresher

Before we dive into the main content, let's quickly align on what Transit Gateway actually does and why it matters.

What Problem Does TGW Solve?

Remember the old days? You'd have:

  • VPC peering connections everywhere (n*(n-1)/2 connections for n VPCs)
  • Separate VPN connections to each VPC for on-premises connectivity
  • Complex routing tables that nobody understood
  • No transitive routing (A→B→C didn't work)

Transit Gateway replaces this mess with a single hub that:

  • Acts as a cloud router — Connect VPCs, VPNs, Direct Connect, and other TGWs through one central point
  • Enables transitive routing — Traffic can flow through the TGW to reach any connected network
  • Scales horizontally — Handles up to 5 Gbps per VPC attachment, 50 Gbps aggregate
  • Supports multiple route tables — Segment traffic based on security requirements

Core Components at a Glance

1# The basic building blocks you'll be working with
2Components:
3 TransitGateway:
4 Purpose: 'The central hub'
5 Limits: '5 per region, 5000 VPC attachments'
6
7 Attachments:
8 Types: ['VPC', 'VPN', 'Direct Connect Gateway', 'Peering', 'Connect']
9 KeyPoint: 'Each attachment is a network connection to the TGW'
10
11 RouteTables:
12 Purpose: 'Control traffic flow between attachments'
13 Association: 'Each attachment associates with exactly one route table'
14 Propagation: 'Attachments can propagate routes to multiple tables'

Now that we're on the same page about what TGW is and what it does, let's talk about what happens when you actually run this in production across multiple accounts, regions, and teams — because that's where things get interesting (and occasionally terrifying).

Advanced Architecture Patterns

The Multi-Account Transit Hub Pattern

Forget the basic hub-and-spoke. In real enterprises, you're dealing with multiple AWS accounts, each with their own networking requirements, compliance boundaries, and — let's be honest — political territories.

Here's the pattern that actually works:

1# The centralized network account owns the TGW
2Resources:
3 TransitGateway:
4 Type: AWS::EC2::TransitGateway
5 Properties:
6 Description: Central transit hub
7 DefaultRouteTableAssociation: disable
8 DefaultRouteTablePropagation: disable
9 # This is critical for multi-account
10 AmazonSideAsn: 64512
11 Tags:
12 - Key: Name
13 Value: tgw-central
14 - Key: Environment
15 Value: shared
16 - Key: CostCenter
17 Value: networking # Yes, this matters politically

The key insight: Disable default route table association and propagation. You want explicit control over routing decisions, not AWS making assumptions for you.

Segmented Route Tables for Security Zones

Production networks aren't flat. You've got prod, dev, DMZ, shared services — each with different security postures. Here's how to segment properly:

1Parameters:
2 Environment:
3 Type: String
4 AllowedValues: [production, development, staging]
5
6Mappings:
7 SecurityZones:
8 production:
9 CidrBlocks: '10.0.0.0/16'
10 AllowedDestinations: 'shared_services,dmz_egress'
11 development:
12 CidrBlocks: '10.1.0.0/16'
13 AllowedDestinations: 'shared_services,dmz_egress,development'
14 dmz:
15 CidrBlocks: '10.2.0.0/16'
16 AllowedDestinations: 'dmz_egress'
17 shared_services:
18 CidrBlocks: '10.3.0.0/16'
19 AllowedDestinations: 'production,development'
20
21Resources:
22 # Create route table for each security zone
23 ProductionRouteTable:
24 Type: AWS::EC2::TransitGatewayRouteTable
25 Properties:
26 TransitGatewayId: !Ref TransitGateway
27 Tags:
28 - Key: Name
29 Value: tgw-rt-production
30 - Key: SecurityZone
31 Value: production
32
33 DevelopmentRouteTable:
34 Type: AWS::EC2::TransitGatewayRouteTable
35 Properties:
36 TransitGatewayId: !Ref TransitGateway
37 Tags:
38 - Key: Name
39 Value: tgw-rt-development
40 - Key: SecurityZone
41 Value: development

The magic is in the can_reach mapping. This becomes your network security policy as code.

Cross-Account Resource Sharing: The Hard Parts

RAM Sharing That Actually Works

AWS Resource Access Manager (RAM) is how you share TGW across accounts. But here's what the docs don't tell you:

1Resources:
2 TGWResourceShare:
3 Type: AWS::RAM::ResourceShare
4 Properties:
5 Name: !Sub 'tgw-share-${Environment}'
6 AllowExternalPrincipals: false # Start with false!
7 ResourceArns:
8 - !Sub 'arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:transit-gateway/${TransitGateway}'
9 Tags:
10 - Key: Purpose
11 Value: TGW cross-account sharing
12
13 # The gotcha: You need the principal associations
14 SpokeAccountAssociation:
15 Type: AWS::RAM::ResourceShareAssociation
16 Properties:
17 ResourceShareArn: !Ref TGWResourceShare
18 AssociatedEntity: !Ref SpokeAccountId # From Parameters
19 AssociationStatus: ASSOCIATING

Critical lesson: Don't set allow_external_principals = true unless you absolutely need it. It opens up sharing outside your AWS Organization, which is rarely what you want.

The Attachment Configuration Dance

This is where things get messy. Each spoke account needs to:

  1. Accept the RAM invitation (often forgotten!)
  2. Create the attachment
  3. Configure route tables

But here's the killer pattern — use parameters to pass the shared TGW ID:

1# In the spoke account template
2Parameters:
3 TransitGatewayId:
4 Type: String
5 Description: Shared Transit Gateway ID from network account
6
7 NetworkAccountId:
8 Type: String
9 Description: AWS Account ID that owns the TGW
10
11Resources:
12 VPCAttachment:
13 Type: AWS::EC2::TransitGatewayVpcAttachment
14 Properties:
15 TransitGatewayId: !Ref TransitGatewayId
16 VpcId: !Ref VPC
17 SubnetIds:
18 - !Ref PrivateSubnetA
19 - !Ref PrivateSubnetB
20
21 # This is crucial for routing control
22 Options:
23 DnsSupport: enable
24 Ipv6Support: disable
25 ApplianceModeSupport: disable
26
27 Tags:
28 - Key: Name
29 Value: !Sub 'tgw-attach-${Environment}'

Production Gotchas and Solutions

The Route Propagation Mystery

Ever had routes that should propagate but don't? Here's the debugging checklist I've built from painful experience:

1# Check attachment state
2aws ec2 describe-transit-gateway-attachments \
3 --filters "Name=transit-gateway-id,Values=tgw-xxx" \
4 --query 'TransitGatewayAttachments[].{ID:TransitGatewayAttachmentId,State:State,Association:Association}'
5
6# Verify route table associations
7aws ec2 get-transit-gateway-route-table-associations \
8 --transit-gateway-route-table-id tgw-rtb-xxx
9
10# The money shot - actual propagations
11aws ec2 get-transit-gateway-route-table-propagations \
12 --transit-gateway-route-table-id tgw-rtb-xxx

90% of the time, it's because:

  1. The attachment isn't associated with the right route table
  2. Propagation is enabled but the source doesn't have routes to propagate
  3. Overlapping CIDR blocks (TGW silently drops these)

The Asymmetric Routing Trap

This one's subtle. You've got traffic flowing A→B but not B→A. Classic symptoms:

  • SSH works one way
  • Health checks fail randomly
  • Intermittent timeouts

The fix? Always configure routes symmetrically:

1Resources:
2 # Don't just add the route one way
3 ProdToSharedRoute:
4 Type: AWS::EC2::TransitGatewayRoute
5 Properties:
6 DestinationCidrBlock: !Ref SharedServicesCidr
7 TransitGatewayAttachmentId: !Ref SharedServicesAttachment
8 TransitGatewayRouteTableId: !Ref ProductionRouteTable
9
10 # You need the return path too!
11 SharedToProdRoute:
12 Type: AWS::EC2::TransitGatewayRoute
13 Properties:
14 DestinationCidrBlock: !Ref ProductionCidr
15 TransitGatewayAttachmentId: !Ref ProductionAttachment
16 TransitGatewayRouteTableId: !Ref SharedServicesRouteTable

Advanced Patterns for Scale

Dynamic Route Learning with BGP

If you're connecting to on-premises or using SD-WAN, you'll want BGP. Here's the pattern that actually works in production:

1Resources:
2 CustomerGateway:
3 Type: AWS::EC2::CustomerGateway
4 Properties:
5 BgpAsn: 65000 # Your on-prem ASN
6 IpAddress: !Ref CustomerGatewayIP
7 Type: ipsec.1
8 Tags:
9 - Key: Name
10 Value: cgw-datacenter
11
12 VPNConnection:
13 Type: AWS::EC2::VPNConnection
14 Properties:
15 CustomerGatewayId: !Ref CustomerGateway
16 TransitGatewayId: !Ref TransitGateway
17 Type: ipsec.1
18 StaticRoutesOnly: false # Enable BGP
19
20 # Enable route propagation from VPN
21 VPNAttachmentPropagation:
22 Type: AWS::EC2::TransitGatewayRouteTablePropagation
23 Properties:
24 TransitGatewayAttachmentId: !Ref VPNConnection
25 TransitGatewayRouteTableId: !Ref ProductionRouteTable

Pro tip: Use BGP communities to tag routes for different handling:

1# On your edge router
2ip community-list standard PROD_ROUTES permit 65000:100
3ip community-list standard DEV_ROUTES permit 65000:200
4
5route-map TO_AWS permit 10
6 match community PROD_ROUTES
7 set as-path prepend 65000

Cost Optimization Patterns

TGW charges add up fast. Here's how to optimize:

  1. Attachment Consolidation: Don't create separate attachments for every little thing
  2. Regional Hub Pattern: Use TGW peering instead of duplicating TGWs
  3. Traffic Engineering: Route internet-bound traffic directly, not through TGW
1Resources:
2 # Smart routing to avoid unnecessary TGW charges
3 PrivateRouteTable:
4 Type: AWS::EC2::RouteTable
5 Properties:
6 VpcId: !Ref VPC
7
8 InternetRoute:
9 Type: AWS::EC2::Route
10 Properties:
11 RouteTableId: !Ref PrivateRouteTable
12 DestinationCidrBlock: 0.0.0.0/0
13 NatGatewayId: !Ref NATGateway
14 # NOT TransitGatewayId - avoid TGW charges!

Monitoring and Observability

You can't manage what you can't measure. Essential metrics:

1# CloudWatch custom metric for route table size
2import boto3
3
4def check_route_table_size():
5 ec2 = boto3.client('ec2')
6 cloudwatch = boto3.client('cloudwatch')
7
8 response = ec2.describe_transit_gateway_route_tables()
9
10 for rt in response['TransitGatewayRouteTables']:
11 routes = ec2.search_transit_gateway_routes(
12 TransitGatewayRouteTableId=rt['TransitGatewayRouteTableId'],
13 MaxResults=100
14 )
15
16 cloudwatch.put_metric_data(
17 Namespace='TGW/RouteTableSize',
18 MetricData=[{
19 'MetricName': 'RouteCount',
20 'Value': len(routes['Routes']),
21 'Dimensions': [
22 {'Name': 'RouteTableId', 'Value': rt['TransitGatewayRouteTableId']},
23 {'Name': 'Environment', 'Value': rt['Tags'].get('Environment', 'unknown')}
24 ]
25 }]
26 )

Set alarms when route tables approach limits (10,000 routes per table).

CI/CD for Network Changes

Network changes are scary. Here's a battle-tested CloudFormation workflow that's saved me from disasters:

Deploying CloudFormation with GitHub Actions

The CloudFormation-Specific Gotchas

After migrating from Terraform to CloudFormation for TGW management, here are the lessons learned:

  1. Change Sets Are Your Friend: Never update directly. Change sets show you exactly what will happen.

  2. The "No Updates" Trap: CloudFormation fails change sets with no changes. Handle this gracefully:

    1if [[ "$REASON" == *"no updates"* ]]; then
    2 echo "No changes needed"
    3 exit 0
    4fi
  3. Wait Conditions Matter: Don't just fire and forget. That wait change-set-create-complete has prevented many race conditions.

  4. S3 Templates for Large Stacks: Direct template body has size limits. Always use S3 for production:

    1aws s3 cp template.yaml s3://bucket/templates/${{ github.sha }}.yaml
    2aws cloudformation create-change-set --template-url https://...
  5. GitHub Environments for Safety: The environment: production line is crucial. It creates a deployment gate:

    1deploy:
    2 environment: production # This requires manual approval!

    Combined with GitHub's deployment protection rules, this ensures someone with proper access must explicitly click "Deploy" before any changes hit production. No more accidental merges taking down the network at 3 AM.

    Pro tip: Configure your environment with required reviewers who actually understand networking:

    1# In your repo settings → Environments → production
    2- Required reviewers: @network-team
    3- Wait timer: 5 minutes (gives time to catch mistakes)
    4- Deployment branches: main only

The key: Always review change sets before execution. A misconfigured route table can take down production faster than you can say "rollback." The manual approval gate has saved us more times than I'd like to admit.

The Politics of Network Architecture

Let's be real — the technical challenges are only half the battle. Here's how to navigate the organizational aspects:

Cost Attribution

1Parameters:
2 CostCenter:
3 Type: String
4 Description: Cost center code for billing
5
6 ChargebackCode:
7 Type: String
8 Description: Chargeback code for internal accounting
9
10# Tag everything with cost centers
11Resources:
12 TransitGateway:
13 Type: AWS::EC2::TransitGateway
14 Properties:
15 Tags:
16 - Key: CostCenter
17 Value: !Ref CostCenter
18 - Key: ChargebackCode
19 Value: !Ref ChargebackCode
20 - Key: DataTransferTo
21 Value: track-me-please

Finance will thank you when they can actually attribute the $10K/month TGW bill.

Access Control Patterns

1{
2 "Version": "2012-10-17",
3 "Statement": [
4 {
5 "Sid": "ReadOnlyTGW",
6 "Effect": "Allow",
7 "Action": [
8 "ec2:DescribeTransitGateways",
9 "ec2:DescribeTransitGatewayRouteTables",
10 "ec2:GetTransitGatewayRouteTableAssociations"
11 ],
12 "Resource": "*"
13 },
14 {
15 "Sid": "ModifyOnlyOwnAttachments",
16 "Effect": "Allow",
17 "Action": [
18 "ec2:ModifyTransitGatewayVpcAttachment",
19 "ec2:DeleteTransitGatewayVpcAttachment"
20 ],
21 "Resource": "*",
22 "Condition": {
23 "StringEquals": {
24 "ec2:ResourceTag/Owner": "${aws:PrincipalTag/team}"
25 }
26 }
27 }
28 ]
29}

Give teams autonomy while maintaining guardrails.

The TL;DR Checklist

If you take away nothing else:

  1. Disable default route propagation — Be explicit about routing
  2. Test asymmetric routing — It will happen, be ready
  3. Monitor route table size — You'll hit limits before you expect
  4. Use RAM carefully — Cross-account sharing is powerful but complex
  5. Tag everything — Future you will thank present you
  6. Automate with change sets — But always human-review route changes
  7. Plan for BGP — Even if you don't need it today
  8. Document security zones — Network segmentation as code

Transit Gateway isn't just a network hub — it's the backbone of your multi-account strategy. Treat it with the respect (and automation) it deserves.

Further Reading

Remember: In networking, as in life, it's not about avoiding all problems — it's about failing fast, debugging faster, and documenting everything for the next poor soul who has to maintain this.

Related Articles