The Reality of Transit Gateway at Scale
Everyone knows Transit Gateway is AWS's answer to the "VPC peering mesh from hell" problem. But after managing TGW across multiple accounts, dealing with cross-account resource sharing drama, and debugging route propagation issues at 2 AM, I've learned that the real complexity isn't in setting it up — it's in operating it at scale.
This isn't your typical "here's how to create a TGW" tutorial. This is about the patterns, pitfalls, and production lessons that only come from actually running this thing in anger.
Quick TGW Refresher
Before we dive into the main content, let's quickly align on what Transit Gateway actually does and why it matters.
What Problem Does TGW Solve?
Remember the old days? You'd have:
- VPC peering connections everywhere (n*(n-1)/2 connections for n VPCs)
- Separate VPN connections to each VPC for on-premises connectivity
- Complex routing tables that nobody understood
- No transitive routing (A→B→C didn't work)
Transit Gateway replaces this mess with a single hub that:
- Acts as a cloud router — Connect VPCs, VPNs, Direct Connect, and other TGWs through one central point
- Enables transitive routing — Traffic can flow through the TGW to reach any connected network
- Scales horizontally — Handles up to 5 Gbps per VPC attachment, 50 Gbps aggregate
- Supports multiple route tables — Segment traffic based on security requirements
Core Components at a Glance
1# The basic building blocks you'll be working with2Components:3 TransitGateway:4 Purpose: 'The central hub'5 Limits: '5 per region, 5000 VPC attachments'67 Attachments:8 Types: ['VPC', 'VPN', 'Direct Connect Gateway', 'Peering', 'Connect']9 KeyPoint: 'Each attachment is a network connection to the TGW'1011 RouteTables:12 Purpose: 'Control traffic flow between attachments'13 Association: 'Each attachment associates with exactly one route table'14 Propagation: 'Attachments can propagate routes to multiple tables'
Now that we're on the same page about what TGW is and what it does, let's talk about what happens when you actually run this in production across multiple accounts, regions, and teams — because that's where things get interesting (and occasionally terrifying).
Advanced Architecture Patterns
The Multi-Account Transit Hub Pattern
Forget the basic hub-and-spoke. In real enterprises, you're dealing with multiple AWS accounts, each with their own networking requirements, compliance boundaries, and — let's be honest — political territories.
Here's the pattern that actually works:
1# The centralized network account owns the TGW2Resources:3 TransitGateway:4 Type: AWS::EC2::TransitGateway5 Properties:6 Description: Central transit hub7 DefaultRouteTableAssociation: disable8 DefaultRouteTablePropagation: disable9 # This is critical for multi-account10 AmazonSideAsn: 6451211 Tags:12 - Key: Name13 Value: tgw-central14 - Key: Environment15 Value: shared16 - Key: CostCenter17 Value: networking # Yes, this matters politically
The key insight: Disable default route table association and propagation. You want explicit control over routing decisions, not AWS making assumptions for you.
Segmented Route Tables for Security Zones
Production networks aren't flat. You've got prod, dev, DMZ, shared services — each with different security postures. Here's how to segment properly:
1Parameters:2 Environment:3 Type: String4 AllowedValues: [production, development, staging]56Mappings:7 SecurityZones:8 production:9 CidrBlocks: '10.0.0.0/16'10 AllowedDestinations: 'shared_services,dmz_egress'11 development:12 CidrBlocks: '10.1.0.0/16'13 AllowedDestinations: 'shared_services,dmz_egress,development'14 dmz:15 CidrBlocks: '10.2.0.0/16'16 AllowedDestinations: 'dmz_egress'17 shared_services:18 CidrBlocks: '10.3.0.0/16'19 AllowedDestinations: 'production,development'2021Resources:22 # Create route table for each security zone23 ProductionRouteTable:24 Type: AWS::EC2::TransitGatewayRouteTable25 Properties:26 TransitGatewayId: !Ref TransitGateway27 Tags:28 - Key: Name29 Value: tgw-rt-production30 - Key: SecurityZone31 Value: production3233 DevelopmentRouteTable:34 Type: AWS::EC2::TransitGatewayRouteTable35 Properties:36 TransitGatewayId: !Ref TransitGateway37 Tags:38 - Key: Name39 Value: tgw-rt-development40 - Key: SecurityZone41 Value: development
The magic is in the can_reach
mapping. This becomes your network security policy as code.
Cross-Account Resource Sharing: The Hard Parts
RAM Sharing That Actually Works
AWS Resource Access Manager (RAM) is how you share TGW across accounts. But here's what the docs don't tell you:
1Resources:2 TGWResourceShare:3 Type: AWS::RAM::ResourceShare4 Properties:5 Name: !Sub 'tgw-share-${Environment}'6 AllowExternalPrincipals: false # Start with false!7 ResourceArns:8 - !Sub 'arn:aws:ec2:${AWS::Region}:${AWS::AccountId}:transit-gateway/${TransitGateway}'9 Tags:10 - Key: Purpose11 Value: TGW cross-account sharing1213 # The gotcha: You need the principal associations14 SpokeAccountAssociation:15 Type: AWS::RAM::ResourceShareAssociation16 Properties:17 ResourceShareArn: !Ref TGWResourceShare18 AssociatedEntity: !Ref SpokeAccountId # From Parameters19 AssociationStatus: ASSOCIATING
Critical lesson: Don't set
allow_external_principals = true
unless you absolutely need it. It opens up sharing outside your AWS Organization, which is rarely what you want.
The Attachment Configuration Dance
This is where things get messy. Each spoke account needs to:
- Accept the RAM invitation (often forgotten!)
- Create the attachment
- Configure route tables
But here's the killer pattern — use parameters to pass the shared TGW ID:
1# In the spoke account template2Parameters:3 TransitGatewayId:4 Type: String5 Description: Shared Transit Gateway ID from network account67 NetworkAccountId:8 Type: String9 Description: AWS Account ID that owns the TGW1011Resources:12 VPCAttachment:13 Type: AWS::EC2::TransitGatewayVpcAttachment14 Properties:15 TransitGatewayId: !Ref TransitGatewayId16 VpcId: !Ref VPC17 SubnetIds:18 - !Ref PrivateSubnetA19 - !Ref PrivateSubnetB2021 # This is crucial for routing control22 Options:23 DnsSupport: enable24 Ipv6Support: disable25 ApplianceModeSupport: disable2627 Tags:28 - Key: Name29 Value: !Sub 'tgw-attach-${Environment}'
Production Gotchas and Solutions
The Route Propagation Mystery
Ever had routes that should propagate but don't? Here's the debugging checklist I've built from painful experience:
1# Check attachment state2aws ec2 describe-transit-gateway-attachments \3 --filters "Name=transit-gateway-id,Values=tgw-xxx" \4 --query 'TransitGatewayAttachments[].{ID:TransitGatewayAttachmentId,State:State,Association:Association}'56# Verify route table associations7aws ec2 get-transit-gateway-route-table-associations \8 --transit-gateway-route-table-id tgw-rtb-xxx910# The money shot - actual propagations11aws ec2 get-transit-gateway-route-table-propagations \12 --transit-gateway-route-table-id tgw-rtb-xxx
90% of the time, it's because:
- The attachment isn't associated with the right route table
- Propagation is enabled but the source doesn't have routes to propagate
- Overlapping CIDR blocks (TGW silently drops these)
The Asymmetric Routing Trap
This one's subtle. You've got traffic flowing A→B but not B→A. Classic symptoms:
- SSH works one way
- Health checks fail randomly
- Intermittent timeouts
The fix? Always configure routes symmetrically:
1Resources:2 # Don't just add the route one way3 ProdToSharedRoute:4 Type: AWS::EC2::TransitGatewayRoute5 Properties:6 DestinationCidrBlock: !Ref SharedServicesCidr7 TransitGatewayAttachmentId: !Ref SharedServicesAttachment8 TransitGatewayRouteTableId: !Ref ProductionRouteTable910 # You need the return path too!11 SharedToProdRoute:12 Type: AWS::EC2::TransitGatewayRoute13 Properties:14 DestinationCidrBlock: !Ref ProductionCidr15 TransitGatewayAttachmentId: !Ref ProductionAttachment16 TransitGatewayRouteTableId: !Ref SharedServicesRouteTable
Advanced Patterns for Scale
Dynamic Route Learning with BGP
If you're connecting to on-premises or using SD-WAN, you'll want BGP. Here's the pattern that actually works in production:
1Resources:2 CustomerGateway:3 Type: AWS::EC2::CustomerGateway4 Properties:5 BgpAsn: 65000 # Your on-prem ASN6 IpAddress: !Ref CustomerGatewayIP7 Type: ipsec.18 Tags:9 - Key: Name10 Value: cgw-datacenter1112 VPNConnection:13 Type: AWS::EC2::VPNConnection14 Properties:15 CustomerGatewayId: !Ref CustomerGateway16 TransitGatewayId: !Ref TransitGateway17 Type: ipsec.118 StaticRoutesOnly: false # Enable BGP1920 # Enable route propagation from VPN21 VPNAttachmentPropagation:22 Type: AWS::EC2::TransitGatewayRouteTablePropagation23 Properties:24 TransitGatewayAttachmentId: !Ref VPNConnection25 TransitGatewayRouteTableId: !Ref ProductionRouteTable
Pro tip: Use BGP communities to tag routes for different handling:
1# On your edge router2ip community-list standard PROD_ROUTES permit 65000:1003ip community-list standard DEV_ROUTES permit 65000:20045route-map TO_AWS permit 106 match community PROD_ROUTES7 set as-path prepend 65000
Cost Optimization Patterns
TGW charges add up fast. Here's how to optimize:
- Attachment Consolidation: Don't create separate attachments for every little thing
- Regional Hub Pattern: Use TGW peering instead of duplicating TGWs
- Traffic Engineering: Route internet-bound traffic directly, not through TGW
1Resources:2 # Smart routing to avoid unnecessary TGW charges3 PrivateRouteTable:4 Type: AWS::EC2::RouteTable5 Properties:6 VpcId: !Ref VPC78 InternetRoute:9 Type: AWS::EC2::Route10 Properties:11 RouteTableId: !Ref PrivateRouteTable12 DestinationCidrBlock: 0.0.0.0/013 NatGatewayId: !Ref NATGateway14 # NOT TransitGatewayId - avoid TGW charges!
Monitoring and Observability
You can't manage what you can't measure. Essential metrics:
1# CloudWatch custom metric for route table size2import boto334def check_route_table_size():5 ec2 = boto3.client('ec2')6 cloudwatch = boto3.client('cloudwatch')78 response = ec2.describe_transit_gateway_route_tables()910 for rt in response['TransitGatewayRouteTables']:11 routes = ec2.search_transit_gateway_routes(12 TransitGatewayRouteTableId=rt['TransitGatewayRouteTableId'],13 MaxResults=10014 )1516 cloudwatch.put_metric_data(17 Namespace='TGW/RouteTableSize',18 MetricData=[{19 'MetricName': 'RouteCount',20 'Value': len(routes['Routes']),21 'Dimensions': [22 {'Name': 'RouteTableId', 'Value': rt['TransitGatewayRouteTableId']},23 {'Name': 'Environment', 'Value': rt['Tags'].get('Environment', 'unknown')}24 ]25 }]26 )
Set alarms when route tables approach limits (10,000 routes per table).
CI/CD for Network Changes
Network changes are scary. Here's a battle-tested CloudFormation workflow that's saved me from disasters:
Deploying CloudFormation with GitHub Actions
The CloudFormation-Specific Gotchas
After migrating from Terraform to CloudFormation for TGW management, here are the lessons learned:
-
Change Sets Are Your Friend: Never update directly. Change sets show you exactly what will happen.
-
The "No Updates" Trap: CloudFormation fails change sets with no changes. Handle this gracefully:
1if [[ "$REASON" == *"no updates"* ]]; then2 echo "No changes needed"3 exit 04fi -
Wait Conditions Matter: Don't just fire and forget. That
wait change-set-create-complete
has prevented many race conditions. -
S3 Templates for Large Stacks: Direct template body has size limits. Always use S3 for production:
1aws s3 cp template.yaml s3://bucket/templates/${{ github.sha }}.yaml2aws cloudformation create-change-set --template-url https://... -
GitHub Environments for Safety: The
environment: production
line is crucial. It creates a deployment gate:1deploy:2 environment: production # This requires manual approval!Combined with GitHub's deployment protection rules, this ensures someone with proper access must explicitly click "Deploy" before any changes hit production. No more accidental merges taking down the network at 3 AM.
Pro tip: Configure your environment with required reviewers who actually understand networking:
1# In your repo settings → Environments → production2- Required reviewers: @network-team3- Wait timer: 5 minutes (gives time to catch mistakes)4- Deployment branches: main only
The key: Always review change sets before execution. A misconfigured route table can take down production faster than you can say "rollback." The manual approval gate has saved us more times than I'd like to admit.
The Politics of Network Architecture
Let's be real — the technical challenges are only half the battle. Here's how to navigate the organizational aspects:
Cost Attribution
1Parameters:2 CostCenter:3 Type: String4 Description: Cost center code for billing56 ChargebackCode:7 Type: String8 Description: Chargeback code for internal accounting910# Tag everything with cost centers11Resources:12 TransitGateway:13 Type: AWS::EC2::TransitGateway14 Properties:15 Tags:16 - Key: CostCenter17 Value: !Ref CostCenter18 - Key: ChargebackCode19 Value: !Ref ChargebackCode20 - Key: DataTransferTo21 Value: track-me-please
Finance will thank you when they can actually attribute the $10K/month TGW bill.
Access Control Patterns
1{2 "Version": "2012-10-17",3 "Statement": [4 {5 "Sid": "ReadOnlyTGW",6 "Effect": "Allow",7 "Action": [8 "ec2:DescribeTransitGateways",9 "ec2:DescribeTransitGatewayRouteTables",10 "ec2:GetTransitGatewayRouteTableAssociations"11 ],12 "Resource": "*"13 },14 {15 "Sid": "ModifyOnlyOwnAttachments",16 "Effect": "Allow",17 "Action": [18 "ec2:ModifyTransitGatewayVpcAttachment",19 "ec2:DeleteTransitGatewayVpcAttachment"20 ],21 "Resource": "*",22 "Condition": {23 "StringEquals": {24 "ec2:ResourceTag/Owner": "${aws:PrincipalTag/team}"25 }26 }27 }28 ]29}
Give teams autonomy while maintaining guardrails.
The TL;DR Checklist
If you take away nothing else:
- Disable default route propagation — Be explicit about routing
- Test asymmetric routing — It will happen, be ready
- Monitor route table size — You'll hit limits before you expect
- Use RAM carefully — Cross-account sharing is powerful but complex
- Tag everything — Future you will thank present you
- Automate with change sets — But always human-review route changes
- Plan for BGP — Even if you don't need it today
- Document security zones — Network segmentation as code
Transit Gateway isn't just a network hub — it's the backbone of your multi-account strategy. Treat it with the respect (and automation) it deserves.
Further Reading
- AWS Transit Gateway quotas - Know your limits
- TGW Network Manager - Visualization at scale
- Cross-region peering patterns - When one region isn't enough
Remember: In networking, as in life, it's not about avoiding all problems — it's about failing fast, debugging faster, and documenting everything for the next poor soul who has to maintain this.