| name | migration-compute |
| description | Cross-region migration for compute services including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS). Covers AZ failure recovery, AMI/snapshot migration, coldsnap EBS Direct API transfers, AWS MGN agent-based migration, WSFC/SQL FCI cluster recovery, FSx ONTAP iSCSI, container image replication, task definition migration, Kubernetes workload backup/restore, and IRSA re-association. |
Compute Services Migration
Security: Always ensure migrated resources meet or exceed the security configuration of the source resources. Refer to SECURITY.md for security requirements.
EC2 Migration Guide
Key Considerations
- Not all EC2 instance types are available in every Region and AZ, confirm your required instance types are supported in the destination Region + AZs (check here or via
aws ec2 describe-instance-type-offerings --location-type availability-zone --region <region>)
- Instance store volumes are ephemeral and not preserved in AMI copies — use EBS-backed AMIs for persistent storage
- IP addresses change during migration; update DNS and security group references accordingly
- Reference: EC2 Regions and Availability Zones
What You Cannot Do When the AZ Is Down
- Create AMIs from affected instances
- Take EBS snapshots of affected volumes
- Use AWS Application Migration Service to migrate an instance that is physically not running (This may work if the instance can be logged onto, to install the AWS MGN agent and is able to each the AWS MGN endpoints; or if you migrated the instance before the AZ failure and have snapshot available)
- Detach EBS volumes (volumes are AZ-bound)
- Access affected instances via RDP, AWS Systems Manager, or Serial Console
- Collect inventory data from affected instances or resources
You can only work with what existed before the failure: AMIs, EBS snapshots, or AWS Backup recovery points.
Recovery Path
Do you have a recent AMI?
YES --> Step 1: Launch from AMI
NO --> Do you have EBS snapshots?
YES --> Step 2: Restore from snapshots
NO --> Is AWS Backup configured?
YES --> Step 3: Restore from AWS Backup
NO --> Step 4: Launch fresh instance (data loss)
Moving Between AZs or Regions (Once the Affected Region is Restored)
Failed AZ, moving to new region:
- With an AZ down, you do not have access to the instances and EBS volumes within that AZ.
- Large Snapshots/AMIs (multi-TB) can take hours to copy cross-region.
- You can recover from an EBS snapshot or AMI that was before the event.
- Snapshot:
Security: Always use --encrypted --kms-key-id for all snapshot copies to maintain encryption at rest.
Encrypted Snapshot :
This is the most common cross-region scenario — KMS keys are regional, so you need a key in the destination
region and an operational KMS/EBS service in the source region.
aws ec2 copy-snapshot \
--source-region us-west-2 \
--source-snapshot-id snap-0123456789abcdef0 \
--region us-east-1 \
--encrypted \
--kms-key-id arn:aws:kms:us-east-1:123456789012:key/abcd1234-5678-90ab-cdef-EXAMPLE \
--description "Re-encrypted with destination region key"
Security: Always use --encrypted --kms-key-id for all AMI copies to maintain encryption at rest.
Encrypted AMI:
Copy an encrypted AMI cross-region (re-encrypt with destination region KMS key)
aws ec2 copy-image \
--source-image-id ami-0123456789abcdef0 \
--source-region us-west-2 \
--name "Re-encrypted AMI in us-east-1" \
--region us-east-1 \
--encrypted \
--kms-key-id arn:aws:kms:us-east-1:123456789012:key/abcd1234-5678-90ab-cdef-EXAMPLE
Moving between AZ’s (Once the Affected Region is Restored)
Restore to a new EBS Volume
aws ec2 create-volume \
--snapshot-id snap-0123456789abcdef0 \
--availability-zone us-east-1a \
--volume-type gp3
Then attach it to an instance:
(AZ must match the target instance's AZ.)
aws ec2 attach-volume \
--volume-id vol-**NEW_ID** \
--instance-id i-0123456789abcdef0 \
--device /dev/xvdf
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type t3.micro \
--subnet-id subnet-0abcdef1234567890 \
--security-group-ids sg-0abcdef1234567890 \
--associate-public-ip-address \
--key-name MyKeyPair
- EC2 Troubleshooting documentation if something went wrong during AMI/Snapshot EC2 recovery:
For instances that are still accessible within the impacted region
The below steps allow extraction of EC2 instances that are alive, but all control plane functions are unavailable.
AWS Application Migration Service (AWS MGN)
If EC2 resources are still available within the impacted region, while all control plane functions are not available, you can make use of AWS MGN and its agent to migrate data to a new region using the AWS MGN agent. AWS MGN doesn't distinguish between on-prem and EC2 sources once the agent is installed — so treating your source EC2 instance as "on-prem" is simply the standard agent-based workflow pointing at a different target region. AWS MGN relies on the control plane of the DESTINATION region to operate.
This can be done by following these steps:
Step 1:Setup AWS MGN in the TARGET region - Use the Getting Started area in the new region.
Step 2: Verify that you can connect from the source region to AWS MGN in the target region.
Consider what connectivity options you have out of the region based on your SGs and route tables.
TCP 443 and replication needs TCP 1500.
Step 3: Download & Install the Replication Agent
Install the agent for the source OS, and find the system listed as a Source Server in the destination region.
Step 4: Configure Launch Settings in MGN
Where possible, you can configure the same instance shape as the source region.
Step 5: Launch the test instance
The sync takes time to complete. Once complete, launch a test instance to confirm AWS MGN was able to replicate correctly.
Step 6: Cutover the instance.
The above steps extract an EC2 instance and all of its data. Other considerations will be DNS and IP addresses which will change, and any dependent resources for a deployment.
Manually Dumping my MySQL databases before data sync:
https://dev.mysql.com/doc/refman/8.4/en/mysqldump.html#:~:text=for%20Backups%E2%80%9D.-,Invocation%20Syntax,-There%20are%20in
mysqldump -u username --ssl-mode=VERIFY_IDENTITY --ssl-ca=/path/to/rds-combined-ca-bundle.pem --all-databases | gzip > /source/data/dbdump.db.gz
Security: Use --ssl-mode=VERIFY_IDENTITY with the RDS CA certificate bundle to verify the server certificate and prevent man-in-the-middle attacks. Download the bundle from https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html
Manually copying data between EC2 Linux instances using Rsync (If API’s are not working)
Note: RSync must be installed on destination.
ssh-keyscan -H <DESTINATION-IP> >> ~/.ssh/known_hosts
sudo rsync -avzP -e "ssh -i /path/to/key.pem -o StrictHostKeyChecking=yes" \
/source/data/ \
ec2-user@<DESTINATION-IP>:/destination/data/
Security: Verify the SSH host key before first connection. Verify that the key file has restricted permissions (chmod 400). Consider using ssh-agent for key management instead of specifying key paths directly.
/path/to/key.pem (This is the key for the destination instance)
Key flags:
-a — archive mode (preserves permissions, timestamps, symlinks)
-v — verbose
-z — compress during transfer
-P — show progress + allow resume of partial transfers
--delete — (optional) remove files on destination that do not exist on source
Manually copying data between EC2 Windows instances using Robocopy
Note: Windows-based file transfers using robocopy may not fully utilize available bandwidth on connections with latency >100ms due to default TCP window size settings. Consider testing transfer methods for your specific environment.
Security: Enable SMB encryption for robocopy transfers: Set-SmbServerConfiguration -EncryptData $true -Force. Verify destination volumes use encrypted EBS volumes.
Prerequisites
- Destination server must have file sharing enabled
- Source server needs network access to destination
- Account needs write permissions on destination
PsRemoting must be enabled
Enable PsRemoting:
# On Both Servers:
Enable-PSRemoting -Force
# On Source Server:
Set-Item WSMan:\localhost\Client\TrustedHosts -Value "DestionationServerIPorName" -Force
# On Target Server:
Set-Item WSMan:\localhost\Client\TrustedHosts -Value "SourceServerIPorName" -Force
# On Both Servers:
Restart-Service WinRM
PowerShell Remote Method
# From your local machine, establish session to source server
$session = New-PSSession -ComputerName SourceServer -Credential (Get-Credential)
# Execute robocopy on source server to copy to destination
Invoke-Command -Session $session -ScriptBlock {
robocopy "C:\SourcePath" "\\DestinationServer\C$\DestPath" /E /Z /R:3 /W:5 /MT:8 /LOG:C:\robocopy.log
}
# Check log
Invoke-Command -Session $session -ScriptBlock { Get-Content C:\robocopy.log -Tail 20 }
# Clean up
Remove-PSSession $session
RDP Method
- Connect via RDP, then run directly:
robocopy "C:\SourcePath" "\\DestinationServer\C$\DestPath" /E /Z /R:3 /W:5 /MT:8 /LOG:C:\robocopy.log
Key Flags
/E - Copy subdirectories including empty
/Z - Restartable mode
/R:3 - 3 retries on failed copies
/W:5 - 5 seconds wait between retries
/MT:8 - 8 threads for faster copy
/LOG - Output log file
Once Control Plane has been restored
Once we have control plane restored customers can still follow the guidance above but have migration options
- Use the
AWSSupport-CopyEC2Instance runbook to automate the process of moving an EC2 instance to a new subnet, AZ, or VPC — in the same or a different Region.
- For EC2 instances with encrypted EBS volumes, use the AMI copy + S3 method to avoid sharing KMS keys across Regions.
- Restore from AWS Backup
aws backup list-recovery-points-by-resource \
--resource-arn arn:aws:ec2:us-east-1:ACCOUNT:instance/i-XXXXXXXXXXXX
aws backup start-restore-job \
--recovery-point-arn arn:aws:backup:us-east-1:ACCOUNT:recovery-point:RP-ID \
--iam-role-arn arn:aws:iam::ACCOUNT:role/AWSBackupDefaultServiceRole \
--metadata '{"SubnetId":"subnet-healthy-az","SecurityGroupIds":"sg-XXXXXXXXXXXX","InstanceType":"r5.2xlarge"}'
Monitor the restore:
aws backup describe-restore-job --restore-job-id RESTORE-JOB-ID
Post Migration Checklist?
[ ] OS boots cleanly (no dracut/emergency shell)
[ ] Instance responding to SSH
[ ] Filesystems mounted correctly (e.g /etc/fstab uses UUIDs)
[ ] DNS resolution functional (host/dig [www.amazon.com](http://www.amazon.com/))
[ ] SSM agent online (SSM Session Manager)
[ ] Security groups applied correctly
[ ] Application services running (systemctl status checks)
[ ] Elastic IP reassociated (if applicable)
[ ] Route 53 / DNS records updated
[ ] Backup schedule re-enabled for new instance IDs
[ ] Monitoring and alerting updated for new instance IDs
[ ] Cron jobs / systemd timers verified
WSFC / SQL FCI Cluster Recovery
This section applies only to Windows Server Failover Cluster and SQL Server Failover Cluster Instance environments.
If one cluster node survived
The surviving node should own the SQL FCI resources. Verify:
Get-ClusterNode | Format-Table Name, State
Get-ClusterGroup | Format-Table Name, State, OwnerNode
Get-Service MSSQLSERVER | Select-Object Status
Evict the dead node:
Remove-ClusterNode -Name "DEAD-NODE" -Force
Build a replacement node using Step 1 through 4 above. After the instance is running and domain-joined:
# Install clustering features
Install-WindowsFeature -Name Failover-Clustering -IncludeManagementTools
Install-WindowsFeature -Name Multipath-IO
# Connect iSCSI to FSx ONTAP (if using shared storage)
# Run scripts/04-configure-iscsi-fsx.ps1
# Add to cluster
Add-ClusterNode -Name $env:COMPUTERNAME -Cluster "SQLFCI"
# Add to SQL FCI
# Run scripts/06-install-sql-fci.ps1 -Action AddNode
VIP behavior in VPC
Cluster VIPs do not float between subnets in AWS VPC. Each node has its own secondary IP on its ENI. The cluster updates DNS on failover.
After adding a node in a new subnet:
- Add a secondary IP to the node's ENI
- In Failover Cluster Manager, set the cluster IP resource to Static with the secondary IP
- Set Possible Owners to only the node that owns that IP
If both nodes were in the failed AZ
Full cluster rebuild. If using Amazon FSx for NetApp ONTAP multi-AZ, your LUNs and SQL data are still intact. Rebuild both nodes, recreate the cluster, reconnect iSCSI, and reinstall SQL FCI.
Amazon FSx for NetApp ONTAP and Failover Cluster Storage Recovery
This section covers the OS-level storage recovery when using FSx for NetApp ONTAP with iSCSI and Windows Server Failover Clustering.
Amazon FSx for NetApp ONTAP Behavior During AZ Failure
Amazon FSx for NetApp ONTAP multi-AZ has an active file server in one AZ and a standby in the other. If the active AZ fails, Amazon FSx automatically fails over to the standby (typically within seconds). The iSCSI endpoint IPs stay the same (floating IPs managed by FSx). Windows iSCSI sessions should reconnect automatically if configured with persistent connections.
# Check Amazon FSx file system status
aws fsx describe-file-systems --file-system-ids fs-XXXXXXXXXXXX \
--query "FileSystems[0].[Lifecycle,FileSystemType,StorageType]" --output table
# Check SVM status and iSCSI endpoints
aws fsx describe-storage-virtual-machines \
--filters "Name=file-system-id,Values=fs-XXXXXXXXXXXX" \
--query "StorageVirtualMachines[*].[Name,Lifecycle,Endpoints.Iscsi.IpAddresses]" --output table
Verifying iSCSI Reconnection After FSx Failover
On each cluster node:
Get-IscsiSession | Format-Table TargetNodeAddress, IsConnected, IsPersistent
Get-IscsiTargetPortal | Format-Table TargetPortalAddress, TargetPortalPortNumber
mpclaim -s -d
Get-Disk | Where-Object BusType -eq iSCSI | Format-Table Number, FriendlyName, OperationalStatus, Size
If iSCSI Sessions Did Not Reconnect
# Remove stale portals and re-add
Get-IscsiTargetPortal | Remove-IscsiTargetPortal -Confirm:$false
New-IscsiTargetPortal -TargetPortalAddress 10.16.36.42 -TargetPortalPortNumber 3260
New-IscsiTargetPortal -TargetPortalAddress 10.16.97.81 -TargetPortalPortNumber 3260
Start-Sleep -Seconds 5
$targets = Get-IscsiTarget
foreach ($target in $targets) {
Connect-IscsiTarget -NodeAddress $target.NodeAddress
-TargetPortalAddress 10.16.36.42 -TargetPortalPortNumber 3260
-IsPersistent $true -IsMultipathEnabled $true -ErrorAction SilentlyContinue
Connect-IscsiTarget -NodeAddress $target.NodeAddress
-TargetPortalAddress 10.16.97.81 -TargetPortalPortNumber 3260
-IsPersistent $true -IsMultipathEnabled $true -ErrorAction SilentlyContinue
}
Get-IscsiSession | Format-Table TargetNodeAddress, IsConnected
Cluster Disk Recovery
After iSCSI reconnects, cluster disks may show as Failed or Offline.
# Check cluster disk status
Get-ClusterResource | Where-Object ResourceType -eq "Physical Disk" | Format-Table Name, State, OwnerNode
# Start failed disks
Get-ClusterResource | Where-Object { $_.ResourceType -eq "Physical Disk" -and $_.State -eq "Failed" } | ForEach-Object {
Start-ClusterResource -Name $_.Name -ErrorAction SilentlyContinue
}
# If disks show Offline in Disk Management
Get-Disk | Where-Object { $_.BusType -eq "iSCSI" -and $_.OperationalStatus -eq "Offline" } | Set-Disk -IsOffline $false
Disk Signature Issues After Recovery
After snapshot restore or node rebuild, disk signature collisions can prevent disks from coming online.
# Check for offline disks
Get-Disk | Where-Object OperationalStatus -eq "Offline"
# Force online (verify correct disk first)
Set-Disk -Number X -IsOffline $false
Set-Disk -Number X -IsReadOnly $false
# For cluster disks, use cluster commands
Get-ClusterResource "Cluster Disk 1" | Start-ClusterResource
MPIO Path Verification
# Show all MPIO disks and paths (expect 2 paths per disk)
mpclaim -s -d
# Check load balance policy (should be RR)
Get-MSDSMGlobalDefaultLoadBalancePolicy
# Fix if needed
Set-MSDSMGlobalDefaultLoadBalancePolicy -Policy RR
# Detailed path status
mpclaim -v
SQL Server Recovery After Storage Reconnection
# Check SQL cluster resource
Get-ClusterResource | Where-Object ResourceType -eq "SQL Server" | Format-Table Name, State
# Check dependency chain
Get-ClusterResource "SQL Server" | Get-ClusterResourceDependency
# Start in order: disks, then network name, then SQL
Get-ClusterResource | Where-Object ResourceType -eq "Physical Disk" | Start-ClusterResource
Start-Sleep -Seconds 10
Start-ClusterResource "SQL Network Name (SQLFCI-SQL)"
Start-Sleep -Seconds 5
Start-ClusterResource "SQL Server"
# Verify
Invoke-Sqlcmd -Query "SELECT @@SERVERNAME, @@VERSION" -ServerInstance "SQLFCI-SQL"
Invoke-Sqlcmd -Query "SELECT name, state_desc FROM sys.databases" -ServerInstance "SQLFCI-SQL"
FSx ONTAP REST API Verification
Security: The examples below retrieve credentials from AWS Secrets Manager into an environment variable. Do not hardcode passwords in scripts or pass them as literal strings in commands. The $PASSWORD variable below is populated securely from Secrets Manager. Use proper certificate verification instead of -k. See SECURITY.md.
# Get management IP
aws fsx describe-file-systems --file-system-ids fs-XXXXXXXXXXXX \
--query "FileSystems[0].OntapConfiguration.Endpoints.Management.IpAddresses[0]" \
--output text
# Check LUN status (retrieve password from Secrets Manager in production)
PASSWORD=$(aws secretsmanager get-secret-value --secret-id fsxadmin-password --query SecretString --output text)
curl -u "fsxadmin:$PASSWORD" \
"https://MGMT-IP/api/storage/luns?svm.name=svm-sql&fields=status,name,space"
# Check igroup mappings
curl -u "fsxadmin:$PASSWORD" \
"https://MGMT-IP/api/protocols/san/igroups?svm.name=svm-sql&fields=initiators,lun_maps"
# Check iSCSI service
curl -u "fsxadmin:$PASSWORD" \
"https://MGMT-IP/api/protocols/san/iscsi/services?svm.name=svm-sql"
Coldsnap: EBS Direct API Snapshot Transfers
What coldsnap does:
- Uploads local disk images directly into EBS snapshots using the EBS Direct APIs — no EC2 instance or volume attachment needed.
- Downloads EBS snapshots to local files for out-of-band transfer.
- Useful in automated pipelines, when the control plane is impaired, or when you need to move disk images without launching instances.
- Source: github.com/awslabs/coldsnap (Apache-2.0, Rust)
Installation
cargo install --locked coldsnap
Upload Local Disk Image to EBS Snapshot
coldsnap upload --wait disk.img
coldsnap upload disk.img --kms-key-id arn:aws:kms:<region>:<account>:key/<key-id>
coldsnap upload disk.img --tag "Key=Environment,Value=DR" --tag "Key=Source,Value=migration"
Download EBS Snapshot to Local File
coldsnap download snap-1234 disk.img
Cross-Region Migration with Coldsnap
AWS_DEFAULT_REGION=<source-region> coldsnap download snap-1234 disk.img
AWS_DEFAULT_REGION=<target-region> coldsnap upload --wait disk.img
aws ec2 create-volume --snapshot-id <new-snap-id> --availability-zone <target-az> --region <target-region>
Wait for Snapshot
coldsnap wait snap-1234
Credentials
Coldsnap uses the same credential chain as the AWS CLI (~/.aws/credentials, environment variables, instance profiles). Use --profile <name> to select a specific profile.
Amazon ECS Migration Guide
Key Considerations
- ECS clusters cannot be moved to another region. You must recreate the cluster, task definitions, and services in the target region.
- The ECS data plane is independent from the control plane. Running tasks on healthy nodes continue to operate even when the control plane is unreachable.
- Container images must be available in the target region before services can be created. If ECR in the source region is inaccessible, plan to rebuild images from your CI/CD pipeline or pull from an alternate registry.
- Stateful workloads backed by EFS continue working across AZ failures because EFS is a regional service — only the mount target is AZ-specific. Cross-region EFS data migration is not currently possible.
- Stateful workloads backed by EBS (EC2 launch type only) are AZ-specific and cannot move automatically.
- IAM roles are global, but verify they exist and hold the correct permissions before creating services in the target region.
- Secrets Manager secrets and SSM parameters are regional. Export and recreate them in the target region before registering task definitions.
What You CANNOT Do When the ECS Control Plane Is Down
- Stop, start, or restart tasks
- Update services or change desired count
- Register new task definitions
- Create new services or clusters
- Access Amazon ECR in the affected region
Running tasks on healthy nodes continue to operate. If the ECS control plane is completely inaccessible and you cannot run any API or CLI commands, open an AWS Support case (Critical severity) and provide your cluster names, account ID, and target region.
Recovery Path
Is the ECS control plane reachable and at least one AZ healthy?
YES --> Stream 1: Recover in-place (partial AZ failure)
NO --> Is this a full region outage?
YES --> Stream 2: Migrate to another region
NO --> Open an AWS Support case (Critical severity)
Moving Between AZs (Partial AZ Failure)
Use this path when the ECS control plane is still reachable and at least one AZ is healthy. The goal is to stop scheduling tasks in the failed AZ and reschedule them in the remaining healthy AZs.
Step 1: Identify tasks running in the affected AZ
aws ecs list-tasks \
--region SOURCE-REGION \
--cluster YOUR-CLUSTER-NAME
aws ecs describe-tasks \
--region SOURCE-REGION \
--cluster YOUR-CLUSTER-NAME \
--tasks TASK-ARN-1 TASK-ARN-2 \
--query 'tasks[*].[taskArn,availabilityZone,lastStatus]'
Step 2: Remove the failed AZ from service networking (Fargate launch type)
Update the service to remove the affected AZ's subnet:
aws ecs update-service \
--region SOURCE-REGION \
--cluster YOUR-CLUSTER-NAME \
--service YOUR-SERVICE-NAME \
--network-configuration "awsvpcConfiguration={subnets=[subnet-HEALTHY-AZ1,subnet-HEALTHY-AZ2],securityGroups=[sg-xxxxxxxxx],assignPublicIp=DISABLED}"
Step 3: Exclude the failed AZ via placement constraints (EC2 launch type)
aws ecs update-service \
--region SOURCE-REGION \
--cluster YOUR-CLUSTER-NAME \
--service YOUR-SERVICE-NAME \
--placement-constraints type=memberOf,expression="attribute:ecs.availability-zone != FAILED-AZ-ID"
Replace FAILED-AZ-ID with the actual failed AZ identifier (for example, me-central-1a).
Step 4: Force redeployment to reschedule tasks
aws ecs update-service \
--region SOURCE-REGION \
--cluster YOUR-CLUSTER-NAME \
--service YOUR-SERVICE-NAME \
--force-new-deployment
ECS stops tasks in the failed AZ and reschedules them in the healthy AZs.
Step 5: Verify service stability
aws ecs describe-services \
--region SOURCE-REGION \
--cluster YOUR-CLUSTER-NAME \
--services YOUR-SERVICE-NAME \
--query 'services[*].[serviceName,runningCount,desiredCount,status]'
Notes:
- Verify that healthy AZs have sufficient capacity to absorb rescheduled tasks. Request a limit increase proactively if needed.
- If tasks use EBS volumes (EC2 launch type only), those volumes are AZ-specific and cannot be moved automatically.
- If the desired count cannot be met in the remaining AZs, temporarily reduce it to match available capacity, then scale back up once stable.
Moving to a Different Region (Full Region Loss)
Recreate your ECS infrastructure in the target region. Complete all steps in order. The key dependency chain is: images in ECR, then task definitions, then VPC and supporting infrastructure, then cluster, then services.
Before you start:
- Document your current ECS cluster configuration.
- Identify container images and their registry locations.
- Note all task definitions, services, and their configurations.
- Identify your target recovery region.
- Verify you have the necessary IAM permissions in the target region.
- Document load balancer and networking configurations.
Step 1: Copy container images to the target region ECR
Option A: Pull from source region and push to target region
Create the repository in the target region:
aws ecr create-repository \
--repository-name YOUR-REPO-NAME \
--region TARGET-REGION
Authenticate, pull, retag, and push:
aws ecr get-login-password --region SOURCE-REGION | \
docker login --username AWS --password-stdin ACCOUNT-ID.dkr.ecr.SOURCE-REGION.amazonaws.com
docker pull ACCOUNT-ID.dkr.ecr.SOURCE-REGION.amazonaws.com/YOUR-REPO-NAME:TAG
docker tag ACCOUNT-ID.dkr.ecr.SOURCE-REGION.amazonaws.com/YOUR-REPO-NAME:TAG \
ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO-NAME:TAG
aws ecr get-login-password --region TARGET-REGION | \
docker login --username AWS --password-stdin ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com
docker push ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO-NAME:TAG
Option B: Rebuild from your CI/CD pipeline
Trigger your existing pipeline (CodePipeline, GitHub Actions, Jenkins) targeting the recovery region directly. This is the most reliable option if your pipeline is decoupled from the affected region.
Option C: Pull from an alternate registry
If images were previously pushed to ECR Public, Docker Hub, or another registry:
# From ECR Public
docker pull public.ecr.aws/YOUR-ALIAS/YOUR-REPO:TAG
docker tag public.ecr.aws/YOUR-ALIAS/YOUR-REPO:TAG \
ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO:TAG
aws ecr get-login-password --region TARGET-REGION | \
docker login --username AWS --password-stdin \
ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com
docker push ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO:TAG
Check whether ECR cross-region replication was pre-configured — images may already be available:
aws ecr describe-repositories --region TARGET-REGION
aws ecr list-images --region TARGET-REGION --repository-name YOUR-REPO-NAME
Recommendation going forward: Enable ECR cross-region replication proactively so images are always available in your DR region without manual intervention. See ECR replication documentation.
Step 2: Export task definitions from the source region
aws ecs list-task-definitions --region SOURCE-REGION
aws ecs describe-task-definition \
--region SOURCE-REGION \
--task-definition YOUR-TASK-DEFINITION:REVISION \
--query 'taskDefinition' > task-definition.json
Save these JSON files. You need them to register task definitions in the target region.
See Task Definitions documentation.
Step 3: Document service configurations
aws ecs list-services \
--region SOURCE-REGION \
--cluster YOUR-CLUSTER-NAME
aws ecs describe-services \
--region SOURCE-REGION \
--cluster YOUR-CLUSTER-NAME \
--services YOUR-SERVICE-NAME > service-config.json
For each service, record: service name, task definition and revision, desired count, launch type, network configuration, load balancer configuration, auto scaling settings, and service discovery settings.
Step 4: Migrate secrets and configuration
Copy Secrets Manager secrets:
# Retrieve secret value from source region
aws secretsmanager get-secret-value \
--region SOURCE-REGION \
--secret-id YOUR-SECRET-NAME \
--query 'SecretString' --output text > secret.txt
# Create secret in target region
aws secretsmanager create-secret \
--region TARGET-REGION \
--name YOUR-SECRET-NAME \
--secret-string file://secret.txt
Copy SSM Parameter Store parameters:
# Retrieve parameter from source region
aws ssm get-parameter \
--region SOURCE-REGION \
--name YOUR-PARAMETER-NAME \
--with-decryption \
--query 'Parameter.Value' --output text > parameter.txt
# Create parameter in target region
aws ssm put-parameter \
--region TARGET-REGION \
--name YOUR-PARAMETER-NAME \
--value file://parameter.txt \
--type SecureString
EFS note: For a partial AZ failure, update your service subnet configuration to use subnets in healthy AZs and EFS access resumes automatically via the healthy mount targets. For a full region migration, cross-region EFS data migration is not currently possible.
Step 5: Create VPC and networking in the target region
The target region requires: an Amazon Virtual Private Cloud (Amazon VPC) with public and private subnets across multiple AZs, an Internet Gateway, NAT Gateways for private subnets, route tables, and security groups matching your source configuration.
aws ec2 create-vpc \
--region TARGET-REGION \
--cidr-block 10.0.0.0/16 \
--tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=ecs-recovery-vpc}]'
# Repeat for each AZ
aws ec2 create-subnet \
--region TARGET-REGION \
--vpc-id vpc-xxxxxxxxx \
--cidr-block 10.0.1.0/24 \
--availability-zone TARGET-REGION-AZ
aws ec2 create-security-group \
--region TARGET-REGION \
--group-name ecs-tasks-sg \
--description "Security group for ECS tasks" \
--vpc-id vpc-xxxxxxxxx
Step 6: Verify IAM roles in the target region
IAM roles are global. Verify that your ECS task execution role, task role, service role, and auto scaling role exist and have correct policies. See ECS IAM Roles documentation.
Step 7: Create the ECS cluster in the target region
aws ecs create-cluster \
--region TARGET-REGION \
--cluster-name YOUR-CLUSTER-NAME
See Creating ECS Clusters documentation.
Step 8: Create the Application Load Balancer (if needed)
Create an ALB with at least two AZ subnets, configure HTTP/HTTPS listeners, and create a target group with IP target type for Fargate or instance target type for EC2 launch type. See ALB documentation.
Step 9: Register task definitions in the target region
Before registering, update the exported task definition JSON: - Change container image URIs to point to the target region ECR repositories. - Update Secrets Manager ARNs to reference the target region secrets. - Update CloudWatch log group names or create new log groups in the target region. - Verify the task execution role ARN and task role ARN are correct.
aws ecs register-task-definition \
--region TARGET-REGION \
--cli-input-json file://task-definition.json
Step 10: Create ECS services in the target region
aws ecs create-service \
--region TARGET-REGION \
--cluster YOUR-CLUSTER-NAME \
--service-name YOUR-SERVICE-NAME \
--task-definition YOUR-TASK-DEFINITION:REVISION \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-AZ1,subnet-AZ2],securityGroups=[sg-xxxxxxxxx],assignPublicIp=DISABLED}" \
--load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:TARGET-REGION:ACCOUNT-ID:targetgroup/...,containerName=YOUR-CONTAINER,containerPort=8080"
See Creating ECS Services documentation.
Step 11: Configure service auto scaling
aws application-autoscaling register-scalable-target \
--region TARGET-REGION \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/YOUR-CLUSTER-NAME/YOUR-SERVICE-NAME \
--min-capacity 1 \
--max-capacity 10
See Service Auto Scaling documentation.
Step 12: Update DNS and route traffic
In Route 53, update the A or CNAME record to point to the new load balancer DNS name. Set a low TTL (60 seconds) before making the change to speed up propagation.
Step 13: Verify and test
Run these checks before switching production traffic:
# Check all services show desired count = running count
aws ecs describe-services \
--region TARGET-REGION \
--cluster YOUR-CLUSTER-NAME \
--services YOUR-SERVICE-NAME \
--query 'services[*].[serviceName,runningCount,desiredCount,status]'
# Check load balancer target health
aws elbv2 describe-target-health \
--region TARGET-REGION \
--target-group-arn YOUR-TARGET-GROUP-ARN
Also verify: logs are flowing to CloudWatch, secrets and environment variables are correct, database connectivity is working, and auto scaling triggers are configured.
Alternative: Redeploy using Infrastructure as Code
If you have Terraform:
# Update provider region in your configuration, then run:
terraform plan
terraform apply
If you have CloudFormation:
aws cloudformation create-stack \
--region TARGET-REGION \
--stack-name ecs-recovery-stack \
--template-body file://ecs-stack.yaml \
--capabilities CAPABILITY_IAM
Post-Recovery Checklist
[ ] All services showing desired count = running count
[ ] Load balancer target group health checks passing
[ ] Logs flowing to CloudWatch Log Groups
[ ] CloudWatch alarms configured for service failures
[ ] CloudWatch Logs retention configured
[ ] DNS / Route 53 records updated to new load balancer
[ ] Auto scaling thresholds verified
[ ] Secrets and environment variables confirmed correct
[ ] Database connectivity tested
[ ] AWS Backup or snapshot schedule re-enabled
[ ] Monitoring dashboards updated
[ ] Disaster recovery documentation updated
Amazon EKS Migration Guide
Key Considerations
- Amazon Elastic Kubernetes Service (Amazon EKS) clusters cannot be moved. You must create a new cluster in the target region and restore workloads into it.
- The Kubernetes data plane is independent from the control plane. Running pods on healthy nodes continue to operate even when the EKS API server is unreachable.
- IRSA (IAM Roles for Service Accounts) must be re-associated with the new cluster's OIDC provider before triggering any restore. Skipping this causes silent
AccessDenied errors on all AWS API calls from pods.
- EKS add-ons (CSI drivers, CoreDNS, kube-proxy, VPC CNI) must be installed on the new cluster before restoring workloads. Persistent volume mounts fail if CSI drivers are not present at restore time.
- Fargate profiles are cluster-specific. If using EKS Fargate, recreate profiles on the new cluster. Drain and cordon steps do not apply to Fargate nodes.
- Kubernetes Secrets exported with
kubectl are only base64-encoded, not encrypted. Handle exported secret files securely and delete them after use.
- Cross-region EFS data migration is not currently possible. For an AZ failure, update mount targets to healthy AZs. For a full region migration, EFS-backed workloads cannot be migrated with their persistent data.
What You CANNOT Do When the EKS Control Plane Is Down
- Run
kubectl commands of any kind
- Schedule new pods or trigger deployments
- Perform rolling updates or rollbacks
- Access the EKS console or API server
Running pods on healthy nodes continue to operate. If the EKS API server is completely inaccessible and cannot be restored, open an AWS Support case (Critical severity) with your cluster ARN, account ID, and target region.
Recovery Path
Is the EKS API server reachable?
YES --> Are any nodes in the failed AZ?
YES --> Drain affected nodes, exclude failed AZ from node provisioning
NO --> No action needed; workloads are already on healthy nodes
NO --> Is this a partial AZ failure or full region loss?
PARTIAL AZ --> Wait for API access to recover, then drain affected nodes
FULL REGION --> Migrate to another region (steps below)
Moving Between AZs (Partial AZ Failure)
Step 1: Identify nodes in the failed AZ and cordon them
# List nodes and their AZs
kubectl get nodes --label-columns topology.kubernetes.io/zone
# Prevent new pods from scheduling on affected nodes
kubectl cordon NODE-NAME
# Evict all running pods from affected nodes
kubectl drain NODE-NAME --ignore-daemonsets --delete-emptydir-data
Step 2: Prevent new nodes from being created in the failed AZ
Managed Node Groups: Create a new node group that does not include the failed AZ's subnets, then cordon and drain the old node group to migrate workloads.
aws eks create-nodegroup \
--cluster-name YOUR-CLUSTER \
--nodegroup-name recovery-nodegroup \
--subnets subnet-HEALTHY-AZ1 subnet-HEALTHY-AZ2 \
--instance-types INSTANCE-TYPE \
--ami-type AMI-TYPE \
--scaling-config minSize=MIN-SIZE,maxSize=MAX-SIZE,desiredSize=DESIRED-SIZE \
--region SOURCE-REGION
See Migrating to a new node group.
Karpenter: Patch the NodePool to exclude the failed AZ:
kubectl patch nodepool default --type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/requirements/-",
"value": {"key": "topology.kubernetes.io/zone",
"operator": "NotIn", "values": ["FAILED-AZ-ID"]}}]'
Replace FAILED-AZ-ID with the actual failed AZ identifier.
Step 3: Handle stateful workloads in the failed AZ
EBS-backed StatefulSets: EBS volumes are AZ-specific. Take a volume snapshot and use the snapshot as the data source for a new PVC in a healthy AZ. See Migrating EKS clusters from gp2 to gp3 EBS volumes for the snapshot-based migration pattern.
EFS-backed StatefulSets: EFS is regional. Only the mount target is AZ-specific. Update the PVC to reference a mount target in a healthy AZ. No data migration is required.
Moving to a Different Region (Full Region Loss)
Step 1: Export workload manifests
Export manifests before attempting any backup or restore. This provides a portable fallback if AWS Backup is unavailable in the affected region.
# Export all workload resources across all namespaces
kubectl get deploy,svc,configmap,ingress,hpa,statefulset,daemonset \
-A -o yaml > all-workloads.yaml
# Export Kubernetes Secrets (filter out system secrets before applying to new cluster)
# WARNING: Secrets are only base64-encoded, not encrypted. You MUST encrypt the output file.
# Verify GPG is configured with a recipient key before proceeding.
kubectl get secrets -A -o yaml | gpg --encrypt -r <key-id> > secrets-backup.yaml.gpg
# Export ConfigMaps
kubectl get configmaps -A -o yaml > configmaps-backup.yaml
Kubernetes Secrets exported this way are only base64-encoded, not encrypted. Delete the output files after use.
Step 2: Back up the cluster using AWS Backup (if accessible)
If AWS Backup is operational in the source region:
aws backup start-backup-job \
--region SOURCE-REGION \
--backup-vault-name YOUR-VAULT-NAME \
--resource-arn arn:aws:eks:SOURCE-REGION:ACCOUNT-ID:cluster/YOUR-CLUSTER \
--iam-role-arn arn:aws:iam::ACCOUNT-ID:role/AWSBackupDefaultServiceRole
See AWS Backup for EKS.
Step 3: Prepare the new cluster in the target region
Complete all three preparation steps before triggering any restore. Restoring without completing them causes IRSA failures and persistent volume mount errors.
3a. Create the new EKS cluster
Use eksctl, the AWS Console, or your existing IaC targeting the new region.
eksctl create cluster \
--name NEW-CLUSTER-NAME \
--region TARGET-REGION \
--version KUBERNETES-VERSION
3b. Associate the OIDC provider and update IRSA trust policies
Security: Customers are responsible for configuring IAM roles for service accounts, managing OIDC provider associations, and verifying trust policies follow least-privilege principles. Review each IRSA role's permissions during migration to remove any unnecessary access.
# Get the new cluster's OIDC issuer URL
aws eks describe-cluster \
--name NEW-CLUSTER-NAME \
--region TARGET-REGION \
--query "cluster.identity.oidc.issuer" --output text
# Associate the OIDC provider
eksctl utils associate-iam-oidc-provider \
--cluster NEW-CLUSTER-NAME \
--region TARGET-REGION --approve
# For each IRSA role: retrieve the trust policy, update the OIDC ARN, then apply
aws iam get-role --role-name YOUR-IRSA-ROLE \
--query "Role.AssumeRolePolicyDocument" > trust-policy.json
# Edit trust-policy.json: replace the old OIDC ARN with the new cluster's OIDC ARN
aws iam update-assume-role-policy \
--role-name YOUR-IRSA-ROLE \
--policy-document file://trust-policy.json
See IAM Roles for Service Accounts.
3c. Install required EKS add-ons on the new cluster
SOURCE_CLUSTER=YOUR-SOURCE-CLUSTER
TARGET_CLUSTER=NEW-CLUSTER-NAME
for ADDON in $(aws eks list-addons \
--cluster-name $SOURCE_CLUSTER \
--region SOURCE-REGION \
--query 'addons[]' --output text); do
echo "Installing $ADDON..."
aws eks create-addon \
--cluster-name $TARGET_CLUSTER \
--addon-name $ADDON \
--region TARGET-REGION
done
See Managing EKS add-ons.
Step 4: Migrate container images
Pod images reference ECR repositories in the source region. If ECR in the source region is inaccessible, follow the ECS section options A, B, and C under "Copy container images to the target region ECR." The process is identical for EKS.
Step 5: Migrate supporting resources
EBS volumes: Use volume snapshots and follow the EBS snapshot migration steps in the EC2 section of this guide.
Secrets Manager and SSM Parameter Store: Follow ECS Step 4. The process is identical.
S3 buckets: When migrating S3 data to a new region, enable Block Public Access on the target bucket, configure default encryption (SSE-KMS recommended), enforce TLS-only access via bucket policy, and enable versioning. Use aws s3 sync with --sse aws:kms for cross-region replication.
VPC resources: Recreate VPC, subnets, security groups, and NAT Gateways in the target region before restoring the cluster.
Fargate profiles: If using EKS Fargate, recreate the Fargate profiles on the new cluster after it is created.
Step 6: Restore workloads
Option A: Restore from AWS Backup
aws backup start-restore-job \
--region TARGET-REGION \
--recovery-point-arn RECOVERY-POINT-ARN \
--iam-role-arn arn:aws:iam::ACCOUNT-ID:role/AWSBackupDefaultServiceRole \
--metadata '{"ClusterName":"NEW-CLUSTER-NAME","Region":"TARGET-REGION"}'
See Restoring an EKS cluster.
Option B: Apply exported manifests
Update the kubeconfig to point to the new cluster, then apply the exported manifests. Review and filter system namespaces and service account secrets before applying.
aws eks update-kubeconfig \
--name NEW-CLUSTER-NAME \
--region TARGET-REGION
# Review the file before applying to exclude kube-system and other system namespaces
kubectl apply -f all-workloads.yaml
Step 7: Update DNS and route traffic
In Route 53, update A or CNAME records to point to the new load balancer or ingress endpoints. Set a low TTL (60 seconds) before making the change to speed up propagation.
Post-Recovery Checklist
[ ] kubeconfig updated to new cluster (aws eks update-kubeconfig)
[ ] All nodes in Ready state (kubectl get nodes)
[ ] All pods running and healthy (kubectl get pods -A)
[ ] EKS add-ons installed and active
[ ] Pod AWS API calls working (check logs for AccessDenied or timeout errors)
[ ] ECR images accessible from target region
[ ] Persistent volumes bound and accessible
[ ] Services and Ingress endpoints resolving correctly
[ ] DNS / Route 53 records updated to new load balancer
[ ] Horizontal Pod Autoscaler (HPA) thresholds verified
[ ] AWS Backup plan re-enabled for new cluster
[ ] Monitoring and alerting updated for new cluster endpoints
[ ] Disaster recovery documentation updated
Security Controls and Measurable Improvements
Encrypted Snapshots/AMIs: Cross-region AWS KMS re-encryption provides data encryption in transit and at rest using region-specific keys. Measurable: all snapshots and AMIs in target region are encrypted; no unencrypted copies exist.
Secrets Manager Migration: Secrets Manager provides automatic rotation, encryption at rest with AWS KMS, and audit logging via CloudTrail. Migrating secrets before service deployment enables applications to avoid falling back to hardcoded credentials. Measurable: all secrets accessible in target region; rotation schedules configured; no plaintext credentials in application configs.
IRSA (IAM Roles for Service Accounts): IRSA establishes pod-level IAM permissions using OIDC federation, replacing node-level IAM roles. This implements least-privilege access by scoping permissions to individual service accounts. Measurable: all IRSA trust policies updated with new cluster OIDC ARN; no pods using node-level IAM roles; AccessDenied errors resolved before cutover.