Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

migration-compute

Name: Migration Compute
Author: aws-samples

// Cross-region migration for compute services including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS). Covers AZ failure recovery, AMI/snapshot migration, coldsnap EBS Direct API transfers, AWS MGN agent-based migration, WSFC/SQL FCI cluster recovery, FSx ONTAP iSCSI, container image replication, task definition migration, Kubernetes workload backup/restore, and IRSA re-association.

Ejecutar en Manus

$ git log --oneline --stat

stars:3

forks:0

updated:24 de marzo de 2026, 20:18

SKILL.md

readonly

related-skills.json

mismo repositorio

aws-database-backup.md

from "aws-samples/sample-migration-agentic-cli-assistant"

Expert guide for backing up AWS databases including Amazon Relational Database Service (Amazon RDS), Aurora, MS-SQL, MySQL, and PostgreSQL. Use when planning or executing database backup operations, retention policies, or restore procedures. Covers Amazon Simple Storage Service (Amazon S3) upload security.

2026-03-243

migration-database.md

from "aws-samples/sample-migration-agentic-cli-assistant"

Cross-region migration for database services including Amazon Relational Database Service (Amazon RDS), Amazon Aurora, Amazon Redshift, and Amazon ElastiCache. Covers snapshot copy and restore, cross-region read replica promotion, cross-region automated backups, Aurora Global Database failover, AWS KMS re-encryption, native database dump fallback, Amazon Redshift cross-region snapshot copy, and ElastiCache Global Datastore failover.

2026-03-243

migration-networking.md

from "aws-samples/sample-migration-agentic-cli-assistant"

Cross-region migration for networking services including AWS Transit Gateway, AWS Site-to-Site VPN, AWS Client VPN, and AWS Direct Connect. Covers Transit Gateway recreation with VPC and Direct Connect gateway attachments, VPN tunnel configuration with pre-shared keys, Client VPN endpoint migration with certificate handling, and Direct Connect Gateway association.

2026-03-243

migration-security.md

from "aws-samples/sample-migration-agentic-cli-assistant"

Cross-region migration for security services including ACM, AWS KMS, AWS IAM Identity Center, IAM/STS federation, and AWS WAF. Covers certificate re-issuance and validation, KMS key re-encryption workflows, Encryption SDK re-keying, IAM Identity Center multi-region replication, SAML regional endpoint failover, STS endpoint configuration, and WAF WebACL cloning.

2026-03-243

aws-logging-diagnostics.md

from "aws-samples/sample-migration-agentic-cli-assistant"

Querying and analyzing CloudWatch Logs, CloudTrail events, and other AWS log sources. Use when investigating errors, auditing actions, or understanding what happened with an AWS resource.

2026-03-243

aws-network-diagnostics.md

from "aws-samples/sample-migration-agentic-cli-assistant"

Debugging AWS network connectivity issues including Amazon Virtual Private Cloud (Amazon VPC), security groups, network ACLs (NACLs), route tables, VPC endpoints, DNS, load balancers, and AWS Transit Gateway. Use when troubleshooting connectivity failures or validating network paths.

2026-03-243

package.json

"author": "aws-samples"

"repository": "aws-samples/sample-migration-agentic-cli-assistant"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Administradores de redes y sistemas informáticosOcupaciones informáticas y matemáticas15-1244L4

Ejecuta cualquier Skill con un clic

name

migration-compute

description

Cross-region migration for compute services including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS). Covers AZ failure recovery, AMI/snapshot migration, coldsnap EBS Direct API transfers, AWS MGN agent-based migration, WSFC/SQL FCI cluster recovery, FSx ONTAP iSCSI, container image replication, task definition migration, Kubernetes workload backup/restore, and IRSA re-association.

Compute Services Migration

Security: Always ensure migrated resources meet or exceed the security configuration of the source resources. Refer to SECURITY.md for security requirements.

EC2 Migration Guide

Key Considerations

Not all EC2 instance types are available in every Region and AZ, confirm your required instance types are supported in the destination Region + AZs (check here or via aws ec2 describe-instance-type-offerings --location-type availability-zone --region <region>)
Instance store volumes are ephemeral and not preserved in AMI copies — use EBS-backed AMIs for persistent storage
IP addresses change during migration; update DNS and security group references accordingly
Reference: EC2 Regions and Availability Zones

What You Cannot Do When the AZ Is Down

Create AMIs from affected instances
Take EBS snapshots of affected volumes
Use AWS Application Migration Service to migrate an instance that is physically not running (This may work if the instance can be logged onto, to install the AWS MGN agent and is able to each the AWS MGN endpoints; or if you migrated the instance before the AZ failure and have snapshot available)
Detach EBS volumes (volumes are AZ-bound)
Access affected instances via RDP, AWS Systems Manager, or Serial Console
Collect inventory data from affected instances or resources

You can only work with what existed before the failure: AMIs, EBS snapshots, or AWS Backup recovery points.

Recovery Path

Do you have a recent AMI?  
  YES --> Step 1: Launch from AMI  
  NO  --> Do you have EBS snapshots?  
            YES --> Step 2: Restore from snapshots  
            NO  --> Is AWS Backup configured?  
                      YES --> Step 3: Restore from AWS Backup  
                      NO  --> Step 4: Launch fresh instance (data loss)

Moving Between AZs or Regions (Once the Affected Region is Restored)

Failed AZ, moving to new region:

With an AZ down, you do not have access to the instances and EBS volumes within that AZ.
Large Snapshots/AMIs (multi-TB) can take hours to copy cross-region.
You can recover from an EBS snapshot or AMI that was before the event.
Snapshot:
- Copy Snapshot https://docs.aws.amazon.com/ebs/latest/userguide/ebs-copy-snapshot.html#ebs-snapshot-copy
- CLI: https://docs.aws.amazon.com/ec2/latest/devguide/example_ec2_CopySnapshot_section.html

Security: Always use --encrypted --kms-key-id for all snapshot copies to maintain encryption at rest.

Encrypted Snapshot :

This is the most common cross-region scenario — KMS keys are regional, so you need a key in the destination region and an operational KMS/EBS service in the source region.

aws ec2 copy-snapshot \
--source-region us-west-2 \
--source-snapshot-id snap-0123456789abcdef0 \
--region us-east-1 \
--encrypted \
--kms-key-id arn:aws:kms:us-east-1:123456789012:key/abcd1234-5678-90ab-cdef-EXAMPLE \
--description "Re-encrypted with destination region key"

AMI
- Copy EC2 AMI https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/CopyingAMIs.html

Security: Always use --encrypted --kms-key-id for all AMI copies to maintain encryption at rest.

Encrypted AMI:

Copy an encrypted AMI cross-region (re-encrypt with destination region KMS key)

aws ec2 copy-image \
--source-image-id ami-0123456789abcdef0 \
--source-region us-west-2 \
--name "Re-encrypted AMI in us-east-1" \
--region us-east-1 \
--encrypted \
--kms-key-id arn:aws:kms:us-east-1:123456789012:key/abcd1234-5678-90ab-cdef-EXAMPLE

Moving between AZ’s (Once the Affected Region is Restored)

To restore from a EBS Snapshot https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html

Restore to a new EBS Volume

aws ec2 create-volume \
  --snapshot-id snap-0123456789abcdef0 \
  --availability-zone us-east-1a \
  --volume-type gp3

Then attach it to an instance: (AZ must match the target instance's AZ.)

aws ec2 attach-volume \
--volume-id vol-**NEW_ID** \
--instance-id i-0123456789abcdef0 \
--device /dev/xvdf

To restore from an AMI https://repost.aws/knowledge-center/launch-instance-custom-ami
- CLI: https://docs.aws.amazon.com/ec2/latest/devguide/example_ec2_RunInstances_section.html

aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.micro \
  --subnet-id subnet-0abcdef1234567890 \
  --security-group-ids sg-0abcdef1234567890 \
  --associate-public-ip-address \
  --key-name MyKeyPair

EC2 Troubleshooting documentation if something went wrong during AMI/Snapshot EC2 recovery:
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-troubleshoot.html

For instances that are still accessible within the impacted region

The below steps allow extraction of EC2 instances that are alive, but all control plane functions are unavailable.

AWS Application Migration Service (AWS MGN)

If EC2 resources are still available within the impacted region, while all control plane functions are not available, you can make use of AWS MGN and its agent to migrate data to a new region using the AWS MGN agent. AWS MGN doesn't distinguish between on-prem and EC2 sources once the agent is installed — so treating your source EC2 instance as "on-prem" is simply the standard agent-based workflow pointing at a different target region. AWS MGN relies on the control plane of the DESTINATION region to operate.

This can be done by following these steps:

Step 1:Setup AWS MGN in the TARGET region - Use the Getting Started area in the new region. Step 2: Verify that you can connect from the source region to AWS MGN in the target region. Consider what connectivity options you have out of the region based on your SGs and route tables. TCP 443 and replication needs TCP 1500. Step 3: Download & Install the Replication Agent Install the agent for the source OS, and find the system listed as a Source Server in the destination region. Step 4: Configure Launch Settings in MGN Where possible, you can configure the same instance shape as the source region. Step 5: Launch the test instance The sync takes time to complete. Once complete, launch a test instance to confirm AWS MGN was able to replicate correctly. Step 6: Cutover the instance.

The above steps extract an EC2 instance and all of its data. Other considerations will be DNS and IP addresses which will change, and any dependent resources for a deployment.

Manually Dumping my MySQL databases before data sync:

https://dev.mysql.com/doc/refman/8.4/en/mysqldump.html#:~:text=for%20Backups%E2%80%9D.-,Invocation%20Syntax,-There%20are%20in

mysqldump -u username --ssl-mode=VERIFY_IDENTITY --ssl-ca=/path/to/rds-combined-ca-bundle.pem --all-databases | gzip > /source/data/dbdump.db.gz

Security: Use --ssl-mode=VERIFY_IDENTITY with the RDS CA certificate bundle to verify the server certificate and prevent man-in-the-middle attacks. Download the bundle from https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html

Manually copying data between EC2 Linux instances using Rsync (If API’s are not working)

Note: RSync must be installed on destination.

# Verify and accept host key on first connection
ssh-keyscan -H <DESTINATION-IP> >> ~/.ssh/known_hosts

sudo rsync -avzP -e "ssh -i /path/to/key.pem -o StrictHostKeyChecking=yes" \
    /source/data/ \
    ec2-user@<DESTINATION-IP>:/destination/data/

Security: Verify the SSH host key before first connection. Verify that the key file has restricted permissions (chmod 400). Consider using ssh-agent for key management instead of specifying key paths directly.

/path/to/key.pem (This is the key for the destination instance)

Key flags:

-a — archive mode (preserves permissions, timestamps, symlinks)
-v — verbose
-z — compress during transfer
-P — show progress + allow resume of partial transfers

--delete — (optional) remove files on destination that do not exist on source

Manually copying data between EC2 Windows instances using Robocopy

Note: Windows-based file transfers using robocopy may not fully utilize available bandwidth on connections with latency >100ms due to default TCP window size settings. Consider testing transfer methods for your specific environment.

Security: Enable SMB encryption for robocopy transfers: Set-SmbServerConfiguration -EncryptData $true -Force. Verify destination volumes use encrypted EBS volumes.

Prerequisites

Destination server must have file sharing enabled
Source server needs network access to destination
Account needs write permissions on destination
PsRemoting must be enabled

Enable PsRemoting:

# On Both Servers:
Enable-PSRemoting -Force

# On Source Server:
Set-Item WSMan:\localhost\Client\TrustedHosts -Value "DestionationServerIPorName" -Force

# On Target Server:
Set-Item WSMan:\localhost\Client\TrustedHosts -Value "SourceServerIPorName" -Force

# On Both Servers:
Restart-Service WinRM

PowerShell Remote Method


# From your local machine, establish session to source server
$session = New-PSSession -ComputerName SourceServer -Credential (Get-Credential)

# Execute robocopy on source server to copy to destination
Invoke-Command -Session $session -ScriptBlock {
robocopy "C:\SourcePath" "\\DestinationServer\C$\DestPath" /E /Z /R:3 /W:5 /MT:8 /LOG:C:\robocopy.log
}

# Check log
Invoke-Command -Session $session -ScriptBlock { Get-Content C:\robocopy.log -Tail 20 }

# Clean up
Remove-PSSession $session

RDP Method

Connect via RDP, then run directly:

robocopy "C:\SourcePath" "\\DestinationServer\C$\DestPath" /E /Z /R:3 /W:5 /MT:8 /LOG:C:\robocopy.log

Key Flags

/E - Copy subdirectories including empty
/Z - Restartable mode
/R:3 - 3 retries on failed copies
/W:5 - 5 seconds wait between retries
/MT:8 - 8 threads for faster copy
/LOG - Output log file

Once Control Plane has been restored

Once we have control plane restored customers can still follow the guidance above but have migration options

Use the AWSSupport-CopyEC2Instance runbook to automate the process of moving an EC2 instance to a new subnet, AZ, or VPC — in the same or a different Region.
- Do not use AWSSupport-CopyEC2Instance for Active Directory Domain Controller instances
- 📄 AWSSupport-CopyEC2Instance Runbook Reference
For EC2 instances with encrypted EBS volumes, use the AMI copy + S3 method to avoid sharing KMS keys across Regions.
- 📄 Migrate Encrypted EC2 Instances Across Regions Without Sharing KMS Keys
Restore from AWS Backup

aws backup list-recovery-points-by-resource \
  --resource-arn arn:aws:ec2:us-east-1:ACCOUNT:instance/i-XXXXXXXXXXXX

aws backup start-restore-job \
  --recovery-point-arn arn:aws:backup:us-east-1:ACCOUNT:recovery-point:RP-ID \
  --iam-role-arn arn:aws:iam::ACCOUNT:role/AWSBackupDefaultServiceRole \
  --metadata '{"SubnetId":"subnet-healthy-az","SecurityGroupIds":"sg-XXXXXXXXXXXX","InstanceType":"r5.2xlarge"}'

Monitor the restore:

aws backup describe-restore-job --restore-job-id RESTORE-JOB-ID

Post Migration Checklist?

[ ] OS boots cleanly (no dracut/emergency shell)
[ ] Instance responding to SSH
[ ] Filesystems mounted correctly (e.g /etc/fstab uses UUIDs)
[ ] DNS resolution functional (host/dig [www.amazon.com](http://www.amazon.com/))
[ ] SSM agent online (SSM Session Manager)
[ ] Security groups applied correctly
[ ] Application services running (systemctl status checks)
[ ] Elastic IP reassociated (if applicable)
[ ] Route 53 / DNS records updated
[ ] Backup schedule re-enabled for new instance IDs
[ ] Monitoring and alerting updated for new instance IDs
[ ] Cron jobs / systemd timers verified

WSFC / SQL FCI Cluster Recovery

This section applies only to Windows Server Failover Cluster and SQL Server Failover Cluster Instance environments.

If one cluster node survived

The surviving node should own the SQL FCI resources. Verify:

Get-ClusterNode | Format-Table Name, State
Get-ClusterGroup | Format-Table Name, State, OwnerNode
Get-Service MSSQLSERVER | Select-Object Status

Evict the dead node:

Remove-ClusterNode -Name "DEAD-NODE" -Force

Build a replacement node using Step 1 through 4 above. After the instance is running and domain-joined:

# Install clustering features
Install-WindowsFeature -Name Failover-Clustering -IncludeManagementTools
Install-WindowsFeature -Name Multipath-IO

# Connect iSCSI to FSx ONTAP (if using shared storage)
# Run scripts/04-configure-iscsi-fsx.ps1

# Add to cluster
Add-ClusterNode -Name $env:COMPUTERNAME -Cluster "SQLFCI"

# Add to SQL FCI
# Run scripts/06-install-sql-fci.ps1 -Action AddNode

VIP behavior in VPC

Cluster VIPs do not float between subnets in AWS VPC. Each node has its own secondary IP on its ENI. The cluster updates DNS on failover.

After adding a node in a new subnet:

Add a secondary IP to the node's ENI
In Failover Cluster Manager, set the cluster IP resource to Static with the secondary IP
Set Possible Owners to only the node that owns that IP

If both nodes were in the failed AZ

Full cluster rebuild. If using Amazon FSx for NetApp ONTAP multi-AZ, your LUNs and SQL data are still intact. Rebuild both nodes, recreate the cluster, reconnect iSCSI, and reinstall SQL FCI.

Amazon FSx for NetApp ONTAP and Failover Cluster Storage Recovery

This section covers the OS-level storage recovery when using FSx for NetApp ONTAP with iSCSI and Windows Server Failover Clustering.

Amazon FSx for NetApp ONTAP Behavior During AZ Failure

Amazon FSx for NetApp ONTAP multi-AZ has an active file server in one AZ and a standby in the other. If the active AZ fails, Amazon FSx automatically fails over to the standby (typically within seconds). The iSCSI endpoint IPs stay the same (floating IPs managed by FSx). Windows iSCSI sessions should reconnect automatically if configured with persistent connections.

# Check Amazon FSx file system status
aws fsx describe-file-systems --file-system-ids fs-XXXXXXXXXXXX \
  --query "FileSystems[0].[Lifecycle,FileSystemType,StorageType]" --output table

# Check SVM status and iSCSI endpoints
aws fsx describe-storage-virtual-machines \
  --filters "Name=file-system-id,Values=fs-XXXXXXXXXXXX" \
  --query "StorageVirtualMachines[*].[Name,Lifecycle,Endpoints.Iscsi.IpAddresses]" --output table

Verifying iSCSI Reconnection After FSx Failover

On each cluster node:

Get-IscsiSession | Format-Table TargetNodeAddress, IsConnected, IsPersistent
Get-IscsiTargetPortal | Format-Table TargetPortalAddress, TargetPortalPortNumber
mpclaim -s -d
Get-Disk | Where-Object BusType -eq iSCSI | Format-Table Number, FriendlyName, OperationalStatus, Size

If iSCSI Sessions Did Not Reconnect

# Remove stale portals and re-add
Get-IscsiTargetPortal | Remove-IscsiTargetPortal -Confirm:$false

New-IscsiTargetPortal -TargetPortalAddress 10.16.36.42 -TargetPortalPortNumber 3260
New-IscsiTargetPortal -TargetPortalAddress 10.16.97.81 -TargetPortalPortNumber 3260

Start-Sleep -Seconds 5

$targets = Get-IscsiTarget
foreach ($target in $targets) {
    Connect-IscsiTarget -NodeAddress $target.NodeAddress 
        -TargetPortalAddress 10.16.36.42 -TargetPortalPortNumber 3260 
        -IsPersistent $true -IsMultipathEnabled $true -ErrorAction SilentlyContinue
    Connect-IscsiTarget -NodeAddress $target.NodeAddress 
        -TargetPortalAddress 10.16.97.81 -TargetPortalPortNumber 3260 
        -IsPersistent $true -IsMultipathEnabled $true -ErrorAction SilentlyContinue
}

Get-IscsiSession | Format-Table TargetNodeAddress, IsConnected

Cluster Disk Recovery

After iSCSI reconnects, cluster disks may show as Failed or Offline.

# Check cluster disk status
Get-ClusterResource | Where-Object ResourceType -eq "Physical Disk" | Format-Table Name, State, OwnerNode

# Start failed disks
Get-ClusterResource | Where-Object { $_.ResourceType -eq "Physical Disk" -and $_.State -eq "Failed" } | ForEach-Object {
    Start-ClusterResource -Name $_.Name -ErrorAction SilentlyContinue
}

# If disks show Offline in Disk Management
Get-Disk | Where-Object { $_.BusType -eq "iSCSI" -and $_.OperationalStatus -eq "Offline" } | Set-Disk -IsOffline $false

Disk Signature Issues After Recovery

After snapshot restore or node rebuild, disk signature collisions can prevent disks from coming online.

# Check for offline disks
Get-Disk | Where-Object OperationalStatus -eq "Offline"

# Force online (verify correct disk first)
Set-Disk -Number X -IsOffline $false
Set-Disk -Number X -IsReadOnly $false

# For cluster disks, use cluster commands
Get-ClusterResource "Cluster Disk 1" | Start-ClusterResource

MPIO Path Verification

# Show all MPIO disks and paths (expect 2 paths per disk)
mpclaim -s -d

# Check load balance policy (should be RR)
Get-MSDSMGlobalDefaultLoadBalancePolicy

# Fix if needed
Set-MSDSMGlobalDefaultLoadBalancePolicy -Policy RR

# Detailed path status
mpclaim -v

SQL Server Recovery After Storage Reconnection

# Check SQL cluster resource
Get-ClusterResource | Where-Object ResourceType -eq "SQL Server" | Format-Table Name, State

# Check dependency chain
Get-ClusterResource "SQL Server" | Get-ClusterResourceDependency

# Start in order: disks, then network name, then SQL
Get-ClusterResource | Where-Object ResourceType -eq "Physical Disk" | Start-ClusterResource
Start-Sleep -Seconds 10
Start-ClusterResource "SQL Network Name (SQLFCI-SQL)"
Start-Sleep -Seconds 5
Start-ClusterResource "SQL Server"

# Verify
Invoke-Sqlcmd -Query "SELECT @@SERVERNAME, @@VERSION" -ServerInstance "SQLFCI-SQL"
Invoke-Sqlcmd -Query "SELECT name, state_desc FROM sys.databases" -ServerInstance "SQLFCI-SQL"

FSx ONTAP REST API Verification

Security: The examples below retrieve credentials from AWS Secrets Manager into an environment variable. Do not hardcode passwords in scripts or pass them as literal strings in commands. The $PASSWORD variable below is populated securely from Secrets Manager. Use proper certificate verification instead of -k. See SECURITY.md.

# Get management IP
aws fsx describe-file-systems --file-system-ids fs-XXXXXXXXXXXX \
  --query "FileSystems[0].OntapConfiguration.Endpoints.Management.IpAddresses[0]" \
  --output text

# Check LUN status (retrieve password from Secrets Manager in production)
PASSWORD=$(aws secretsmanager get-secret-value --secret-id fsxadmin-password --query SecretString --output text)
curl -u "fsxadmin:$PASSWORD" \
  "https://MGMT-IP/api/storage/luns?svm.name=svm-sql&fields=status,name,space"

# Check igroup mappings
curl -u "fsxadmin:$PASSWORD" \
  "https://MGMT-IP/api/protocols/san/igroups?svm.name=svm-sql&fields=initiators,lun_maps"

# Check iSCSI service
curl -u "fsxadmin:$PASSWORD" \
  "https://MGMT-IP/api/protocols/san/iscsi/services?svm.name=svm-sql"

Coldsnap: EBS Direct API Snapshot Transfers

What coldsnap does:

Uploads local disk images directly into EBS snapshots using the EBS Direct APIs — no EC2 instance or volume attachment needed.

Downloads EBS snapshots to local files for out-of-band transfer.

Useful in automated pipelines, when the control plane is impaired, or when you need to move disk images without launching instances.

Source: github.com/awslabs/coldsnap (Apache-2.0, Rust)

Installation

cargo install --locked coldsnap

Upload Local Disk Image to EBS Snapshot

# Upload and wait for snapshot to become available
coldsnap upload --wait disk.img

# Upload with KMS encryption
coldsnap upload disk.img --kms-key-id arn:aws:kms:<region>:<account>:key/<key-id>

# Upload with tags
coldsnap upload disk.img --tag "Key=Environment,Value=DR" --tag "Key=Source,Value=migration"

Download EBS Snapshot to Local File

coldsnap download snap-1234 disk.img

Cross-Region Migration with Coldsnap

# 1. Download snapshot from source region
AWS_DEFAULT_REGION=<source-region> coldsnap download snap-1234 disk.img

# 2. Upload to target region
AWS_DEFAULT_REGION=<target-region> coldsnap upload --wait disk.img

# 3. Create volume from new snapshot in target region
aws ec2 create-volume --snapshot-id <new-snap-id> --availability-zone <target-az> --region <target-region>

Wait for Snapshot

coldsnap wait snap-1234

Credentials

Coldsnap uses the same credential chain as the AWS CLI (~/.aws/credentials, environment variables, instance profiles). Use --profile <name> to select a specific profile.

Amazon ECS Migration Guide

Key Considerations

ECS clusters cannot be moved to another region. You must recreate the cluster, task definitions, and services in the target region.
The ECS data plane is independent from the control plane. Running tasks on healthy nodes continue to operate even when the control plane is unreachable.
Container images must be available in the target region before services can be created. If ECR in the source region is inaccessible, plan to rebuild images from your CI/CD pipeline or pull from an alternate registry.
Stateful workloads backed by EFS continue working across AZ failures because EFS is a regional service — only the mount target is AZ-specific. Cross-region EFS data migration is not currently possible.
Stateful workloads backed by EBS (EC2 launch type only) are AZ-specific and cannot move automatically.
IAM roles are global, but verify they exist and hold the correct permissions before creating services in the target region.
Secrets Manager secrets and SSM parameters are regional. Export and recreate them in the target region before registering task definitions.

What You CANNOT Do When the ECS Control Plane Is Down

Stop, start, or restart tasks
Update services or change desired count
Register new task definitions
Create new services or clusters
Access Amazon ECR in the affected region

Running tasks on healthy nodes continue to operate. If the ECS control plane is completely inaccessible and you cannot run any API or CLI commands, open an AWS Support case (Critical severity) and provide your cluster names, account ID, and target region.

Recovery Path

Is the ECS control plane reachable and at least one AZ healthy?
  YES --> Stream 1: Recover in-place (partial AZ failure)
  NO  --> Is this a full region outage?
            YES --> Stream 2: Migrate to another region
            NO  --> Open an AWS Support case (Critical severity)

Moving Between AZs (Partial AZ Failure)

Use this path when the ECS control plane is still reachable and at least one AZ is healthy. The goal is to stop scheduling tasks in the failed AZ and reschedule them in the remaining healthy AZs.

Step 1: Identify tasks running in the affected AZ

aws ecs list-tasks \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME

aws ecs describe-tasks \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --tasks TASK-ARN-1 TASK-ARN-2 \
  --query 'tasks[*].[taskArn,availabilityZone,lastStatus]'

Step 2: Remove the failed AZ from service networking (Fargate launch type)

Update the service to remove the affected AZ's subnet:

aws ecs update-service \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --service YOUR-SERVICE-NAME \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-HEALTHY-AZ1,subnet-HEALTHY-AZ2],securityGroups=[sg-xxxxxxxxx],assignPublicIp=DISABLED}"

Step 3: Exclude the failed AZ via placement constraints (EC2 launch type)

aws ecs update-service \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --service YOUR-SERVICE-NAME \
  --placement-constraints type=memberOf,expression="attribute:ecs.availability-zone != FAILED-AZ-ID"

Replace FAILED-AZ-ID with the actual failed AZ identifier (for example, me-central-1a).

Step 4: Force redeployment to reschedule tasks

aws ecs update-service \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --service YOUR-SERVICE-NAME \
  --force-new-deployment

ECS stops tasks in the failed AZ and reschedules them in the healthy AZs.

Step 5: Verify service stability

aws ecs describe-services \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --services YOUR-SERVICE-NAME \
  --query 'services[*].[serviceName,runningCount,desiredCount,status]'

Notes:

Verify that healthy AZs have sufficient capacity to absorb rescheduled tasks. Request a limit increase proactively if needed.
If tasks use EBS volumes (EC2 launch type only), those volumes are AZ-specific and cannot be moved automatically.
If the desired count cannot be met in the remaining AZs, temporarily reduce it to match available capacity, then scale back up once stable.

Moving to a Different Region (Full Region Loss)

Recreate your ECS infrastructure in the target region. Complete all steps in order. The key dependency chain is: images in ECR, then task definitions, then VPC and supporting infrastructure, then cluster, then services.

Before you start:

Document your current ECS cluster configuration.
Identify container images and their registry locations.
Note all task definitions, services, and their configurations.
Identify your target recovery region.
Verify you have the necessary IAM permissions in the target region.
Document load balancer and networking configurations.

Step 1: Copy container images to the target region ECR

Option A: Pull from source region and push to target region

Create the repository in the target region:

aws ecr create-repository \
  --repository-name YOUR-REPO-NAME \
  --region TARGET-REGION

Authenticate, pull, retag, and push:

aws ecr get-login-password --region SOURCE-REGION | \
  docker login --username AWS --password-stdin ACCOUNT-ID.dkr.ecr.SOURCE-REGION.amazonaws.com

docker pull ACCOUNT-ID.dkr.ecr.SOURCE-REGION.amazonaws.com/YOUR-REPO-NAME:TAG

docker tag ACCOUNT-ID.dkr.ecr.SOURCE-REGION.amazonaws.com/YOUR-REPO-NAME:TAG \
  ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO-NAME:TAG

aws ecr get-login-password --region TARGET-REGION | \
  docker login --username AWS --password-stdin ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com

docker push ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO-NAME:TAG

Option B: Rebuild from your CI/CD pipeline

Trigger your existing pipeline (CodePipeline, GitHub Actions, Jenkins) targeting the recovery region directly. This is the most reliable option if your pipeline is decoupled from the affected region.

Option C: Pull from an alternate registry

If images were previously pushed to ECR Public, Docker Hub, or another registry:

# From ECR Public
docker pull public.ecr.aws/YOUR-ALIAS/YOUR-REPO:TAG

docker tag public.ecr.aws/YOUR-ALIAS/YOUR-REPO:TAG \
  ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO:TAG

aws ecr get-login-password --region TARGET-REGION | \
  docker login --username AWS --password-stdin \
  ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com

docker push ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO:TAG

Check whether ECR cross-region replication was pre-configured — images may already be available:

aws ecr describe-repositories --region TARGET-REGION
aws ecr list-images --region TARGET-REGION --repository-name YOUR-REPO-NAME

Recommendation going forward: Enable ECR cross-region replication proactively so images are always available in your DR region without manual intervention. See ECR replication documentation.

Step 2: Export task definitions from the source region

aws ecs list-task-definitions --region SOURCE-REGION

aws ecs describe-task-definition \
  --region SOURCE-REGION \
  --task-definition YOUR-TASK-DEFINITION:REVISION \
  --query 'taskDefinition' > task-definition.json

Save these JSON files. You need them to register task definitions in the target region.

See Task Definitions documentation.

Step 3: Document service configurations

aws ecs list-services \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME

aws ecs describe-services \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --services YOUR-SERVICE-NAME > service-config.json

For each service, record: service name, task definition and revision, desired count, launch type, network configuration, load balancer configuration, auto scaling settings, and service discovery settings.

Step 4: Migrate secrets and configuration

Copy Secrets Manager secrets:

# Retrieve secret value from source region
aws secretsmanager get-secret-value \
  --region SOURCE-REGION \
  --secret-id YOUR-SECRET-NAME \
  --query 'SecretString' --output text > secret.txt

# Create secret in target region
aws secretsmanager create-secret \
  --region TARGET-REGION \
  --name YOUR-SECRET-NAME \
  --secret-string file://secret.txt

Copy SSM Parameter Store parameters:

# Retrieve parameter from source region
aws ssm get-parameter \
  --region SOURCE-REGION \
  --name YOUR-PARAMETER-NAME \
  --with-decryption \
  --query 'Parameter.Value' --output text > parameter.txt

# Create parameter in target region
aws ssm put-parameter \
  --region TARGET-REGION \
  --name YOUR-PARAMETER-NAME \
  --value file://parameter.txt \
  --type SecureString

EFS note: For a partial AZ failure, update your service subnet configuration to use subnets in healthy AZs and EFS access resumes automatically via the healthy mount targets. For a full region migration, cross-region EFS data migration is not currently possible.

Step 5: Create VPC and networking in the target region

The target region requires: an Amazon Virtual Private Cloud (Amazon VPC) with public and private subnets across multiple AZs, an Internet Gateway, NAT Gateways for private subnets, route tables, and security groups matching your source configuration.

aws ec2 create-vpc \
  --region TARGET-REGION \
  --cidr-block 10.0.0.0/16 \
  --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=ecs-recovery-vpc}]'

# Repeat for each AZ
aws ec2 create-subnet \
  --region TARGET-REGION \
  --vpc-id vpc-xxxxxxxxx \
  --cidr-block 10.0.1.0/24 \
  --availability-zone TARGET-REGION-AZ

aws ec2 create-security-group \
  --region TARGET-REGION \
  --group-name ecs-tasks-sg \
  --description "Security group for ECS tasks" \
  --vpc-id vpc-xxxxxxxxx

Step 6: Verify IAM roles in the target region

IAM roles are global. Verify that your ECS task execution role, task role, service role, and auto scaling role exist and have correct policies. See ECS IAM Roles documentation.

Step 7: Create the ECS cluster in the target region

aws ecs create-cluster \
  --region TARGET-REGION \
  --cluster-name YOUR-CLUSTER-NAME

See Creating ECS Clusters documentation.

Step 8: Create the Application Load Balancer (if needed)

Create an ALB with at least two AZ subnets, configure HTTP/HTTPS listeners, and create a target group with IP target type for Fargate or instance target type for EC2 launch type. See ALB documentation.

Step 9: Register task definitions in the target region

Before registering, update the exported task definition JSON: - Change container image URIs to point to the target region ECR repositories. - Update Secrets Manager ARNs to reference the target region secrets. - Update CloudWatch log group names or create new log groups in the target region. - Verify the task execution role ARN and task role ARN are correct.

aws ecs register-task-definition \
  --region TARGET-REGION \
  --cli-input-json file://task-definition.json

Step 10: Create ECS services in the target region

aws ecs create-service \
  --region TARGET-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --service-name YOUR-SERVICE-NAME \
  --task-definition YOUR-TASK-DEFINITION:REVISION \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-AZ1,subnet-AZ2],securityGroups=[sg-xxxxxxxxx],assignPublicIp=DISABLED}" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:TARGET-REGION:ACCOUNT-ID:targetgroup/...,containerName=YOUR-CONTAINER,containerPort=8080"

See Creating ECS Services documentation.

Step 11: Configure service auto scaling

aws application-autoscaling register-scalable-target \
  --region TARGET-REGION \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/YOUR-CLUSTER-NAME/YOUR-SERVICE-NAME \
  --min-capacity 1 \
  --max-capacity 10

See Service Auto Scaling documentation.

Step 12: Update DNS and route traffic

In Route 53, update the A or CNAME record to point to the new load balancer DNS name. Set a low TTL (60 seconds) before making the change to speed up propagation.

Step 13: Verify and test

Run these checks before switching production traffic:

# Check all services show desired count = running count
aws ecs describe-services \
  --region TARGET-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --services YOUR-SERVICE-NAME \
  --query 'services[*].[serviceName,runningCount,desiredCount,status]'

# Check load balancer target health
aws elbv2 describe-target-health \
  --region TARGET-REGION \
  --target-group-arn YOUR-TARGET-GROUP-ARN

Also verify: logs are flowing to CloudWatch, secrets and environment variables are correct, database connectivity is working, and auto scaling triggers are configured.

Alternative: Redeploy using Infrastructure as Code

If you have Terraform:

# Update provider region in your configuration, then run:
terraform plan
terraform apply

If you have CloudFormation:

aws cloudformation create-stack \
  --region TARGET-REGION \
  --stack-name ecs-recovery-stack \
  --template-body file://ecs-stack.yaml \
  --capabilities CAPABILITY_IAM

Post-Recovery Checklist

[ ] All services showing desired count = running count
[ ] Load balancer target group health checks passing
[ ] Logs flowing to CloudWatch Log Groups
[ ] CloudWatch alarms configured for service failures
[ ] CloudWatch Logs retention configured
[ ] DNS / Route 53 records updated to new load balancer
[ ] Auto scaling thresholds verified
[ ] Secrets and environment variables confirmed correct
[ ] Database connectivity tested
[ ] AWS Backup or snapshot schedule re-enabled
[ ] Monitoring dashboards updated
[ ] Disaster recovery documentation updated

Amazon EKS Migration Guide

Key Considerations

Amazon Elastic Kubernetes Service (Amazon EKS) clusters cannot be moved. You must create a new cluster in the target region and restore workloads into it.
The Kubernetes data plane is independent from the control plane. Running pods on healthy nodes continue to operate even when the EKS API server is unreachable.
IRSA (IAM Roles for Service Accounts) must be re-associated with the new cluster's OIDC provider before triggering any restore. Skipping this causes silent AccessDenied errors on all AWS API calls from pods.
EKS add-ons (CSI drivers, CoreDNS, kube-proxy, VPC CNI) must be installed on the new cluster before restoring workloads. Persistent volume mounts fail if CSI drivers are not present at restore time.
Fargate profiles are cluster-specific. If using EKS Fargate, recreate profiles on the new cluster. Drain and cordon steps do not apply to Fargate nodes.
Kubernetes Secrets exported with kubectl are only base64-encoded, not encrypted. Handle exported secret files securely and delete them after use.
Cross-region EFS data migration is not currently possible. For an AZ failure, update mount targets to healthy AZs. For a full region migration, EFS-backed workloads cannot be migrated with their persistent data.

What You CANNOT Do When the EKS Control Plane Is Down

Run kubectl commands of any kind
Schedule new pods or trigger deployments
Perform rolling updates or rollbacks
Access the EKS console or API server

Running pods on healthy nodes continue to operate. If the EKS API server is completely inaccessible and cannot be restored, open an AWS Support case (Critical severity) with your cluster ARN, account ID, and target region.

Recovery Path

Is the EKS API server reachable?
  YES --> Are any nodes in the failed AZ?
            YES --> Drain affected nodes, exclude failed AZ from node provisioning
            NO  --> No action needed; workloads are already on healthy nodes
  NO  --> Is this a partial AZ failure or full region loss?
            PARTIAL AZ --> Wait for API access to recover, then drain affected nodes
            FULL REGION --> Migrate to another region (steps below)

Moving Between AZs (Partial AZ Failure)

Step 1: Identify nodes in the failed AZ and cordon them

# List nodes and their AZs
kubectl get nodes --label-columns topology.kubernetes.io/zone

# Prevent new pods from scheduling on affected nodes
kubectl cordon NODE-NAME

# Evict all running pods from affected nodes
kubectl drain NODE-NAME --ignore-daemonsets --delete-emptydir-data

Step 2: Prevent new nodes from being created in the failed AZ

Managed Node Groups: Create a new node group that does not include the failed AZ's subnets, then cordon and drain the old node group to migrate workloads.

aws eks create-nodegroup \
  --cluster-name YOUR-CLUSTER \
  --nodegroup-name recovery-nodegroup \
  --subnets subnet-HEALTHY-AZ1 subnet-HEALTHY-AZ2 \
  --instance-types INSTANCE-TYPE \
  --ami-type AMI-TYPE \
  --scaling-config minSize=MIN-SIZE,maxSize=MAX-SIZE,desiredSize=DESIRED-SIZE \
  --region SOURCE-REGION

See Migrating to a new node group.

Karpenter: Patch the NodePool to exclude the failed AZ:

kubectl patch nodepool default --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/requirements/-",
  "value": {"key": "topology.kubernetes.io/zone",
            "operator": "NotIn", "values": ["FAILED-AZ-ID"]}}]'

Replace FAILED-AZ-ID with the actual failed AZ identifier.

Step 3: Handle stateful workloads in the failed AZ

EBS-backed StatefulSets: EBS volumes are AZ-specific. Take a volume snapshot and use the snapshot as the data source for a new PVC in a healthy AZ. See Migrating EKS clusters from gp2 to gp3 EBS volumes for the snapshot-based migration pattern.

EFS-backed StatefulSets: EFS is regional. Only the mount target is AZ-specific. Update the PVC to reference a mount target in a healthy AZ. No data migration is required.

Moving to a Different Region (Full Region Loss)

Step 1: Export workload manifests

Export manifests before attempting any backup or restore. This provides a portable fallback if AWS Backup is unavailable in the affected region.

# Export all workload resources across all namespaces
kubectl get deploy,svc,configmap,ingress,hpa,statefulset,daemonset \
  -A -o yaml > all-workloads.yaml

# Export Kubernetes Secrets (filter out system secrets before applying to new cluster)
# WARNING: Secrets are only base64-encoded, not encrypted. You MUST encrypt the output file.
# Verify GPG is configured with a recipient key before proceeding.
kubectl get secrets -A -o yaml | gpg --encrypt -r <key-id> > secrets-backup.yaml.gpg

# Export ConfigMaps
kubectl get configmaps -A -o yaml > configmaps-backup.yaml

Kubernetes Secrets exported this way are only base64-encoded, not encrypted. Delete the output files after use.

Step 2: Back up the cluster using AWS Backup (if accessible)

If AWS Backup is operational in the source region:

aws backup start-backup-job \
  --region SOURCE-REGION \
  --backup-vault-name YOUR-VAULT-NAME \
  --resource-arn arn:aws:eks:SOURCE-REGION:ACCOUNT-ID:cluster/YOUR-CLUSTER \
  --iam-role-arn arn:aws:iam::ACCOUNT-ID:role/AWSBackupDefaultServiceRole

See AWS Backup for EKS.

Step 3: Prepare the new cluster in the target region

Complete all three preparation steps before triggering any restore. Restoring without completing them causes IRSA failures and persistent volume mount errors.

3a. Create the new EKS cluster

Use eksctl, the AWS Console, or your existing IaC targeting the new region.

eksctl create cluster \
  --name NEW-CLUSTER-NAME \
  --region TARGET-REGION \
  --version KUBERNETES-VERSION

3b. Associate the OIDC provider and update IRSA trust policies

Security: Customers are responsible for configuring IAM roles for service accounts, managing OIDC provider associations, and verifying trust policies follow least-privilege principles. Review each IRSA role's permissions during migration to remove any unnecessary access.

# Get the new cluster's OIDC issuer URL
aws eks describe-cluster \
  --name NEW-CLUSTER-NAME \
  --region TARGET-REGION \
  --query "cluster.identity.oidc.issuer" --output text

# Associate the OIDC provider
eksctl utils associate-iam-oidc-provider \
  --cluster NEW-CLUSTER-NAME \
  --region TARGET-REGION --approve

# For each IRSA role: retrieve the trust policy, update the OIDC ARN, then apply
aws iam get-role --role-name YOUR-IRSA-ROLE \
  --query "Role.AssumeRolePolicyDocument" > trust-policy.json
# Edit trust-policy.json: replace the old OIDC ARN with the new cluster's OIDC ARN
aws iam update-assume-role-policy \
  --role-name YOUR-IRSA-ROLE \
  --policy-document file://trust-policy.json

See IAM Roles for Service Accounts.

3c. Install required EKS add-ons on the new cluster

SOURCE_CLUSTER=YOUR-SOURCE-CLUSTER
TARGET_CLUSTER=NEW-CLUSTER-NAME

for ADDON in $(aws eks list-addons \
    --cluster-name $SOURCE_CLUSTER \
    --region SOURCE-REGION \
    --query 'addons[]' --output text); do
  echo "Installing $ADDON..."
  aws eks create-addon \
    --cluster-name $TARGET_CLUSTER \
    --addon-name $ADDON \
    --region TARGET-REGION
done

See Managing EKS add-ons.

Step 4: Migrate container images

Pod images reference ECR repositories in the source region. If ECR in the source region is inaccessible, follow the ECS section options A, B, and C under "Copy container images to the target region ECR." The process is identical for EKS.

Step 5: Migrate supporting resources

EBS volumes: Use volume snapshots and follow the EBS snapshot migration steps in the EC2 section of this guide.

Secrets Manager and SSM Parameter Store: Follow ECS Step 4. The process is identical.

S3 buckets: When migrating S3 data to a new region, enable Block Public Access on the target bucket, configure default encryption (SSE-KMS recommended), enforce TLS-only access via bucket policy, and enable versioning. Use aws s3 sync with --sse aws:kms for cross-region replication.

VPC resources: Recreate VPC, subnets, security groups, and NAT Gateways in the target region before restoring the cluster.

Fargate profiles: If using EKS Fargate, recreate the Fargate profiles on the new cluster after it is created.

Step 6: Restore workloads

Option A: Restore from AWS Backup

aws backup start-restore-job \
  --region TARGET-REGION \
  --recovery-point-arn RECOVERY-POINT-ARN \
  --iam-role-arn arn:aws:iam::ACCOUNT-ID:role/AWSBackupDefaultServiceRole \
  --metadata '{"ClusterName":"NEW-CLUSTER-NAME","Region":"TARGET-REGION"}'

See Restoring an EKS cluster.

Option B: Apply exported manifests

Update the kubeconfig to point to the new cluster, then apply the exported manifests. Review and filter system namespaces and service account secrets before applying.

aws eks update-kubeconfig \
  --name NEW-CLUSTER-NAME \
  --region TARGET-REGION

# Review the file before applying to exclude kube-system and other system namespaces
kubectl apply -f all-workloads.yaml

Step 7: Update DNS and route traffic

In Route 53, update A or CNAME records to point to the new load balancer or ingress endpoints. Set a low TTL (60 seconds) before making the change to speed up propagation.

Post-Recovery Checklist

[ ] kubeconfig updated to new cluster (aws eks update-kubeconfig)
[ ] All nodes in Ready state (kubectl get nodes)
[ ] All pods running and healthy (kubectl get pods -A)
[ ] EKS add-ons installed and active
[ ] Pod AWS API calls working (check logs for AccessDenied or timeout errors)
[ ] ECR images accessible from target region
[ ] Persistent volumes bound and accessible
[ ] Services and Ingress endpoints resolving correctly
[ ] DNS / Route 53 records updated to new load balancer
[ ] Horizontal Pod Autoscaler (HPA) thresholds verified
[ ] AWS Backup plan re-enabled for new cluster
[ ] Monitoring and alerting updated for new cluster endpoints
[ ] Disaster recovery documentation updated

Security Controls and Measurable Improvements

Encrypted Snapshots/AMIs: Cross-region AWS KMS re-encryption provides data encryption in transit and at rest using region-specific keys. Measurable: all snapshots and AMIs in target region are encrypted; no unencrypted copies exist.

Secrets Manager Migration: Secrets Manager provides automatic rotation, encryption at rest with AWS KMS, and audit logging via CloudTrail. Migrating secrets before service deployment enables applications to avoid falling back to hardcoded credentials. Measurable: all secrets accessible in target region; rotation schedules configured; no plaintext credentials in application configs.

IRSA (IAM Roles for Service Accounts): IRSA establishes pod-level IAM permissions using OIDC federation, replacing node-level IAM roles. This implements least-privilege access by scoping permissions to individual service accounts. Measurable: all IRSA trust policies updated with new cluster OIDC ARN; no pods using node-level IAM roles; AccessDenied errors resolved before cutover.

name

migration-compute

description

Compute Services Migration

Security: Always ensure migrated resources meet or exceed the security configuration of the source resources. Refer to SECURITY.md for security requirements.

EC2 Migration Guide

Key Considerations

Not all EC2 instance types are available in every Region and AZ, confirm your required instance types are supported in the destination Region + AZs (check here or via aws ec2 describe-instance-type-offerings --location-type availability-zone --region <region>)
Instance store volumes are ephemeral and not preserved in AMI copies — use EBS-backed AMIs for persistent storage
IP addresses change during migration; update DNS and security group references accordingly
Reference: EC2 Regions and Availability Zones

What You Cannot Do When the AZ Is Down

Create AMIs from affected instances
Take EBS snapshots of affected volumes
Use AWS Application Migration Service to migrate an instance that is physically not running (This may work if the instance can be logged onto, to install the AWS MGN agent and is able to each the AWS MGN endpoints; or if you migrated the instance before the AZ failure and have snapshot available)
Detach EBS volumes (volumes are AZ-bound)
Access affected instances via RDP, AWS Systems Manager, or Serial Console
Collect inventory data from affected instances or resources

You can only work with what existed before the failure: AMIs, EBS snapshots, or AWS Backup recovery points.

Recovery Path

Do you have a recent AMI?  
  YES --> Step 1: Launch from AMI  
  NO  --> Do you have EBS snapshots?  
            YES --> Step 2: Restore from snapshots  
            NO  --> Is AWS Backup configured?  
                      YES --> Step 3: Restore from AWS Backup  
                      NO  --> Step 4: Launch fresh instance (data loss)

Moving Between AZs or Regions (Once the Affected Region is Restored)

Failed AZ, moving to new region:

With an AZ down, you do not have access to the instances and EBS volumes within that AZ.
Large Snapshots/AMIs (multi-TB) can take hours to copy cross-region.
You can recover from an EBS snapshot or AMI that was before the event.
Snapshot:
- Copy Snapshot https://docs.aws.amazon.com/ebs/latest/userguide/ebs-copy-snapshot.html#ebs-snapshot-copy
- CLI: https://docs.aws.amazon.com/ec2/latest/devguide/example_ec2_CopySnapshot_section.html

Security: Always use --encrypted --kms-key-id for all snapshot copies to maintain encryption at rest.

Encrypted Snapshot :

This is the most common cross-region scenario — KMS keys are regional, so you need a key in the destination region and an operational KMS/EBS service in the source region.

aws ec2 copy-snapshot \
--source-region us-west-2 \
--source-snapshot-id snap-0123456789abcdef0 \
--region us-east-1 \
--encrypted \
--kms-key-id arn:aws:kms:us-east-1:123456789012:key/abcd1234-5678-90ab-cdef-EXAMPLE \
--description "Re-encrypted with destination region key"

AMI
- Copy EC2 AMI https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/CopyingAMIs.html

Security: Always use --encrypted --kms-key-id for all AMI copies to maintain encryption at rest.

Encrypted AMI:

Copy an encrypted AMI cross-region (re-encrypt with destination region KMS key)

aws ec2 copy-image \
--source-image-id ami-0123456789abcdef0 \
--source-region us-west-2 \
--name "Re-encrypted AMI in us-east-1" \
--region us-east-1 \
--encrypted \
--kms-key-id arn:aws:kms:us-east-1:123456789012:key/abcd1234-5678-90ab-cdef-EXAMPLE

Moving between AZ’s (Once the Affected Region is Restored)

To restore from a EBS Snapshot https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html

Restore to a new EBS Volume

aws ec2 create-volume \
  --snapshot-id snap-0123456789abcdef0 \
  --availability-zone us-east-1a \
  --volume-type gp3

Then attach it to an instance: (AZ must match the target instance's AZ.)

aws ec2 attach-volume \
--volume-id vol-**NEW_ID** \
--instance-id i-0123456789abcdef0 \
--device /dev/xvdf

To restore from an AMI https://repost.aws/knowledge-center/launch-instance-custom-ami
- CLI: https://docs.aws.amazon.com/ec2/latest/devguide/example_ec2_RunInstances_section.html

aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.micro \
  --subnet-id subnet-0abcdef1234567890 \
  --security-group-ids sg-0abcdef1234567890 \
  --associate-public-ip-address \
  --key-name MyKeyPair

EC2 Troubleshooting documentation if something went wrong during AMI/Snapshot EC2 recovery:
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-troubleshoot.html

For instances that are still accessible within the impacted region

The below steps allow extraction of EC2 instances that are alive, but all control plane functions are unavailable.

AWS Application Migration Service (AWS MGN)

This can be done by following these steps:

The above steps extract an EC2 instance and all of its data. Other considerations will be DNS and IP addresses which will change, and any dependent resources for a deployment.

Manually Dumping my MySQL databases before data sync:

https://dev.mysql.com/doc/refman/8.4/en/mysqldump.html#:~:text=for%20Backups%E2%80%9D.-,Invocation%20Syntax,-There%20are%20in

mysqldump -u username --ssl-mode=VERIFY_IDENTITY --ssl-ca=/path/to/rds-combined-ca-bundle.pem --all-databases | gzip > /source/data/dbdump.db.gz

Security: Use --ssl-mode=VERIFY_IDENTITY with the RDS CA certificate bundle to verify the server certificate and prevent man-in-the-middle attacks. Download the bundle from https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html

Manually copying data between EC2 Linux instances using Rsync (If API’s are not working)

Note: RSync must be installed on destination.

# Verify and accept host key on first connection
ssh-keyscan -H <DESTINATION-IP> >> ~/.ssh/known_hosts

sudo rsync -avzP -e "ssh -i /path/to/key.pem -o StrictHostKeyChecking=yes" \
    /source/data/ \
    ec2-user@<DESTINATION-IP>:/destination/data/

Security: Verify the SSH host key before first connection. Verify that the key file has restricted permissions (chmod 400). Consider using ssh-agent for key management instead of specifying key paths directly.

/path/to/key.pem (This is the key for the destination instance)

Key flags:

-a — archive mode (preserves permissions, timestamps, symlinks)
-v — verbose
-z — compress during transfer
-P — show progress + allow resume of partial transfers

--delete — (optional) remove files on destination that do not exist on source

Manually copying data between EC2 Windows instances using Robocopy

Note: Windows-based file transfers using robocopy may not fully utilize available bandwidth on connections with latency >100ms due to default TCP window size settings. Consider testing transfer methods for your specific environment.

Security: Enable SMB encryption for robocopy transfers: Set-SmbServerConfiguration -EncryptData $true -Force. Verify destination volumes use encrypted EBS volumes.

Prerequisites

Destination server must have file sharing enabled
Source server needs network access to destination
Account needs write permissions on destination
PsRemoting must be enabled

Enable PsRemoting:

# On Both Servers:
Enable-PSRemoting -Force

# On Source Server:
Set-Item WSMan:\localhost\Client\TrustedHosts -Value "DestionationServerIPorName" -Force

# On Target Server:
Set-Item WSMan:\localhost\Client\TrustedHosts -Value "SourceServerIPorName" -Force

# On Both Servers:
Restart-Service WinRM

PowerShell Remote Method


# From your local machine, establish session to source server
$session = New-PSSession -ComputerName SourceServer -Credential (Get-Credential)

# Execute robocopy on source server to copy to destination
Invoke-Command -Session $session -ScriptBlock {
robocopy "C:\SourcePath" "\\DestinationServer\C$\DestPath" /E /Z /R:3 /W:5 /MT:8 /LOG:C:\robocopy.log
}

# Check log
Invoke-Command -Session $session -ScriptBlock { Get-Content C:\robocopy.log -Tail 20 }

# Clean up
Remove-PSSession $session

RDP Method

Connect via RDP, then run directly:

robocopy "C:\SourcePath" "\\DestinationServer\C$\DestPath" /E /Z /R:3 /W:5 /MT:8 /LOG:C:\robocopy.log

Key Flags

/E - Copy subdirectories including empty
/Z - Restartable mode
/R:3 - 3 retries on failed copies
/W:5 - 5 seconds wait between retries
/MT:8 - 8 threads for faster copy
/LOG - Output log file

Once Control Plane has been restored

Once we have control plane restored customers can still follow the guidance above but have migration options

Use the AWSSupport-CopyEC2Instance runbook to automate the process of moving an EC2 instance to a new subnet, AZ, or VPC — in the same or a different Region.
- Do not use AWSSupport-CopyEC2Instance for Active Directory Domain Controller instances
- 📄 AWSSupport-CopyEC2Instance Runbook Reference
For EC2 instances with encrypted EBS volumes, use the AMI copy + S3 method to avoid sharing KMS keys across Regions.
- 📄 Migrate Encrypted EC2 Instances Across Regions Without Sharing KMS Keys
Restore from AWS Backup

aws backup list-recovery-points-by-resource \
  --resource-arn arn:aws:ec2:us-east-1:ACCOUNT:instance/i-XXXXXXXXXXXX

aws backup start-restore-job \
  --recovery-point-arn arn:aws:backup:us-east-1:ACCOUNT:recovery-point:RP-ID \
  --iam-role-arn arn:aws:iam::ACCOUNT:role/AWSBackupDefaultServiceRole \
  --metadata '{"SubnetId":"subnet-healthy-az","SecurityGroupIds":"sg-XXXXXXXXXXXX","InstanceType":"r5.2xlarge"}'

Monitor the restore:

aws backup describe-restore-job --restore-job-id RESTORE-JOB-ID

Post Migration Checklist?

[ ] OS boots cleanly (no dracut/emergency shell)
[ ] Instance responding to SSH
[ ] Filesystems mounted correctly (e.g /etc/fstab uses UUIDs)
[ ] DNS resolution functional (host/dig [www.amazon.com](http://www.amazon.com/))
[ ] SSM agent online (SSM Session Manager)
[ ] Security groups applied correctly
[ ] Application services running (systemctl status checks)
[ ] Elastic IP reassociated (if applicable)
[ ] Route 53 / DNS records updated
[ ] Backup schedule re-enabled for new instance IDs
[ ] Monitoring and alerting updated for new instance IDs
[ ] Cron jobs / systemd timers verified

WSFC / SQL FCI Cluster Recovery

This section applies only to Windows Server Failover Cluster and SQL Server Failover Cluster Instance environments.

If one cluster node survived

The surviving node should own the SQL FCI resources. Verify:

Get-ClusterNode | Format-Table Name, State
Get-ClusterGroup | Format-Table Name, State, OwnerNode
Get-Service MSSQLSERVER | Select-Object Status

Evict the dead node:

Remove-ClusterNode -Name "DEAD-NODE" -Force

Build a replacement node using Step 1 through 4 above. After the instance is running and domain-joined:

# Install clustering features
Install-WindowsFeature -Name Failover-Clustering -IncludeManagementTools
Install-WindowsFeature -Name Multipath-IO

# Connect iSCSI to FSx ONTAP (if using shared storage)
# Run scripts/04-configure-iscsi-fsx.ps1

# Add to cluster
Add-ClusterNode -Name $env:COMPUTERNAME -Cluster "SQLFCI"

# Add to SQL FCI
# Run scripts/06-install-sql-fci.ps1 -Action AddNode

VIP behavior in VPC

Cluster VIPs do not float between subnets in AWS VPC. Each node has its own secondary IP on its ENI. The cluster updates DNS on failover.

After adding a node in a new subnet:

Add a secondary IP to the node's ENI
In Failover Cluster Manager, set the cluster IP resource to Static with the secondary IP
Set Possible Owners to only the node that owns that IP

If both nodes were in the failed AZ

Full cluster rebuild. If using Amazon FSx for NetApp ONTAP multi-AZ, your LUNs and SQL data are still intact. Rebuild both nodes, recreate the cluster, reconnect iSCSI, and reinstall SQL FCI.

Amazon FSx for NetApp ONTAP and Failover Cluster Storage Recovery

This section covers the OS-level storage recovery when using FSx for NetApp ONTAP with iSCSI and Windows Server Failover Clustering.

Amazon FSx for NetApp ONTAP Behavior During AZ Failure

# Check Amazon FSx file system status
aws fsx describe-file-systems --file-system-ids fs-XXXXXXXXXXXX \
  --query "FileSystems[0].[Lifecycle,FileSystemType,StorageType]" --output table

# Check SVM status and iSCSI endpoints
aws fsx describe-storage-virtual-machines \
  --filters "Name=file-system-id,Values=fs-XXXXXXXXXXXX" \
  --query "StorageVirtualMachines[*].[Name,Lifecycle,Endpoints.Iscsi.IpAddresses]" --output table

Verifying iSCSI Reconnection After FSx Failover

On each cluster node:

Get-IscsiSession | Format-Table TargetNodeAddress, IsConnected, IsPersistent
Get-IscsiTargetPortal | Format-Table TargetPortalAddress, TargetPortalPortNumber
mpclaim -s -d
Get-Disk | Where-Object BusType -eq iSCSI | Format-Table Number, FriendlyName, OperationalStatus, Size

If iSCSI Sessions Did Not Reconnect

# Remove stale portals and re-add
Get-IscsiTargetPortal | Remove-IscsiTargetPortal -Confirm:$false

New-IscsiTargetPortal -TargetPortalAddress 10.16.36.42 -TargetPortalPortNumber 3260
New-IscsiTargetPortal -TargetPortalAddress 10.16.97.81 -TargetPortalPortNumber 3260

Start-Sleep -Seconds 5

$targets = Get-IscsiTarget
foreach ($target in $targets) {
    Connect-IscsiTarget -NodeAddress $target.NodeAddress 
        -TargetPortalAddress 10.16.36.42 -TargetPortalPortNumber 3260 
        -IsPersistent $true -IsMultipathEnabled $true -ErrorAction SilentlyContinue
    Connect-IscsiTarget -NodeAddress $target.NodeAddress 
        -TargetPortalAddress 10.16.97.81 -TargetPortalPortNumber 3260 
        -IsPersistent $true -IsMultipathEnabled $true -ErrorAction SilentlyContinue
}

Get-IscsiSession | Format-Table TargetNodeAddress, IsConnected

Cluster Disk Recovery

After iSCSI reconnects, cluster disks may show as Failed or Offline.

# Check cluster disk status
Get-ClusterResource | Where-Object ResourceType -eq "Physical Disk" | Format-Table Name, State, OwnerNode

# Start failed disks
Get-ClusterResource | Where-Object { $_.ResourceType -eq "Physical Disk" -and $_.State -eq "Failed" } | ForEach-Object {
    Start-ClusterResource -Name $_.Name -ErrorAction SilentlyContinue
}

# If disks show Offline in Disk Management
Get-Disk | Where-Object { $_.BusType -eq "iSCSI" -and $_.OperationalStatus -eq "Offline" } | Set-Disk -IsOffline $false

Disk Signature Issues After Recovery

After snapshot restore or node rebuild, disk signature collisions can prevent disks from coming online.

# Check for offline disks
Get-Disk | Where-Object OperationalStatus -eq "Offline"

# Force online (verify correct disk first)
Set-Disk -Number X -IsOffline $false
Set-Disk -Number X -IsReadOnly $false

# For cluster disks, use cluster commands
Get-ClusterResource "Cluster Disk 1" | Start-ClusterResource

MPIO Path Verification

# Show all MPIO disks and paths (expect 2 paths per disk)
mpclaim -s -d

# Check load balance policy (should be RR)
Get-MSDSMGlobalDefaultLoadBalancePolicy

# Fix if needed
Set-MSDSMGlobalDefaultLoadBalancePolicy -Policy RR

# Detailed path status
mpclaim -v

SQL Server Recovery After Storage Reconnection

# Check SQL cluster resource
Get-ClusterResource | Where-Object ResourceType -eq "SQL Server" | Format-Table Name, State

# Check dependency chain
Get-ClusterResource "SQL Server" | Get-ClusterResourceDependency

# Start in order: disks, then network name, then SQL
Get-ClusterResource | Where-Object ResourceType -eq "Physical Disk" | Start-ClusterResource
Start-Sleep -Seconds 10
Start-ClusterResource "SQL Network Name (SQLFCI-SQL)"
Start-Sleep -Seconds 5
Start-ClusterResource "SQL Server"

# Verify
Invoke-Sqlcmd -Query "SELECT @@SERVERNAME, @@VERSION" -ServerInstance "SQLFCI-SQL"
Invoke-Sqlcmd -Query "SELECT name, state_desc FROM sys.databases" -ServerInstance "SQLFCI-SQL"

FSx ONTAP REST API Verification

Security: The examples below retrieve credentials from AWS Secrets Manager into an environment variable. Do not hardcode passwords in scripts or pass them as literal strings in commands. The $PASSWORD variable below is populated securely from Secrets Manager. Use proper certificate verification instead of -k. See SECURITY.md.

# Get management IP
aws fsx describe-file-systems --file-system-ids fs-XXXXXXXXXXXX \
  --query "FileSystems[0].OntapConfiguration.Endpoints.Management.IpAddresses[0]" \
  --output text

# Check LUN status (retrieve password from Secrets Manager in production)
PASSWORD=$(aws secretsmanager get-secret-value --secret-id fsxadmin-password --query SecretString --output text)
curl -u "fsxadmin:$PASSWORD" \
  "https://MGMT-IP/api/storage/luns?svm.name=svm-sql&fields=status,name,space"

# Check igroup mappings
curl -u "fsxadmin:$PASSWORD" \
  "https://MGMT-IP/api/protocols/san/igroups?svm.name=svm-sql&fields=initiators,lun_maps"

# Check iSCSI service
curl -u "fsxadmin:$PASSWORD" \
  "https://MGMT-IP/api/protocols/san/iscsi/services?svm.name=svm-sql"

Coldsnap: EBS Direct API Snapshot Transfers

What coldsnap does:

Uploads local disk images directly into EBS snapshots using the EBS Direct APIs — no EC2 instance or volume attachment needed.

Downloads EBS snapshots to local files for out-of-band transfer.

Useful in automated pipelines, when the control plane is impaired, or when you need to move disk images without launching instances.

Source: github.com/awslabs/coldsnap (Apache-2.0, Rust)

Installation

cargo install --locked coldsnap

Upload Local Disk Image to EBS Snapshot

# Upload and wait for snapshot to become available
coldsnap upload --wait disk.img

# Upload with KMS encryption
coldsnap upload disk.img --kms-key-id arn:aws:kms:<region>:<account>:key/<key-id>

# Upload with tags
coldsnap upload disk.img --tag "Key=Environment,Value=DR" --tag "Key=Source,Value=migration"

Download EBS Snapshot to Local File

coldsnap download snap-1234 disk.img

Cross-Region Migration with Coldsnap

# 1. Download snapshot from source region
AWS_DEFAULT_REGION=<source-region> coldsnap download snap-1234 disk.img

# 2. Upload to target region
AWS_DEFAULT_REGION=<target-region> coldsnap upload --wait disk.img

# 3. Create volume from new snapshot in target region
aws ec2 create-volume --snapshot-id <new-snap-id> --availability-zone <target-az> --region <target-region>

Wait for Snapshot

coldsnap wait snap-1234

Credentials

Coldsnap uses the same credential chain as the AWS CLI (~/.aws/credentials, environment variables, instance profiles). Use --profile <name> to select a specific profile.

Amazon ECS Migration Guide

Key Considerations

ECS clusters cannot be moved to another region. You must recreate the cluster, task definitions, and services in the target region.
The ECS data plane is independent from the control plane. Running tasks on healthy nodes continue to operate even when the control plane is unreachable.
Container images must be available in the target region before services can be created. If ECR in the source region is inaccessible, plan to rebuild images from your CI/CD pipeline or pull from an alternate registry.
Stateful workloads backed by EFS continue working across AZ failures because EFS is a regional service — only the mount target is AZ-specific. Cross-region EFS data migration is not currently possible.
Stateful workloads backed by EBS (EC2 launch type only) are AZ-specific and cannot move automatically.
IAM roles are global, but verify they exist and hold the correct permissions before creating services in the target region.
Secrets Manager secrets and SSM parameters are regional. Export and recreate them in the target region before registering task definitions.

What You CANNOT Do When the ECS Control Plane Is Down

Stop, start, or restart tasks
Update services or change desired count
Register new task definitions
Create new services or clusters
Access Amazon ECR in the affected region

Recovery Path

Is the ECS control plane reachable and at least one AZ healthy?
  YES --> Stream 1: Recover in-place (partial AZ failure)
  NO  --> Is this a full region outage?
            YES --> Stream 2: Migrate to another region
            NO  --> Open an AWS Support case (Critical severity)

Moving Between AZs (Partial AZ Failure)

Use this path when the ECS control plane is still reachable and at least one AZ is healthy. The goal is to stop scheduling tasks in the failed AZ and reschedule them in the remaining healthy AZs.

Step 1: Identify tasks running in the affected AZ

aws ecs list-tasks \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME

aws ecs describe-tasks \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --tasks TASK-ARN-1 TASK-ARN-2 \
  --query 'tasks[*].[taskArn,availabilityZone,lastStatus]'

Step 2: Remove the failed AZ from service networking (Fargate launch type)

Update the service to remove the affected AZ's subnet:

aws ecs update-service \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --service YOUR-SERVICE-NAME \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-HEALTHY-AZ1,subnet-HEALTHY-AZ2],securityGroups=[sg-xxxxxxxxx],assignPublicIp=DISABLED}"

Step 3: Exclude the failed AZ via placement constraints (EC2 launch type)

aws ecs update-service \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --service YOUR-SERVICE-NAME \
  --placement-constraints type=memberOf,expression="attribute:ecs.availability-zone != FAILED-AZ-ID"

Replace FAILED-AZ-ID with the actual failed AZ identifier (for example, me-central-1a).

Step 4: Force redeployment to reschedule tasks

aws ecs update-service \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --service YOUR-SERVICE-NAME \
  --force-new-deployment

ECS stops tasks in the failed AZ and reschedules them in the healthy AZs.

Step 5: Verify service stability

aws ecs describe-services \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --services YOUR-SERVICE-NAME \
  --query 'services[*].[serviceName,runningCount,desiredCount,status]'

Notes:

Verify that healthy AZs have sufficient capacity to absorb rescheduled tasks. Request a limit increase proactively if needed.
If tasks use EBS volumes (EC2 launch type only), those volumes are AZ-specific and cannot be moved automatically.
If the desired count cannot be met in the remaining AZs, temporarily reduce it to match available capacity, then scale back up once stable.

Moving to a Different Region (Full Region Loss)

Before you start:

Document your current ECS cluster configuration.
Identify container images and their registry locations.
Note all task definitions, services, and their configurations.
Identify your target recovery region.
Verify you have the necessary IAM permissions in the target region.
Document load balancer and networking configurations.

Step 1: Copy container images to the target region ECR

Option A: Pull from source region and push to target region

Create the repository in the target region:

aws ecr create-repository \
  --repository-name YOUR-REPO-NAME \
  --region TARGET-REGION

Authenticate, pull, retag, and push:

aws ecr get-login-password --region SOURCE-REGION | \
  docker login --username AWS --password-stdin ACCOUNT-ID.dkr.ecr.SOURCE-REGION.amazonaws.com

docker pull ACCOUNT-ID.dkr.ecr.SOURCE-REGION.amazonaws.com/YOUR-REPO-NAME:TAG

docker tag ACCOUNT-ID.dkr.ecr.SOURCE-REGION.amazonaws.com/YOUR-REPO-NAME:TAG \
  ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO-NAME:TAG

aws ecr get-login-password --region TARGET-REGION | \
  docker login --username AWS --password-stdin ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com

docker push ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO-NAME:TAG

Option B: Rebuild from your CI/CD pipeline

Trigger your existing pipeline (CodePipeline, GitHub Actions, Jenkins) targeting the recovery region directly. This is the most reliable option if your pipeline is decoupled from the affected region.

Option C: Pull from an alternate registry

If images were previously pushed to ECR Public, Docker Hub, or another registry:

# From ECR Public
docker pull public.ecr.aws/YOUR-ALIAS/YOUR-REPO:TAG

docker tag public.ecr.aws/YOUR-ALIAS/YOUR-REPO:TAG \
  ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO:TAG

aws ecr get-login-password --region TARGET-REGION | \
  docker login --username AWS --password-stdin \
  ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com

docker push ACCOUNT-ID.dkr.ecr.TARGET-REGION.amazonaws.com/YOUR-REPO:TAG

Check whether ECR cross-region replication was pre-configured — images may already be available:

aws ecr describe-repositories --region TARGET-REGION
aws ecr list-images --region TARGET-REGION --repository-name YOUR-REPO-NAME

Recommendation going forward: Enable ECR cross-region replication proactively so images are always available in your DR region without manual intervention. See ECR replication documentation.

Step 2: Export task definitions from the source region

aws ecs list-task-definitions --region SOURCE-REGION

aws ecs describe-task-definition \
  --region SOURCE-REGION \
  --task-definition YOUR-TASK-DEFINITION:REVISION \
  --query 'taskDefinition' > task-definition.json

Save these JSON files. You need them to register task definitions in the target region.

See Task Definitions documentation.

Step 3: Document service configurations

aws ecs list-services \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME

aws ecs describe-services \
  --region SOURCE-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --services YOUR-SERVICE-NAME > service-config.json

Step 4: Migrate secrets and configuration

Copy Secrets Manager secrets:

# Retrieve secret value from source region
aws secretsmanager get-secret-value \
  --region SOURCE-REGION \
  --secret-id YOUR-SECRET-NAME \
  --query 'SecretString' --output text > secret.txt

# Create secret in target region
aws secretsmanager create-secret \
  --region TARGET-REGION \
  --name YOUR-SECRET-NAME \
  --secret-string file://secret.txt

Copy SSM Parameter Store parameters:

# Retrieve parameter from source region
aws ssm get-parameter \
  --region SOURCE-REGION \
  --name YOUR-PARAMETER-NAME \
  --with-decryption \
  --query 'Parameter.Value' --output text > parameter.txt

# Create parameter in target region
aws ssm put-parameter \
  --region TARGET-REGION \
  --name YOUR-PARAMETER-NAME \
  --value file://parameter.txt \
  --type SecureString

Step 5: Create VPC and networking in the target region

aws ec2 create-vpc \
  --region TARGET-REGION \
  --cidr-block 10.0.0.0/16 \
  --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=ecs-recovery-vpc}]'

# Repeat for each AZ
aws ec2 create-subnet \
  --region TARGET-REGION \
  --vpc-id vpc-xxxxxxxxx \
  --cidr-block 10.0.1.0/24 \
  --availability-zone TARGET-REGION-AZ

aws ec2 create-security-group \
  --region TARGET-REGION \
  --group-name ecs-tasks-sg \
  --description "Security group for ECS tasks" \
  --vpc-id vpc-xxxxxxxxx

Step 6: Verify IAM roles in the target region

IAM roles are global. Verify that your ECS task execution role, task role, service role, and auto scaling role exist and have correct policies. See ECS IAM Roles documentation.

Step 7: Create the ECS cluster in the target region

aws ecs create-cluster \
  --region TARGET-REGION \
  --cluster-name YOUR-CLUSTER-NAME

See Creating ECS Clusters documentation.

Step 8: Create the Application Load Balancer (if needed)

Step 9: Register task definitions in the target region

aws ecs register-task-definition \
  --region TARGET-REGION \
  --cli-input-json file://task-definition.json

Step 10: Create ECS services in the target region

aws ecs create-service \
  --region TARGET-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --service-name YOUR-SERVICE-NAME \
  --task-definition YOUR-TASK-DEFINITION:REVISION \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-AZ1,subnet-AZ2],securityGroups=[sg-xxxxxxxxx],assignPublicIp=DISABLED}" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:TARGET-REGION:ACCOUNT-ID:targetgroup/...,containerName=YOUR-CONTAINER,containerPort=8080"

See Creating ECS Services documentation.

Step 11: Configure service auto scaling

aws application-autoscaling register-scalable-target \
  --region TARGET-REGION \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/YOUR-CLUSTER-NAME/YOUR-SERVICE-NAME \
  --min-capacity 1 \
  --max-capacity 10

See Service Auto Scaling documentation.

Step 12: Update DNS and route traffic

In Route 53, update the A or CNAME record to point to the new load balancer DNS name. Set a low TTL (60 seconds) before making the change to speed up propagation.

Step 13: Verify and test

Run these checks before switching production traffic:

# Check all services show desired count = running count
aws ecs describe-services \
  --region TARGET-REGION \
  --cluster YOUR-CLUSTER-NAME \
  --services YOUR-SERVICE-NAME \
  --query 'services[*].[serviceName,runningCount,desiredCount,status]'

# Check load balancer target health
aws elbv2 describe-target-health \
  --region TARGET-REGION \
  --target-group-arn YOUR-TARGET-GROUP-ARN

Also verify: logs are flowing to CloudWatch, secrets and environment variables are correct, database connectivity is working, and auto scaling triggers are configured.

Alternative: Redeploy using Infrastructure as Code

If you have Terraform:

# Update provider region in your configuration, then run:
terraform plan
terraform apply

If you have CloudFormation:

aws cloudformation create-stack \
  --region TARGET-REGION \
  --stack-name ecs-recovery-stack \
  --template-body file://ecs-stack.yaml \
  --capabilities CAPABILITY_IAM

Post-Recovery Checklist

[ ] All services showing desired count = running count
[ ] Load balancer target group health checks passing
[ ] Logs flowing to CloudWatch Log Groups
[ ] CloudWatch alarms configured for service failures
[ ] CloudWatch Logs retention configured
[ ] DNS / Route 53 records updated to new load balancer
[ ] Auto scaling thresholds verified
[ ] Secrets and environment variables confirmed correct
[ ] Database connectivity tested
[ ] AWS Backup or snapshot schedule re-enabled
[ ] Monitoring dashboards updated
[ ] Disaster recovery documentation updated

Amazon EKS Migration Guide

Key Considerations

Amazon Elastic Kubernetes Service (Amazon EKS) clusters cannot be moved. You must create a new cluster in the target region and restore workloads into it.
The Kubernetes data plane is independent from the control plane. Running pods on healthy nodes continue to operate even when the EKS API server is unreachable.
IRSA (IAM Roles for Service Accounts) must be re-associated with the new cluster's OIDC provider before triggering any restore. Skipping this causes silent AccessDenied errors on all AWS API calls from pods.
EKS add-ons (CSI drivers, CoreDNS, kube-proxy, VPC CNI) must be installed on the new cluster before restoring workloads. Persistent volume mounts fail if CSI drivers are not present at restore time.
Fargate profiles are cluster-specific. If using EKS Fargate, recreate profiles on the new cluster. Drain and cordon steps do not apply to Fargate nodes.
Kubernetes Secrets exported with kubectl are only base64-encoded, not encrypted. Handle exported secret files securely and delete them after use.
Cross-region EFS data migration is not currently possible. For an AZ failure, update mount targets to healthy AZs. For a full region migration, EFS-backed workloads cannot be migrated with their persistent data.

What You CANNOT Do When the EKS Control Plane Is Down

Run kubectl commands of any kind
Schedule new pods or trigger deployments
Perform rolling updates or rollbacks
Access the EKS console or API server

Recovery Path

Is the EKS API server reachable?
  YES --> Are any nodes in the failed AZ?
            YES --> Drain affected nodes, exclude failed AZ from node provisioning
            NO  --> No action needed; workloads are already on healthy nodes
  NO  --> Is this a partial AZ failure or full region loss?
            PARTIAL AZ --> Wait for API access to recover, then drain affected nodes
            FULL REGION --> Migrate to another region (steps below)

Moving Between AZs (Partial AZ Failure)

Step 1: Identify nodes in the failed AZ and cordon them

# List nodes and their AZs
kubectl get nodes --label-columns topology.kubernetes.io/zone

# Prevent new pods from scheduling on affected nodes
kubectl cordon NODE-NAME

# Evict all running pods from affected nodes
kubectl drain NODE-NAME --ignore-daemonsets --delete-emptydir-data

Step 2: Prevent new nodes from being created in the failed AZ

Managed Node Groups: Create a new node group that does not include the failed AZ's subnets, then cordon and drain the old node group to migrate workloads.

aws eks create-nodegroup \
  --cluster-name YOUR-CLUSTER \
  --nodegroup-name recovery-nodegroup \
  --subnets subnet-HEALTHY-AZ1 subnet-HEALTHY-AZ2 \
  --instance-types INSTANCE-TYPE \
  --ami-type AMI-TYPE \
  --scaling-config minSize=MIN-SIZE,maxSize=MAX-SIZE,desiredSize=DESIRED-SIZE \
  --region SOURCE-REGION

See Migrating to a new node group.

Karpenter: Patch the NodePool to exclude the failed AZ:

kubectl patch nodepool default --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/requirements/-",
  "value": {"key": "topology.kubernetes.io/zone",
            "operator": "NotIn", "values": ["FAILED-AZ-ID"]}}]'

Replace FAILED-AZ-ID with the actual failed AZ identifier.

Step 3: Handle stateful workloads in the failed AZ

EFS-backed StatefulSets: EFS is regional. Only the mount target is AZ-specific. Update the PVC to reference a mount target in a healthy AZ. No data migration is required.

Moving to a Different Region (Full Region Loss)

Step 1: Export workload manifests

Export manifests before attempting any backup or restore. This provides a portable fallback if AWS Backup is unavailable in the affected region.

# Export all workload resources across all namespaces
kubectl get deploy,svc,configmap,ingress,hpa,statefulset,daemonset \
  -A -o yaml > all-workloads.yaml

# Export Kubernetes Secrets (filter out system secrets before applying to new cluster)
# WARNING: Secrets are only base64-encoded, not encrypted. You MUST encrypt the output file.
# Verify GPG is configured with a recipient key before proceeding.
kubectl get secrets -A -o yaml | gpg --encrypt -r <key-id> > secrets-backup.yaml.gpg

# Export ConfigMaps
kubectl get configmaps -A -o yaml > configmaps-backup.yaml

Kubernetes Secrets exported this way are only base64-encoded, not encrypted. Delete the output files after use.

Step 2: Back up the cluster using AWS Backup (if accessible)

If AWS Backup is operational in the source region:

aws backup start-backup-job \
  --region SOURCE-REGION \
  --backup-vault-name YOUR-VAULT-NAME \
  --resource-arn arn:aws:eks:SOURCE-REGION:ACCOUNT-ID:cluster/YOUR-CLUSTER \
  --iam-role-arn arn:aws:iam::ACCOUNT-ID:role/AWSBackupDefaultServiceRole

See AWS Backup for EKS.

Step 3: Prepare the new cluster in the target region

Complete all three preparation steps before triggering any restore. Restoring without completing them causes IRSA failures and persistent volume mount errors.

3a. Create the new EKS cluster

Use eksctl, the AWS Console, or your existing IaC targeting the new region.

eksctl create cluster \
  --name NEW-CLUSTER-NAME \
  --region TARGET-REGION \
  --version KUBERNETES-VERSION

3b. Associate the OIDC provider and update IRSA trust policies

Security: Customers are responsible for configuring IAM roles for service accounts, managing OIDC provider associations, and verifying trust policies follow least-privilege principles. Review each IRSA role's permissions during migration to remove any unnecessary access.

# Get the new cluster's OIDC issuer URL
aws eks describe-cluster \
  --name NEW-CLUSTER-NAME \
  --region TARGET-REGION \
  --query "cluster.identity.oidc.issuer" --output text

# Associate the OIDC provider
eksctl utils associate-iam-oidc-provider \
  --cluster NEW-CLUSTER-NAME \
  --region TARGET-REGION --approve

# For each IRSA role: retrieve the trust policy, update the OIDC ARN, then apply
aws iam get-role --role-name YOUR-IRSA-ROLE \
  --query "Role.AssumeRolePolicyDocument" > trust-policy.json
# Edit trust-policy.json: replace the old OIDC ARN with the new cluster's OIDC ARN
aws iam update-assume-role-policy \
  --role-name YOUR-IRSA-ROLE \
  --policy-document file://trust-policy.json

See IAM Roles for Service Accounts.

3c. Install required EKS add-ons on the new cluster

SOURCE_CLUSTER=YOUR-SOURCE-CLUSTER
TARGET_CLUSTER=NEW-CLUSTER-NAME

for ADDON in $(aws eks list-addons \
    --cluster-name $SOURCE_CLUSTER \
    --region SOURCE-REGION \
    --query 'addons[]' --output text); do
  echo "Installing $ADDON..."
  aws eks create-addon \
    --cluster-name $TARGET_CLUSTER \
    --addon-name $ADDON \
    --region TARGET-REGION
done

See Managing EKS add-ons.

Step 4: Migrate container images

Step 5: Migrate supporting resources

EBS volumes: Use volume snapshots and follow the EBS snapshot migration steps in the EC2 section of this guide.

Secrets Manager and SSM Parameter Store: Follow ECS Step 4. The process is identical.

VPC resources: Recreate VPC, subnets, security groups, and NAT Gateways in the target region before restoring the cluster.

Fargate profiles: If using EKS Fargate, recreate the Fargate profiles on the new cluster after it is created.

Step 6: Restore workloads

Option A: Restore from AWS Backup

aws backup start-restore-job \
  --region TARGET-REGION \
  --recovery-point-arn RECOVERY-POINT-ARN \
  --iam-role-arn arn:aws:iam::ACCOUNT-ID:role/AWSBackupDefaultServiceRole \
  --metadata '{"ClusterName":"NEW-CLUSTER-NAME","Region":"TARGET-REGION"}'

See Restoring an EKS cluster.

Option B: Apply exported manifests

Update the kubeconfig to point to the new cluster, then apply the exported manifests. Review and filter system namespaces and service account secrets before applying.

aws eks update-kubeconfig \
  --name NEW-CLUSTER-NAME \
  --region TARGET-REGION

# Review the file before applying to exclude kube-system and other system namespaces
kubectl apply -f all-workloads.yaml

Step 7: Update DNS and route traffic

In Route 53, update A or CNAME records to point to the new load balancer or ingress endpoints. Set a low TTL (60 seconds) before making the change to speed up propagation.

Post-Recovery Checklist

[ ] kubeconfig updated to new cluster (aws eks update-kubeconfig)
[ ] All nodes in Ready state (kubectl get nodes)
[ ] All pods running and healthy (kubectl get pods -A)
[ ] EKS add-ons installed and active
[ ] Pod AWS API calls working (check logs for AccessDenied or timeout errors)
[ ] ECR images accessible from target region
[ ] Persistent volumes bound and accessible
[ ] Services and Ingress endpoints resolving correctly
[ ] DNS / Route 53 records updated to new load balancer
[ ] Horizontal Pod Autoscaler (HPA) thresholds verified
[ ] AWS Backup plan re-enabled for new cluster
[ ] Monitoring and alerting updated for new cluster endpoints
[ ] Disaster recovery documentation updated

migration-compute

Más de este repositorio

Compute Services Migration

EC2 Migration Guide

Key Considerations

What You Cannot Do When the AZ Is Down

Recovery Path

Moving Between AZs or Regions (Once the Affected Region is Restored)

Failed AZ, moving to new region:

Moving between AZ’s (Once the Affected Region is Restored)

For instances that are still accessible within the impacted region

AWS Application Migration Service (AWS MGN)

Manually Dumping my MySQL databases before data sync:

Manually copying data between EC2 Linux instances using Rsync (If API’s are not working)

Manually copying data between EC2 Windows instances using Robocopy

Once Control Plane has been restored

Post Migration Checklist?

WSFC / SQL FCI Cluster Recovery

If one cluster node survived

If both nodes were in the failed AZ

Amazon FSx for NetApp ONTAP and Failover Cluster Storage Recovery

Amazon FSx for NetApp ONTAP Behavior During AZ Failure

Verifying iSCSI Reconnection After FSx Failover

If iSCSI Sessions Did Not Reconnect

Cluster Disk Recovery

Disk Signature Issues After Recovery

MPIO Path Verification

SQL Server Recovery After Storage Reconnection

FSx ONTAP REST API Verification

Coldsnap: EBS Direct API Snapshot Transfers

Installation

Upload Local Disk Image to EBS Snapshot

Download EBS Snapshot to Local File

Cross-Region Migration with Coldsnap

Wait for Snapshot

Credentials

Amazon ECS Migration Guide

Key Considerations

What You CANNOT Do When the ECS Control Plane Is Down

Recovery Path

Moving Between AZs (Partial AZ Failure)

Step 1: Identify tasks running in the affected AZ

Step 2: Remove the failed AZ from service networking (Fargate launch type)

Step 3: Exclude the failed AZ via placement constraints (EC2 launch type)

Step 4: Force redeployment to reschedule tasks

Step 5: Verify service stability

Moving to a Different Region (Full Region Loss)

Step 1: Copy container images to the target region ECR

Step 2: Export task definitions from the source region

Step 3: Document service configurations

Step 4: Migrate secrets and configuration

Step 5: Create VPC and networking in the target region

Step 6: Verify IAM roles in the target region

Step 7: Create the ECS cluster in the target region

Step 8: Create the Application Load Balancer (if needed)

Step 9: Register task definitions in the target region

Step 10: Create ECS services in the target region

Step 11: Configure service auto scaling

Step 12: Update DNS and route traffic

Step 13: Verify and test

Alternative: Redeploy using Infrastructure as Code

Post-Recovery Checklist

Amazon EKS Migration Guide

Key Considerations

What You CANNOT Do When the EKS Control Plane Is Down

Recovery Path

Moving Between AZs (Partial AZ Failure)

Step 1: Identify nodes in the failed AZ and cordon them

Step 2: Prevent new nodes from being created in the failed AZ

Step 3: Handle stateful workloads in the failed AZ

Moving to a Different Region (Full Region Loss)

Step 1: Export workload manifests

Step 2: Back up the cluster using AWS Backup (if accessible)

Step 3: Prepare the new cluster in the target region

Step 4: Migrate container images

Step 5: Migrate supporting resources

Step 6: Restore workloads

Step 7: Update DNS and route traffic

Post-Recovery Checklist

Security Controls and Measurable Improvements