How To: Recover from Failed Amazon EC2 Instances (and fail they will)
4 Comments Published February 7th, 2011 in aws, technologyOne of the things that’s not immediately obvious about Amazon EC2 instances is that they could fail, in fact Amazon says:
It’s inevitable that EC2 instances will fail, and you need to plan for it. An instance failure isn’t a problem if your application is designed to handle it.
The EC2 forum posts are littered with users whose EC2 instances have become unresponsive and can not be stopped or restarted. Instances can get “stuck” in “stopping” mode for 24 hours or more. Amazon generally recommends issuing a forced stop via the client tools “ec2-stop-instances –force” command, but this actually doesn’t seem to work in most cases.
Luckily, Eric Hammond wrote a post about how to move EC2 instances to new hardware if such a problem were to occur (as it did to me). Eric’s solution relies on the client tools under Linux.
It turns out that its possible to replicate these steps directly in the Amazon panel and quickly recover from a failed instance. I recommend everyone follow these steps to prepare for a failure scenario:
- In the “Instances” panel: create a new instance using the same AMI as your production instance. This is your backup instance. “Stop” the instance after it is created. (Amazon will not charge you for any stopped instances).
- In “Volumes”: detatch and then delete the drive that was created as part of this new instance.
- Still in Volumes: create a spapshot of your production drive.
- Go to the “Snapshots” section of the panel, select your new snapshot and choose “create volume from snapshot.” Be sure to choose the same availability zone as your instance. I’ve seen some caching issues here, so if you don’t see your snapshot when selecting this menu, be sure to refresh.
- Go back to “Volumes” and choose “attach volume” on your new available volume. Choose your stopped backup instance and type in the same device as your original volume (visible under “attachment information” for the volume)
- Go ahead and start your backup instance, it should be an exact copy of your production instance.
- Sleep better at night.
Boris,
Thank you!
Looking back over previous Forum posts on the AWS EC2 community I saw a ton of other people with instances stuck in the stopping state. It didn’t bode well when the resolution time varied from less than an hour to more than a day!
I came across this post and within fifteen minutes got everything back up and running on a duplicate instance.
I really appreciate you taking the time to post this!
Ben
Thank you! you saved my day!
Thank you for the information.
I was wondering if just doing a stop and start instance wouldn’t by itself change to a new hardware?
I’m sure there are benefits of using your approach, but wondering how a stop and start instance would be different.
Thanks a lot
Hi Guarav, the issue is that its impossible to stop the instance in this scenario. Instances can get “stuck” in “stopping” mode for 24 hours or more, which is why this technique is neeeded.
Cheers!
- Boris