One of the things that’s not immediately obvious about Amazon EC2 instances is that they could fail, in fact Amazon says:

It’s inevitable that EC2 instances will fail, and you need to plan for it. An instance failure isn’t a problem if your application is designed to handle it.

The EC2 forum posts are littered with users whose EC2 instances have become unresponsive and can not be stopped or restarted. Instances can get “stuck” in “stopping” mode for 24 hours or more. Amazon generally recommends issuing a forced stop via the client tools “ec2-stop-instances –force” command, but this actually doesn’t seem to work in most cases.

Luckily, Eric Hammond wrote a post about how to move EC2 instances to new hardware if such a problem were to occur (as it did to me). Eric’s solution relies on the client tools under Linux.

It turns out that its possible to replicate these steps directly in the Amazon panel and quickly recover from a failed instance. I recommend everyone follow these steps to prepare for a failure scenario:

  1. In the “Instances” panel: create a new instance using the same AMI as your production instance. This is your backup instance. “Stop” the instance after it is created. (Amazon will not charge you for any stopped instances).
  2. In “Volumes”: detatch and then delete the drive that was created as part of this new instance.
  3. Still in Volumes: create a spapshot of your production drive.
  4. Go to the “Snapshots” section of the panel, select your new snapshot and choose “create volume from snapshot.” Be sure to choose the same availability zone as your instance. I’ve seen some caching issues here, so if you don’t see your snapshot when selecting this menu, be sure to refresh.
  5. Go back to “Volumes” and choose “attach volume” on your new available volume. Choose your stopped backup instance and type in the same device as your original volume (visible under “attachment information” for the volume)
  6. Go ahead and start your backup instance, it should be an exact copy of your production instance.
  7. Sleep better at night.

8 Responses to “How To: Recover from Failed Amazon EC2 Instances (and fail they will)”  

  1. 1 Ben

    Boris,

    Thank you!

    Looking back over previous Forum posts on the AWS EC2 community I saw a ton of other people with instances stuck in the stopping state. It didn’t bode well when the resolution time varied from less than an hour to more than a day!
    I came across this post and within fifteen minutes got everything back up and running on a duplicate instance.

    I really appreciate you taking the time to post this!

    Ben

  2. 2 vanja

    Thank you! you saved my day!

  3. 3 Gaurav

    Thank you for the information.

    I was wondering if just doing a stop and start instance wouldn’t by itself change to a new hardware?

    I’m sure there are benefits of using your approach, but wondering how a stop and start instance would be different.

    Thanks a lot

  4. 4 boris

    Hi Guarav, the issue is that its impossible to stop the instance in this scenario. Instances can get “stuck” in “stopping” mode for 24 hours or more, which is why this technique is neeeded.

    Cheers!
    - Boris

  5. 5 ernest

    Thanks Boris, a welcome time saver!

  6. 6 Aniruddha J

    but what if AWS CloudFormation gives error ec2 instance did not stabilize”. I’m unable to find any solution so far.

  7. 7 Nir Levy

    Thank you Boris for this post, it was really helpful. It took AWS 30 minutes to free my stuck instance, it took me 15 minutes to re-launch a new one based on your post. That’s 15 minutes of uptime in your favor.

  8. 8 sai

    Hi
    I taken a space in AWS AMAZON. As per my client request, installing Active Directory in that machine. i connected remotely to that system , trying to installing active directory. But unfortunately i changed the ip address and , subnet mask and Dns in that system (where i connected in remotely , cloud system), immediately its disconnected. And i am unable to connect to it in remote. i went to that website and login , the server os is running and its hows the “instance failure” message. but its showing the os is running. How can i start my system, please help me. if any one give phone number , i will explain it.

    please help me .

Leave a Reply