Understanding EC2 Status Checks

I recently had a student ask me “what’s up with 2/2 checks passed” message on EC2 dashboard of AWS console. Great question, here’s the answer. Elastic Compute Cloud (EC2) is a virtual server offering in Amazon Web Services (AWS). Each EC2 server is known as an instance. An EC2 instance goes through two status checks at startup, then continuously performed every minute afterwards. The automated status checks allow engineers to quickly determine whether Amazon EC2 has detected any problems that might prevent the EC2 instance from running properly. The two status checks are known as a system check and a instance check. There’s a “Status Check” tab for each EC2 instance – you can access it by simply selecting the instance -> navigating to the details -> selecting status check tab. You can also refer to the image below for an example.

System Check = Infrastructure Check

The system check is basically an infrastructure check that detects problems with the instance and requires AWS involvement to repair – essentially something isn’t working properly at the AWS data center. This may include: 

  • Loss of network connectivity
  • Loss of system power
  • Software issues on the physical host
  • Hardware issues on the physical host that impact network reachability

FYI – it’s really hard to find an image of EC2 system check failure. Either AWS infrastructure is fault tolerant or they just do a really good PR job  ?

Instance Check = Server Configuration Check 

Once AWS has confirmed their infrastructure is working properly, the next step is to check if the customers EC2 instance is properly configured. The instance checks look at the software (OS) and network configuration of individual EC2 instance. Instance checks detect problems that require the customers involvement to repair. This means that if instance check fails, customers typically must address the problem themselves – for example, by rebooting the instance or by making instance configuration changes. Note, it’s common for instance checks to fail as a result of underlying system check failure – customer EC2 instance isn’t working properly due to issues with AWS infrastructure. Other common causes of instance check failure include:

  • Incorrect networking or startup configuration
  • Exhausted memory
  • Corrupted file system
  • Incompatible kernel

How Should You Respond To Status Check Failures

EC2 status check failures are inevitable so knowing how to respond to them is a good idea. 

There’s not much you can do if your EC2 instance fails at system check. As I stated above, a system check failure implies that something is not working properly at the AWS data center so there’s nothing you can do about it unless you work for AWS and have access to the back-end infrastructure technology. Normally system check failures are fixed by AWS in a timely manner, however you should take immediate action by contacting AWS support if this is a time sensitive or critical environment. Here’s sound advice from my friend (and avid blogger) Patrica regarding what you can do if your EC2 instance fails at system check. 

While preparing for AWS Certifications – one thing that really stood out to me is that AWS’s proposed solution to a system check failure is to migrate to a new instance by stopping and starting the instance. I saw you mentioned reaching out to AWS support – usually, they will refer you to their documentation in such circumstances such as this: https://aws.amazon.com/premiumsupport/knowledge-center/system-reachability-check/ 

– Patricia Anong

Unlike system check failures, the onus is on you to resolve instance check failures – unless the instance check failure is a result of a system check failure. Analyzing instance logs is a great start to troubleshooting instance check failure. Also, undoing any new configuration change that triggered the instance failure is a good idea. Last resort is to contact AWS support in the event you’re unable to resolve the issue by yourself – be ready to send all types of logs.