AWS EC2 Spot – Best Practices

Amazon’s EC2 has several options for running instances. On-demand instances is what would be used by most. Reserved instances are used by those who can do some level of usage prediction. Another option which can be a cost saver is using Spot instances. Amazon claims savings up to 90% off regular EC2 rates using Spot instances.

AWS operates like a utility company as such it has spare capacity at any given time. This spare capacity can be purchased through Spot instances. There’s a catch, though. With a 2 minute warning, Amazon can take back that “spare capacity” so using Spot instances needs to be carefully planned. When used correctly Spot instances can be a real cost-saver.

When to use Spot instances

There is a fairly broad set of use cases for using Spot instances. The general consensus is simply containerized, stateless workloads, but in reality there’s a lot more.

  • Distributed databases – think MongoDB or Cassandra or even Elasticsearch. These are distributed so losing one instance would not affect the data; simply start another one
  • Machine Learning – typically these are running training jobs and losing it would only mean the learning stops until another one is started. ML lends itself well to the Spot instance paradigm
  • CI/CD operations – this is a great one for Spot instances
  • Big Data operations – AWS EMR or Spark are also great use cases for Spot instances
  • Stateful workloads – even though these applications would need IP and data persistence, some (maybe even all) of these may be candidates for Spot instances especially if they are automated properly.

Be prepared for disruption

The primary practice for working in AWS in general, but also working with Spot instances is be prepared. Spot instances will be interrupted at some point when it’s least expected. It is critical to create your workload to handle failure. Take advantage of EC2 instance re-balance recommendations and Spot instance interruption notices.

The EC2 re-balance recommendation will notify of an elevated risk of Spot instance interruption in advance of the “2 minute warning”. Using the Capacity Rebalancing feature in Auto-scaling Groups and Spot fleet will provide the ability to be more proactive. Take a look at Capacity Rebalancing for more detail.

If the workloads are “time flexible” configure the Spot instances to stop or hibernate vs terminated when an interruption occurs. When the spare capacity returns the instance will be restarted.

Use the Spot instance interruption notice and the Capacity rebalance notice to your advantage by using the EventBridge to create rules to gracefully handle an interruption. One such example is outlined next.

Using Spot instances with ELB

In a lot of cases Elastic Load Balancer (ELB) is used. Instances are registered and de-registered to the ELB based on health check status. Problem with Spot instances is the instance do not de-register automatically so there may be some interruption if the situation is not handled properly.

The proper way would be to use the interruption notice as a trigger to de-register the instance from the ELB. By programmatically de-registering the Spot instance prior to termination traffic would not be routed to the instance and no traffic would be lost.

Easiest way is to use a Lambda function to trigger based on a Cloudwatch instance termination notice. The Lambda function simply retrieves the instance ID from the event and de-registers the instance from the ELB. As usual, Amazon Solution Architects showed how to do it on the AWS Compute Blog.

Keep your options open

The Spot capacity pool consists of a set of unused EC2 instances with the same instance type (t3.micro, m4.large, etc) and Availability Zone (us-west-1a). Avoid getting too specific on instance types and what zone they use. For instance, avoid specifically requesting c4.large if running the workload on a m5, c5, or m4 family would work the same. Keep specific needs in mind, vertically scaled workloads need more resources and horizontally scaled workloads would find more availability in older generation types as they are in less demand.

Amazon recommends being flexible across at least 10 instance types and there is never a need to limit Availability Zones. Ensure all AZs are enabled in your VPC for your instance to use.

Price and capacity optimized strategy

Take advantage of Auto Scaling groups as the allocation strategies will enable provisioning capacity automatically. The price-capacity-optimized strategy in Spot Fleet due to how the instance capacity is sourced from pools with optimal capacity. This strategy will reduce the possibility of having the Spot instance reclaimed. Dig into the Auto Scaling User Guide Spot Instances section for more detail. Also take a look at this section which describes when workloads have a high cost of interruption.

Think aggregate capacity

Instead of looking at individual instances, Spot enables a more holistic view across units such as vCPUs, network, memory, or storage. Using Spot Fleet with Auto Scaling Groups allows for a higher level view enabling the concept of “target capacity”. Automating the request for more resources to maintain the target capacity of a workload enables considerable flexibility.

Other options to consider

Amazon has a considerable number of services which can be integrated with Spot instances to manage compute costs. Used effectively these services will allow for more flexibility and automation eliminating the need to manage individual instances or fleets. Take a look at the EC2 Spot Workshops for some ideas and examples.