Managing AWS Best Practices Using Terraform

We take a look at how to express your AWS infrastructure using Terraform scripts by adhering to 5 AWS best practices.

APR 22, 2021 | SUMEET NINAWE
undefined

Converting your cloud infrastructure to code is awesome. It gives us the ability to manage infrastructure in a single repository, reuse templates and even apply governance on how cloud resources are utilized at an organizational level.

Using Terraform, we can easily build these templates and reuse them to create or destroy cloud environments. For example, we can reuse a template for spinning up or down EC2 instances in an autoscaling group.

The good part? Since these templates are pre-defined, they are also well-tested. Nobody likes reinventing the wheel and resolving the same set of errors again and again. This hinders evolution.

Terraform Registry is a great resource to refer to while building Terraform scripts for your infrastructure. All the modules, resources, data sources, and providers which are currently supported by Terraform are documented in the registry.

The Registry provides us with an example for every resource supported by Terraform. For example, if we want to create a cluster in AWS ECS, we can simply go to this document and refer to the example which is pretty simple.

resource "aws_ecs_cluster" "foo" { 
  name = "white-hart" 
}

Name is the only attribute that is “required” as per the document to create AWS ECS.

This is not a great system though. We can’t rely on default values and required attributes alone for production-grade applications. We'll see the reasons for the same in upcoming sections.

There are certain best practices defined by AWS in their Well-Architected Framework which encompass various aspects of hosting workloads in AWS in the best possible way. To take your Terraform skills to the next level, it is important to design cloud infrastructure based on this framework.

Let’s take a look at how to express your AWS infrastructure using Terraform scripts by adhering to the AWS best practices. Well-Architected Framework is supported by 5 best practices, which we'll go through in this post.

They are:

AWS Best Practice 1: Operational Excellence

AWS Best Practice 2: Performance Efficiency

AWS Best Practice 3: Security

AWS Best Practice 4: Reliability

AWS Best Practice 5: Cost Optimization

We will touch upon each of these pillars in this post. However, it should be understood that each of these pillars is philosophy in itself and involves many nuances which are not covered here. The purpose of this post is to realize how AWS best practices can be achieved using Terraform.

Interested in learning more about Terraform for AWS? Join our Slack community to connect with DevOps experts and continue the conversation.

AWS Best Practice 1: Operational Excellence

The most important feature of any operational procedure in today’s age is automation, which is the key to any DevOps practice trying to roll out product features more frequently than ever.

Most of the time, operational tasks are defined in runbooks or knowledge articles, where certain procedures are defined to accomplish a given infrastructure-related activity.

One of the objectives of using Terraform is to automate the underlying infrastructure. Using Terraform provisioners, we can implement user-data scripts that can perform certain provisioning tasks.

Provisioners also help in Day 1 provisioning activities where we can install certain software components as soon as the EC2 instance is up and running. The below Terraform script does exactly the same - it runs a user-data script that installs Nginx web server on the EC2 instances which are being created.

Main.tf

//EC2 Instances
resource "aws_instance" "compute_nodes" {
  ami                       = var.ami
  instance_type             = var.instance_type

  user_data = data.template_file.user_data.rendered
}

//User data
data "template_file" "user_data" {
    template = file("install.tpl")
}

Install.tpl

#!/bin/bash
apt-get update -y
apt install nginx -y
systemctl start nginx



Furthermore, changes to the infrastructure code should be made frequently in small, reversible increments. To achieve the same, a process should be defined around how the code changes are deployed automatically for a given environment.

Improving AWS infrastructure in small increments of Terraform code allows us to test the failure paths and fix them without losing track of the end goal.

AWS Best Practice 2: Performance Efficiency

There are ways in which you can gain advantages of shifting your current on-prem datacenters to the cloud.

1. Lift and shift - by mirroring the on-prem infrastructure on the cloud. Eg. Hosting the VMs and databases on EC2 instances.

2. Partially cloud-native - by making use of managed services provided by the cloud providers for some of your workloads. Eg. Hosting the database on an RDS instance.

3. Fully cloud-native - by making use of virtualization services. Eg. Hosting workloads on ECS/EKS, using DynamoDB/RDS.

4. Serverless - by making use of Lambda functions, DynamoDB, SNS, SQS, etc.

To gain the maximum advantage of the cloud journey, it is best to use as many cloud-native services as possible. The key is to choose your cloud infrastructure wisely. If not today, there should at least be some plans to move towards cloud-native transition.

Terraform supports most of the cutting-edge services offered by AWS. It is a good idea to segregate your infrastructure as code based on the kind of resource it provisions - storage, compute, network, etc.

Segregating Terraform code helps you manage your infrastructure in a version-controlled manner and it becomes easier to plan and transition from a less cloud-native approach to a more cloud-native approach.

The key takeaway here is to keep a constant eye on improving the infrastructure and reflect any insights directly in Terraform code.

AWS Best Practice 3: Security

Security is, perhaps, one of the most discussed topics in the cloud ecosystem. AWS cloud security is a shared model - the security of the cloud is managed by AWS, whereas security in the cloud is the customer’s responsibility.

Use Terraform to define in-depth security policies for users as well as cloud infrastructure. Regarding user management, AWS works on the principle of least privilege. Use Terraform scripts to define your authorization model for any user that has access to AWS resources.

Terraform can be used to implement a hierarchical access control system using IAM policies that can be attached to users, groups, roles, and resources. A generic Terraform module can be built that only takes care of managing user access. Accommodating any update in policy or creating a new policy is just a matter of updating or creating an appropriate Terraform IAM resource.

Additionally, the security of the compute resources is also a prime concern. Creating an AWS EC2 instance is one thing that is achieved by writing very few lines of Terraform code. But securing that instance calls for the creation of VPCs, Security Groups, defining ingress and egress rules, etc.

Data is an asset. Securing data in transit as well as at rest thus becomes very important. Consider an example of an S3 bucket. Creating an S3 bucket in AWS using Terraform is as easy as the below code.

resource "aws_s3_bucket" "bucket" {
  bucket = "mybucket"
  acl    = "private"
}

Executing the above code can successfully create an S3 bucket. In fact, attributes `bucket` and `acl` are not “required” attributes as far as Terraform documentation is concerned. But this is not a viable solution.

A better solution would be to secure the bucket by implementing -

1. A bucket policy to define access

2. Enable versioning to keep track of changes

3. Enable logging to keep track of access requests

4. Enable encryption to encrypt data when at rest

There could be many more aspects you would want to consider while securing your data - we are just getting started here. A better version of the above Terraform code looks something like this:

resource "aws_s3_bucket" "bucket" {
  bucket = "mybucket"
  acl    = "private"
  policy = <<-POLICY
    {
      "Version": "2012-10-17",
      "Statement": [{
        "Sid": "MainBucketPermissions",
        "Effect": "Allow",
        "Principal": "*",
        "Action": ["s3:GetObject"],
        "Resource": ["arn:aws:s3:::examplewebsite1234.com/*"]
      }]
    }
    POLICY

  versioning {
    enabled = true
  }

  logging {
    target_bucket = aws_s3_bucket.log_bucket.id
    target_prefix = "log/"
  }
  
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        kms_master_key_id = aws_kms_key.a.arn
        sse_algorithm     = "aws:kms"
      }
    }
  }
}

AWS Best Practice 4: Reliability

Production systems are usually designed in high availability architecture. It ensures that the services are available irrespective of any event. Managing sudden or scheduled traffic surges, failovers caused due to any reason becomes easy by leveraging AWS platform features like autoscaling, multi-AZ deployments, etc.

Terraform is used to manage these scenarios using IaC as well. With the right automation used, managing the desired state of a cluster or autoscaling group is just a matter of changing an attribute in Terraform code.

As an example, let us consider a configuration for an autoscaling group below. We define the version and the launch template for this autoscaling group. Launch template defines OS level aspects of the EC2 instance (AMI).

resource "aws_autoscaling_group" "bar" {
  availability_zones = ["us-east-1a"]
  desired_capacity   = 3
  max_size           = 5
  min_size           = 1

  launch_template {
    id      = aws_launch_template.foobar.id
    version = "$Latest"
  }

Similarly, we can set the min and max size of the autoscaling group. As we can see, the desired capacity is also being set to 3. By applying this configuration, the autoscaling group will make sure that there are always 3 instances available in a healthy state to serve the traffic. Thus making the system reliable.

The reliability pillar also emphasizes testing the failure scenario. However, similar tests can also be carried out when defining HA infrastructure code, especially testing the reversal of recent changes to IaC.

AWS Best Practice 5: Monitoring and Cost Optimization

To gain transparency of your workloads running in AWS, monitoring needs to be implemented to keep track of resource utilization. Monitoring is not a switch that can be turned on or off. Most of the time, customers are in need of tailor-made monitoring solutions.

AWS CloudWatch helps monitor resource utilization and trigger alarms whenever a threshold is breached. It offers great flexibility as it is integrated with most of the AWS services.

To define a monitoring solution on AWS for complex infrastructures using Terraform, it makes sense to dedicate an entire module or set of files for monitoring.

Use Terraform to create CloudWatch log groups, metric filters, and alarms. By segregating the monitoring code in a single place, it becomes easy to manage the thresholds and notification settings. The alarm can also be used to trigger appropriate autoscaling actions.

AWS Cost Optimization focuses heavily on tagging of the Terraform resources. Tags are omnipresent in AWS. Appropriate tagging mechanisms are used in Terraform code. Tags are maintained in variable files based on the environment being dealt with. Cost reports can be generated based on the information captured in tags.

Conclusion

In this post, we discussed a few options to manage AWS best practices using Terraform. As mentioned earlier, this is not a complete list as some of the topics may be very specific per customer. However, I hope it gives you a taste of how to best manage AWS infrastructure using Terraform.

We have a ton of other resources available to learn about Terraform for AWS. You can find them here.

Interested in learning more about Terraform for AWS? Join our Slack community to connect with DevOps experts and continue the conversation.


Note: This post was written by Sumeet Ninawe from Let’s Do Tech. Sumeet has multi-cloud management platform experience where he used Terraform for the orchestration of customer deployments across major cloud providers like AWS, MS Azure, and GCP. You can find him on Github, Twitter, and his website.