Cloudwatch Alarms Posting to Slack using Terraform

PUBLISHED ON DEC 2, 2019 — AWS, INFRASTRUCTURE, TERRAFORM

I’ve been reaching the limits of some of my resources in AWS recently - namely the remaining free disk space on both Redshift and Elasticsearch.
To help warn me of this happening again, I wanted a way to automatically notify when I’m getting close. The first idea I thought of was to have an email sent to me, but I never read emails because I’m lazy cool.
So instead I thought I’d have it notify me via Slack, because I’m usually procrastinating on there cool.
And because I tend to break stuff follow best practices, I like to keep all my infrastructure as code - I’m using Terraform.

As always, I’ve added the finished code to my GitHub which you can find here.
If you check the code on GitHub first you’ll notice that I’ve set this up as a Terraform module, which I haven’t covered in this post. I’ve also provided all the variables with default values so that you can try running a plan straight away.

You’ll need to have Terraform installed for this to work, you can follow the instructions to do so here.
For this post I’m using Terraform v0.12 - this is worth noting because this version introduced expressions which changed how references to other resources are made as interpolation only strings are deprecated.

First things first - we need an AWS provider in Terraform - I’m not assuming any roles, so I have a straight forward provider, and my AWS credentials are set as environment variables:

provider "aws" {
    version = "~> 2.0"
    region  = "eu-west-2"
}

The first step of this arduous journey is to define my CloudWatch alarms. I have two, one for Redshift and one for Elasticsearch. I’m going to have it notify me when disk space reaches 75%:

resource "aws_cloudwatch_metric_alarm" "elasticsearch_disk_space_alarm" {
  alarm_name          = "${var.application}-elasticsearch-${var.environment}-disk-alarm"
  alarm_description   = "Remaining ElasticSearch disk space below threshold"
  comparison_operator = "LessThanOrEqualToThreshold"
  evaluation_periods  = "1"
  period              = 60
  threshold           = var.elasticsearch_volume_size * 250 # 25% of the space in Mb
  namespace           = "AWS/ES"
  metric_name         = "FreeStorageSpace"
  statistic           = "Minimum"

  dimensions = {
    DomainName  = var.elasticsearch_domain_name
    ClientId    = var.account_id
  }

  alarm_actions = [

  ]
  ok_actions = [

  ]
}

resource "aws_cloudwatch_metric_alarm" "redshift_disk_space_alarm" {
  alarm_name          = "${var.application}-redshift-${var.environment}-disk-alarm"
  alarm_description   = "Remaining Redshift disk space below threshold"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  period              = 60
  threshold           = 75 # Redshift metric is in percentage already
  namespace           = "AWS/Redshift"
  metric_name         = "PercentageDiskSpaceUsed"
  statistic           = "Maximum"

  dimensions = {
    ClusterIdentifier = var.redshift_cluster_id
  }

  alarm_actions = [

  ]
  ok_actions = [

  ]
}

Now there are a couple of things to note here.
First, the thresholds are both defined differently. Redshift is lovely, and provides a metric showing the percentage of disk space used. So we simply set a threshold of 75(%), and use a comparison_operator of GreaterThanOrEqualToThreshold - so if the disk space used is above 75%, the alarm gets triggered.
Beautiful.
Unfortunately, Elasticsearch is a heathen, and only exposes a metric on the amount of free storage space.
Not only that, when creating the instance you work in Gb, but the metric is Mb, so we use the value you’d used to configure your Elasticsearch domain, and multiply that by 250 to give us an amount equal to 25% of the disk space.
Because it’s 25%, you’ll notice our comparison_operator is now LessThanOrEqualToThreshold.
Both have dimensions set up for what’s needed to pull in the metric - how to find the Redshift cluster, or Elasticsearch domain.

The other thing to note is that our alarm_actions and ok_actions are empty - so nothing will happen when the alarm is triggered.
We need to fix that, so let’s add an SNS topic. The idea here is that when the alarm is triggered, a message will be posted to SNS, which will then be picked up by a Lambda, which will post to a Slack webhook.
To add this we need to add both the SNS topic itself, and a subscription - if nothing subscribes to our topic, nothing happens:

resource "aws_sns_topic" "sns_notify_slack_topic" {
  name = "${var.application}-slack-notifications-${var.environment}"
}

resource "aws_sns_topic_subscription" "sns_notify_slack_subscription" {
  topic_arn = aws_sns_topic.sns_notify_slack_topic.arn
  protocol  = "lambda"
  endpoint  = ""
}

Fortunately, as you can see, this is pretty straightforward. I’ve spiced things up a bit by using variables for the environment to deploy to, and the application it’s for. If we’re making something, why not make something reusable?
However, as you can see, the subscription endpoint is empty - we have no lambda to actually subscribe to the topic, so the next step is to add that.

Unfortunately, adding a lambda isn’t as straightforward as I think it should be. Not only do you need to define the lambda and it’s code, but also a way of packaging that code, and the necessary permissions for the code to be executed, and for the lambda to be allowed to log to CloudWatch (especially useful when you keep screwing up your code, like I did).
Let’s start with the actual code for our Lambda - I’m using Python because, well, I want to. Use whatever you want.

import json
from botocore.vendored import requests
import os

def lambda_handler(event, context):
    webhook_url = os.environ['SLACK_WEBHOOK']
    emoji = os.environ['SLACK_EMOJI']

    raw_message = json.loads(event['Records'][0]['Sns']['Message'])

    slack_data = {
        'text': event['Records'][0]['Sns']['Subject'],
        'icon_emoji': emoji,
        'attachments': [
            {
                'text': raw_message['AlarmDescription'],
                'title': event['Records'][0]['Sns']['Subject'],
                'color': '#ff9a17'
            }
        ]
    }

    response = requests.post(
        webhook_url, data=json.dumps(slack_data),
        headers={'Content-Type': 'application/json'}
    )
    if response.status_code != 200:
        raise ValueError(f'Request to slack returned an error {response.status_code}, the response is:\n{response.text}')

In the above code you can see that this is actually a terrible example to use.
We read the SNS message, and we read a Slack webhook and emoji from the environment variables. Then we construct a message to post to Slack, and simply post it. Done.
What this doesn’t do is remotely account for different messages - for example, if the alarm transitions back to an OK state, the message format will look exactly the same. Same emoji, same colour. Not ideal, but it will do for the purposes of this example.
Now this code needs to be packed up in a way that’s acceptable - a zip folder, apparently. We can do this through Terraform using data sources. That way you don’t have to remember to create a new zip every time you make a change:

data "null_data_source" "lambda_file" {
  inputs = {
    filename = "${path.module}/lambda.py"
  }
}

data "null_data_source" "lambda_archive" {
  inputs = {
    filename = "${path.module}/lambda.zip"
  }
}

data "archive_file" "sns_notify_slack_code" {
  type        = "zip"
  source_file = data.null_data_source.lambda_file.outputs.filename
  output_path = data.null_data_source.lambda_archive.outputs.filename
}

The first data source simply grabs a reference to the file, the second is a reference to the zip archive we’ll output. The third data source is the one that will actually create the archive for us, using the previous two values.

To get this working end to end, you’ll need an incoming webhook to the appropriate Slack channel to be configured. You can find out how to do that here.

Now we can finally create our lambda function:

resource "aws_lambda_function" "sns_notify_slack_lambda" {
  filename         = data.archive_file.sns_notify_slack_code.output_path
  function_name    = "${var.application}-notify-slack-${var.environment}"
  role             = ""
  handler          = "lambda.lambda_handler"
  source_code_hash = data.archive_file.sns_notify_slack_code.output_base64sha256
  runtime          = "python3.6"
  timeout          = 30

  environment {
    variables = {
      SLACK_WEBHOOK     = var.slack_webhook
      SLACK_EMOJI       = var.slack_emoji
    }
  }
}

Here we can see that the filename and source_code_hash are from the aforementioned data sources. The function_name is just what I want the function to be called in AWS - again using the application and environment variables for reusability.
The handler is the name of the function to execute in the code, which I’ve creatively called lambda_handler. It’s preceded by the name of the file, lambda because I’ve called my file lambda.py.
My creativity astounds even me.

Everything else is pretty standard, except you’ll notice that role is empty - we need to add a role to allow the Lambda to be executed:

resource "aws_iam_role" "sns_notify_slack_lambda_role" {
  name = "${var.application}-sns-notify-slack-${var.environment}-lambda-role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

This creates an IAM role to allow the Lambda to be invoked by SNS.
However, that still isn’t enough. The Lambda can now be invoked by SNS, but SNS isn’t yet allowed to invoke a Lambda. You know, security. I guess.
So we need to add a Lambda permission so that it can:

resource "aws_lambda_permission" "sns_notify_slack_permission" {
  statement_id  = "AllowExecutionFromSNS"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.sns_notify_slack_lambda.function_name
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.sns_notify_slack_topic.arn
}

Based on the statement_id and action you can see we’re giving the permission to invoke a Lambda function from SNS, the function_name is taken from our previously created Lambda, and the source_arn is taken from our SNS topic.

And that will do it! This will actually work.
Now if you deploy this, and your variables are correct, your CloudWatch alarms will be triggered at the appropriate time, a message will be posted to SNS, picked up by the Lambda, and posted to the Slack webhook.
If you want to ensure it works, you can lower the threshold to force a message, or simply post to SNS yourself to test half of it.
Go on, try it. I’ll wait.
Neat, huh?

One last thing I want to add is to enable logging for the Lambda. It currently does not have permission to create CloudWatch logs, so let’s change that. To do this, we need to create a new IAM policy, and attach that to the IAM role we created earlier:

resource "aws_iam_policy" "lambda_logging" {
  name = "${var.environment}-${var.application}-lambda_logging"
  path = "/"
  description = "IAM policy for logging from a lambda"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*",
      "Effect": "Allow"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "lambda_logs" {
  role       = aws_iam_role.sns_notify_slack_lambda_role.name
  policy_arn = aws_iam_policy.lambda_logging.arn
}

This creates a policy to give access to create logs on Cloudwatch, and attaches it to the role we previously defined.

And fortunately for everyone involved, that’s all!
I’ve skipped setting up the Redshift and Elasticsearch resources in this post for brevity, but you can find them in the code on GitHub if you’re curious.
This will allow you to deploy Cloudwatch alarms, and have them post to SNS, which is picked up by Lambda, and posts a message to Slack for us to be, well, alarmed.
If you want to check out the finished code example, you can find it here on my GitHub.

comments powered by Disqus