How to Set Up Disk Utilization Alerts for Cloud Instances

Monitoring disk utilization is a critical task for maintaining the health of your cloud infrastructure. Whether you’re using AWS EC2, Azure VMs, or Google Cloud Compute Instances, an unmonitored disk can lead to critical system failures. This blog provides a detailed guide to automate disk utilization alerts using Python, cron jobs, and Slack. By the end of this guide, you’ll be able to:

Launch cloud instances using Terraform.
Set up a Python script to monitor disk utilization.
Automate alerts using a cron job.
Send notifications to Slack.

This approach saves time, prevents outages, and ensures proactive cloud management.

Disk utilization monitoring is often overlooked until it becomes a problem, but proactive monitoring helps avoid disruptions in applications and services. Let’s dive into the technical details to ensure your cloud infrastructure is both efficient and reliable.

Prerequisites

Before we start, make sure you have the following:

A cloud account on AWS, Azure, or GCP.
Basic familiarity with Python, Terraform, and shell scripting.
Slack workspace with a webhook URL for sending alerts.
Terraform installed on your local machine.
Python 3.x installed with basic knowledge of psutil and requests modules.
IAM permissions to launch and configure instances in your cloud provider.

Having these prerequisites in place ensures a smooth implementation of the solution described below.

High-Level Architecture

Workflow Overview

Use Terraform to launch a cloud instance with user data scripts.
Inject a Python script to monitor disk usage.
Configure a cron job to run the script every 2 hours.
Send alerts to Slack when disk utilization exceeds a threshold.

Below is the architecture diagram:

[Terraform] --> [Launch Instance] --> [User Data Scripts] --> [Disk Monitoring Script + Cron] --> [Slack Alerts]

By automating these tasks, you minimize manual intervention and maintain a robust monitoring solution across cloud environments.

Writing the Python Script

Overview of the Python Script

The Python script will:

Check disk utilization using the psutil library.
Format a message and send it to Slack using the requests library.

Python Code

Here’s a sample script:

import os
import psutil
import requests

def check_disk_utilization(threshold, slack_webhook_url):
    disk_usage = psutil.disk_usage('/')
    used_percentage = disk_usage.percent

    if used_percentage > threshold:
        message = f":warning: Disk utilization is at {used_percentage}%! Take action now."
        payload = {"text": message}
        response = requests.post(slack_webhook_url, json=payload)

        if response.status_code == 200:
            print("Alert sent to Slack successfully.")
        else:
            print(f"Failed to send alert: {response.text}")

if __name__ == "__main__":
    SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL")
    THRESHOLD = int(os.getenv("THRESHOLD", 80))  # Default threshold is 80%

    check_disk_utilization(THRESHOLD, SLACK_WEBHOOK_URL)

Key Points:
- The script retrieves the disk usage using psutil.disk_usage(‘/’).
- It sends an alert to Slack only if the utilization exceeds the defined threshold.

Setting Up Automation with Terraform

Terraform Configuration

The Terraform script will:

Launch a cloud instance.
Configure networking resources like VPC, subnets, and security groups.
Add user data scripts for the Python and cron setup.

Below is the updated main.tf file:

provider "aws" {
  region = "us-east-1"
}

resource "aws_vpc" "main_vpc" {
  cidr_block = "10.0.0.0/16"
  enable_dns_support = true
  enable_dns_hostnames = true
  tags = {
    Name = "MainVPC"
  }
}

resource "aws_subnet" "main_subnet" {
  vpc_id            = aws_vpc.main_vpc.id
  cidr_block        = "10.0.1.0/24"
  map_public_ip_on_launch = true
  availability_zone = "us-east-1a"
}

resource "aws_security_group" "instance_sg" {
  vpc_id = aws_vpc.main_vpc.id

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "disk_monitor" {
  ami           = "ami-0c55b159cbfafe1f0" # Example AMI
  instance_type = "t2.micro"
  subnet_id     = aws_subnet.main_subnet.id
  security_groups = [aws_security_group.instance_sg.name]

  user_data = templatefile("script.sh.tpl", {})

  tags = {
    Name = "DiskMonitorInstance"
  }
}

This configuration ensures your instance is secure and connected to a proper network for access and monitoring.

Creating User Data Templates

The Python Script Template (script.py.tpl)

This template dynamically injects environment variables for the Slack webhook and threshold.

import os
import psutil
import requests

def check_disk_utilization(threshold, slack_webhook_url):
    disk_usage = psutil.disk_usage('/')
    used_percentage = disk_usage.percent

    if used_percentage > threshold:
        message = f":warning: Disk utilization is at {used_percentage}%! Take action now."
        payload = {"text": message}
        requests.post(slack_webhook_url, json=payload)

if __name__ == "__main__":
    SLACK_WEBHOOK_URL = "${slack_webhook_url}"
    THRESHOLD = ${threshold}

    check_disk_utilization(THRESHOLD, SLACK_WEBHOOK_URL)

The Shell Script Template (script.sh.tpl)

This script installs Python dependencies and sets up the cron job to run the Python script every 2 hours.

#!/bin/bash

# Update and install dependencies
yum update -y
yum install -y python3
pip3 install psutil requests

# Add Python script
cat <<EOF > /opt/disk_monitor.py
${python_script}
EOF

# Make the script executable
chmod +x /opt/disk_monitor.py

# Add cron job to run the script every 2 hours
(crontab -l 2>/dev/null; echo "0 */2 * * * python3 /opt/disk_monitor.py") | crontab -

# Start cron service
service crond start

This ensures that the necessary dependencies are installed, and the monitoring script is executed periodically.

Adding the Python Script to Cron

In the script.sh.tpl, we used:

(crontab -l 2>/dev/null; echo "0 */2 * * * python3 /opt/disk_monitor.py") | crontab -

This ensures the Python script runs every 2 hours without manual intervention. The cron service is started to ensure the job execution.

Testing the Setup

Run the Terraform script to launch the instance: terraform init terraform apply
SSH into the instance to verify the files and cron setup: crontab -l
Simulate high disk usage and confirm the Slack alert.
- Use commands like dd to create large files and monitor the script behavior.

Enhancing and Scaling the Setup

Once you have verified the basic setup, there are multiple ways to enhance and scale this solution:

Adding Multi-Disk Support

If your cloud instance has multiple disks, modify the Python script to check utilization for all mounted disks. Update the psutil.disk_partitions() function to iterate through all partitions.

for partition in psutil.disk_partitions():
    usage = psutil.disk_usage(partition.mountpoint)
    if usage.percent &gt; threshold:
        message = f":warning: Disk {partition.device} is at {usage.percent}% utilization!"
        payload = {"text": message}
        requests.post(slack_webhook_url, json=payload)

Configuring Alerts for Multiple Channels

If you want to send alerts to different Slack channels based on the severity, you can integrate multiple Slack webhook URLs and classify thresholds as Warning, Critical, etc.

Scaling Across Multiple Instances

To monitor multiple instances in the same cloud environment:

Use Terraform to deploy the setup across multiple instances.
Configure a centralized monitoring service like CloudWatch, Azure Monitor, or GCP Stackdriver for aggregated alerts.

Adding Logging and Metrics

Integrate logging to keep track of disk usage trends over time. Use Python’s logging module to log utilization percentages to a file or an external monitoring service.

Also read: AWS RDS Alarms via Twilio: How to Set Up Automated Phone Alerts

Conclusion

In this blog, we’ve demonstrated a robust way to automate disk utilization monitoring for cloud instances using Python, cron jobs, and Slack. This setup is cost-effective, highly customizable, and adaptable across major cloud providers like AWS, Azure, and GCP.

With the power of Terraform, you can launch instances with pre-configured monitoring scripts, ensuring consistent and automated deployment. By leveraging Slack for notifications, you maintain real-time awareness of your infrastructure health.

Disk utilization monitoring is just one of many steps toward a proactive cloud management strategy. Implementing this solution not only prevents outages but also promotes efficient resource utilization, helping you save time and operational costs.

How to Set Up Disk Utilization Alerts for Cloud Instances

Platform Engineering: The Strategic Imperative for Modern DevOps and Internal Developer Platforms

AIOps: Revolutionizing Incident Management and Observability in the Age of Complexity

Optimizing AWS Lambda Performance: Effective Warmup Strategies for Faster Response Times

How to Set Up Disk Utilization Alerts for Cloud Instances

Prerequisites

High-Level Architecture

Workflow Overview

Writing the Python Script

Overview of the Python Script

Python Code

Setting Up Automation with Terraform

Terraform Configuration

Creating User Data Templates

The Python Script Template (script.py.tpl)

The Shell Script Template (script.sh.tpl)

Adding the Python Script to Cron

Testing the Setup

Enhancing and Scaling the Setup

Adding Multi-Disk Support

Configuring Alerts for Multiple Channels

Scaling Across Multiple Instances

Adding Logging and Metrics

Conclusion

References

Related Posts