Dev.Chan64's Blog

Go Home
Show Cover Slide Show Cover Slide

Retrospective on an AWS DevOps Example Project

gpt-4-turbo has translated this article into English.


This article is a retrospective on the structure and design process of the aws-devops-example project.
It aims to share practical examples of automating the deployment of infrastructure and applications, with a focus on CloudFormation and GitHub Actions.


Why Did We Start This Project?

Modern service operations require more than just writing good code.
The project was planned against the following background:


What Experiences Did We Want to Incorporate?

This document is not merely a list of results.
Instead, we aim to organize practical experiences from the following three perspectives:

  1. Structure: How resources were segmented and connected
  2. Automation: What triggers and flows were used for deployment
  3. Failures and Lessons: What worked well and what prompted a reevaluation

In the following sections, we will sequentially explain everything from design goals to overall configuration, resource separation methods, deployment flow, and trials and errors.


Design Goals

This project was started with the goal of going beyond a simple “deployment automation example” to achieve
“structured infrastructure design and executable automation”.


1. Infrastructure as Code


2. Clear Stack Separation and Change Flexibility (Modular & Mutable Stacks)

Example:

This structure provides the following flexibility:

This stack separation not only divides components but also focuses on structuring resources based on “changeability” in actual operational environments.


3. Declarative Git-based Deployment (GitOps)

Code Commit → GitHub Actions → CloudFormation → AWS Resource Creation

4. Minimum Structure, Clear Flow

Architecture Overview

This project is configured on AWS following this flow:


Overall Flow Summary (Mermaid)

flowchart TD
ecr --> ecs_taskdef
role --> ecs_taskdef

subnet --> ecs_service
vpc --> subnet --> load_balancer

subgraph alb [ALB Group]
load_balancer --> alb_dns
load_balancer --> tg
load_balancer --> listener
end

alb_dns --> apigw
vpc --> load_balancer
vpc --> securitygroup --> load_balancer
tg --> ecs_service

ecs_cluster --> ecs_service
ecs_taskdef --> ecs_service
vpc --> ecs_service
securitygroup --> ecs_service
ecr --> ecs_service

Detailed Infrastructure Layer Structure (D2)

direction: down

Infra: {
  label: "AWS Infrastructure"

  INET: "Internet"

  Traffic: {
    label: "Traffic Management"
    Route53: "Route 53"
  }

  CDN: {
    label: "CDN"
    CloudFront: "CloudFront"
  }

  API: {
    label: "API Gateway"
    APIGateway: "API Gateway"
  }

  NetworkLayer: {
    label: "Network Layer"

    VPC: {
      label: "VPC Network"

      IGW: "Internet Gateway"
      RouteTable: "Route Table"
      SecurityGroup: "Security Group"

      IGW -> RouteTable
      RouteTable -> SecurityGroup

      Subnets: {
        label: "Subnets"
        Subnet1: "Subnet 1"
        Subnet2: "Subnet 2"
        Subnet3: "Subnet 3"
      }

      LoadBalancers: {
        label: "Load Balancers"

        LB1: {
          label: "Load Balancer 1"
          Listener1: "ELB Listener 1"
          TargetGroup1: "Target Group 1"
          Listener1 -> TargetGroup1
        }

        LB2: {
          label: "Load Balancer 2"
          Listener2: "ELB Listener 2"
          TargetGroup2: "Target Group 2"
          Listener2 -> TargetGroup2
        }

        LB3: {
          label: "Load Balancer 3"
          Listener3: "ELB Listener 3"
          TargetGroup3: "Target Group 3"
          Listener3 -> TargetGroup3
        }
      }

      SecurityGroup -> Subnets
      LoadBalancers.LB1.TargetGroup1 -> Subnets.Subnet1
      LoadBalancers.LB2.TargetGroup2 -> Subnets.Subnet2
      LoadBalancers.LB3.TargetGroup3 -> Subnets.Subnet3
    }
  }

  Application: {
    label: "Application Layer"

    ECS: {
      label: "ECS Cluster"

      SRV1: {
        label: "ECS Service 1"
        Task1: "ECS Task 1"
        Frontend: "Frontend"
        Task1 -> Frontend
      }

      SRV2: {
        label: "ECS Service 2"
        Task2: "ECS Task 2"
        API: "API Service"
        Task2 -> API
      }

      SRV3: {
        label: "ECS Service 3"
        Task3: "ECS Task 3"
        Data: "Data Service"
        Task3 -> Data
      }
    }
  }

  NetworkLayer.VPC.Subnets.Subnet1 -> Application.ECS.SRV1.Task1
  NetworkLayer.VPC.Subnets.Subnet2 -> Application.ECS.SRV2.Task2
  NetworkLayer.VPC.Subnets.Subnet3 -> Application.ECS.SRV3.Task3

  DATA: {
    label: "Data Platform"

    IoT: {
      IoTCore: "IoT Core"
      IoTRules: "IoT Rules"
      IoTCore -> IoTRules
    }

    CW: "Cloud Watch"
    IoT.IoTRules -> CW

    Storage: {
      S3: "S3 Bucket"
      DynamoDB
    }
  }

  AUTH: {
    label: "Authentication"
    CognitoUserPool: "Cognito User Pool"
    CognitoIdentityPool: "Cognito Identity Pool"
  }

  Application -> DATA
  Application -> AUTH

  INET -> Traffic.Route53 -> CDN.CloudFront -> API.APIGateway -> NetworkLayer.VPC.LoadBalancers
  INET -> NetworkLayer.VPC.IGW
}

Resource Stack Breakdown

This project is structured into four main stacks, each divided by functionality:

  1. Network Stack: Infrastructure such as VPC, Subnet, SecurityGroup
  2. Application Stack: ECS Cluster, Task Definition, Service, IAM Role
  3. Load Balancer Stack: ALB, Listener, Target Group
  4. API Gateway Stack: Configuration of API Gateway connected to ALB

4.1 Network Stack

Included Templates:

Key Resources:

Design Points:


4.2 Application Stack

Included Templates:

Key Resources:

Design Points:


4.3 Load Balancer Stack

Included Templates:

Key Resources:

Design Points:


4.4 API Gateway Stack

Included Templates:

Key Resources:

Design Points:


This separation provides both deployment unit flexibility and ease of maintenance.
When modifying specific resources, there’s no need to redeploy the entire structure; instead, the relevant stack can be selectively executed in GitHub Actions.


GitHub Actions-based Deployment Automation (CI/CD)

This project is designed to deploy infrastructure without manual CLI commands,
centering around a completely automated deployment flow with GitHub Actions.


Workflow Configuration

Each stack is managed with a separate GitHub Actions workflow file, following these rules:

Stack Workflow File Trigger Condition
VPC vpc-stack.yml workflow_dispatch, or Push
ALB alb-stack.yml Automatically runs after VPC stack completion
ECS ecs-stack.yml Automatically runs after ALB stack completion
API Gateway apigw-stack.yml Manually run or possible connection after ECS completion

Trigger Structure (Dependency Connection)

Workflows are set to automatically execute in the following order:

vpc-stack
   ↓ (on success)
alb-stack
   ↓
ecs-stack
   ↓
apigw-stack (optional connection)

This structure allows:


Environment Variables and Secret Management

All AWS credentials and parameters are stored in GitHub repository Secrets and referenced within Actions.

Example:

env:
  AWS_REGION: $
  AWS_ACCOUNT_ID: $

Secrets items:


Manual Execution (workflow_dispatch)

For the development environment, workflow_dispatch is set up to allow manual execution of the entire stack.

This structure allows for GitOps operations that do not depend on local development environments and control infrastructure solely through version-managed YAML files.


Deployment Flow Summary

This project employs GitHub Actions and CloudFormation to
fully automate the flow from infrastructure setup to application execution.


Overall Flow

Push code to GitHub
       ↓
Execute GitHub Actions workflow
       ↓
Create/Update CloudFormation stack
       ↓
Deploy in order: VPC → ALB → ECS → API Gateway
       ↓
Check ECS Task running status and connect to ALB
       ↓
Finally, access the service via the API Gateway Endpoint

Actual Flow Example

  1. A user pushes code changes to the main branch.
  2. GitHub Actions runs .github/workflows/vpc-stack.yml.
  3. Subsequent workflows like alb-stack.yml, ecs-stack.yml are executed in sequence based on workflow_run conditions.
  4. The ECS Task is registered with the Target Group and connected, with ALB distributing the traffic.
  5. Finally, API Gateway connects to ALB to receive external requests.

Post-Deployment Verification


In Summary

This structure is characterized by:


Trials and Improvements (What Went Wrong & Fixed)

Throughout the project, we encountered various trials and errors, not just from coding but also due to dependencies and timing issues among AWS resources and permission problems.
This section outlines some of the significant cases.


  1. GitHub Actions Trigger Sequence Issues
    Problem: The workflow_run trigger did not function as expected.
    Cause: Mismatch in previous workflow names or missing conclusion conditions.
    Solution:
on:
  workflow_run:
    workflows: ["vpc-stack"]
    types:
      - completed

The name must match exactly for the subsequent workflow to execute.


2. CloudFormation Wait Delays Due to ECS Service and TargetGroup Health Checks

Problem: During CloudFormation stack creation, the ECS service creation step was delayed for several minutes waiting for the TargetGroup’s health check results.

Cause: The ECS Service needs at least one Task to be healthy before moving to “CREATE_COMPLETE” status.

Impact:

Solution:

HealthCheckPath: "/health"
Matcher:
  HttpCode: "200"

3. ECS TaskDefinition Not Updated After ECR Deployment (Using latest Tag)

Problem: Even after pushing a new image to ECR, the ECS service continued to run with the previous image.

Cause:
Even if the image field in ECS’s TaskDefinition specifies the :latest tag,
CloudFormation considers there to be no change if the repository:latest URI remains the same, thus not creating a new revision → deployment omission.

Temporary Solution: Use force-new-deployment

Added the following script to the CI workflow (push-ecr-hello),
configuring forced redeployment of the ECS service after ECR image push:

aws ecs update-service \
  --cluster dev-ecs-cluster \
  --service dev-ecs-service-hello \
  --force-new-deployment \
  --region $AWS_REGION

Future Improvement Direction: Immutable Tag Strategy
This method requires forced restarts for each deployment and,
since it is not tracked by CloudFormation, further improvements are needed:

Example Strategy:

image: ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/dev-hello:$

Lesson:
The latest is convenient for local testing,
but precise version tracking and change detection are challenging in production deployments.
Ideally, we should adopt an immutable tag strategy in the CI/CD pipeline.


Next Steps

1. Expansion to Multi-Service/Multi-Environment Configuration

2. Integration with Monitoring and Notifications


In Conclusion

We hope this article provides helpful insights for those with similar goals.


Go Home
Tags: Project