Dev.Chan64's Blog

Go Home

gpt-4-turbo has translated this article into English.


Enhancing Observability with CloudWatch and Introducing OpenTelemetry and ADOT

When improving observability in a cloud environment, the first issue encountered is not “what and how much to log.” The more fundamental issue arises when there is a failure to distinguish between observability responsibilities of the infrastructure and the application, leading to increased operational complexity and costs.

This article reorganizes the observability framework in an AWS-based backend, detailing how we connected infrastructure monitoring centered around CloudWatch with internal tracing based on OpenTelemetry.

Problem Definition

The initial state presented four major problems:

The critical decision here was not “let’s log more” but rather “let’s divide the observability responsibilities between infrastructure and application.”

Separation of Observability Responsibilities

The principles set for this improvement were simple:

Based on these principles, there is no need to recreate infrastructure-level metrics like request counts or latency within application logs. Conversely, business failure causes or internal bottlenecks cannot be determined by infrastructure metrics alone.

Overall Architecture

The restructured observability framework is as follows:

flowchart LR
  Client["Client"] --> APIGW["Amazon API Gateway"]
  APIGW --> ALB["Application Load Balancer"]
  ALB --> ECS["ECS Fargate Service"]

  ECS --> App["Backend Application"]
  ECS --> ADOT["ADOT Collector Sidecar"]

  App --> CWLogs["CloudWatch Logs"]
  App --> OTLP["OTLP Trace Export"]
  OTLP --> ADOT
  ADOT --> XRay["AWS X-Ray"]

  APIGW --> CW["CloudWatch Metrics"]
  ALB --> CW
  ECS --> CW
  CWLogs --> Dashboard["CloudWatch Dashboard / Logs Insights"]
  CW --> Dashboard
  XRay --> Dashboard

The key points are two-fold:

This structure allows for joint use of CloudWatch and X-Ray within the AWS operational experience, while placing application instrumentation on the OpenTelemetry standard.

Enhancing Infrastructure Observability with CloudWatch

The first steps included enhancing AWS infrastructure observability points:

The goal here was to ensure that application logs do not replace the visibility of infrastructure status.

For example, metrics like request counts, 4xx/5xx rates, and edge latency are more accurately known by API Gateway and ALB. Duplicating these metrics in application logs would destabilize operational benchmarks.

Log Policy: Lower Defaults and Expand Only for Investigation

Strengthening observability does not mean logging more extensively. Costs for CloudWatch Logs generally increase more from logs of normal requests and repeated successes than from error logs.

Therefore, the log policy was adjusted as follows:

This approach clearly has its benefits:

Furthermore, we eliminated logs for repeated successes and large payload logs, structuring failure logs to only include identifiers and context. Cost optimization focused more on the form and frequency of logs than on log levels.

Request Correlation and Structured Logging

The next step involved aligning request correlation IDs.

In actual operations, it’s more crucial to know “in which request, under what context, and during which dependency call did a failure occur” rather than just knowing “an error occurred.” To achieve this, we maintained:

This setup allows for querying server logs based on the same request in CloudWatch Logs Insights and simplifies integration when distributed tracing is later implemented.

Why Choose OpenTelemetry

We chose OpenTelemetry as the standard for internal tracing not simply because it’s the latest standard, but because we did not want to tie our observability data to a specific APM product’s SDK.

Choosing OpenTelemetry offers several advantages:

In this phase, we did not attempt to solve logging, metrics, and tracing all at once. Instead, we started with tracing to ensure the actual request flow was visible.

Why Choose ADOT Collector Sidecar

After integrating OpenTelemetry, we considered directly transmitting to X-Ray from the application. However, it was ultimately deemed more appropriate to use the ADOT Collector as a sidecar for these reasons:

Setting up the initial structure as a sidecar was for similar reasons. Connecting the application and Collector within the same task as a local endpoint ensures low startup costs and allows for expansion to a separate Collector service if needed later.

First Checks in OpenTelemetry PoC

Before full deployment, we first confirmed “does it really work?”

Questions addressed in the PoC included:

The purpose of the PoC was more about risk mitigation than adding features. Introducing observability often encounters more issues with runtime environments, package interpretations, and deployment structures than with code.

Lessons Learned During Deployment

1. Observability does not end with replacing loggers

Actual operational observability only comes together when the following are aligned:

Changing the logger library is just the beginning.

2. Costs are more sensitive to the form of logs than to log levels

The most effective way to reduce costs is not by reducing error logs, but by minimizing:

To reduce costs while maintaining operability, it’s more effective to aggregate or sample success logs while retaining failure logs.

3. Infrastructure and application observability must be separated

AWS handles infrastructure metrics well. Applications handle internal contexts well. Clearly defining this boundary avoids redundant metrics and excessive logging.

4. The success of OpenTelemetry depends on platform readiness

OTel isn’t just a few lines of application code; it requires a ready platform. To create real value in operations, everything from Collector deployment, IAM permissions, exporter paths, to dashboard integration must be prepared by the platform.

Current Stage Outcomes

Through this initiative, we have achieved the following:

It’s not over yet. However, we are now ready to move from “should we log more?” to “which actual request flows are creating bottlenecks?”

Next Steps

The future tasks are relatively clear:

Observability isn’t a one-time setup; it’s a platform capability that continues to be refined based on operational experience. This work has reestablished the starting point on CloudWatch and OpenTelemetry.


Go Home
Tags: Project