AWS Observability vs OpenTelemetry: What I Learned

Why I Explored OpenTelemetry

For the past 9 years, every AWS project I worked on used CloudWatch and X-Ray. It was automatic — spin up services, observability comes built-in. No complaints.

Then came a project with a twist: the application needed to run across multiple clouds. AWS-native observability simply wasn't an option.

That led me to explore alternatives — both paid and open-source. After analyzing several options, we landed on OpenTelemetry. The paid tools were impressive, but we didn't want to trade one vendor lock-in for another.

What I Still Like About CloudWatch/X-Ray

Let me be clear: CloudWatch and X-Ray are excellent tools. Here's where they shine:

Zero setup friction. You can get up and running in no time. Almost no code required — everything works out of the box.

Native integration. CloudWatch talks to Lambda, API Gateway, DynamoDB, and every other AWS service without configuration. It just works.

Perfect for getting started. When you're building an MVP or early-stage product, you don't need a complex observability pipeline. You need to ship. CloudWatch lets you do that.

Where CloudWatch Falls Short

After years of using it, I've hit some consistent pain points:

Customization is hard. The visualization is rigid. Widget limitations and cross-account/cross-region constraints get frustrating as your system grows.

Connecting the dots is painful. Correlating metrics, logs, and traces in a single view requires significant configuration and code. It's possible, but not seamless.

These aren't deal-breakers for simple architectures. But when you're running distributed systems across environments, they start to compound.

Setting Up OpenTelemetry

For our stack, we chose:

Prometheus for metrics
Jaeger for traces
OpenSearch for logs
Grafana for visualization

OpenTelemetry has become an industry standard with strong community support and integrations with virtually every observability tool on the market.

What surprised me: The configuration is simple yet powerful. It covers not just the application layer but the underlying system as well. OpenTelemetry exports data to specialized tools (Prometheus, Jaeger, OpenSearch), and Grafana ties it all together with end-to-end request lifecycle visualization.

Setup time: A few hours to get a working proof-of-concept. We've since automated the entire setup with Ansible, making it repeatable across environments.

To be clear: a few hours gets you a PoC. Production-ready deployment — handling high-cardinality metrics, tuning collectors, configuring retention, setting up alerting — is a multi-week effort. Don't underestimate it.

The Hybrid Approach: Managed OTel on AWS

There's a middle ground worth mentioning: AWS now heavily supports OpenTelemetry.

AWS Distro for OpenTelemetry (ADOT) lets you instrument with vendor-neutral OTel code, but route telemetry to Amazon Managed Prometheus (AMP) and Amazon Managed Grafana (AMG).

This gives you:

Vendor-neutral instrumentation (no code lock-in)
Managed infrastructure (no self-hosting headaches)
AWS-native billing and support

For teams who want portability at the application layer but don't want to manage Prometheus/OpenSearch clusters, this is the smart middle path.

We chose full self-hosting because our multi-cloud requirement included non-AWS environments. But if you're AWS-primary with future portability concerns, ADOT + AMP + AMG is worth evaluating.

The Real Comparison

Here's how the two approaches stack up in practice:

Dimension	CloudWatch / X-Ray	OpenTelemetry
Setup time	Almost none	Few hours (PoC) / weeks (production)
Customization	Hard	Easy
SaaS invoice	$$$	$
Total Cost of Ownership	$$	$$ (shifts to compute + engineering)
Multi-cloud support	No	Yes
Debugging experience	Easy	Easy
Team learning curve	Easy	Easy

A note on cost: OpenTelemetry software is free, but self-hosting isn't. Running OpenSearch clusters, Prometheus instances, and EBS volumes for retention can get expensive at scale — not to mention engineering hours for index management, patching, and scaling. OTel lowers your SaaS invoice, but shifts the cost to compute and engineering time. It's a strategic reinvestment, not a simple cost-saving.

Where OpenTelemetry wins: Cloud-agnostic solutions without vendor lock-in. Same monitoring capabilities for on-premises and internal applications. When we needed identical observability for internal applications running on on-prem servers, the OTel stack worked flawlessly.

Where CloudWatch wins: Quick deployment on AWS when you want an efficient, no-code monitoring solution.

The Operational Reality

Running your own observability stack isn't free. Here's what I've learned:

Index management is painful. Managing indices for logs and traces in OpenSearch requires ongoing attention. It's not set-and-forget.

Reliability requires planning. Early on, Prometheus stopped accepting requests due to high call volume. Once we started batching requests, it stabilized. But it was a reminder: you're now responsible for your monitoring infrastructure.

Monitoring the monitor. We use Grafana alerts to notify us of any downtime in the observability pipeline itself. Yes, you need to monitor your monitoring.

Cost comparison: OpenTelemetry is cheaper than most paid solutions in terms of licensing. But factor in compute, storage, and engineering time. There are no restrictions on application count, call volume, or data retention — retention depends entirely on your needs and infrastructure budget. Maintenance has its overhead, but so does running any production system.

Team Adaptation

The team was happy. Using the same tooling everywhere meant consistent knowledge across environments. Same dashboards, same queries, same debugging workflows — whether troubleshooting AWS, another cloud, or on-prem.

Skills required: Prometheus and Grafana experience was important for our team. Jaeger and OpenSearch were easier to pick up.

Small teams: It depends entirely on the application's architecture and roadmap. A distributed, multi-cloud application in maintenance mode can actually be managed by a small team if the automation is solid. However, for a 2-3 person team building a fresh AWS-only MVP, the overhead of OTel might be a distraction.

My Decision Framework

When a CTO asks me "CloudWatch or OpenTelemetry?", I ask three questions:

Where will your applications run? AWS only, or multiple environments?
Is AWS the only cloud you're targeting? Now and in the future?
Are you willing to invest in monitoring infrastructure right now?

My rule of thumb:

If you're targeting AWS only and it's a new product, the AWS observability stack gets you up and running in no time.
If you want future portability without self-hosting, consider the hybrid approach (ADOT + AMP + AMG).
If you have a mature product with multiple microservices, multi-cloud requirements, and don't want vendor lock-in, choose full OTel.

For my next greenfield project: It depends. For serverless development, AWS observability still suits perfectly. But if I'm building a distributed system with multi-cloud support, OpenTelemetry will be my default choice.

The Future of Observability

Every major paid monitoring tool now supports OpenTelemetry. That tells you where the industry is heading. The community support is massive and growing.

OpenTelemetry is becoming the standard — not because it's free, but because it solves real problems around portability and vendor independence.

The Unbeatable Value of Traces

If I could only have one observability signal — logs, metrics, or traces — I'd choose traces without hesitation.

Here's why: as systems evolve from simple APIs into distributed orchestration layers (Kubernetes, event-driven pipelines, multi-service workflows), logs lose context rapidly. A log line tells you something happened. A trace tells you why, where, and how long it took across every hop.

For debugging distributed systems, tracing is irreplaceable.

Final Thoughts

Use CloudWatch/X-Ray when you need to hit the ground running on AWS with zero setup friction. Use OpenTelemetry when you need a mature, cloud-agnostic standard that grows with your multi-cloud or on-prem architecture without vendor lock-in.

One thing most people get wrong about observability: it's not a silver bullet. It gives you insight, but at the end of the day, it's still a developer's responsibility to write performant code.

Any regrets going the OpenTelemetry route? None so far.

What drove your observability strategy? Running OpenTelemetry in production — how are you managing collector infrastructure and reliability? I'd love to hear your experience.

AWSOpenTelemetryObservabilityCloudWatchDistributed Systems

Ashish Suman

// senior_technical_architect · 18y

Building production systems since 2007. I write about what breaks at 3 AM — and the architecture decisions that prevent it.

Book a chat →