Logo

Zero-Noise Monitoring: Designing Meaningful Alerts

4 min read
Datadog

Table of Contents

Introduction

When managing a complex application, it's easy to fall into the trap of “silent monitoring”—setting up alerts that don't actually help you understand your service's health. This leads to alert fatigue and confusion during incidents. Instead, we need to focus on meaningful monitoring—designing monitors that detect only your service's own failures, while filtering out upstream noise.

1. If You're Monitoring ALB 5xx in Datadog, Use elb_5xx

Why target_5xx wasn't the right fit in our case

There's nothing inherently wrong with aws.applicationelb.httpcode_target_5xx.
But in our setup, we already have application-level monitors. This led to double alerts for the same incident—which became noise.

If you don't have app-side monitoring, then using target_5xx can still be a valid option.

On the other hand, aws.applicationelb.httpcode_elb_5xx detects only failures from the ALB itself (TLS, connection limits, capacity issues, misconfiguration, etc.). By separating roles, we significantly reduce alert noise.

Categoryhttpcode_target_5xxhttpcode_elb_5xx
What it detects5xx errors returned by the app5xx errors returned by the ALB itself
Typical root causeApp bugs, downstream service failuresTLS errors, connection caps, capacity issues, ALB misconfig
Relation to app monitoringMay cause double alerts if app has its own monitorsDetects failures independent from the app
Best use caseIf the app doesn't have monitors, or you want a unified view via ALBIf you want to clearly separate app/infrastructure alerts
Noise levelHigh (if app monitors exist)Low

Datadog query example (ALB-only failure detection):

1sum:aws.applicationelb.httpcode_elb_5xx{...}.as_rate() > 1

In short: Let apps detect their own failures, and let the ALB monitor only its own issues. This clear role separation reduces noise dramatically and speeds up root cause identification and decision-making during incidents.

2. Turning "Don't Alert on Upstream Failures" into Real Monitor Logic

Frontend (especially CSR mode) and Gateway apps often suffer from upstream-originating alert noise.

2-1. CSR Soft Fallbacks: No Paging if User Experience is Preserved

In a CSR setup, the frontend may return 200 OK even if some upstream APIs fail.
This is by design: the error was gracefully handled, and the UI rendered a placeholder or simply hid that section. The user still had a valid experience.

This means: this is not an error. It's expected behavior. Alerting on it just creates noise.

2-2. Gateway: Consider Transfer Completion as “Success”

Bottom line: Only upstream monitors should page for upstream failures. The Gateway should alert only on its own problems.

The Gateway's job is to forward requests properly. If a request results in a 504 (timeout) due to upstream delay or failure—that's not the Gateway's fault, and shouldn't page.

However, Gateway-side issues like misrouting, middleware errors, or rate limiting should trigger alerts (502/503/500).

I used APM trace metrics and filtered by resource_name to exclude internal Gateway endpoints and explicitly excluded http.status_code:504. This gave me a monitor for only Gateway-originated 5xx errors.

Datadog query example:

1min(last_5m):default_zero(sum:trace.http.request.errors{...,!resource_name:/internal/*,!http.status_code:504}.as_rate()) >= ${local.threshold}

3. Reducing Noise in Path-Based Monitoring (Fixing the "Small Denominator Problem")

If you monitor error rate per API path, low-traffic endpoints can easily trigger false alerts.

Example:

So I switched from sum() to min() for the time aggregation. Using min(last_5m) makes the monitor more resilient to sudden spikes, reducing false positives for low-traffic endpoints.

→ Datadog Docs: Metric Monitors

TL;DR

Related Articles