Zero-Noise Monitoring: Designing Meaningful Alerts

Introduction

When managing a complex application, it's easy to fall into the trap of “silent monitoring”—setting up alerts that don't actually help you understand your service's health. This leads to alert fatigue and confusion during incidents. Instead, we need to focus on meaningful monitoring—designing monitors that detect only your service's own failures, while filtering out upstream noise.

1. If You're Monitoring ALB 5xx in Datadog, Use `elb_5xx`

Why `target_5xx` wasn't the right fit in our case

There's nothing inherently wrong with aws.applicationelb.httpcode_target_5xx.
But in our setup, we already have application-level monitors. This led to double alerts for the same incident—which became noise.

If you don't have app-side monitoring, then using target_5xx can still be a valid option.

On the other hand, aws.applicationelb.httpcode_elb_5xx detects only failures from the ALB itself (TLS, connection limits, capacity issues, misconfiguration, etc.). By separating roles, we significantly reduce alert noise.

Category	`httpcode_target_5xx`	`httpcode_elb_5xx`
What it detects	5xx errors returned by the app	5xx errors returned by the ALB itself
Typical root cause	App bugs, downstream service failures	TLS errors, connection caps, capacity issues, ALB misconfig
Relation to app monitoring	May cause double alerts if app has its own monitors	Detects failures independent from the app
Best use case	If the app doesn't have monitors, or you want a unified view via ALB	If you want to clearly separate app/infrastructure alerts
Noise level	High (if app monitors exist)	Low

Datadog query example (ALB-only failure detection):

1sum:aws.applicationelb.httpcode_elb_5xx{...}.as_rate() > 1

In short: Let apps detect their own failures, and let the ALB monitor only its own issues. This clear role separation reduces noise dramatically and speeds up root cause identification and decision-making during incidents.

2. Turning "Don't Alert on Upstream Failures" into Real Monitor Logic

Frontend (especially CSR mode) and Gateway apps often suffer from upstream-originating alert noise.

2-1. CSR Soft Fallbacks: No Paging if User Experience is Preserved

In a CSR setup, the frontend may return 200 OK even if some upstream APIs fail.
This is by design: the error was gracefully handled, and the UI rendered a placeholder or simply hid that section. The user still had a valid experience.

This means: this is not an error. It's expected behavior. Alerting on it just creates noise.

2-2. Gateway: Consider Transfer Completion as “Success”

Bottom line: Only upstream monitors should page for upstream failures. The Gateway should alert only on its own problems.

The Gateway's job is to forward requests properly. If a request results in a 504 (timeout) due to upstream delay or failure—that's not the Gateway's fault, and shouldn't page.

However, Gateway-side issues like misrouting, middleware errors, or rate limiting should trigger alerts (502/503/500).

I used APM trace metrics and filtered by resource_name to exclude internal Gateway endpoints and explicitly excluded http.status_code:504. This gave me a monitor for only Gateway-originated 5xx errors.

Datadog query example:

1min(last_5m):default_zero(sum:trace.http.request.errors{...,!resource_name:/internal/*,!http.status_code:504}.as_rate()) >= ${local.threshold}

3. Reducing Noise in Path-Based Monitoring (Fixing the "Small Denominator Problem")

If you monitor error rate per API path, low-traffic endpoints can easily trigger false alerts.

Example:

One request in 5 minutes → one failure = 100% error rate.
If you aggregate using sum(last_5m), you'll pick up these random blips and end up paging—despite no real impact.

So I switched from sum() to min() for the time aggregation. Using min(last_5m) makes the monitor more resilient to sudden spikes, reducing false positives for low-traffic endpoints.

→ Datadog Docs: Metric Monitors

For high-traffic, high-impact APIs: keep using sum() for sensitivity.
But for sparse endpoints: switching to min() drastically reduced noise, allowing us to surface only meaningful errors.

TL;DR

Use elb_5xx to monitor ALB-level issues (not target_5xx, which overlaps with app errors)
Don't alert on upstream (outbound) failures:
- CSR frontend with graceful fallback = not an error
- Gateway that forwarded correctly = not at fault for upstream 504s
For low-traffic endpoints, use min() instead of sum() in path-based monitors to reduce noise

Zero-Noise Monitoring: Designing Meaningful Alerts

Table of Contents