Logo

Lessons Learned from Datadog Query Mistakes

datadog

Introduction

Datadog is a powerful tool for monitoring AWS resources, but it can be tricky to get the queries right. Here are some lessons learned from common mistakes when querying AWS Application Load Balancer (ALB) metrics in Datadog.

1. httpcode_elb_5xx vs httpcode_target_5xx

  • HTTPCode_ELB_5XX captures 5xx errors generated by the load balancer (e.g., no healthy targets, timeouts) (datadoghq.com)
  • HTTPCode_Target_5XX captures backend (instance/pod) errors returned after forwarding.

If application health is monitored separately, only alert on httpcode_elb_5xx. no need to include backend errors twice!

❌ Bad example

1sum:aws.applicationelb.httpcode_target_5xx{...}.as_rate()

✅ Good example

1sum:aws.applicationelb.httpcode_elb_5xx{...}.as_rate()

2. min() can't be used with count()

❌ Bad example

1min(last_5m):sum:trace.servlet.request.errors{...} by {resource_name}.as_count() / sum:trace.servlet.request.hits{...} by {resource_name}.as_count() * 100 > ${each.value.request_errors_threshold

Terraform creation passed, but Datadog query UI showed sum(...)—min didn’t apply, and error rate always showed 100%. The min() vs sum() mismatch happens because DataDog sum aggregation overrides the min. The fix was:

✅ Good example

1min(last_5m):sum:trace.servlet.request.errors{...} by {resource_name}.as_rate() / sum:trace.servlet.request.hits{...} by {resource_name}.as_rate() * 100 > ${each.value.request_errors_threshold

Use .as_rate() instead of min() or sum() to properly calculate rate ratios without metric misinterpretation.

3. Use avg(), not sum(), for latency metrics

❌ Bad example

1sum:aws.applicationelb.target_response_time.p95{...}

✅ Good example

1avg:aws.applicationelb.target_response_time.p95{...}

Using sum across multiple instances aggregated raw latency values and skewed numbers. avg produces a sensible p95 across requests; much more readable per-request latency. (docs.datadoghq.com)

4. Explicitly set units with number_format

If multiple requests or series in a widget, Datadog may drop unit inference—leading to blank or confusing units. This is especially true when mixing queries.

Use Terraform's number_format in formula blocks to force units:

1formula {
2 formula_expression = "default_zero(query1)"
3 number_format {
4 unit {
5 canonical {
6 unit_name = "hit"
7 per_unit_name = "second"
8 }
9 }
10 }
11}

This ensures consistent unit display in dashboards. Feature available since Terraform provider v3.56.0. (github.com)

5. 4xx and 5xx errors are included in "hits"

Querying sum:trace.servlet.request.hits{...}.as_rate() includes both successful and error requests:

If 100 hits, 10 of which returned 500, the value is still 100. So using this for error rate numerator/denominator is fine—but avoid summing separate 4xx/5xx when counting total hits. The clean way:

❌ Bad example

1alb_total_request_rate = "(sum:aws.applicationelb.request_count{...}.as_rate() + sum:aws.applicationelb.httpcode_elb_3xx{...}.as_rate() + sum:aws.applicationelb.httpcode_elb_4xx{...}.as_rate() + sum:aws.applicationelb.httpcode_elb_5xx{...}.as_rate())"

✅ Good example

1alb_total_request_rate = "sum:aws.applicationelb.request_count{...}.as_rate()"

Avoid redundant sums over error code categories.

Final takeaways

  • Use httpcode_elb_5xx to monitor load balancer health; leave target errors to backend monitors.
  • Never mix min() with sum() in Datadog queries—use .as_rate().
  • Monitor latency with avg() to avoid skew.
  • Always define units explicitly when mixing query series.
  • Be mindful of how Datadog aggregates error counts in “hits.”

Refer to AWS‑ELB metrics documentation for deeper context and definitions. (docs.datadoghq.com)

Done.