Lessons Learned from Datadog Query Mistakes
Introduction
Datadog is a powerful tool for monitoring AWS resources, but it can be tricky to get the queries right. Here are some lessons learned from common mistakes when querying AWS Application Load Balancer (ALB) metrics in Datadog.
1. httpcode_elb_5xx vs httpcode_target_5xx
HTTPCode_ELB_5XX
captures 5xx errors generated by the load balancer (e.g., no healthy targets, timeouts) (datadoghq.com)HTTPCode_Target_5XX
captures backend (instance/pod) errors returned after forwarding.
If application health is monitored separately, only alert on httpcode_elb_5xx
. no need to include backend errors twice!
❌ Bad example
1sum:aws.applicationelb.httpcode_target_5xx{...}.as_rate()
✅ Good example
1sum:aws.applicationelb.httpcode_elb_5xx{...}.as_rate()
2. min() can't be used with count()
❌ Bad example
1min(last_5m):sum:trace.servlet.request.errors{...} by {resource_name}.as_count() / sum:trace.servlet.request.hits{...} by {resource_name}.as_count() * 100 > ${each.value.request_errors_threshold
Terraform creation passed, but Datadog query UI showed sum(...)
—min didn’t apply, and error rate always showed 100%. The min()
vs sum()
mismatch happens because DataDog sum aggregation overrides the min. The fix was:
✅ Good example
1min(last_5m):sum:trace.servlet.request.errors{...} by {resource_name}.as_rate() / sum:trace.servlet.request.hits{...} by {resource_name}.as_rate() * 100 > ${each.value.request_errors_threshold
Use .as_rate()
instead of min()
or sum()
to properly calculate rate ratios without metric misinterpretation.
3. Use avg(), not sum(), for latency metrics
❌ Bad example
1sum:aws.applicationelb.target_response_time.p95{...}
✅ Good example
1avg:aws.applicationelb.target_response_time.p95{...}
Using sum
across multiple instances aggregated raw latency values and skewed numbers. avg
produces a sensible p95 across requests; much more readable per-request latency. (docs.datadoghq.com)
4. Explicitly set units with number_format
If multiple requests or series in a widget, Datadog may drop unit inference—leading to blank or confusing units. This is especially true when mixing queries.
Use Terraform's number_format
in formula blocks to force units:
1formula {2 formula_expression = "default_zero(query1)"3 number_format {4 unit {5 canonical {6 unit_name = "hit"7 per_unit_name = "second"8 }9 }10 }11}
This ensures consistent unit display in dashboards. Feature available since Terraform provider v3.56.0. (github.com)
5. 4xx and 5xx errors are included in "hits"
Querying sum:trace.servlet.request.hits{...}.as_rate()
includes both successful and error requests:
If 100 hits, 10 of which returned 500, the value is still 100. So using this for error rate numerator/denominator is fine—but avoid summing separate 4xx/5xx when counting total hits. The clean way:
❌ Bad example
1alb_total_request_rate = "(sum:aws.applicationelb.request_count{...}.as_rate() + sum:aws.applicationelb.httpcode_elb_3xx{...}.as_rate() + sum:aws.applicationelb.httpcode_elb_4xx{...}.as_rate() + sum:aws.applicationelb.httpcode_elb_5xx{...}.as_rate())"
✅ Good example
1alb_total_request_rate = "sum:aws.applicationelb.request_count{...}.as_rate()"
Avoid redundant sums over error code categories.
Final takeaways
- Use
httpcode_elb_5xx
to monitor load balancer health; leave target errors to backend monitors. - Never mix
min()
withsum()
in Datadog queries—use.as_rate()
. - Monitor latency with
avg()
to avoid skew. - Always define units explicitly when mixing query series.
- Be mindful of how Datadog aggregates error counts in “hits.”
Refer to AWS‑ELB metrics documentation for deeper context and definitions. (docs.datadoghq.com)
Done.