Observing and Optimizing Your GraphQL API

Part 7 of the “Production GraphQL with Netflix DGS” series — Bonus: Operations
GraphQL APIs are invisible to traditional monitoring. Every request hits the same /graphql endpoint, returns HTTP 200 (even with errors in the body), and carries no URL-based context for your dashboards. If you monitor a GraphQL API the same way you monitor REST, you’re flying blind.
This article covers the observability and optimization layer that sits above your DGS backend: federation with a GraphQL router, client identification, operation-level metrics, error classification, schema analytics, and performance optimization techniques that prevent your API from becoming a bottleneck.
Why GraphQL Needs Different Observability
With REST, your monitoring stack can answer basic questions by looking at HTTP metadata:
GET /api/products?page=0&size=20 → 200 → 42ms
POST /api/orders → 201 → 156ms
GET /api/orders/123 → 404 → 3msEach endpoint is a distinct URL. You can build dashboards, set alerts, and identify slow endpoints without looking at request bodies.
GraphQL breaks this model:
POST /graphql → 200 → 42ms (Was this a product search? An order? Which fields?)
POST /graphql → 200 → 3200ms (Slow — but what operation? Which resolver?)
POST /graphql → 200 → 12ms (200 OK, but the response body contains 3 errors)Every request is POST /graphql, every response is 200 OK (because GraphQL returns errors in the body, not as HTTP status codes), and you can’t tell a lightweight dropdown query from a deeply nested analytics query without inspecting the payload.
The fix is operation-level observability: name your operations, measure them individually, track which clients send them, and monitor error rates per operation — not per endpoint.
Federation: The Router Layer
As your system grows, you’ll split your GraphQL API across multiple services. A federation router composes their schemas into a single graph and routes incoming queries to the right service.
The Architecture
Auth, rate limiting,
circuit breaker"] GW --> R["GraphQL Router
Schema composition,
query planning"] R --> S1["Products Service
DGS — catalog + inventory"] R --> S2["Orders Service
DGS — checkout + fulfillment"] R --> S3["Users Service
DGS — accounts + preferences"] style C fill:#4a9eff,stroke:#2171c7,color:#fff style GW fill:#7c4dff,stroke:#5e35b1,color:#fff style R fill:#7c4dff,stroke:#5e35b1,color:#fff style S1 fill:#00bfa5,stroke:#00897b,color:#fff style S2 fill:#00bfa5,stroke:#00897b,color:#fff style S3 fill:#00bfa5,stroke:#00897b,color:#fff
The router reads each service’s schema, composes them into a supergraph, and handles query planning — deciding which services need to be called for each incoming query, and in what order.
Supergraph Composition
The supergraph is composed at build time (or on change), not at request time:
schema"] --> FETCH["Composition Tool"] S2["Service B
schema"] --> FETCH S3["Service C
schema"] --> FETCH FETCH --> DIFF{"Schema
changed?"} DIFF -->|Yes| DEPLOY["Deploy supergraph
to router"] DIFF -->|No| SKIP["Skip — no changes"] style FETCH fill:#7c4dff,stroke:#5e35b1,color:#fff style DIFF fill:#ffd54f,stroke:#f9a825,color:#333 style DEPLOY fill:#00bfa5,stroke:#00897b,color:#fff style SKIP fill:#bdbdbd,stroke:#9e9e9e,color:#333
A practical composition script detects changes before updating:
# Pseudocode for a composition pipeline
compose_supergraph() {
# Fetch current schemas from running services
rover supergraph compose --config supergraph.yaml > new_supergraph.graphqls
# Only update if schema actually changed
current_hash=$(sha256sum current_supergraph.graphqls | cut -d' ' -f1)
new_hash=$(sha256sum new_supergraph.graphqls | cut -d' ' -f1)
if [ "$current_hash" != "$new_hash" ]; then
deploy_supergraph new_supergraph.graphqls
log "Supergraph updated"
else
log "No schema changes detected, skipping"
fi
}This composition-on-change pattern avoids unnecessary router reloads and makes the pipeline idempotent — safe to run on a schedule or on every deployment.
Schema Publishing (Optional)
You can optionally publish your composed schema to a registry (Apollo Studio, GraphQL Hive, or similar) for analytics:
# Publish to a schema registry for field-level analytics
if [ "$PUBLISH_TARGET" = "registry" ]; then
rover subgraph publish \
--name core-service \
--schema service-a.graphqls \
--routing-url http://service-a:4000/graphql
fiThis enables the registry to track field usage, detect breaking changes, and provide deprecation analytics — capabilities we’ll cover later in this article.
Service Authentication for Composition
The composition tool needs to introspect your services. In production, this shouldn’t use the same auth as regular users. A dedicated service token with restricted permissions is the standard approach:
# Composition tool configuration
services:
- name: core-service
url: http://service-a.internal:4000/graphql
headers:
Authorization: "Bearer ${SERVICE_INTROSPECTION_TOKEN}"
- name: health-service
url: http://service-b.internal:4000/graphql
headers:
Authorization: "Bearer ${SERVICE_INTROSPECTION_TOKEN}"The backend validates this token separately from user JWTs — it grants introspection access but nothing else.
Gateway Resilience
The API gateway sits in front of the router and provides resilience patterns that the GraphQL layer shouldn’t own:
Circuit Breaker
# Tune these values for your traffic patterns
circuit-breaker:
sliding-window-size: 20
sliding-window-type: TIME_BASED
minimum-number-of-calls: 10
wait-duration-in-open-state: 10s
failure-rate-threshold: 60When the failure rate exceeds the threshold within the sliding window, the circuit opens and returns a fast failure instead of letting requests pile up against a failing service.
For GraphQL routes, you might choose to not apply a circuit breaker — because GraphQL handles partial failures gracefully (some fields succeed, some fail, and the response includes both data and errors). The gateway only needs to intervene for total service outages.
Rate Limiting
Rate limiting per endpoint doesn’t make sense for GraphQL (it’s all one endpoint). Instead, rate limit by identity:
# Example values — adjust based on your traffic and abuse patterns
rate-limiting:
graphql:
requests-per-minute: ${RATE_LIMIT_GRAPHQL}
auth-login:
requests-per-minute: ${RATE_LIMIT_LOGIN}
auth-register:
requests-per-hour: ${RATE_LIMIT_REGISTER}Login and registration endpoints get tight limits to prevent brute-force attacks. The main GraphQL endpoint gets a generous limit that legitimate users won’t hit, but that prevents automated scraping.
The rate limiter uses the authenticated user’s ID when available, falling back to the client IP for unauthenticated requests. This prevents a single user from starving others while allowing legitimate traffic through.
GraphQL Error Fallbacks
When the backend is completely unreachable, the gateway returns a properly formatted GraphQL error — not an HTML 503 page:
{
"errors": [{
"message": "Service temporarily unavailable. Please try again.",
"extensions": {
"code": "SERVICE_UNAVAILABLE"
}
}],
"data": null
}This is important because GraphQL clients expect a specific response format. An HTML error page breaks JSON parsing and causes cryptic client-side errors.
WebSocket Routing
Subscriptions use WebSocket connections that bypass the router entirely:
# WebSocket connections route directly to the backend
websocket-route:
path: /graphql
uri: ws://backend-service/graphql
filters:
- SetRequestHeader=Upgrade, websocketWebSocket connections are long-lived (sessions can run for an hour or more), so they need different timeout and scaling characteristics than regular HTTP requests.
Client Identification
Apollo Studio (and similar tools) can tell you which client sent each request — but only if the client identifies itself.
The Two Required Headers
const graphqlClient = axios.create({
headers: {
'apollographql-client-name': 'web-app',
'apollographql-client-version': '2.4.1'
}
})These two headers — apollographql-client-name and apollographql-client-version — power the Clients dashboard:
- Which clients are sending requests (web app, mobile app, admin panel, cron jobs)
- Which version of each client is deployed
- Error rates per client version (did the v2.4.1 deploy break something?)
- Operation breakdown per client (what does the admin panel query that the web app doesn’t?)
Without these headers, your traffic appears as “Unidentified client” and you lose all per-client visibility.
Version Injection at Build Time
Don’t hardcode the version — inject it from your package.json at build time:
// vite.config.ts
import pkg from './package.json'
export default defineConfig({
define: {
__APP_VERSION__: JSON.stringify(pkg.version)
}
})
// In your GraphQL client setup
declare const __APP_VERSION__: string
const clientVersion = typeof __APP_VERSION__ !== 'undefined'
? __APP_VERSION__
: 'unknown'This ensures the version tracks actual deployments. If you see errors spike for version 2.4.1 but not 2.4.0, you know exactly which deploy to investigate.
Multiple Client Names
In a micro-frontend architecture, each MFE should ideally use the same client name with module context in the operation names:
apollographql-client-name: my-web-app
apollographql-client-version: 2.4.1All MFEs share the same centralized GraphQL client (and therefore the same headers), so they appear as a single client in the analytics. Individual operations are distinguished by their operation names (e.g., GetProducts, CreateOrder), not by the client name.
Operation-Level Metrics
AOP-Based Measurement
An AOP aspect wraps every @DgsQuery and @DgsMutation with timing and counting:
@Aspect
@Component
public class OperationMetricsAspect {
@Around("@annotation(com.netflix.graphql.dgs.DgsQuery) || " +
"@annotation(com.netflix.graphql.dgs.DgsMutation)")
public Object measureOperation(ProceedingJoinPoint joinPoint) throws Throwable {
String operationName = joinPoint.getSignature().getName();
String operationType = isQuery(joinPoint) ? "query" : "mutation";
long startTime = System.nanoTime();
Object result = joinPoint.proceed();
if (result instanceof Mono<?> mono) {
return mono
.doOnSuccess(v -> recordMetrics(operationType, operationName, startTime, "success"))
.doOnError(e -> recordMetrics(operationType, operationName, startTime, "error"));
}
recordMetrics(operationType, operationName, startTime, "success");
return result;
}
}This produces three metrics per operation:
| Metric | Type | Purpose |
|---|---|---|
gql.operation.latency | Timer (histogram) | Latency distribution per operation |
gql.operation.count | Counter | Throughput per operation |
gql.operation.errors | Counter | Error rate per operation and error type |
SLO-Based Histograms
Rather than tracking just averages and p99, configure SLO boundary histograms that tell you exactly how many requests fall into each latency bucket:
management:
metrics:
distribution:
percentile-histogram:
gql.operation.latency: true
slo:
# Choose boundaries that match your SLAs
gql.operation.latency: 25ms, 75ms, 150ms, 300ms, 750ms, 1.5s, 3sThis generates histogram buckets at each SLO boundary. Your monitoring dashboard can then show:
gql.operation.latency (products.search):
≤ 75ms: 72% ████████████████████
≤ 150ms: 89% ████████████████████████
≤ 300ms: 96% ██████████████████████████
≤ 750ms: 99% ███████████████████████████
≤ 1.5s: 99.8%
> 3s: 0.1% (these need investigation)When an operation’s 95th percentile crosses a boundary (say, 200ms to 500ms), that’s a leading indicator of a performance problem — even if the average is still fine.
Slow Query Detection
Log a warning when any operation exceeds a threshold:
private void recordSuccess(String type, String name, long startTime) {
Duration duration = Duration.ofNanos(System.nanoTime() - startTime);
if (duration.compareTo(LATENCY_THRESHOLD) > 0) {
log.warn("GraphQL operation exceeded threshold",
kv("operation", name),
kv("type", type),
kv("durationMs", duration.toMillis()));
}
}With structured logging, you can query these slow operations in your log aggregator:
# Example: searching for slow operations in your log aggregator
message="GraphQL operation exceeded threshold" AND durationMs > 2000This surfaces operations that need optimization — before users notice.
Error Classification and Monitoring
The Error Flow
Errors in a federated GraphQL system flow through multiple layers:
data + errors array"] SAN --> R R --> ROUTER["Router forwards
HTTP 200"] ROUTER --> GW["Gateway passes through
No circuit breaker —
GraphQL handles partial failures"] GW --> CLIENT["Client receives
data + errors"] style EX fill:#ff7043,stroke:#e64a19,color:#fff style H fill:#7c4dff,stroke:#5e35b1,color:#fff style R fill:#ffd54f,stroke:#f9a825,color:#333 style CLIENT fill:#4a9eff,stroke:#2171c7,color:#fff
Error Code Distribution
Track which error codes appear most frequently:
| Error Code | What It Means | Monitor? |
|---|---|---|
BAD_USER_INPUT | Validation failure (expected) | Track volume, not individual errors |
NOT_FOUND | Resource doesn’t exist (expected) | Track volume, watch for spikes |
FORBIDDEN | Auth failure (security concern) | Alert on spikes — may indicate attack |
INTERNAL_SERVER_ERROR | Bug (unexpected) | Alert immediately — these need fixing |
SERVICE_UNAVAILABLE | Gateway/router fallback | Alert — indicates infrastructure issue |
The key insight: not every error is a bug. BAD_USER_INPUT is normal user behavior. INTERNAL_SERVER_ERROR is a defect. Your alerting should distinguish between them:
// Alert-worthy: unexpected errors
if (exception instanceof RuntimeException && !(exception instanceof DomainException)) {
log.error("Unexpected error in GraphQL operation",
kv("path", path),
kv("errorType", exception.getClass().getSimpleName()),
exception); // Full stack trace for debugging
}
// Info-level: expected validation failures
if (exception instanceof ValidationException) {
log.info("Validation error",
kv("path", path),
kv("message", exception.getMessage()));
}Error Rate by Operation
Combine operation names with error codes to find problem areas:
Operation: createOrder → 12% error rate → 80% BAD_USER_INPUT (ok, complex form)
Operation: getProducts → 0.1% error rate → mostly NOT_FOUND (ok, invalid URLs)
Operation: updateStock → 8% error rate → 60% INTERNAL_SERVER_ERROR (needs fixing!)The updateStock mutation has 8% errors, and most are internal server errors — that’s a bug. The createOrder mutation has 12% errors, but they’re almost all validation failures — that’s just users submitting bad data. Without this breakdown, you’d see “10% error rate on the GraphQL API” and have no idea where to look.
Schema Analytics
Tracking Field Usage
A schema registry can track which fields are actually used by clients. This requires two things:
- Operation registration — clients send named operations (not anonymous queries)
- Usage reporting — the router or backend reports which fields each operation touches
With this data, you can answer questions like:
Field: Product.legacyCode
Used by: 0 clients in the last 90 days
Action: Safe to deprecate and remove
Field: Product.reviews
Used by: web-app (v2.3+), mobile-app (v1.8+)
Action: Cannot remove without client migration
Field: Product.internalSKU
Used by: admin-panel only
Action: Consider restricting to admin roleDetecting Unused Fields
Schema analytics surfaces dead fields — fields that are defined in the schema but never requested:
type Product {
id: ID!
name: String!
price: Float!
legacyCode: String # ← Last used 6 months ago
internalNotes: String # ← Never requested by any client
migrationStatus: String # ← Used only by deprecated v1 client
}Without analytics, these fields accumulate indefinitely. Their data loaders and resolvers still execute when included in a query, and their backing data still needs to be maintained. Field-level usage data lets you:
- Deprecate fields that are no longer used
- Remove deprecated fields after a grace period
- Identify fields that shouldn’t be public (like
internalNotes)
Schema Validation in CI
Catch breaking changes before they reach production by comparing schema versions:
# CI pipeline step
graphql-inspector diff \
schema-deployed.graphqls \
schema-current.graphqlsThis catches:
- Field removals (breaking)
- Type changes (breaking)
- Required argument additions (breaking)
- Deprecations (non-breaking, informational)
Adding this as a CI gate means breaking changes are caught at code review time, not after deployment.
Performance Optimization
Persisted Queries
Every GraphQL request includes the full query string — which can be large. Automatic Persisted Queries (APQ) replace the query string with a hash:
# First request: send the full query
POST /graphql
{
"query": "query GetProducts($page: Int!) { products(pageNumber: $page) { ... } }",
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "abc123..."
}
}
}
# Subsequent requests: send only the hash
POST /graphql
{
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "abc123..."
}
}
}The router caches the query text by hash. After the first request, clients send only the hash — reducing request payload by 80-90% for complex queries.
This also has a security benefit: in a locked-down production environment, you can reject queries not in the persisted list — effectively creating an operation allowlist.
Query Complexity Budgets
Beyond the depth and complexity limits covered in Part 3, consider assigning explicit costs to expensive fields:
// Custom field weights
fields.put("Product.reviews", 10); // Triggers a DB join
fields.put("Product.recommendations", 20); // Triggers an ML service call
fields.put("Order.timeline", 5); // Aggregates multiple eventsA query that fetches products { reviews recommendations } would cost 1 + 10 + 20 = 31 per item. With a page of 20 items, that’s 620 points — potentially over budget. The client either reduces the page size or removes an expensive field.
Load Testing with GraphQL-Aware Metrics
Standard load testing tools measure HTTP response times but can’t distinguish GraphQL operations. Use a tool like k6 with custom metrics:
// k6 load test with GraphQL-specific metrics
import { Trend, Rate } from 'k6/metrics'
const queryDuration = new Trend('graphql_query_duration')
const mutationDuration = new Trend('graphql_mutation_duration')
const errorRate = new Rate('graphql_error_rate')
export default function () {
const start = Date.now()
const response = http.post(GRAPHQL_URL, JSON.stringify({
query: `query GetProducts($page: Int!) {
products(pageNumber: $page, pageSize: 20) {
items { id name price }
totalElements
}
}`,
variables: { page: 0 }
}), { headers: { 'Content-Type': 'application/json' } })
const duration = Date.now() - start
queryDuration.add(duration)
const body = JSON.parse(response.body)
errorRate.add(body.errors ? 1 : 0)
}Progressive Load Profiles
Design test profiles that simulate real traffic patterns:
| Profile | VUs | Duration | Thresholds | Purpose |
|---|---|---|---|---|
| Canary | 1 | 30s | p95 < 500ms, errors < 1% | Pre-deploy smoke test |
| Load | 10→30 | 20min | p95 < 1s, errors < 5% | Capacity validation |
| Stress | 20→300 | 25min | p95 < 5s, errors < 30% | Breaking point discovery |
| Spike | 10→150 | 12min | p95 < 3s, errors < 15% | Burst handling |
The canary profile is especially valuable: run it automatically before each deployment. If it fails, abort the rollout before users are affected.
A realistic workload mix mirrors actual traffic:
80% read operations (queries)
├── Product search: 30%
├── Product detail: 25%
├── Order list: 15%
└── User profile: 10%
20% write operations (mutations)
├── Add to cart: 10%
├── Place order: 5%
└── Update profile: 5%Distributed Tracing
A GraphQL request can touch multiple services, databases, and caches. Distributed tracing ties them together into a single timeline.
Trace Context Propagation
The client starts the trace by sending correlation headers:
const headers = {
'Content-Type': 'application/json',
'X-Trace-Id': generateTraceId(),
'X-Span-Id': generateSpanId()
}The gateway, router, and backend all propagate these IDs. In your log aggregator, you can search by trace ID to see the complete journey:
OpenTelemetry automates most of this. With the Java agent, every HTTP call, database query, and cache lookup is automatically instrumented:
# OpenTelemetry configuration
otel:
traces:
exporter: otlp
propagators: tracecontext # W3C standard
sampling:
probability: 0.05 # Sample 5% of production traffic (adjust for your volume)The tracecontext propagator follows the W3C Trace Context standard. If you need backward compatibility with older tracing systems, add additional propagators as needed.
The Observability Checklist
| What to Monitor | How | Alert When |
|---|---|---|
| Operation latency | SLO histogram per operation | p95 crosses SLO boundary |
| Error rate by code | Counter per error code per operation | INTERNAL_SERVER_ERROR > 1% |
| Client identification | Apollo headers on every request | “Unidentified client” > 5% |
| Field usage | Schema registry analytics | Fields unused for 90+ days |
| Schema changes | CI schema diff | Breaking change detected |
| Query complexity | Complexity scoring per request | Budget exceeded > 10% of requests |
| Gateway health | Circuit breaker state | Circuit opens |
| Composition health | Supergraph diff | Composition fails |
| Subscription connections | WebSocket connection count | Spike above baseline |
| Rate limit hits | Counter per identity | Legitimate users hitting limits |
What’s Missing (And What to Add Next)
No observability setup is complete on day one. Here are high-impact additions to consider:
- Schema validation in CI — use
graphql-inspectorto catch breaking changes before merge. - Persisted queries — reduce payload size and add an operation allowlist for security.
- Per-field cost tracking — assign weights to expensive resolvers for smarter complexity limiting.
- Client-side error correlation — match frontend errors to backend traces using shared trace IDs.
- Canary deployments with GraphQL metrics — fail fast if a new version degrades operation latency.
These aren’t day-one requirements — but they’re the difference between “we have monitoring” and “we understand our API.”
Cover photo by Luke Chesser on Unsplash.


