Observing and Optimizing Your GraphQL API

Posted on April 9, 2026 schedule 15 min read

GraphQLNetflix DGSObservabilityPerformanceFederation

Observing and Optimizing Your GraphQL API

Part 7 of the “Production GraphQL with Netflix DGS” series — Bonus: Operations

GraphQL APIs are invisible to traditional monitoring. Every request hits the same /graphql endpoint, returns HTTP 200 (even with errors in the body), and carries no URL-based context for your dashboards. If you monitor a GraphQL API the same way you monitor REST, you’re flying blind.

This article covers the observability and optimization layer that sits above your DGS backend: federation with a GraphQL router, client identification, operation-level metrics, error classification, schema analytics, and performance optimization techniques that prevent your API from becoming a bottleneck.

Why GraphQL Needs Different Observability

With REST, your monitoring stack can answer basic questions by looking at HTTP metadata:

GET /api/products?page=0&size=20  →  200  →  42ms
POST /api/orders                   →  201  →  156ms
GET /api/orders/123                →  404  →  3ms

Each endpoint is a distinct URL. You can build dashboards, set alerts, and identify slow endpoints without looking at request bodies.

GraphQL breaks this model:

POST /graphql  →  200  →  42ms   (Was this a product search? An order? Which fields?)
POST /graphql  →  200  →  3200ms (Slow — but what operation? Which resolver?)
POST /graphql  →  200  →  12ms   (200 OK, but the response body contains 3 errors)

Every request is POST /graphql, every response is 200 OK (because GraphQL returns errors in the body, not as HTTP status codes), and you can’t tell a lightweight dropdown query from a deeply nested analytics query without inspecting the payload.

The fix is operation-level observability: name your operations, measure them individually, track which clients send them, and monitor error rates per operation — not per endpoint.

Federation: The Router Layer

As your system grows, you’ll split your GraphQL API across multiple services. A federation router composes their schemas into a single graph and routes incoming queries to the right service.

The Architecture

graph TD C["Client"] --> GW["API Gateway
Auth, rate limiting,
circuit breaker"] GW --> R["GraphQL Router
Schema composition,
query planning"] R --> S1["Products Service
DGS — catalog + inventory"] R --> S2["Orders Service
DGS — checkout + fulfillment"] R --> S3["Users Service
DGS — accounts + preferences"] style C fill:#4a9eff,stroke:#2171c7,color:#fff style GW fill:#7c4dff,stroke:#5e35b1,color:#fff style R fill:#7c4dff,stroke:#5e35b1,color:#fff style S1 fill:#00bfa5,stroke:#00897b,color:#fff style S2 fill:#00bfa5,stroke:#00897b,color:#fff style S3 fill:#00bfa5,stroke:#00897b,color:#fff

The router reads each service’s schema, composes them into a supergraph, and handles query planning — deciding which services need to be called for each incoming query, and in what order.

Supergraph Composition

The supergraph is composed at build time (or on change), not at request time:

graph LR S1["Service A
schema"] --> FETCH["Composition Tool"] S2["Service B
schema"] --> FETCH S3["Service C
schema"] --> FETCH FETCH --> DIFF{"Schema
changed?"} DIFF -->|Yes| DEPLOY["Deploy supergraph
to router"] DIFF -->|No| SKIP["Skip — no changes"] style FETCH fill:#7c4dff,stroke:#5e35b1,color:#fff style DIFF fill:#ffd54f,stroke:#f9a825,color:#333 style DEPLOY fill:#00bfa5,stroke:#00897b,color:#fff style SKIP fill:#bdbdbd,stroke:#9e9e9e,color:#333

A practical composition script detects changes before updating:

# Pseudocode for a composition pipeline
compose_supergraph() {
    # Fetch current schemas from running services
    rover supergraph compose --config supergraph.yaml > new_supergraph.graphqls

    # Only update if schema actually changed
    current_hash=$(sha256sum current_supergraph.graphqls | cut -d' ' -f1)
    new_hash=$(sha256sum new_supergraph.graphqls | cut -d' ' -f1)

    if [ "$current_hash" != "$new_hash" ]; then
        deploy_supergraph new_supergraph.graphqls
        log "Supergraph updated"
    else
        log "No schema changes detected, skipping"
    fi
}

This composition-on-change pattern avoids unnecessary router reloads and makes the pipeline idempotent — safe to run on a schedule or on every deployment.

Schema Publishing (Optional)

You can optionally publish your composed schema to a registry (Apollo Studio, GraphQL Hive, or similar) for analytics:

# Publish to a schema registry for field-level analytics
if [ "$PUBLISH_TARGET" = "registry" ]; then
    rover subgraph publish \
        --name core-service \
        --schema service-a.graphqls \
        --routing-url http://service-a:4000/graphql
fi

This enables the registry to track field usage, detect breaking changes, and provide deprecation analytics — capabilities we’ll cover later in this article.

Service Authentication for Composition

The composition tool needs to introspect your services. In production, this shouldn’t use the same auth as regular users. A dedicated service token with restricted permissions is the standard approach:

# Composition tool configuration
services:
    - name: core-service
      url: http://service-a.internal:4000/graphql
      headers:
          Authorization: "Bearer ${SERVICE_INTROSPECTION_TOKEN}"
    - name: health-service
      url: http://service-b.internal:4000/graphql
      headers:
          Authorization: "Bearer ${SERVICE_INTROSPECTION_TOKEN}"

The backend validates this token separately from user JWTs — it grants introspection access but nothing else.

Gateway Resilience

The API gateway sits in front of the router and provides resilience patterns that the GraphQL layer shouldn’t own:

Circuit Breaker

# Tune these values for your traffic patterns
circuit-breaker:
    sliding-window-size: 20
    sliding-window-type: TIME_BASED
    minimum-number-of-calls: 10
    wait-duration-in-open-state: 10s
    failure-rate-threshold: 60

When the failure rate exceeds the threshold within the sliding window, the circuit opens and returns a fast failure instead of letting requests pile up against a failing service.

For GraphQL routes, you might choose to not apply a circuit breaker — because GraphQL handles partial failures gracefully (some fields succeed, some fail, and the response includes both data and errors). The gateway only needs to intervene for total service outages.

Rate Limiting

Rate limiting per endpoint doesn’t make sense for GraphQL (it’s all one endpoint). Instead, rate limit by identity:

# Example values — adjust based on your traffic and abuse patterns
rate-limiting:
    graphql:
        requests-per-minute: ${RATE_LIMIT_GRAPHQL}
    auth-login:
        requests-per-minute: ${RATE_LIMIT_LOGIN}
    auth-register:
        requests-per-hour: ${RATE_LIMIT_REGISTER}

Login and registration endpoints get tight limits to prevent brute-force attacks. The main GraphQL endpoint gets a generous limit that legitimate users won’t hit, but that prevents automated scraping.

The rate limiter uses the authenticated user’s ID when available, falling back to the client IP for unauthenticated requests. This prevents a single user from starving others while allowing legitimate traffic through.

GraphQL Error Fallbacks

When the backend is completely unreachable, the gateway returns a properly formatted GraphQL error — not an HTML 503 page:

{
    "errors": [{
        "message": "Service temporarily unavailable. Please try again.",
        "extensions": {
            "code": "SERVICE_UNAVAILABLE"
        }
    }],
    "data": null
}

This is important because GraphQL clients expect a specific response format. An HTML error page breaks JSON parsing and causes cryptic client-side errors.

WebSocket Routing

Subscriptions use WebSocket connections that bypass the router entirely:

# WebSocket connections route directly to the backend
websocket-route:
    path: /graphql
    uri: ws://backend-service/graphql
    filters:
        - SetRequestHeader=Upgrade, websocket

WebSocket connections are long-lived (sessions can run for an hour or more), so they need different timeout and scaling characteristics than regular HTTP requests.

Client Identification

Apollo Studio (and similar tools) can tell you which client sent each request — but only if the client identifies itself.

The Two Required Headers

const graphqlClient = axios.create({
    headers: {
        'apollographql-client-name': 'web-app',
        'apollographql-client-version': '2.4.1'
    }
})

These two headers — apollographql-client-name and apollographql-client-version — power the Clients dashboard:

Which clients are sending requests (web app, mobile app, admin panel, cron jobs)
Which version of each client is deployed
Error rates per client version (did the v2.4.1 deploy break something?)
Operation breakdown per client (what does the admin panel query that the web app doesn’t?)

Without these headers, your traffic appears as “Unidentified client” and you lose all per-client visibility.

Version Injection at Build Time

Don’t hardcode the version — inject it from your package.json at build time:

// vite.config.ts
import pkg from './package.json'

export default defineConfig({
    define: {
        __APP_VERSION__: JSON.stringify(pkg.version)
    }
})

// In your GraphQL client setup
declare const __APP_VERSION__: string
const clientVersion = typeof __APP_VERSION__ !== 'undefined'
    ? __APP_VERSION__
    : 'unknown'

This ensures the version tracks actual deployments. If you see errors spike for version 2.4.1 but not 2.4.0, you know exactly which deploy to investigate.

Multiple Client Names

In a micro-frontend architecture, each MFE should ideally use the same client name with module context in the operation names:

apollographql-client-name: my-web-app
apollographql-client-version: 2.4.1

All MFEs share the same centralized GraphQL client (and therefore the same headers), so they appear as a single client in the analytics. Individual operations are distinguished by their operation names (e.g., GetProducts, CreateOrder), not by the client name.

Operation-Level Metrics

AOP-Based Measurement

An AOP aspect wraps every @DgsQuery and @DgsMutation with timing and counting:

@Aspect
@Component
public class OperationMetricsAspect {

    @Around("@annotation(com.netflix.graphql.dgs.DgsQuery) || " +
            "@annotation(com.netflix.graphql.dgs.DgsMutation)")
    public Object measureOperation(ProceedingJoinPoint joinPoint) throws Throwable {
        String operationName = joinPoint.getSignature().getName();
        String operationType = isQuery(joinPoint) ? "query" : "mutation";
        long startTime = System.nanoTime();

        Object result = joinPoint.proceed();

        if (result instanceof Mono<?> mono) {
            return mono
                .doOnSuccess(v -> recordMetrics(operationType, operationName, startTime, "success"))
                .doOnError(e -> recordMetrics(operationType, operationName, startTime, "error"));
        }

        recordMetrics(operationType, operationName, startTime, "success");
        return result;
    }
}

This produces three metrics per operation:

Metric	Type	Purpose
`gql.operation.latency`	Timer (histogram)	Latency distribution per operation
`gql.operation.count`	Counter	Throughput per operation
`gql.operation.errors`	Counter	Error rate per operation and error type

SLO-Based Histograms

Rather than tracking just averages and p99, configure SLO boundary histograms that tell you exactly how many requests fall into each latency bucket:

management:
    metrics:
        distribution:
            percentile-histogram:
                gql.operation.latency: true
            slo:
                # Choose boundaries that match your SLAs
                gql.operation.latency: 25ms, 75ms, 150ms, 300ms, 750ms, 1.5s, 3s

This generates histogram buckets at each SLO boundary. Your monitoring dashboard can then show:

gql.operation.latency (products.search):
  ≤ 75ms:   72%  ████████████████████
  ≤ 150ms:  89%  ████████████████████████
  ≤ 300ms:  96%  ██████████████████████████
  ≤ 750ms:  99%  ███████████████████████████
  ≤ 1.5s:   99.8%
  > 3s:     0.1%  (these need investigation)

When an operation’s 95th percentile crosses a boundary (say, 200ms to 500ms), that’s a leading indicator of a performance problem — even if the average is still fine.

Slow Query Detection

Log a warning when any operation exceeds a threshold:

private void recordSuccess(String type, String name, long startTime) {
    Duration duration = Duration.ofNanos(System.nanoTime() - startTime);

    if (duration.compareTo(LATENCY_THRESHOLD) > 0) {
        log.warn("GraphQL operation exceeded threshold",
                kv("operation", name),
                kv("type", type),
                kv("durationMs", duration.toMillis()));
    }
}

With structured logging, you can query these slow operations in your log aggregator:

# Example: searching for slow operations in your log aggregator
message="GraphQL operation exceeded threshold" AND durationMs > 2000

This surfaces operations that need optimization — before users notice.

Error Classification and Monitoring

The Error Flow

Errors in a federated GraphQL system flow through multiple layers:

graph TD EX["DGS throws exception"] --> H["DataFetcherExceptionHandler"] H --> CLS["Classify error code"] H --> SAN["Sanitize message"] H --> LOG["Log full details server-side"] CLS --> R["GraphQL response
data + errors array"] SAN --> R R --> ROUTER["Router forwards
HTTP 200"] ROUTER --> GW["Gateway passes through
No circuit breaker —
GraphQL handles partial failures"] GW --> CLIENT["Client receives
data + errors"] style EX fill:#ff7043,stroke:#e64a19,color:#fff style H fill:#7c4dff,stroke:#5e35b1,color:#fff style R fill:#ffd54f,stroke:#f9a825,color:#333 style CLIENT fill:#4a9eff,stroke:#2171c7,color:#fff

Error Code Distribution

Track which error codes appear most frequently:

Error Code	What It Means	Monitor?
`BAD_USER_INPUT`	Validation failure (expected)	Track volume, not individual errors
`NOT_FOUND`	Resource doesn’t exist (expected)	Track volume, watch for spikes
`FORBIDDEN`	Auth failure (security concern)	Alert on spikes — may indicate attack
`INTERNAL_SERVER_ERROR`	Bug (unexpected)	Alert immediately — these need fixing
`SERVICE_UNAVAILABLE`	Gateway/router fallback	Alert — indicates infrastructure issue

The key insight: not every error is a bug. BAD_USER_INPUT is normal user behavior. INTERNAL_SERVER_ERROR is a defect. Your alerting should distinguish between them:

// Alert-worthy: unexpected errors
if (exception instanceof RuntimeException && !(exception instanceof DomainException)) {
    log.error("Unexpected error in GraphQL operation",
            kv("path", path),
            kv("errorType", exception.getClass().getSimpleName()),
            exception);  // Full stack trace for debugging
}

// Info-level: expected validation failures
if (exception instanceof ValidationException) {
    log.info("Validation error",
            kv("path", path),
            kv("message", exception.getMessage()));
}

Error Rate by Operation

Combine operation names with error codes to find problem areas:

Operation: createOrder  →  12% error rate  →  80% BAD_USER_INPUT (ok, complex form)
Operation: getProducts  →  0.1% error rate →  mostly NOT_FOUND (ok, invalid URLs)
Operation: updateStock  →  8% error rate   →  60% INTERNAL_SERVER_ERROR (needs fixing!)

The updateStock mutation has 8% errors, and most are internal server errors — that’s a bug. The createOrder mutation has 12% errors, but they’re almost all validation failures — that’s just users submitting bad data. Without this breakdown, you’d see “10% error rate on the GraphQL API” and have no idea where to look.

Schema Analytics

Tracking Field Usage

A schema registry can track which fields are actually used by clients. This requires two things:

Operation registration — clients send named operations (not anonymous queries)
Usage reporting — the router or backend reports which fields each operation touches

With this data, you can answer questions like:

Field: Product.legacyCode
  Used by: 0 clients in the last 90 days
  Action: Safe to deprecate and remove

Field: Product.reviews
  Used by: web-app (v2.3+), mobile-app (v1.8+)
  Action: Cannot remove without client migration

Field: Product.internalSKU
  Used by: admin-panel only
  Action: Consider restricting to admin role

Detecting Unused Fields

Schema analytics surfaces dead fields — fields that are defined in the schema but never requested:

type Product {
    id: ID!
    name: String!
    price: Float!
    legacyCode: String      # ← Last used 6 months ago
    internalNotes: String    # ← Never requested by any client
    migrationStatus: String  # ← Used only by deprecated v1 client
}

Without analytics, these fields accumulate indefinitely. Their data loaders and resolvers still execute when included in a query, and their backing data still needs to be maintained. Field-level usage data lets you:

Deprecate fields that are no longer used
Remove deprecated fields after a grace period
Identify fields that shouldn’t be public (like internalNotes)

Schema Validation in CI

Catch breaking changes before they reach production by comparing schema versions:

# CI pipeline step
graphql-inspector diff \
    schema-deployed.graphqls \
    schema-current.graphqls

This catches:

Field removals (breaking)
Type changes (breaking)
Required argument additions (breaking)
Deprecations (non-breaking, informational)

Adding this as a CI gate means breaking changes are caught at code review time, not after deployment.

Performance Optimization

Persisted Queries

Every GraphQL request includes the full query string — which can be large. Automatic Persisted Queries (APQ) replace the query string with a hash:

# First request: send the full query
POST /graphql
{
    "query": "query GetProducts($page: Int!) { products(pageNumber: $page) { ... } }",
    "extensions": {
        "persistedQuery": {
            "version": 1,
            "sha256Hash": "abc123..."
        }
    }
}

# Subsequent requests: send only the hash
POST /graphql
{
    "extensions": {
        "persistedQuery": {
            "version": 1,
            "sha256Hash": "abc123..."
        }
    }
}

The router caches the query text by hash. After the first request, clients send only the hash — reducing request payload by 80-90% for complex queries.

This also has a security benefit: in a locked-down production environment, you can reject queries not in the persisted list — effectively creating an operation allowlist.

Query Complexity Budgets

Beyond the depth and complexity limits covered in Part 3, consider assigning explicit costs to expensive fields:

// Custom field weights
fields.put("Product.reviews", 10);      // Triggers a DB join
fields.put("Product.recommendations", 20); // Triggers an ML service call
fields.put("Order.timeline", 5);         // Aggregates multiple events

A query that fetches products { reviews recommendations } would cost 1 + 10 + 20 = 31 per item. With a page of 20 items, that’s 620 points — potentially over budget. The client either reduces the page size or removes an expensive field.

Load Testing with GraphQL-Aware Metrics

Standard load testing tools measure HTTP response times but can’t distinguish GraphQL operations. Use a tool like k6 with custom metrics:

// k6 load test with GraphQL-specific metrics
import { Trend, Rate } from 'k6/metrics'

const queryDuration = new Trend('graphql_query_duration')
const mutationDuration = new Trend('graphql_mutation_duration')
const errorRate = new Rate('graphql_error_rate')

export default function () {
    const start = Date.now()

    const response = http.post(GRAPHQL_URL, JSON.stringify({
        query: `query GetProducts($page: Int!) {
            products(pageNumber: $page, pageSize: 20) {
                items { id name price }
                totalElements
            }
        }`,
        variables: { page: 0 }
    }), { headers: { 'Content-Type': 'application/json' } })

    const duration = Date.now() - start
    queryDuration.add(duration)

    const body = JSON.parse(response.body)
    errorRate.add(body.errors ? 1 : 0)
}

Progressive Load Profiles

Design test profiles that simulate real traffic patterns:

Profile	VUs	Duration	Thresholds	Purpose
Canary	1	30s	p95 < 500ms, errors < 1%	Pre-deploy smoke test
Load	10→30	20min	p95 < 1s, errors < 5%	Capacity validation
Stress	20→300	25min	p95 < 5s, errors < 30%	Breaking point discovery
Spike	10→150	12min	p95 < 3s, errors < 15%	Burst handling

The canary profile is especially valuable: run it automatically before each deployment. If it fails, abort the rollout before users are affected.

A realistic workload mix mirrors actual traffic:

80% read operations (queries)
├── Product search: 30%
├── Product detail: 25%
├── Order list: 15%
└── User profile: 10%

20% write operations (mutations)
├── Add to cart: 10%
├── Place order: 5%
└── Update profile: 5%

Distributed Tracing

A GraphQL request can touch multiple services, databases, and caches. Distributed tracing ties them together into a single timeline.

Trace Context Propagation

The client starts the trace by sending correlation headers:

const headers = {
    'Content-Type': 'application/json',
    'X-Trace-Id': generateTraceId(),
    'X-Span-Id': generateSpanId()
}

The gateway, router, and backend all propagate these IDs. In your log aggregator, you can search by trace ID to see the complete journey:

sequenceDiagram participant Client participant GW as Gateway participant Router participant Backend participant DB as Database Client->>GW: POST /graphql (traceId=abc123) GW->>Router: Forward with traceId Router->>Backend: Route to service (operation=GetProducts) Backend->>DB: SELECT products (12ms) DB-->>Backend: Results Backend->>Backend: DataLoader batch (loader=categories, 5 keys) Backend-->>Router: GraphQL response (45ms) Router-->>GW: Composed response GW-->>Client: HTTP 200 (48ms total)

OpenTelemetry automates most of this. With the Java agent, every HTTP call, database query, and cache lookup is automatically instrumented:

# OpenTelemetry configuration
otel:
    traces:
        exporter: otlp
    propagators: tracecontext   # W3C standard
    sampling:
        probability: 0.05  # Sample 5% of production traffic (adjust for your volume)

The tracecontext propagator follows the W3C Trace Context standard. If you need backward compatibility with older tracing systems, add additional propagators as needed.

The Observability Checklist

What to Monitor	How	Alert When
Operation latency	SLO histogram per operation	p95 crosses SLO boundary
Error rate by code	Counter per error code per operation	`INTERNAL_SERVER_ERROR` > 1%
Client identification	Apollo headers on every request	“Unidentified client” > 5%
Field usage	Schema registry analytics	Fields unused for 90+ days
Schema changes	CI schema diff	Breaking change detected
Query complexity	Complexity scoring per request	Budget exceeded > 10% of requests
Gateway health	Circuit breaker state	Circuit opens
Composition health	Supergraph diff	Composition fails
Subscription connections	WebSocket connection count	Spike above baseline
Rate limit hits	Counter per identity	Legitimate users hitting limits

What’s Missing (And What to Add Next)

No observability setup is complete on day one. Here are high-impact additions to consider:

Schema validation in CI — use graphql-inspector to catch breaking changes before merge.
Persisted queries — reduce payload size and add an operation allowlist for security.
Per-field cost tracking — assign weights to expensive resolvers for smarter complexity limiting.
Client-side error correlation — match frontend errors to backend traces using shared trace IDs.
Canary deployments with GraphQL metrics — fail fast if a new version degrades operation latency.

These aren’t day-one requirements — but they’re the difference between “we have monitoring” and “we understand our API.”

Cover photo by Luke Chesser on Unsplash.

arrow_back Back to blog