Advanced MCP Error Handling: Retry Logic, Circuit Breakers & Dead Letter Queues

When your ChatGPT app serves 800 million weekly users, error handling isn't just about catching exceptions—it's about building resilient systems that gracefully handle failures, prevent cascade outages, and maintain user trust even when dependencies fail.

OpenAI's Apps SDK approval guidelines emphasize performance responsiveness: your MCP server must "respond quickly enough to maintain chat rhythm." But what happens when your database connection pool is exhausted? When a third-party API goes down? When rate limits kick in during peak traffic?

The difference between amateur and production-grade MCP servers lies in advanced error handling patterns:

  • Exponential backoff with jitter prevents retry storms that amplify failures
  • Circuit breakers detect unhealthy dependencies and fail fast to protect your infrastructure
  • Dead letter queues (DLQ) isolate poison messages that would otherwise crash your server
  • Graceful degradation provides partial results when full functionality isn't available
  • Error classification distinguishes retryable transient failures from terminal errors

This guide dives deep into production-grade error handling with 7 battle-tested TypeScript implementations used by high-traffic MCP servers. Whether you're building a fitness studio booking app or a real estate search tool, these patterns ensure your ChatGPT app stays responsive under adverse conditions.

Table of Contents

  1. Why Advanced Error Handling Matters
  2. Exponential Backoff with Jitter
  3. Circuit Breaker Pattern
  4. Dead Letter Queue Implementation
  5. Graceful Degradation Strategies
  6. Error Classification System
  7. Retry Policy Manager
  8. Health Checks with Circuit Breakers
  9. Conclusion: Building Resilient MCP Servers

Why Advanced Error Handling Matters

MCP servers face unique challenges compared to traditional APIs:

1. ChatGPT Model May Retry Tool Calls

When ChatGPT detects a tool failure, it may retry the same tool call multiple times. If your MCP server doesn't handle retries intelligently, you create retry storms where:

  • Failed database queries get retried 10x in parallel
  • Rate-limited API calls trigger more rate limit errors
  • Transient network failures amplify into cascade outages

Example: A fitness studio booking app experiences a 2-second database timeout. ChatGPT retries 5 times. Without exponential backoff, all 5 retries hit the database simultaneously, creating a thundering herd that extends the outage.

2. Multi-Tool Dependencies Create Failure Chains

ChatGPT apps often compose multiple tools in a single conversation turn:

  1. searchClasses → Queries database for available yoga classes
  2. getInstructorBio → Fetches instructor details from CRM
  3. bookClass → Creates reservation via payment gateway

If getInstructorBio fails due to a CRM outage, should the entire conversation fail? Or should you provide class search results with a fallback message: "Instructor details temporarily unavailable"?

Circuit breakers prevent cascade failures by detecting unhealthy dependencies and failing fast, while graceful degradation provides partial results that keep the conversation moving.

3. Poison Messages Can Crash Your Server

Some errors are not transient—they're terminal and will fail every time:

  • Malformed input that triggers validation errors
  • Database queries with syntax errors
  • API requests with invalid authentication credentials

Retrying these failures indefinitely wastes resources and can crash your server through memory exhaustion. Dead letter queues (DLQ) isolate poison messages for manual investigation while keeping healthy traffic flowing.

4. Rate Limits Require Intelligent Backoff

Third-party APIs (Google Maps, Stripe, Twilio) enforce rate limits:

  • 429 Too Many Requests: Temporary rate limit (back off and retry)
  • 403 Forbidden: Permanent quota exhaustion (don't retry)

Error classification distinguishes between retryable rate limits and permanent quota failures, preventing infinite retry loops.


Exponential Backoff with Jitter

Exponential backoff is the foundation of intelligent retry logic. Instead of retrying failed requests at fixed intervals (every 1 second), exponential backoff doubles the wait time between retries: 1s → 2s → 4s → 8s.

Adding jitter (random variance) prevents synchronized retry storms when multiple clients fail simultaneously.

Implementation: Production-Grade Retry with Jitter

// src/lib/retry/exponential-backoff.ts

/**
 * Exponential Backoff Retry with Jitter
 *
 * Prevents retry storms and respects rate limits with intelligent backoff.
 *
 * Features:
 * - Exponential backoff (1s, 2s, 4s, 8s, 16s)
 * - Full jitter to prevent thundering herd
 * - Configurable max attempts and max delay
 * - Error classification (retryable vs terminal)
 * - Detailed retry metrics for monitoring
 */

export interface RetryOptions {
  maxAttempts: number;        // Maximum retry attempts (default: 5)
  initialDelayMs: number;     // Initial delay in milliseconds (default: 1000)
  maxDelayMs: number;         // Maximum delay cap (default: 30000)
  jitterType: 'full' | 'equal' | 'decorrelated';  // Jitter strategy
  retryableErrors: Set<string>;  // Error codes that should trigger retries
}

export interface RetryMetrics {
  attempt: number;
  totalAttempts: number;
  delayMs: number;
  error: Error;
  isRetryable: boolean;
  timestamp: Date;
}

export class ExponentialBackoffRetry {
  private options: RetryOptions;
  private metrics: RetryMetrics[] = [];

  constructor(options: Partial<RetryOptions> = {}) {
    this.options = {
      maxAttempts: options.maxAttempts ?? 5,
      initialDelayMs: options.initialDelayMs ?? 1000,
      maxDelayMs: options.maxDelayMs ?? 30000,
      jitterType: options.jitterType ?? 'full',
      retryableErrors: options.retryableErrors ?? new Set([
        'ECONNREFUSED',   // Connection refused
        'ETIMEDOUT',      // Connection timeout
        'ENOTFOUND',      // DNS lookup failed
        'NETWORK_ERROR',  // Generic network error
        'RATE_LIMITED',   // Temporary rate limit (429)
        'SERVICE_UNAVAILABLE',  // 503 Service Unavailable
      ]),
    };
  }

  /**
   * Execute function with exponential backoff retry logic
   */
  async execute<T>(
    fn: () => Promise<T>,
    context: string = 'operation'
  ): Promise<T> {
    let lastError: Error;

    for (let attempt = 1; attempt <= this.options.maxAttempts; attempt++) {
      try {
        const result = await fn();

        // Success - log metrics if there were retries
        if (attempt > 1) {
          console.log(`✅ ${context} succeeded after ${attempt} attempts`, {
            totalRetries: attempt - 1,
            totalDelayMs: this.metrics.reduce((sum, m) => sum + m.delayMs, 0),
          });
        }

        return result;
      } catch (error) {
        lastError = error as Error;
        const isRetryable = this.isRetryableError(error);
        const isLastAttempt = attempt === this.options.maxAttempts;

        // Record metrics
        const metric: RetryMetrics = {
          attempt,
          totalAttempts: this.options.maxAttempts,
          delayMs: 0,
          error: lastError,
          isRetryable,
          timestamp: new Date(),
        };

        // Terminal error or last attempt - fail immediately
        if (!isRetryable || isLastAttempt) {
          this.metrics.push(metric);
          console.error(`❌ ${context} failed permanently`, {
            attempt,
            isRetryable,
            error: lastError.message,
            metrics: this.metrics,
          });
          throw lastError;
        }

        // Calculate backoff delay with jitter
        const delayMs = this.calculateDelay(attempt);
        metric.delayMs = delayMs;
        this.metrics.push(metric);

        console.warn(`⚠️ ${context} failed, retrying in ${delayMs}ms`, {
          attempt,
          maxAttempts: this.options.maxAttempts,
          error: lastError.message,
        });

        // Wait before retrying
        await this.sleep(delayMs);
      }
    }

    // This should never be reached due to throw in loop, but TypeScript needs it
    throw lastError!;
  }

  /**
   * Calculate exponential backoff delay with jitter
   */
  private calculateDelay(attempt: number): number {
    const exponentialDelay = Math.min(
      this.options.initialDelayMs * Math.pow(2, attempt - 1),
      this.options.maxDelayMs
    );

    switch (this.options.jitterType) {
      case 'full':
        // Full jitter: random between 0 and exponentialDelay
        return Math.floor(Math.random() * exponentialDelay);

      case 'equal':
        // Equal jitter: exponentialDelay/2 + random(0, exponentialDelay/2)
        return Math.floor(exponentialDelay / 2 + Math.random() * (exponentialDelay / 2));

      case 'decorrelated':
        // Decorrelated jitter: random between initialDelay and 3x previous delay
        const previousDelay = attempt > 1
          ? this.metrics[attempt - 2]?.delayMs || this.options.initialDelayMs
          : this.options.initialDelayMs;
        return Math.floor(
          Math.random() * (Math.min(this.options.maxDelayMs, previousDelay * 3) - this.options.initialDelayMs)
          + this.options.initialDelayMs
        );

      default:
        return exponentialDelay;
    }
  }

  /**
   * Determine if error should trigger retry
   */
  private isRetryableError(error: any): boolean {
    // Check error code
    if (error.code && this.options.retryableErrors.has(error.code)) {
      return true;
    }

    // Check HTTP status codes
    if (error.response?.status) {
      const status = error.response.status;

      // Retryable: 408 Request Timeout, 429 Too Many Requests, 503 Service Unavailable, 504 Gateway Timeout
      if ([408, 429, 503, 504].includes(status)) {
        return true;
      }

      // Terminal: 4xx client errors (except 408, 429)
      if (status >= 400 && status < 500) {
        return false;
      }

      // Retryable: 5xx server errors
      if (status >= 500) {
        return true;
      }
    }

    // Check error message for network-related keywords
    const errorMessage = error.message?.toLowerCase() || '';
    const networkKeywords = ['timeout', 'network', 'econnrefused', 'enotfound', 'socket hang up'];
    if (networkKeywords.some(keyword => errorMessage.includes(keyword))) {
      return true;
    }

    // Default: not retryable
    return false;
  }

  /**
   * Sleep for specified milliseconds
   */
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }

  /**
   * Get retry metrics for monitoring
   */
  getMetrics(): RetryMetrics[] {
    return [...this.metrics];
  }

  /**
   * Reset metrics (useful for testing)
   */
  resetMetrics(): void {
    this.metrics = [];
  }
}

Usage Example

// Example: Retry database query with exponential backoff
import { ExponentialBackoffRetry } from './lib/retry/exponential-backoff';

const retry = new ExponentialBackoffRetry({
  maxAttempts: 5,
  initialDelayMs: 1000,
  maxDelayMs: 30000,
  jitterType: 'full',
});

async function searchFitnessClasses(location: string) {
  return retry.execute(
    async () => {
      const response = await fetch(`https://api.example.com/classes?location=${location}`);

      if (!response.ok) {
        const error: any = new Error(`API error: ${response.statusText}`);
        error.response = { status: response.status };
        throw error;
      }

      return response.json();
    },
    'searchFitnessClasses'
  );
}

Circuit Breaker Pattern

Circuit breakers prevent cascade failures by detecting unhealthy dependencies and failing fast instead of wasting resources on doomed requests.

Circuit Breaker States

  1. CLOSED (healthy): Requests pass through normally
  2. OPEN (failed): Requests fail immediately without hitting dependency
  3. HALF_OPEN (testing): Allow limited requests to test if dependency recovered

Implementation: Production-Grade Circuit Breaker

// src/lib/circuit-breaker/circuit-breaker.ts

/**
 * Circuit Breaker Implementation
 *
 * Prevents cascade failures by detecting unhealthy dependencies.
 *
 * Features:
 * - Three states: CLOSED, OPEN, HALF_OPEN
 * - Configurable failure threshold and timeout
 * - Automatic recovery testing
 * - Detailed metrics and health reporting
 */

export enum CircuitState {
  CLOSED = 'CLOSED',      // Healthy: requests pass through
  OPEN = 'OPEN',          // Failed: fail fast without hitting dependency
  HALF_OPEN = 'HALF_OPEN' // Testing: allow limited requests to test recovery
}

export interface CircuitBreakerOptions {
  failureThreshold: number;      // Failures before opening circuit (default: 5)
  successThreshold: number;      // Successes in HALF_OPEN to close circuit (default: 2)
  timeout: number;               // Time in ms before attempting recovery (default: 60000)
  windowSize: number;            // Rolling window for failure tracking (default: 10)
}

export interface CircuitBreakerMetrics {
  state: CircuitState;
  failures: number;
  successes: number;
  totalRequests: number;
  lastFailureTime?: Date;
  lastSuccessTime?: Date;
  stateTransitions: { from: CircuitState; to: CircuitState; timestamp: Date }[];
}

export class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private options: CircuitBreakerOptions;
  private failures: number = 0;
  private successes: number = 0;
  private totalRequests: number = 0;
  private lastFailureTime?: Date;
  private lastSuccessTime?: Date;
  private stateTransitions: { from: CircuitState; to: CircuitState; timestamp: Date }[] = [];
  private nextAttemptTime?: Date;

  constructor(
    private name: string,
    options: Partial<CircuitBreakerOptions> = {}
  ) {
    this.options = {
      failureThreshold: options.failureThreshold ?? 5,
      successThreshold: options.successThreshold ?? 2,
      timeout: options.timeout ?? 60000,  // 60 seconds
      windowSize: options.windowSize ?? 10,
    };
  }

  /**
   * Execute function with circuit breaker protection
   */
  async execute<T>(fn: () => Promise<T>): Promise<T> {
    // Check if circuit is OPEN
    if (this.state === CircuitState.OPEN) {
      if (this.shouldAttemptRecovery()) {
        this.transitionTo(CircuitState.HALF_OPEN);
      } else {
        throw new Error(
          `Circuit breaker [${this.name}] is OPEN. ` +
          `Next attempt at ${this.nextAttemptTime?.toISOString()}`
        );
      }
    }

    this.totalRequests++;

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  /**
   * Handle successful request
   */
  private onSuccess(): void {
    this.successes++;
    this.lastSuccessTime = new Date();

    if (this.state === CircuitState.HALF_OPEN) {
      // Enough successes to close circuit?
      if (this.successes >= this.options.successThreshold) {
        console.log(`✅ Circuit breaker [${this.name}] closing after ${this.successes} successful recoveries`);
        this.transitionTo(CircuitState.CLOSED);
        this.reset();
      }
    } else if (this.state === CircuitState.CLOSED) {
      // Reset failure counter on success
      this.failures = 0;
    }
  }

  /**
   * Handle failed request
   */
  private onFailure(): void {
    this.failures++;
    this.lastFailureTime = new Date();

    if (this.state === CircuitState.HALF_OPEN) {
      // Any failure in HALF_OPEN reopens circuit
      console.warn(`⚠️ Circuit breaker [${this.name}] reopening after failure during recovery test`);
      this.transitionTo(CircuitState.OPEN);
      this.scheduleRecoveryAttempt();
    } else if (this.state === CircuitState.CLOSED) {
      // Exceeded failure threshold?
      if (this.failures >= this.options.failureThreshold) {
        console.error(`❌ Circuit breaker [${this.name}] opening after ${this.failures} failures`);
        this.transitionTo(CircuitState.OPEN);
        this.scheduleRecoveryAttempt();
      }
    }
  }

  /**
   * Transition to new state
   */
  private transitionTo(newState: CircuitState): void {
    const oldState = this.state;
    this.state = newState;
    this.stateTransitions.push({
      from: oldState,
      to: newState,
      timestamp: new Date(),
    });
  }

  /**
   * Check if circuit should attempt recovery
   */
  private shouldAttemptRecovery(): boolean {
    if (!this.nextAttemptTime) {
      return true;
    }
    return new Date() >= this.nextAttemptTime;
  }

  /**
   * Schedule next recovery attempt
   */
  private scheduleRecoveryAttempt(): void {
    this.nextAttemptTime = new Date(Date.now() + this.options.timeout);
    console.log(`🕒 Circuit breaker [${this.name}] will attempt recovery at ${this.nextAttemptTime.toISOString()}`);
  }

  /**
   * Reset circuit breaker state
   */
  private reset(): void {
    this.failures = 0;
    this.successes = 0;
    this.nextAttemptTime = undefined;
  }

  /**
   * Get current metrics
   */
  getMetrics(): CircuitBreakerMetrics {
    return {
      state: this.state,
      failures: this.failures,
      successes: this.successes,
      totalRequests: this.totalRequests,
      lastFailureTime: this.lastFailureTime,
      lastSuccessTime: this.lastSuccessTime,
      stateTransitions: [...this.stateTransitions],
    };
  }

  /**
   * Get current state
   */
  getState(): CircuitState {
    return this.state;
  }

  /**
   * Check if circuit is healthy
   */
  isHealthy(): boolean {
    return this.state === CircuitState.CLOSED;
  }
}

Usage Example

// Example: Protect third-party API calls with circuit breaker
import { CircuitBreaker } from './lib/circuit-breaker/circuit-breaker';

const stripeCircuit = new CircuitBreaker('stripe-api', {
  failureThreshold: 5,     // Open after 5 failures
  successThreshold: 2,     // Close after 2 successful recoveries
  timeout: 60000,          // Retry after 60 seconds
});

async function createStripePayment(amount: number, token: string) {
  return stripeCircuit.execute(async () => {
    const response = await fetch('https://api.stripe.com/v1/charges', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.STRIPE_SECRET_KEY}`,
        'Content-Type': 'application/x-www-form-urlencoded',
      },
      body: new URLSearchParams({
        amount: amount.toString(),
        currency: 'usd',
        source: token,
      }),
    });

    if (!response.ok) {
      throw new Error(`Stripe API error: ${response.statusText}`);
    }

    return response.json();
  });
}

Dead Letter Queue Implementation

Dead letter queues (DLQ) isolate poison messages—requests that fail repeatedly and would otherwise crash your server or waste resources.

When to Use DLQ

  • Malformed input: JSON parse errors, schema validation failures
  • Permanent API errors: 401 Unauthorized, 404 Not Found (won't succeed on retry)
  • Resource constraints: Database connection exhaustion, memory limits

Implementation: In-Memory DLQ with Persistence

// src/lib/dlq/dead-letter-queue.ts

/**
 * Dead Letter Queue Implementation
 *
 * Isolates poison messages that fail repeatedly.
 *
 * Features:
 * - Automatic retry limit enforcement
 * - Persistent storage (filesystem or database)
 * - Manual replay for debugging
 * - Detailed failure metadata
 */

export interface DLQMessage<T> {
  id: string;
  payload: T;
  error: {
    message: string;
    stack?: string;
    code?: string;
  };
  attempts: number;
  firstAttemptTime: Date;
  lastAttemptTime: Date;
  source: string;  // Which tool/operation failed
}

export interface DLQOptions {
  maxRetries: number;        // Max retries before sending to DLQ (default: 3)
  persistPath?: string;      // File path for DLQ persistence
  alertThreshold?: number;   // Alert after N messages in DLQ
}

export class DeadLetterQueue<T = any> {
  private messages: Map<string, DLQMessage<T>> = new Map();
  private options: DLQOptions;

  constructor(options: Partial<DLQOptions> = {}) {
    this.options = {
      maxRetries: options.maxRetries ?? 3,
      persistPath: options.persistPath,
      alertThreshold: options.alertThreshold ?? 10,
    };

    // Load persisted messages on startup
    if (this.options.persistPath) {
      this.loadFromDisk();
    }
  }

  /**
   * Attempt to process message with automatic DLQ on repeated failures
   */
  async process<R>(
    messageId: string,
    payload: T,
    source: string,
    processor: (payload: T) => Promise<R>
  ): Promise<R> {
    try {
      const result = await processor(payload);

      // Success - remove from tracking if it was previously failing
      if (this.messages.has(messageId)) {
        console.log(`✅ Message ${messageId} recovered after previous failures`);
        this.messages.delete(messageId);
        this.persistToDisk();
      }

      return result;
    } catch (error) {
      // Track failure
      const existing = this.messages.get(messageId);
      const now = new Date();

      if (existing) {
        // Increment attempt counter
        existing.attempts++;
        existing.lastAttemptTime = now;
        existing.error = {
          message: (error as Error).message,
          stack: (error as Error).stack,
          code: (error as any).code,
        };

        // Exceeded max retries?
        if (existing.attempts >= this.options.maxRetries) {
          console.error(`❌ Message ${messageId} sent to DLQ after ${existing.attempts} failures`, {
            source,
            error: existing.error.message,
          });

          this.sendToDLQ(existing);
        }
      } else {
        // First failure - track it
        const newMessage: DLQMessage<T> = {
          id: messageId,
          payload,
          error: {
            message: (error as Error).message,
            stack: (error as Error).stack,
            code: (error as any).code,
          },
          attempts: 1,
          firstAttemptTime: now,
          lastAttemptTime: now,
          source,
        };

        this.messages.set(messageId, newMessage);
        console.warn(`⚠️ Message ${messageId} failed (attempt 1/${this.options.maxRetries})`, {
          source,
          error: newMessage.error.message,
        });
      }

      throw error;
    }
  }

  /**
   * Send message to DLQ (persist and alert)
   */
  private sendToDLQ(message: DLQMessage<T>): void {
    // Persist to disk
    this.persistToDisk();

    // Check if we should alert
    if (this.options.alertThreshold && this.messages.size >= this.options.alertThreshold) {
      console.error(`🚨 DLQ threshold exceeded: ${this.messages.size} messages in queue`);
      // TODO: Send alert (email, Slack, PagerDuty)
    }
  }

  /**
   * Manually replay DLQ message (for debugging)
   */
  async replay<R>(
    messageId: string,
    processor: (payload: T) => Promise<R>
  ): Promise<R> {
    const message = this.messages.get(messageId);
    if (!message) {
      throw new Error(`Message ${messageId} not found in DLQ`);
    }

    try {
      const result = await processor(message.payload);

      // Success - remove from DLQ
      this.messages.delete(messageId);
      this.persistToDisk();
      console.log(`✅ DLQ message ${messageId} replayed successfully`);

      return result;
    } catch (error) {
      console.error(`❌ DLQ message ${messageId} replay failed`, {
        error: (error as Error).message,
      });
      throw error;
    }
  }

  /**
   * Get all DLQ messages
   */
  getMessages(): DLQMessage<T>[] {
    return Array.from(this.messages.values());
  }

  /**
   * Get DLQ statistics
   */
  getStats() {
    return {
      totalMessages: this.messages.size,
      oldestMessage: this.getOldestMessage(),
      newestMessage: this.getNewestMessage(),
      messagesBySource: this.groupBySource(),
    };
  }

  private getOldestMessage(): DLQMessage<T> | undefined {
    let oldest: DLQMessage<T> | undefined;
    for (const message of this.messages.values()) {
      if (!oldest || message.firstAttemptTime < oldest.firstAttemptTime) {
        oldest = message;
      }
    }
    return oldest;
  }

  private getNewestMessage(): DLQMessage<T> | undefined {
    let newest: DLQMessage<T> | undefined;
    for (const message of this.messages.values()) {
      if (!newest || message.firstAttemptTime > newest.firstAttemptTime) {
        newest = message;
      }
    }
    return newest;
  }

  private groupBySource(): Record<string, number> {
    const groups: Record<string, number> = {};
    for (const message of this.messages.values()) {
      groups[message.source] = (groups[message.source] || 0) + 1;
    }
    return groups;
  }

  /**
   * Persist DLQ to disk (for durability)
   */
  private persistToDisk(): void {
    if (!this.options.persistPath) return;

    const fs = require('fs');
    const data = JSON.stringify(Array.from(this.messages.entries()), null, 2);
    fs.writeFileSync(this.options.persistPath, data);
  }

  /**
   * Load DLQ from disk
   */
  private loadFromDisk(): void {
    if (!this.options.persistPath) return;

    try {
      const fs = require('fs');
      if (fs.existsSync(this.options.persistPath)) {
        const data = fs.readFileSync(this.options.persistPath, 'utf-8');
        const entries = JSON.parse(data);
        this.messages = new Map(entries);
        console.log(`📥 Loaded ${this.messages.size} messages from DLQ persistence`);
      }
    } catch (error) {
      console.error('Failed to load DLQ from disk:', error);
    }
  }
}

Usage Example

// Example: Use DLQ for database operations
import { DeadLetterQueue } from './lib/dlq/dead-letter-queue';

const dbDLQ = new DeadLetterQueue({
  maxRetries: 3,
  persistPath: './dlq-database.json',
  alertThreshold: 10,
});

async function createBooking(bookingData: any) {
  return dbDLQ.process(
    `booking-${bookingData.id}`,
    bookingData,
    'createBooking',
    async (data) => {
      // Attempt database insert
      const result = await db.bookings.insert(data);
      return result;
    }
  );
}

Graceful Degradation Strategies

Graceful degradation provides partial results when full functionality isn't available, maintaining conversation flow even when dependencies fail.

Implementation: Graceful Degradation Middleware

// src/lib/middleware/graceful-degradation.ts

/**
 * Graceful Degradation Middleware
 *
 * Provides fallback responses when full functionality fails.
 *
 * Features:
 * - Partial result composition
 * - Fallback response templates
 * - User-friendly error messages
 * - Maintains conversation flow
 */

export interface FallbackStrategy<T> {
  fallbackData?: T;           // Static fallback data
  fallbackFn?: () => Promise<T>;  // Dynamic fallback function
  partialResults?: boolean;   // Allow partial results?
  errorMessage?: string;      // User-friendly error message
}

export class GracefulDegradation {
  /**
   * Execute with graceful degradation fallback
   */
  static async execute<T>(
    primaryFn: () => Promise<T>,
    strategy: FallbackStrategy<T>,
    context: string = 'operation'
  ): Promise<{ data: T; degraded: boolean; error?: string }> {
    try {
      const data = await primaryFn();
      return { data, degraded: false };
    } catch (error) {
      console.warn(`⚠️ ${context} failed, applying graceful degradation`, {
        error: (error as Error).message,
      });

      // Try fallback function
      if (strategy.fallbackFn) {
        try {
          const fallbackData = await strategy.fallbackFn();
          return {
            data: fallbackData,
            degraded: true,
            error: strategy.errorMessage || 'Using cached or fallback data',
          };
        } catch (fallbackError) {
          console.error(`❌ Fallback function failed for ${context}`, {
            error: (fallbackError as Error).message,
          });
        }
      }

      // Use static fallback data
      if (strategy.fallbackData !== undefined) {
        return {
          data: strategy.fallbackData,
          degraded: true,
          error: strategy.errorMessage || 'Using default data',
        };
      }

      // No fallback available - throw original error
      throw error;
    }
  }

  /**
   * Compose partial results from multiple tool calls
   */
  static async composePartial<T extends Record<string, any>>(
    operations: Record<keyof T, () => Promise<any>>,
    required: (keyof T)[] = []
  ): Promise<{ data: Partial<T>; degraded: boolean; failures: string[] }> {
    const results: Partial<T> = {};
    const failures: string[] = [];

    // Execute all operations in parallel
    const entries = Object.entries(operations) as [keyof T, () => Promise<any>][];
    const promises = entries.map(async ([key, fn]) => {
      try {
        results[key] = await fn();
      } catch (error) {
        failures.push(String(key));
        console.warn(`⚠️ Operation ${String(key)} failed in partial composition`, {
          error: (error as Error).message,
        });
      }
    });

    await Promise.allSettled(promises);

    // Check if required operations succeeded
    const missingRequired = required.filter(key => !(key in results));
    if (missingRequired.length > 0) {
      throw new Error(
        `Required operations failed: ${missingRequired.join(', ')}`
      );
    }

    return {
      data: results,
      degraded: failures.length > 0,
      failures,
    };
  }
}

Usage Example

// Example: Search fitness classes with fallback to cached data
import { GracefulDegradation } from './lib/middleware/graceful-degradation';

async function searchFitnessClasses(location: string) {
  const result = await GracefulDegradation.execute(
    // Primary: Live API call
    async () => {
      const response = await fetch(`https://api.example.com/classes?location=${location}`);
      if (!response.ok) throw new Error('API unavailable');
      return response.json();
    },

    // Fallback: Cached data from Redis
    {
      fallbackFn: async () => {
        const cached = await redis.get(`classes:${location}`);
        return cached ? JSON.parse(cached) : [];
      },
      errorMessage: 'Showing recently available classes (live data temporarily unavailable)',
    },

    'searchFitnessClasses'
  );

  return {
    classes: result.data,
    note: result.degraded ? result.error : undefined,
  };
}

// Example: Compose partial results
async function getCompleteBookingInfo(bookingId: string) {
  const result = await GracefulDegradation.composePartial(
    {
      booking: () => db.bookings.findById(bookingId),
      instructor: () => api.getInstructor(booking.instructorId),
      reviews: () => api.getReviews(booking.classId),
    },
    ['booking']  // booking is required, others are optional
  );

  if (result.degraded) {
    console.log(`⚠️ Partial results: ${result.failures.join(', ')} unavailable`);
  }

  return result.data;
}

Error Classification System

Not all errors should trigger retries. Error classification distinguishes between:

  • Transient errors (network timeouts, rate limits) → Retry
  • Terminal errors (400 Bad Request, 401 Unauthorized) → Fail immediately

Implementation: Error Classifier

// src/lib/errors/error-classifier.ts

/**
 * Error Classification System
 *
 * Distinguishes retryable vs terminal errors to prevent retry storms.
 *
 * Features:
 * - HTTP status code classification
 * - Network error detection
 * - Custom error type support
 * - Detailed error metadata
 */

export enum ErrorType {
  TRANSIENT = 'TRANSIENT',      // Temporary failure - retry
  TERMINAL = 'TERMINAL',          // Permanent failure - don't retry
  RATE_LIMIT = 'RATE_LIMIT',      // Rate limit - retry with backoff
  AUTHENTICATION = 'AUTHENTICATION',  // Auth failure - don't retry
  VALIDATION = 'VALIDATION',      // Input validation - don't retry
  UNKNOWN = 'UNKNOWN',            // Unknown error - conservative retry
}

export interface ClassifiedError {
  type: ErrorType;
  isRetryable: boolean;
  statusCode?: number;
  errorCode?: string;
  message: string;
  retryAfter?: number;  // Milliseconds to wait before retry
  metadata?: Record<string, any>;
}

export class ErrorClassifier {
  /**
   * Classify error and determine retry strategy
   */
  static classify(error: any): ClassifiedError {
    // HTTP status code classification
    if (error.response?.status) {
      return this.classifyHttpError(error);
    }

    // Network error classification
    if (error.code) {
      return this.classifyNetworkError(error);
    }

    // Custom error type classification
    if (error.type) {
      return this.classifyCustomError(error);
    }

    // Unknown error - conservative retry
    return {
      type: ErrorType.UNKNOWN,
      isRetryable: true,  // Conservative: allow retry
      message: error.message || 'Unknown error',
    };
  }

  /**
   * Classify HTTP status code errors
   */
  private static classifyHttpError(error: any): ClassifiedError {
    const status = error.response.status;
    const headers = error.response.headers || {};

    // 408 Request Timeout - transient
    if (status === 408) {
      return {
        type: ErrorType.TRANSIENT,
        isRetryable: true,
        statusCode: status,
        message: 'Request timeout',
      };
    }

    // 429 Too Many Requests - rate limit
    if (status === 429) {
      const retryAfter = headers['retry-after']
        ? parseInt(headers['retry-after'], 10) * 1000
        : 60000;  // Default: 60 seconds

      return {
        type: ErrorType.RATE_LIMIT,
        isRetryable: true,
        statusCode: status,
        message: 'Rate limit exceeded',
        retryAfter,
      };
    }

    // 401 Unauthorized, 403 Forbidden - authentication
    if (status === 401 || status === 403) {
      return {
        type: ErrorType.AUTHENTICATION,
        isRetryable: false,
        statusCode: status,
        message: 'Authentication failed',
      };
    }

    // 400 Bad Request, 422 Unprocessable Entity - validation
    if (status === 400 || status === 422) {
      return {
        type: ErrorType.VALIDATION,
        isRetryable: false,
        statusCode: status,
        message: 'Validation error',
      };
    }

    // 404 Not Found, 405 Method Not Allowed, 410 Gone - terminal
    if ([404, 405, 410].includes(status)) {
      return {
        type: ErrorType.TERMINAL,
        isRetryable: false,
        statusCode: status,
        message: 'Resource not found or not allowed',
      };
    }

    // 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout - transient
    if ([500, 502, 503, 504].includes(status)) {
      return {
        type: ErrorType.TRANSIENT,
        isRetryable: true,
        statusCode: status,
        message: 'Server error (transient)',
      };
    }

    // Default: 4xx terminal, 5xx transient
    return {
      type: status >= 400 && status < 500 ? ErrorType.TERMINAL : ErrorType.TRANSIENT,
      isRetryable: status >= 500,
      statusCode: status,
      message: error.message || `HTTP ${status} error`,
    };
  }

  /**
   * Classify network errors
   */
  private static classifyNetworkError(error: any): ClassifiedError {
    const retryableNetworkErrors = new Set([
      'ECONNREFUSED',
      'ETIMEDOUT',
      'ENOTFOUND',
      'ECONNRESET',
      'EPIPE',
      'EHOSTUNREACH',
      'EAI_AGAIN',
    ]);

    if (retryableNetworkErrors.has(error.code)) {
      return {
        type: ErrorType.TRANSIENT,
        isRetryable: true,
        errorCode: error.code,
        message: `Network error: ${error.code}`,
      };
    }

    return {
      type: ErrorType.TERMINAL,
      isRetryable: false,
      errorCode: error.code,
      message: `Network error: ${error.code}`,
    };
  }

  /**
   * Classify custom error types
   */
  private static classifyCustomError(error: any): ClassifiedError {
    switch (error.type) {
      case 'RATE_LIMIT':
        return {
          type: ErrorType.RATE_LIMIT,
          isRetryable: true,
          message: error.message,
          retryAfter: error.retryAfter || 60000,
        };

      case 'VALIDATION':
        return {
          type: ErrorType.VALIDATION,
          isRetryable: false,
          message: error.message,
        };

      default:
        return {
          type: ErrorType.UNKNOWN,
          isRetryable: true,
          message: error.message,
        };
    }
  }
}

Retry Policy Manager

Combine exponential backoff, circuit breakers, and error classification into a unified Retry Policy Manager.

// src/lib/retry/retry-policy-manager.ts

/**
 * Retry Policy Manager
 *
 * Unified retry system combining exponential backoff, circuit breakers, and error classification.
 */

import { ExponentialBackoffRetry } from './exponential-backoff';
import { CircuitBreaker } from '../circuit-breaker/circuit-breaker';
import { ErrorClassifier, ErrorType } from '../errors/error-classifier';

export interface RetryPolicyConfig {
  circuitBreaker?: {
    enabled: boolean;
    failureThreshold: number;
    timeout: number;
  };
  retry?: {
    maxAttempts: number;
    initialDelayMs: number;
    maxDelayMs: number;
  };
}

export class RetryPolicyManager {
  private circuitBreaker?: CircuitBreaker;
  private retry: ExponentialBackoffRetry;

  constructor(
    private name: string,
    config: RetryPolicyConfig = {}
  ) {
    // Initialize circuit breaker
    if (config.circuitBreaker?.enabled) {
      this.circuitBreaker = new CircuitBreaker(name, {
        failureThreshold: config.circuitBreaker.failureThreshold ?? 5,
        timeout: config.circuitBreaker.timeout ?? 60000,
      });
    }

    // Initialize retry logic
    this.retry = new ExponentialBackoffRetry({
      maxAttempts: config.retry?.maxAttempts ?? 5,
      initialDelayMs: config.retry?.initialDelayMs ?? 1000,
      maxDelayMs: config.retry?.maxDelayMs ?? 30000,
      jitterType: 'full',
    });
  }

  /**
   * Execute with full retry policy (circuit breaker + retry + error classification)
   */
  async execute<T>(fn: () => Promise<T>): Promise<T> {
    // Wrap in circuit breaker if enabled
    const executeFn = this.circuitBreaker
      ? () => this.circuitBreaker!.execute(fn)
      : fn;

    // Apply retry logic with error classification
    return this.retry.execute(async () => {
      try {
        return await executeFn();
      } catch (error) {
        // Classify error
        const classified = ErrorClassifier.classify(error);

        // If not retryable, throw immediately
        if (!classified.isRetryable) {
          const enhancedError: any = new Error(classified.message);
          enhancedError.type = classified.type;
          enhancedError.statusCode = classified.statusCode;
          enhancedError.errorCode = classified.errorCode;
          throw enhancedError;
        }

        // If rate limited, wait before retrying
        if (classified.type === ErrorType.RATE_LIMIT && classified.retryAfter) {
          console.warn(`⚠️ Rate limited, waiting ${classified.retryAfter}ms before retry`);
          await new Promise(resolve => setTimeout(resolve, classified.retryAfter!));
        }

        throw error;
      }
    }, this.name);
  }

  /**
   * Get health status
   */
  getHealth() {
    return {
      name: this.name,
      circuitBreakerState: this.circuitBreaker?.getState(),
      circuitBreakerMetrics: this.circuitBreaker?.getMetrics(),
      retryMetrics: this.retry.getMetrics(),
    };
  }
}

Usage Example

// Example: Unified retry policy for all API calls
import { RetryPolicyManager } from './lib/retry/retry-policy-manager';

const apiRetryPolicy = new RetryPolicyManager('external-api', {
  circuitBreaker: {
    enabled: true,
    failureThreshold: 5,
    timeout: 60000,
  },
  retry: {
    maxAttempts: 5,
    initialDelayMs: 1000,
    maxDelayMs: 30000,
  },
});

async function callExternalAPI(endpoint: string) {
  return apiRetryPolicy.execute(async () => {
    const response = await fetch(`https://api.example.com${endpoint}`);
    if (!response.ok) {
      const error: any = new Error(`API error: ${response.statusText}`);
      error.response = { status: response.status, headers: response.headers };
      throw error;
    }
    return response.json();
  });
}

Health Checks with Circuit Breakers

Implement health check endpoints that report circuit breaker states and retry metrics.

// src/lib/health/health-check.ts

/**
 * Health Check System with Circuit Breaker Integration
 *
 * Reports system health including circuit breaker states.
 */

import { CircuitBreaker, CircuitState } from '../circuit-breaker/circuit-breaker';
import { RetryPolicyManager } from '../retry/retry-policy-manager';

export interface HealthStatus {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: Date;
  circuitBreakers: {
    name: string;
    state: CircuitState;
    healthy: boolean;
  }[];
  retryPolicies: {
    name: string;
    metrics: any;
  }[];
  uptime: number;
}

export class HealthCheck {
  private static circuitBreakers: Map<string, CircuitBreaker> = new Map();
  private static retryPolicies: Map<string, RetryPolicyManager> = new Map();
  private static startTime: Date = new Date();

  /**
   * Register circuit breaker for health monitoring
   */
  static registerCircuitBreaker(name: string, circuitBreaker: CircuitBreaker): void {
    this.circuitBreakers.set(name, circuitBreaker);
  }

  /**
   * Register retry policy for health monitoring
   */
  static registerRetryPolicy(name: string, retryPolicy: RetryPolicyManager): void {
    this.retryPolicies.set(name, retryPolicy);
  }

  /**
   * Get overall health status
   */
  static getStatus(): HealthStatus {
    const circuitBreakers = Array.from(this.circuitBreakers.entries()).map(([name, cb]) => ({
      name,
      state: cb.getState(),
      healthy: cb.isHealthy(),
    }));

    const retryPolicies = Array.from(this.retryPolicies.entries()).map(([name, rp]) => ({
      name,
      metrics: rp.getHealth(),
    }));

    // Determine overall status
    const hasUnhealthyCircuits = circuitBreakers.some(cb => !cb.healthy);
    const status = hasUnhealthyCircuits ? 'degraded' : 'healthy';

    return {
      status,
      timestamp: new Date(),
      circuitBreakers,
      retryPolicies,
      uptime: Date.now() - this.startTime.getTime(),
    };
  }

  /**
   * Express.js health check endpoint
   */
  static handler(req: any, res: any): void {
    const health = this.getStatus();
    const statusCode = health.status === 'healthy' ? 200 : 503;
    res.status(statusCode).json(health);
  }
}

Usage in MCP Server

// Example: MCP server with health check endpoint
import express from 'express';
import { HealthCheck } from './lib/health/health-check';
import { RetryPolicyManager } from './lib/retry/retry-policy-manager';

const app = express();

// Register retry policies
const dbRetryPolicy = new RetryPolicyManager('database', {
  circuitBreaker: { enabled: true, failureThreshold: 5, timeout: 60000 },
  retry: { maxAttempts: 5, initialDelayMs: 1000, maxDelayMs: 30000 },
});
HealthCheck.registerRetryPolicy('database', dbRetryPolicy);

// Health check endpoint
app.get('/health', HealthCheck.handler);

// MCP tool handlers
app.post('/mcp', async (req, res) => {
  // Use retry policy for database operations
  const result = await dbRetryPolicy.execute(async () => {
    return db.query('SELECT * FROM classes');
  });

  res.json({ result });
});

app.listen(3000, () => {
  console.log('MCP server with advanced error handling running on port 3000');
});

Conclusion: Building Resilient MCP Servers

Advanced error handling transforms fragile MCP servers into production-grade systems that handle 800 million users with confidence. By implementing these patterns, you ensure:

Retry storms are prevented with exponential backoff and jitter ✅ Cascade failures are isolated with circuit breakers ✅ Poison messages don't crash your server with dead letter queues ✅ Users get partial results when full functionality fails (graceful degradation) ✅ Retry logic is intelligent with error classification ✅ System health is transparent with health check endpoints

Next Steps

Ready to implement advanced error handling in your MCP server?

  1. Start with exponential backoff retry logic for all external API calls
  2. Add circuit breakers to protect critical dependencies (database, payment gateways)
  3. Implement dead letter queues for operations with complex failure modes
  4. Use graceful degradation to provide partial results when dependencies fail
  5. Deploy health checks to monitor circuit breaker states in production

Build your ChatGPT app with production-grade error handling using MakeAIHQ.com — the no-code platform that generates MCP servers with built-in retry logic, circuit breakers, and graceful degradation. From zero to ChatGPT App Store in 48 hours, no coding required.

Related Resources:

External Resources:


About the Author: This guide was created by the MakeAIHQ engineering team based on production experience running MCP servers for thousands of ChatGPT apps. We've battle-tested these patterns at scale to ensure your apps achieve OpenAI approval on first submission.

Last Updated: December 25, 2026