Advanced MCP Error Handling: Retry Logic, Circuit Breakers & Dead Letter Queues
When your ChatGPT app serves 800 million weekly users, error handling isn't just about catching exceptions—it's about building resilient systems that gracefully handle failures, prevent cascade outages, and maintain user trust even when dependencies fail.
OpenAI's Apps SDK approval guidelines emphasize performance responsiveness: your MCP server must "respond quickly enough to maintain chat rhythm." But what happens when your database connection pool is exhausted? When a third-party API goes down? When rate limits kick in during peak traffic?
The difference between amateur and production-grade MCP servers lies in advanced error handling patterns:
- Exponential backoff with jitter prevents retry storms that amplify failures
- Circuit breakers detect unhealthy dependencies and fail fast to protect your infrastructure
- Dead letter queues (DLQ) isolate poison messages that would otherwise crash your server
- Graceful degradation provides partial results when full functionality isn't available
- Error classification distinguishes retryable transient failures from terminal errors
This guide dives deep into production-grade error handling with 7 battle-tested TypeScript implementations used by high-traffic MCP servers. Whether you're building a fitness studio booking app or a real estate search tool, these patterns ensure your ChatGPT app stays responsive under adverse conditions.
Table of Contents
- Why Advanced Error Handling Matters
- Exponential Backoff with Jitter
- Circuit Breaker Pattern
- Dead Letter Queue Implementation
- Graceful Degradation Strategies
- Error Classification System
- Retry Policy Manager
- Health Checks with Circuit Breakers
- Conclusion: Building Resilient MCP Servers
Why Advanced Error Handling Matters
MCP servers face unique challenges compared to traditional APIs:
1. ChatGPT Model May Retry Tool Calls
When ChatGPT detects a tool failure, it may retry the same tool call multiple times. If your MCP server doesn't handle retries intelligently, you create retry storms where:
- Failed database queries get retried 10x in parallel
- Rate-limited API calls trigger more rate limit errors
- Transient network failures amplify into cascade outages
Example: A fitness studio booking app experiences a 2-second database timeout. ChatGPT retries 5 times. Without exponential backoff, all 5 retries hit the database simultaneously, creating a thundering herd that extends the outage.
2. Multi-Tool Dependencies Create Failure Chains
ChatGPT apps often compose multiple tools in a single conversation turn:
searchClasses→ Queries database for available yoga classesgetInstructorBio→ Fetches instructor details from CRMbookClass→ Creates reservation via payment gateway
If getInstructorBio fails due to a CRM outage, should the entire conversation fail? Or should you provide class search results with a fallback message: "Instructor details temporarily unavailable"?
Circuit breakers prevent cascade failures by detecting unhealthy dependencies and failing fast, while graceful degradation provides partial results that keep the conversation moving.
3. Poison Messages Can Crash Your Server
Some errors are not transient—they're terminal and will fail every time:
- Malformed input that triggers validation errors
- Database queries with syntax errors
- API requests with invalid authentication credentials
Retrying these failures indefinitely wastes resources and can crash your server through memory exhaustion. Dead letter queues (DLQ) isolate poison messages for manual investigation while keeping healthy traffic flowing.
4. Rate Limits Require Intelligent Backoff
Third-party APIs (Google Maps, Stripe, Twilio) enforce rate limits:
- 429 Too Many Requests: Temporary rate limit (back off and retry)
- 403 Forbidden: Permanent quota exhaustion (don't retry)
Error classification distinguishes between retryable rate limits and permanent quota failures, preventing infinite retry loops.
Exponential Backoff with Jitter
Exponential backoff is the foundation of intelligent retry logic. Instead of retrying failed requests at fixed intervals (every 1 second), exponential backoff doubles the wait time between retries: 1s → 2s → 4s → 8s.
Adding jitter (random variance) prevents synchronized retry storms when multiple clients fail simultaneously.
Implementation: Production-Grade Retry with Jitter
// src/lib/retry/exponential-backoff.ts
/**
* Exponential Backoff Retry with Jitter
*
* Prevents retry storms and respects rate limits with intelligent backoff.
*
* Features:
* - Exponential backoff (1s, 2s, 4s, 8s, 16s)
* - Full jitter to prevent thundering herd
* - Configurable max attempts and max delay
* - Error classification (retryable vs terminal)
* - Detailed retry metrics for monitoring
*/
export interface RetryOptions {
maxAttempts: number; // Maximum retry attempts (default: 5)
initialDelayMs: number; // Initial delay in milliseconds (default: 1000)
maxDelayMs: number; // Maximum delay cap (default: 30000)
jitterType: 'full' | 'equal' | 'decorrelated'; // Jitter strategy
retryableErrors: Set<string>; // Error codes that should trigger retries
}
export interface RetryMetrics {
attempt: number;
totalAttempts: number;
delayMs: number;
error: Error;
isRetryable: boolean;
timestamp: Date;
}
export class ExponentialBackoffRetry {
private options: RetryOptions;
private metrics: RetryMetrics[] = [];
constructor(options: Partial<RetryOptions> = {}) {
this.options = {
maxAttempts: options.maxAttempts ?? 5,
initialDelayMs: options.initialDelayMs ?? 1000,
maxDelayMs: options.maxDelayMs ?? 30000,
jitterType: options.jitterType ?? 'full',
retryableErrors: options.retryableErrors ?? new Set([
'ECONNREFUSED', // Connection refused
'ETIMEDOUT', // Connection timeout
'ENOTFOUND', // DNS lookup failed
'NETWORK_ERROR', // Generic network error
'RATE_LIMITED', // Temporary rate limit (429)
'SERVICE_UNAVAILABLE', // 503 Service Unavailable
]),
};
}
/**
* Execute function with exponential backoff retry logic
*/
async execute<T>(
fn: () => Promise<T>,
context: string = 'operation'
): Promise<T> {
let lastError: Error;
for (let attempt = 1; attempt <= this.options.maxAttempts; attempt++) {
try {
const result = await fn();
// Success - log metrics if there were retries
if (attempt > 1) {
console.log(`✅ ${context} succeeded after ${attempt} attempts`, {
totalRetries: attempt - 1,
totalDelayMs: this.metrics.reduce((sum, m) => sum + m.delayMs, 0),
});
}
return result;
} catch (error) {
lastError = error as Error;
const isRetryable = this.isRetryableError(error);
const isLastAttempt = attempt === this.options.maxAttempts;
// Record metrics
const metric: RetryMetrics = {
attempt,
totalAttempts: this.options.maxAttempts,
delayMs: 0,
error: lastError,
isRetryable,
timestamp: new Date(),
};
// Terminal error or last attempt - fail immediately
if (!isRetryable || isLastAttempt) {
this.metrics.push(metric);
console.error(`❌ ${context} failed permanently`, {
attempt,
isRetryable,
error: lastError.message,
metrics: this.metrics,
});
throw lastError;
}
// Calculate backoff delay with jitter
const delayMs = this.calculateDelay(attempt);
metric.delayMs = delayMs;
this.metrics.push(metric);
console.warn(`⚠️ ${context} failed, retrying in ${delayMs}ms`, {
attempt,
maxAttempts: this.options.maxAttempts,
error: lastError.message,
});
// Wait before retrying
await this.sleep(delayMs);
}
}
// This should never be reached due to throw in loop, but TypeScript needs it
throw lastError!;
}
/**
* Calculate exponential backoff delay with jitter
*/
private calculateDelay(attempt: number): number {
const exponentialDelay = Math.min(
this.options.initialDelayMs * Math.pow(2, attempt - 1),
this.options.maxDelayMs
);
switch (this.options.jitterType) {
case 'full':
// Full jitter: random between 0 and exponentialDelay
return Math.floor(Math.random() * exponentialDelay);
case 'equal':
// Equal jitter: exponentialDelay/2 + random(0, exponentialDelay/2)
return Math.floor(exponentialDelay / 2 + Math.random() * (exponentialDelay / 2));
case 'decorrelated':
// Decorrelated jitter: random between initialDelay and 3x previous delay
const previousDelay = attempt > 1
? this.metrics[attempt - 2]?.delayMs || this.options.initialDelayMs
: this.options.initialDelayMs;
return Math.floor(
Math.random() * (Math.min(this.options.maxDelayMs, previousDelay * 3) - this.options.initialDelayMs)
+ this.options.initialDelayMs
);
default:
return exponentialDelay;
}
}
/**
* Determine if error should trigger retry
*/
private isRetryableError(error: any): boolean {
// Check error code
if (error.code && this.options.retryableErrors.has(error.code)) {
return true;
}
// Check HTTP status codes
if (error.response?.status) {
const status = error.response.status;
// Retryable: 408 Request Timeout, 429 Too Many Requests, 503 Service Unavailable, 504 Gateway Timeout
if ([408, 429, 503, 504].includes(status)) {
return true;
}
// Terminal: 4xx client errors (except 408, 429)
if (status >= 400 && status < 500) {
return false;
}
// Retryable: 5xx server errors
if (status >= 500) {
return true;
}
}
// Check error message for network-related keywords
const errorMessage = error.message?.toLowerCase() || '';
const networkKeywords = ['timeout', 'network', 'econnrefused', 'enotfound', 'socket hang up'];
if (networkKeywords.some(keyword => errorMessage.includes(keyword))) {
return true;
}
// Default: not retryable
return false;
}
/**
* Sleep for specified milliseconds
*/
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
/**
* Get retry metrics for monitoring
*/
getMetrics(): RetryMetrics[] {
return [...this.metrics];
}
/**
* Reset metrics (useful for testing)
*/
resetMetrics(): void {
this.metrics = [];
}
}
Usage Example
// Example: Retry database query with exponential backoff
import { ExponentialBackoffRetry } from './lib/retry/exponential-backoff';
const retry = new ExponentialBackoffRetry({
maxAttempts: 5,
initialDelayMs: 1000,
maxDelayMs: 30000,
jitterType: 'full',
});
async function searchFitnessClasses(location: string) {
return retry.execute(
async () => {
const response = await fetch(`https://api.example.com/classes?location=${location}`);
if (!response.ok) {
const error: any = new Error(`API error: ${response.statusText}`);
error.response = { status: response.status };
throw error;
}
return response.json();
},
'searchFitnessClasses'
);
}
Circuit Breaker Pattern
Circuit breakers prevent cascade failures by detecting unhealthy dependencies and failing fast instead of wasting resources on doomed requests.
Circuit Breaker States
- CLOSED (healthy): Requests pass through normally
- OPEN (failed): Requests fail immediately without hitting dependency
- HALF_OPEN (testing): Allow limited requests to test if dependency recovered
Implementation: Production-Grade Circuit Breaker
// src/lib/circuit-breaker/circuit-breaker.ts
/**
* Circuit Breaker Implementation
*
* Prevents cascade failures by detecting unhealthy dependencies.
*
* Features:
* - Three states: CLOSED, OPEN, HALF_OPEN
* - Configurable failure threshold and timeout
* - Automatic recovery testing
* - Detailed metrics and health reporting
*/
export enum CircuitState {
CLOSED = 'CLOSED', // Healthy: requests pass through
OPEN = 'OPEN', // Failed: fail fast without hitting dependency
HALF_OPEN = 'HALF_OPEN' // Testing: allow limited requests to test recovery
}
export interface CircuitBreakerOptions {
failureThreshold: number; // Failures before opening circuit (default: 5)
successThreshold: number; // Successes in HALF_OPEN to close circuit (default: 2)
timeout: number; // Time in ms before attempting recovery (default: 60000)
windowSize: number; // Rolling window for failure tracking (default: 10)
}
export interface CircuitBreakerMetrics {
state: CircuitState;
failures: number;
successes: number;
totalRequests: number;
lastFailureTime?: Date;
lastSuccessTime?: Date;
stateTransitions: { from: CircuitState; to: CircuitState; timestamp: Date }[];
}
export class CircuitBreaker {
private state: CircuitState = CircuitState.CLOSED;
private options: CircuitBreakerOptions;
private failures: number = 0;
private successes: number = 0;
private totalRequests: number = 0;
private lastFailureTime?: Date;
private lastSuccessTime?: Date;
private stateTransitions: { from: CircuitState; to: CircuitState; timestamp: Date }[] = [];
private nextAttemptTime?: Date;
constructor(
private name: string,
options: Partial<CircuitBreakerOptions> = {}
) {
this.options = {
failureThreshold: options.failureThreshold ?? 5,
successThreshold: options.successThreshold ?? 2,
timeout: options.timeout ?? 60000, // 60 seconds
windowSize: options.windowSize ?? 10,
};
}
/**
* Execute function with circuit breaker protection
*/
async execute<T>(fn: () => Promise<T>): Promise<T> {
// Check if circuit is OPEN
if (this.state === CircuitState.OPEN) {
if (this.shouldAttemptRecovery()) {
this.transitionTo(CircuitState.HALF_OPEN);
} else {
throw new Error(
`Circuit breaker [${this.name}] is OPEN. ` +
`Next attempt at ${this.nextAttemptTime?.toISOString()}`
);
}
}
this.totalRequests++;
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
/**
* Handle successful request
*/
private onSuccess(): void {
this.successes++;
this.lastSuccessTime = new Date();
if (this.state === CircuitState.HALF_OPEN) {
// Enough successes to close circuit?
if (this.successes >= this.options.successThreshold) {
console.log(`✅ Circuit breaker [${this.name}] closing after ${this.successes} successful recoveries`);
this.transitionTo(CircuitState.CLOSED);
this.reset();
}
} else if (this.state === CircuitState.CLOSED) {
// Reset failure counter on success
this.failures = 0;
}
}
/**
* Handle failed request
*/
private onFailure(): void {
this.failures++;
this.lastFailureTime = new Date();
if (this.state === CircuitState.HALF_OPEN) {
// Any failure in HALF_OPEN reopens circuit
console.warn(`⚠️ Circuit breaker [${this.name}] reopening after failure during recovery test`);
this.transitionTo(CircuitState.OPEN);
this.scheduleRecoveryAttempt();
} else if (this.state === CircuitState.CLOSED) {
// Exceeded failure threshold?
if (this.failures >= this.options.failureThreshold) {
console.error(`❌ Circuit breaker [${this.name}] opening after ${this.failures} failures`);
this.transitionTo(CircuitState.OPEN);
this.scheduleRecoveryAttempt();
}
}
}
/**
* Transition to new state
*/
private transitionTo(newState: CircuitState): void {
const oldState = this.state;
this.state = newState;
this.stateTransitions.push({
from: oldState,
to: newState,
timestamp: new Date(),
});
}
/**
* Check if circuit should attempt recovery
*/
private shouldAttemptRecovery(): boolean {
if (!this.nextAttemptTime) {
return true;
}
return new Date() >= this.nextAttemptTime;
}
/**
* Schedule next recovery attempt
*/
private scheduleRecoveryAttempt(): void {
this.nextAttemptTime = new Date(Date.now() + this.options.timeout);
console.log(`🕒 Circuit breaker [${this.name}] will attempt recovery at ${this.nextAttemptTime.toISOString()}`);
}
/**
* Reset circuit breaker state
*/
private reset(): void {
this.failures = 0;
this.successes = 0;
this.nextAttemptTime = undefined;
}
/**
* Get current metrics
*/
getMetrics(): CircuitBreakerMetrics {
return {
state: this.state,
failures: this.failures,
successes: this.successes,
totalRequests: this.totalRequests,
lastFailureTime: this.lastFailureTime,
lastSuccessTime: this.lastSuccessTime,
stateTransitions: [...this.stateTransitions],
};
}
/**
* Get current state
*/
getState(): CircuitState {
return this.state;
}
/**
* Check if circuit is healthy
*/
isHealthy(): boolean {
return this.state === CircuitState.CLOSED;
}
}
Usage Example
// Example: Protect third-party API calls with circuit breaker
import { CircuitBreaker } from './lib/circuit-breaker/circuit-breaker';
const stripeCircuit = new CircuitBreaker('stripe-api', {
failureThreshold: 5, // Open after 5 failures
successThreshold: 2, // Close after 2 successful recoveries
timeout: 60000, // Retry after 60 seconds
});
async function createStripePayment(amount: number, token: string) {
return stripeCircuit.execute(async () => {
const response = await fetch('https://api.stripe.com/v1/charges', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.STRIPE_SECRET_KEY}`,
'Content-Type': 'application/x-www-form-urlencoded',
},
body: new URLSearchParams({
amount: amount.toString(),
currency: 'usd',
source: token,
}),
});
if (!response.ok) {
throw new Error(`Stripe API error: ${response.statusText}`);
}
return response.json();
});
}
Dead Letter Queue Implementation
Dead letter queues (DLQ) isolate poison messages—requests that fail repeatedly and would otherwise crash your server or waste resources.
When to Use DLQ
- Malformed input: JSON parse errors, schema validation failures
- Permanent API errors: 401 Unauthorized, 404 Not Found (won't succeed on retry)
- Resource constraints: Database connection exhaustion, memory limits
Implementation: In-Memory DLQ with Persistence
// src/lib/dlq/dead-letter-queue.ts
/**
* Dead Letter Queue Implementation
*
* Isolates poison messages that fail repeatedly.
*
* Features:
* - Automatic retry limit enforcement
* - Persistent storage (filesystem or database)
* - Manual replay for debugging
* - Detailed failure metadata
*/
export interface DLQMessage<T> {
id: string;
payload: T;
error: {
message: string;
stack?: string;
code?: string;
};
attempts: number;
firstAttemptTime: Date;
lastAttemptTime: Date;
source: string; // Which tool/operation failed
}
export interface DLQOptions {
maxRetries: number; // Max retries before sending to DLQ (default: 3)
persistPath?: string; // File path for DLQ persistence
alertThreshold?: number; // Alert after N messages in DLQ
}
export class DeadLetterQueue<T = any> {
private messages: Map<string, DLQMessage<T>> = new Map();
private options: DLQOptions;
constructor(options: Partial<DLQOptions> = {}) {
this.options = {
maxRetries: options.maxRetries ?? 3,
persistPath: options.persistPath,
alertThreshold: options.alertThreshold ?? 10,
};
// Load persisted messages on startup
if (this.options.persistPath) {
this.loadFromDisk();
}
}
/**
* Attempt to process message with automatic DLQ on repeated failures
*/
async process<R>(
messageId: string,
payload: T,
source: string,
processor: (payload: T) => Promise<R>
): Promise<R> {
try {
const result = await processor(payload);
// Success - remove from tracking if it was previously failing
if (this.messages.has(messageId)) {
console.log(`✅ Message ${messageId} recovered after previous failures`);
this.messages.delete(messageId);
this.persistToDisk();
}
return result;
} catch (error) {
// Track failure
const existing = this.messages.get(messageId);
const now = new Date();
if (existing) {
// Increment attempt counter
existing.attempts++;
existing.lastAttemptTime = now;
existing.error = {
message: (error as Error).message,
stack: (error as Error).stack,
code: (error as any).code,
};
// Exceeded max retries?
if (existing.attempts >= this.options.maxRetries) {
console.error(`❌ Message ${messageId} sent to DLQ after ${existing.attempts} failures`, {
source,
error: existing.error.message,
});
this.sendToDLQ(existing);
}
} else {
// First failure - track it
const newMessage: DLQMessage<T> = {
id: messageId,
payload,
error: {
message: (error as Error).message,
stack: (error as Error).stack,
code: (error as any).code,
},
attempts: 1,
firstAttemptTime: now,
lastAttemptTime: now,
source,
};
this.messages.set(messageId, newMessage);
console.warn(`⚠️ Message ${messageId} failed (attempt 1/${this.options.maxRetries})`, {
source,
error: newMessage.error.message,
});
}
throw error;
}
}
/**
* Send message to DLQ (persist and alert)
*/
private sendToDLQ(message: DLQMessage<T>): void {
// Persist to disk
this.persistToDisk();
// Check if we should alert
if (this.options.alertThreshold && this.messages.size >= this.options.alertThreshold) {
console.error(`🚨 DLQ threshold exceeded: ${this.messages.size} messages in queue`);
// TODO: Send alert (email, Slack, PagerDuty)
}
}
/**
* Manually replay DLQ message (for debugging)
*/
async replay<R>(
messageId: string,
processor: (payload: T) => Promise<R>
): Promise<R> {
const message = this.messages.get(messageId);
if (!message) {
throw new Error(`Message ${messageId} not found in DLQ`);
}
try {
const result = await processor(message.payload);
// Success - remove from DLQ
this.messages.delete(messageId);
this.persistToDisk();
console.log(`✅ DLQ message ${messageId} replayed successfully`);
return result;
} catch (error) {
console.error(`❌ DLQ message ${messageId} replay failed`, {
error: (error as Error).message,
});
throw error;
}
}
/**
* Get all DLQ messages
*/
getMessages(): DLQMessage<T>[] {
return Array.from(this.messages.values());
}
/**
* Get DLQ statistics
*/
getStats() {
return {
totalMessages: this.messages.size,
oldestMessage: this.getOldestMessage(),
newestMessage: this.getNewestMessage(),
messagesBySource: this.groupBySource(),
};
}
private getOldestMessage(): DLQMessage<T> | undefined {
let oldest: DLQMessage<T> | undefined;
for (const message of this.messages.values()) {
if (!oldest || message.firstAttemptTime < oldest.firstAttemptTime) {
oldest = message;
}
}
return oldest;
}
private getNewestMessage(): DLQMessage<T> | undefined {
let newest: DLQMessage<T> | undefined;
for (const message of this.messages.values()) {
if (!newest || message.firstAttemptTime > newest.firstAttemptTime) {
newest = message;
}
}
return newest;
}
private groupBySource(): Record<string, number> {
const groups: Record<string, number> = {};
for (const message of this.messages.values()) {
groups[message.source] = (groups[message.source] || 0) + 1;
}
return groups;
}
/**
* Persist DLQ to disk (for durability)
*/
private persistToDisk(): void {
if (!this.options.persistPath) return;
const fs = require('fs');
const data = JSON.stringify(Array.from(this.messages.entries()), null, 2);
fs.writeFileSync(this.options.persistPath, data);
}
/**
* Load DLQ from disk
*/
private loadFromDisk(): void {
if (!this.options.persistPath) return;
try {
const fs = require('fs');
if (fs.existsSync(this.options.persistPath)) {
const data = fs.readFileSync(this.options.persistPath, 'utf-8');
const entries = JSON.parse(data);
this.messages = new Map(entries);
console.log(`📥 Loaded ${this.messages.size} messages from DLQ persistence`);
}
} catch (error) {
console.error('Failed to load DLQ from disk:', error);
}
}
}
Usage Example
// Example: Use DLQ for database operations
import { DeadLetterQueue } from './lib/dlq/dead-letter-queue';
const dbDLQ = new DeadLetterQueue({
maxRetries: 3,
persistPath: './dlq-database.json',
alertThreshold: 10,
});
async function createBooking(bookingData: any) {
return dbDLQ.process(
`booking-${bookingData.id}`,
bookingData,
'createBooking',
async (data) => {
// Attempt database insert
const result = await db.bookings.insert(data);
return result;
}
);
}
Graceful Degradation Strategies
Graceful degradation provides partial results when full functionality isn't available, maintaining conversation flow even when dependencies fail.
Implementation: Graceful Degradation Middleware
// src/lib/middleware/graceful-degradation.ts
/**
* Graceful Degradation Middleware
*
* Provides fallback responses when full functionality fails.
*
* Features:
* - Partial result composition
* - Fallback response templates
* - User-friendly error messages
* - Maintains conversation flow
*/
export interface FallbackStrategy<T> {
fallbackData?: T; // Static fallback data
fallbackFn?: () => Promise<T>; // Dynamic fallback function
partialResults?: boolean; // Allow partial results?
errorMessage?: string; // User-friendly error message
}
export class GracefulDegradation {
/**
* Execute with graceful degradation fallback
*/
static async execute<T>(
primaryFn: () => Promise<T>,
strategy: FallbackStrategy<T>,
context: string = 'operation'
): Promise<{ data: T; degraded: boolean; error?: string }> {
try {
const data = await primaryFn();
return { data, degraded: false };
} catch (error) {
console.warn(`⚠️ ${context} failed, applying graceful degradation`, {
error: (error as Error).message,
});
// Try fallback function
if (strategy.fallbackFn) {
try {
const fallbackData = await strategy.fallbackFn();
return {
data: fallbackData,
degraded: true,
error: strategy.errorMessage || 'Using cached or fallback data',
};
} catch (fallbackError) {
console.error(`❌ Fallback function failed for ${context}`, {
error: (fallbackError as Error).message,
});
}
}
// Use static fallback data
if (strategy.fallbackData !== undefined) {
return {
data: strategy.fallbackData,
degraded: true,
error: strategy.errorMessage || 'Using default data',
};
}
// No fallback available - throw original error
throw error;
}
}
/**
* Compose partial results from multiple tool calls
*/
static async composePartial<T extends Record<string, any>>(
operations: Record<keyof T, () => Promise<any>>,
required: (keyof T)[] = []
): Promise<{ data: Partial<T>; degraded: boolean; failures: string[] }> {
const results: Partial<T> = {};
const failures: string[] = [];
// Execute all operations in parallel
const entries = Object.entries(operations) as [keyof T, () => Promise<any>][];
const promises = entries.map(async ([key, fn]) => {
try {
results[key] = await fn();
} catch (error) {
failures.push(String(key));
console.warn(`⚠️ Operation ${String(key)} failed in partial composition`, {
error: (error as Error).message,
});
}
});
await Promise.allSettled(promises);
// Check if required operations succeeded
const missingRequired = required.filter(key => !(key in results));
if (missingRequired.length > 0) {
throw new Error(
`Required operations failed: ${missingRequired.join(', ')}`
);
}
return {
data: results,
degraded: failures.length > 0,
failures,
};
}
}
Usage Example
// Example: Search fitness classes with fallback to cached data
import { GracefulDegradation } from './lib/middleware/graceful-degradation';
async function searchFitnessClasses(location: string) {
const result = await GracefulDegradation.execute(
// Primary: Live API call
async () => {
const response = await fetch(`https://api.example.com/classes?location=${location}`);
if (!response.ok) throw new Error('API unavailable');
return response.json();
},
// Fallback: Cached data from Redis
{
fallbackFn: async () => {
const cached = await redis.get(`classes:${location}`);
return cached ? JSON.parse(cached) : [];
},
errorMessage: 'Showing recently available classes (live data temporarily unavailable)',
},
'searchFitnessClasses'
);
return {
classes: result.data,
note: result.degraded ? result.error : undefined,
};
}
// Example: Compose partial results
async function getCompleteBookingInfo(bookingId: string) {
const result = await GracefulDegradation.composePartial(
{
booking: () => db.bookings.findById(bookingId),
instructor: () => api.getInstructor(booking.instructorId),
reviews: () => api.getReviews(booking.classId),
},
['booking'] // booking is required, others are optional
);
if (result.degraded) {
console.log(`⚠️ Partial results: ${result.failures.join(', ')} unavailable`);
}
return result.data;
}
Error Classification System
Not all errors should trigger retries. Error classification distinguishes between:
- Transient errors (network timeouts, rate limits) → Retry
- Terminal errors (400 Bad Request, 401 Unauthorized) → Fail immediately
Implementation: Error Classifier
// src/lib/errors/error-classifier.ts
/**
* Error Classification System
*
* Distinguishes retryable vs terminal errors to prevent retry storms.
*
* Features:
* - HTTP status code classification
* - Network error detection
* - Custom error type support
* - Detailed error metadata
*/
export enum ErrorType {
TRANSIENT = 'TRANSIENT', // Temporary failure - retry
TERMINAL = 'TERMINAL', // Permanent failure - don't retry
RATE_LIMIT = 'RATE_LIMIT', // Rate limit - retry with backoff
AUTHENTICATION = 'AUTHENTICATION', // Auth failure - don't retry
VALIDATION = 'VALIDATION', // Input validation - don't retry
UNKNOWN = 'UNKNOWN', // Unknown error - conservative retry
}
export interface ClassifiedError {
type: ErrorType;
isRetryable: boolean;
statusCode?: number;
errorCode?: string;
message: string;
retryAfter?: number; // Milliseconds to wait before retry
metadata?: Record<string, any>;
}
export class ErrorClassifier {
/**
* Classify error and determine retry strategy
*/
static classify(error: any): ClassifiedError {
// HTTP status code classification
if (error.response?.status) {
return this.classifyHttpError(error);
}
// Network error classification
if (error.code) {
return this.classifyNetworkError(error);
}
// Custom error type classification
if (error.type) {
return this.classifyCustomError(error);
}
// Unknown error - conservative retry
return {
type: ErrorType.UNKNOWN,
isRetryable: true, // Conservative: allow retry
message: error.message || 'Unknown error',
};
}
/**
* Classify HTTP status code errors
*/
private static classifyHttpError(error: any): ClassifiedError {
const status = error.response.status;
const headers = error.response.headers || {};
// 408 Request Timeout - transient
if (status === 408) {
return {
type: ErrorType.TRANSIENT,
isRetryable: true,
statusCode: status,
message: 'Request timeout',
};
}
// 429 Too Many Requests - rate limit
if (status === 429) {
const retryAfter = headers['retry-after']
? parseInt(headers['retry-after'], 10) * 1000
: 60000; // Default: 60 seconds
return {
type: ErrorType.RATE_LIMIT,
isRetryable: true,
statusCode: status,
message: 'Rate limit exceeded',
retryAfter,
};
}
// 401 Unauthorized, 403 Forbidden - authentication
if (status === 401 || status === 403) {
return {
type: ErrorType.AUTHENTICATION,
isRetryable: false,
statusCode: status,
message: 'Authentication failed',
};
}
// 400 Bad Request, 422 Unprocessable Entity - validation
if (status === 400 || status === 422) {
return {
type: ErrorType.VALIDATION,
isRetryable: false,
statusCode: status,
message: 'Validation error',
};
}
// 404 Not Found, 405 Method Not Allowed, 410 Gone - terminal
if ([404, 405, 410].includes(status)) {
return {
type: ErrorType.TERMINAL,
isRetryable: false,
statusCode: status,
message: 'Resource not found or not allowed',
};
}
// 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout - transient
if ([500, 502, 503, 504].includes(status)) {
return {
type: ErrorType.TRANSIENT,
isRetryable: true,
statusCode: status,
message: 'Server error (transient)',
};
}
// Default: 4xx terminal, 5xx transient
return {
type: status >= 400 && status < 500 ? ErrorType.TERMINAL : ErrorType.TRANSIENT,
isRetryable: status >= 500,
statusCode: status,
message: error.message || `HTTP ${status} error`,
};
}
/**
* Classify network errors
*/
private static classifyNetworkError(error: any): ClassifiedError {
const retryableNetworkErrors = new Set([
'ECONNREFUSED',
'ETIMEDOUT',
'ENOTFOUND',
'ECONNRESET',
'EPIPE',
'EHOSTUNREACH',
'EAI_AGAIN',
]);
if (retryableNetworkErrors.has(error.code)) {
return {
type: ErrorType.TRANSIENT,
isRetryable: true,
errorCode: error.code,
message: `Network error: ${error.code}`,
};
}
return {
type: ErrorType.TERMINAL,
isRetryable: false,
errorCode: error.code,
message: `Network error: ${error.code}`,
};
}
/**
* Classify custom error types
*/
private static classifyCustomError(error: any): ClassifiedError {
switch (error.type) {
case 'RATE_LIMIT':
return {
type: ErrorType.RATE_LIMIT,
isRetryable: true,
message: error.message,
retryAfter: error.retryAfter || 60000,
};
case 'VALIDATION':
return {
type: ErrorType.VALIDATION,
isRetryable: false,
message: error.message,
};
default:
return {
type: ErrorType.UNKNOWN,
isRetryable: true,
message: error.message,
};
}
}
}
Retry Policy Manager
Combine exponential backoff, circuit breakers, and error classification into a unified Retry Policy Manager.
// src/lib/retry/retry-policy-manager.ts
/**
* Retry Policy Manager
*
* Unified retry system combining exponential backoff, circuit breakers, and error classification.
*/
import { ExponentialBackoffRetry } from './exponential-backoff';
import { CircuitBreaker } from '../circuit-breaker/circuit-breaker';
import { ErrorClassifier, ErrorType } from '../errors/error-classifier';
export interface RetryPolicyConfig {
circuitBreaker?: {
enabled: boolean;
failureThreshold: number;
timeout: number;
};
retry?: {
maxAttempts: number;
initialDelayMs: number;
maxDelayMs: number;
};
}
export class RetryPolicyManager {
private circuitBreaker?: CircuitBreaker;
private retry: ExponentialBackoffRetry;
constructor(
private name: string,
config: RetryPolicyConfig = {}
) {
// Initialize circuit breaker
if (config.circuitBreaker?.enabled) {
this.circuitBreaker = new CircuitBreaker(name, {
failureThreshold: config.circuitBreaker.failureThreshold ?? 5,
timeout: config.circuitBreaker.timeout ?? 60000,
});
}
// Initialize retry logic
this.retry = new ExponentialBackoffRetry({
maxAttempts: config.retry?.maxAttempts ?? 5,
initialDelayMs: config.retry?.initialDelayMs ?? 1000,
maxDelayMs: config.retry?.maxDelayMs ?? 30000,
jitterType: 'full',
});
}
/**
* Execute with full retry policy (circuit breaker + retry + error classification)
*/
async execute<T>(fn: () => Promise<T>): Promise<T> {
// Wrap in circuit breaker if enabled
const executeFn = this.circuitBreaker
? () => this.circuitBreaker!.execute(fn)
: fn;
// Apply retry logic with error classification
return this.retry.execute(async () => {
try {
return await executeFn();
} catch (error) {
// Classify error
const classified = ErrorClassifier.classify(error);
// If not retryable, throw immediately
if (!classified.isRetryable) {
const enhancedError: any = new Error(classified.message);
enhancedError.type = classified.type;
enhancedError.statusCode = classified.statusCode;
enhancedError.errorCode = classified.errorCode;
throw enhancedError;
}
// If rate limited, wait before retrying
if (classified.type === ErrorType.RATE_LIMIT && classified.retryAfter) {
console.warn(`⚠️ Rate limited, waiting ${classified.retryAfter}ms before retry`);
await new Promise(resolve => setTimeout(resolve, classified.retryAfter!));
}
throw error;
}
}, this.name);
}
/**
* Get health status
*/
getHealth() {
return {
name: this.name,
circuitBreakerState: this.circuitBreaker?.getState(),
circuitBreakerMetrics: this.circuitBreaker?.getMetrics(),
retryMetrics: this.retry.getMetrics(),
};
}
}
Usage Example
// Example: Unified retry policy for all API calls
import { RetryPolicyManager } from './lib/retry/retry-policy-manager';
const apiRetryPolicy = new RetryPolicyManager('external-api', {
circuitBreaker: {
enabled: true,
failureThreshold: 5,
timeout: 60000,
},
retry: {
maxAttempts: 5,
initialDelayMs: 1000,
maxDelayMs: 30000,
},
});
async function callExternalAPI(endpoint: string) {
return apiRetryPolicy.execute(async () => {
const response = await fetch(`https://api.example.com${endpoint}`);
if (!response.ok) {
const error: any = new Error(`API error: ${response.statusText}`);
error.response = { status: response.status, headers: response.headers };
throw error;
}
return response.json();
});
}
Health Checks with Circuit Breakers
Implement health check endpoints that report circuit breaker states and retry metrics.
// src/lib/health/health-check.ts
/**
* Health Check System with Circuit Breaker Integration
*
* Reports system health including circuit breaker states.
*/
import { CircuitBreaker, CircuitState } from '../circuit-breaker/circuit-breaker';
import { RetryPolicyManager } from '../retry/retry-policy-manager';
export interface HealthStatus {
status: 'healthy' | 'degraded' | 'unhealthy';
timestamp: Date;
circuitBreakers: {
name: string;
state: CircuitState;
healthy: boolean;
}[];
retryPolicies: {
name: string;
metrics: any;
}[];
uptime: number;
}
export class HealthCheck {
private static circuitBreakers: Map<string, CircuitBreaker> = new Map();
private static retryPolicies: Map<string, RetryPolicyManager> = new Map();
private static startTime: Date = new Date();
/**
* Register circuit breaker for health monitoring
*/
static registerCircuitBreaker(name: string, circuitBreaker: CircuitBreaker): void {
this.circuitBreakers.set(name, circuitBreaker);
}
/**
* Register retry policy for health monitoring
*/
static registerRetryPolicy(name: string, retryPolicy: RetryPolicyManager): void {
this.retryPolicies.set(name, retryPolicy);
}
/**
* Get overall health status
*/
static getStatus(): HealthStatus {
const circuitBreakers = Array.from(this.circuitBreakers.entries()).map(([name, cb]) => ({
name,
state: cb.getState(),
healthy: cb.isHealthy(),
}));
const retryPolicies = Array.from(this.retryPolicies.entries()).map(([name, rp]) => ({
name,
metrics: rp.getHealth(),
}));
// Determine overall status
const hasUnhealthyCircuits = circuitBreakers.some(cb => !cb.healthy);
const status = hasUnhealthyCircuits ? 'degraded' : 'healthy';
return {
status,
timestamp: new Date(),
circuitBreakers,
retryPolicies,
uptime: Date.now() - this.startTime.getTime(),
};
}
/**
* Express.js health check endpoint
*/
static handler(req: any, res: any): void {
const health = this.getStatus();
const statusCode = health.status === 'healthy' ? 200 : 503;
res.status(statusCode).json(health);
}
}
Usage in MCP Server
// Example: MCP server with health check endpoint
import express from 'express';
import { HealthCheck } from './lib/health/health-check';
import { RetryPolicyManager } from './lib/retry/retry-policy-manager';
const app = express();
// Register retry policies
const dbRetryPolicy = new RetryPolicyManager('database', {
circuitBreaker: { enabled: true, failureThreshold: 5, timeout: 60000 },
retry: { maxAttempts: 5, initialDelayMs: 1000, maxDelayMs: 30000 },
});
HealthCheck.registerRetryPolicy('database', dbRetryPolicy);
// Health check endpoint
app.get('/health', HealthCheck.handler);
// MCP tool handlers
app.post('/mcp', async (req, res) => {
// Use retry policy for database operations
const result = await dbRetryPolicy.execute(async () => {
return db.query('SELECT * FROM classes');
});
res.json({ result });
});
app.listen(3000, () => {
console.log('MCP server with advanced error handling running on port 3000');
});
Conclusion: Building Resilient MCP Servers
Advanced error handling transforms fragile MCP servers into production-grade systems that handle 800 million users with confidence. By implementing these patterns, you ensure:
✅ Retry storms are prevented with exponential backoff and jitter ✅ Cascade failures are isolated with circuit breakers ✅ Poison messages don't crash your server with dead letter queues ✅ Users get partial results when full functionality fails (graceful degradation) ✅ Retry logic is intelligent with error classification ✅ System health is transparent with health check endpoints
Next Steps
Ready to implement advanced error handling in your MCP server?
- Start with exponential backoff retry logic for all external API calls
- Add circuit breakers to protect critical dependencies (database, payment gateways)
- Implement dead letter queues for operations with complex failure modes
- Use graceful degradation to provide partial results when dependencies fail
- Deploy health checks to monitor circuit breaker states in production
Build your ChatGPT app with production-grade error handling using MakeAIHQ.com — the no-code platform that generates MCP servers with built-in retry logic, circuit breakers, and graceful degradation. From zero to ChatGPT App Store in 48 hours, no coding required.
Related Resources:
- MCP Server Development: Complete Production Guide — Master MCP protocol fundamentals and architecture patterns
- MCP Server Error Recovery Patterns — Foundation-level error recovery strategies
- MCP Server Monitoring & Logging Guide — Track errors, metrics, and system health
- MCP Server Deployment Best Practices — Deploy resilient MCP servers to production
- MCP Server Load Balancing Strategies — Scale beyond single-server bottlenecks
- ChatGPT App Testing & QA Complete Guide — Test error handling in ChatGPT conversations
External Resources:
- AWS SQS Dead Letter Queues — Amazon's DLQ implementation patterns
- Netflix Hystrix Circuit Breaker — Industry-standard circuit breaker design
- Exponential Backoff and Jitter — AWS best practices for retry logic
About the Author: This guide was created by the MakeAIHQ engineering team based on production experience running MCP servers for thousands of ChatGPT apps. We've battle-tested these patterns at scale to ensure your apps achieve OpenAI approval on first submission.
Last Updated: December 25, 2026