MCP Server Rate Limiting for ChatGPT Apps

Production ChatGPT apps built on the Model Context Protocol (MCP) require robust rate limiting to prevent abuse, ensure fair resource allocation, and maintain service quality. Without proper rate limiting, a single misbehaving client or coordinated attack can overwhelm your MCP server, degrading performance for all users. This guide provides production-ready TypeScript implementations of industry-standard rate limiting algorithms, distributed quota management with Redis, and comprehensive monitoring strategies that protect your ChatGPT application infrastructure while maintaining excellent user experience.

Rate limiting isn't just about preventing denial-of-service attacks—it's a critical component of your business model. Whether you're implementing tiered pricing, enforcing API quotas, or protecting expensive AI model calls, rate limiting ensures that resources are distributed fairly according to your business rules. This article covers everything from basic in-memory rate limiting for small deployments to sophisticated distributed systems using Redis for multi-node production environments.

Understanding Rate Limiting Algorithms

Before implementing rate limiting for your MCP server, you need to understand the four primary algorithms and their trade-offs. Each algorithm has distinct characteristics that make it suitable for different use cases.

Token Bucket Algorithm: The most popular approach for API rate limiting, token bucket maintains a "bucket" of tokens that refill at a constant rate. Each request consumes one or more tokens. When the bucket is empty, requests are rejected. This algorithm allows controlled burst traffic—if a user hasn't made requests recently, they can use accumulated tokens for a burst of activity. This is ideal for ChatGPT apps where users might send several messages in quick succession after a period of inactivity.

Leaky Bucket Algorithm: Similar to token bucket but enforces strict rate smoothing. Requests enter a queue (the "bucket") and are processed at a fixed rate. Excess requests overflow and are rejected. Unlike token bucket, leaky bucket doesn't allow bursts—it maintains perfectly constant throughput. This is useful when you need to protect downstream services that can't handle traffic spikes, such as third-party APIs with strict rate limits.

Fixed Window Counter: The simplest algorithm—count requests within fixed time windows (e.g., per minute). At the window boundary, the counter resets. While easy to implement, this has a critical flaw: users can make twice the limit by clustering requests at window boundaries (e.g., 100 requests at 11:59 and 100 at 12:00 gives 200 requests in one minute). This "boundary problem" makes fixed window unsuitable for strict rate limiting.

Sliding Window Counter: Combines the efficiency of fixed window with accuracy that approaches true rate limiting. It calculates the request count based on the current time, using a weighted combination of the previous and current window. For example, if 30 seconds into a 60-second window, it counts 50% of the previous window's requests plus 100% of the current window's requests. This provides accurate rate limiting without maintaining per-request timestamps.

For ChatGPT MCP servers, token bucket and sliding window are the recommended algorithms. Token bucket works best for user-facing endpoints where burst traffic is expected. Sliding window is ideal for backend-to-backend communication or strict quota enforcement. The code examples in this article focus on these two algorithms with production-ready implementations.

Production-Ready Token Bucket Implementation

Here's a complete TypeScript implementation of the token bucket algorithm with configurable capacity, refill rate, and burst handling. This version is production-ready with comprehensive error handling and TypeScript types.

/**
 * Production-grade Token Bucket Rate Limiter
 *
 * Features:
 * - Configurable capacity and refill rate
 * - Burst traffic support
 * - TypeScript strict mode compatible
 * - Memory-efficient implementation
 * - Thread-safe for single-process deployments
 */

interface TokenBucketConfig {
  capacity: number;        // Maximum tokens in bucket
  refillRate: number;      // Tokens added per second
  initialTokens?: number;  // Starting token count (default: capacity)
}

interface TokenBucketState {
  tokens: number;
  lastRefillTime: number;
}

export class TokenBucket {
  private capacity: number;
  private refillRate: number;
  private tokens: number;
  private lastRefillTime: number;

  constructor(config: TokenBucketConfig) {
    this.capacity = config.capacity;
    this.refillRate = config.refillRate;
    this.tokens = config.initialTokens ?? config.capacity;
    this.lastRefillTime = Date.now();
  }

  /**
   * Attempt to consume tokens from the bucket
   * @param tokensRequired Number of tokens to consume (default: 1)
   * @returns true if tokens were consumed, false if insufficient tokens
   */
  public consume(tokensRequired: number = 1): boolean {
    this.refill();

    if (this.tokens >= tokensRequired) {
      this.tokens -= tokensRequired;
      return true;
    }

    return false;
  }

  /**
   * Refill bucket based on elapsed time since last refill
   */
  private refill(): void {
    const now = Date.now();
    const elapsedMs = now - this.lastRefillTime;
    const elapsedSeconds = elapsedMs / 1000;

    const tokensToAdd = elapsedSeconds * this.refillRate;
    this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
    this.lastRefillTime = now;
  }

  /**
   * Get current token count (after refill calculation)
   */
  public getAvailableTokens(): number {
    this.refill();
    return Math.floor(this.tokens);
  }

  /**
   * Calculate time until next token is available
   * @returns milliseconds until next token, or 0 if tokens available
   */
  public getTimeUntilNextToken(): number {
    this.refill();

    if (this.tokens >= 1) {
      return 0;
    }

    const tokensNeeded = 1 - this.tokens;
    const secondsNeeded = tokensNeeded / this.refillRate;
    return Math.ceil(secondsNeeded * 1000);
  }

  /**
   * Get current state for persistence/monitoring
   */
  public getState(): TokenBucketState {
    this.refill();
    return {
      tokens: this.tokens,
      lastRefillTime: this.lastRefillTime
    };
  }

  /**
   * Restore state from persistence
   */
  public static fromState(
    config: TokenBucketConfig,
    state: TokenBucketState
  ): TokenBucket {
    const bucket = new TokenBucket(config);
    bucket.tokens = state.tokens;
    bucket.lastRefillTime = state.lastRefillTime;
    return bucket;
  }
}

/**
 * In-memory rate limiter using token bucket per user
 */
export class InMemoryRateLimiter {
  private buckets: Map<string, TokenBucket> = new Map();
  private config: TokenBucketConfig;

  constructor(config: TokenBucketConfig) {
    this.config = config;
  }

  /**
   * Check if request should be allowed
   * @param userId Unique user identifier
   * @param tokensRequired Tokens to consume (default: 1)
   */
  public allowRequest(userId: string, tokensRequired: number = 1): boolean {
    let bucket = this.buckets.get(userId);

    if (!bucket) {
      bucket = new TokenBucket(this.config);
      this.buckets.set(userId, bucket);
    }

    return bucket.consume(tokensRequired);
  }

  /**
   * Get rate limit info for a user
   */
  public getRateLimitInfo(userId: string): {
    available: number;
    capacity: number;
    retryAfterMs: number;
  } {
    let bucket = this.buckets.get(userId);

    if (!bucket) {
      bucket = new TokenBucket(this.config);
      this.buckets.set(userId, bucket);
    }

    const available = bucket.getAvailableTokens();
    const retryAfterMs = bucket.getTimeUntilNextToken();

    return {
      available,
      capacity: this.config.capacity,
      retryAfterMs
    };
  }

  /**
   * Clean up old buckets to prevent memory leaks
   * Call this periodically (e.g., every 5 minutes)
   */
  public cleanup(inactiveThresholdMs: number = 3600000): void {
    const now = Date.now();
    const toDelete: string[] = [];

    for (const [userId, bucket] of this.buckets.entries()) {
      const state = bucket.getState();
      if (now - state.lastRefillTime > inactiveThresholdMs) {
        toDelete.push(userId);
      }
    }

    toDelete.forEach(userId => this.buckets.delete(userId));
  }
}

This implementation provides a production-ready token bucket with cleanup to prevent memory leaks in long-running servers. The cleanup() method should be called periodically to remove inactive users. For more advanced implementations, see our guide on MCP Server Performance Optimization.

Redis-Based Distributed Rate Limiting

For multi-instance deployments, in-memory rate limiting won't work—you need a distributed solution. Redis is the industry standard for distributed rate limiting due to its atomic operations and expiration features.

/**
 * Redis-based Sliding Window Rate Limiter
 *
 * Uses Redis sorted sets for accurate sliding window counting
 * Suitable for distributed/multi-instance deployments
 */

import { Redis } from 'ioredis';

interface RateLimitConfig {
  windowMs: number;      // Time window in milliseconds
  maxRequests: number;   // Max requests per window
}

interface RateLimitResult {
  allowed: boolean;
  remaining: number;
  resetAt: Date;
  retryAfterMs?: number;
}

export class RedisRateLimiter {
  private redis: Redis;
  private config: RateLimitConfig;

  constructor(redis: Redis, config: RateLimitConfig) {
    this.redis = redis;
    this.config = config;
  }

  /**
   * Check if request should be allowed using sliding window
   * @param key Rate limit key (e.g., "user:123" or "ip:192.168.1.1")
   */
  public async allowRequest(key: string): Promise<RateLimitResult> {
    const now = Date.now();
    const windowStart = now - this.config.windowMs;
    const redisKey = `ratelimit:${key}`;

    // Multi-command pipeline for atomicity
    const pipeline = this.redis.pipeline();

    // Remove old entries outside the window
    pipeline.zremrangebyscore(redisKey, 0, windowStart);

    // Count requests in current window
    pipeline.zcard(redisKey);

    // Add current request with timestamp as score
    pipeline.zadd(redisKey, now, `${now}-${Math.random()}`);

    // Set expiration on key
    pipeline.pexpire(redisKey, this.config.windowMs);

    const results = await pipeline.exec();

    // Extract count from pipeline results
    // results[1] is the ZCARD result: [null, count]
    const count = results?.[1]?.[1] as number ?? 0;

    const allowed = count < this.config.maxRequests;
    const remaining = Math.max(0, this.config.maxRequests - count - 1);
    const resetAt = new Date(now + this.config.windowMs);

    const result: RateLimitResult = {
      allowed,
      remaining,
      resetAt
    };

    if (!allowed) {
      // Calculate retry-after based on oldest request in window
      const oldestTimestamp = await this.redis.zrange(redisKey, 0, 0, 'WITHSCORES');
      if (oldestTimestamp.length >= 2) {
        const oldestTime = parseInt(oldestTimestamp[1], 10);
        const retryAfterMs = Math.max(0, oldestTime + this.config.windowMs - now);
        result.retryAfterMs = retryAfterMs;
      }
    }

    return result;
  }

  /**
   * Get current rate limit status without consuming a request
   */
  public async getStatus(key: string): Promise<Omit<RateLimitResult, 'allowed'>> {
    const now = Date.now();
    const windowStart = now - this.config.windowMs;
    const redisKey = `ratelimit:${key}`;

    const count = await this.redis.zcount(redisKey, windowStart, now);
    const remaining = Math.max(0, this.config.maxRequests - count);
    const resetAt = new Date(now + this.config.windowMs);

    return { remaining, resetAt };
  }

  /**
   * Reset rate limit for a specific key
   * Useful for administrative overrides
   */
  public async reset(key: string): Promise<void> {
    const redisKey = `ratelimit:${key}`;
    await this.redis.del(redisKey);
  }

  /**
   * Increment rate limit for a key by a custom amount
   * Useful for operations that cost more than 1 request
   */
  public async consume(key: string, cost: number): Promise<RateLimitResult> {
    const now = Date.now();
    const windowStart = now - this.config.windowMs;
    const redisKey = `ratelimit:${key}`;

    const pipeline = this.redis.pipeline();
    pipeline.zremrangebyscore(redisKey, 0, windowStart);
    pipeline.zcard(redisKey);

    // Add multiple entries for multi-cost operations
    for (let i = 0; i < cost; i++) {
      pipeline.zadd(redisKey, now, `${now}-${i}-${Math.random()}`);
    }

    pipeline.pexpire(redisKey, this.config.windowMs);

    const results = await pipeline.exec();
    const count = results?.[1]?.[1] as number ?? 0;

    const allowed = count + cost <= this.config.maxRequests;
    const remaining = Math.max(0, this.config.maxRequests - count - cost);
    const resetAt = new Date(now + this.config.windowMs);

    return { allowed, remaining, resetAt };
  }
}

This Redis implementation uses sorted sets to track request timestamps, providing accurate sliding window rate limiting. For caching strategies that complement rate limiting, see our Redis Caching Patterns for ChatGPT Apps guide.

Express Middleware Integration

Integrate rate limiting seamlessly into your MCP server's Express application with custom middleware that handles HTTP headers and error responses correctly.

/**
 * Express middleware for MCP server rate limiting
 *
 * Implements standard HTTP rate limit headers (RFC 6585)
 * Supports both in-memory and Redis-based rate limiting
 */

import { Request, Response, NextFunction } from 'express';
import { RedisRateLimiter, RateLimitResult } from './redis-rate-limiter';

interface RateLimitMiddlewareConfig {
  rateLimiter: RedisRateLimiter;
  keyGenerator?: (req: Request) => string;
  skip?: (req: Request) => boolean;
  handler?: (req: Request, res: Response) => void;
}

export function rateLimitMiddleware(
  config: RateLimitMiddlewareConfig
) {
  const {
    rateLimiter,
    keyGenerator = defaultKeyGenerator,
    skip = () => false,
    handler = defaultHandler
  } = config;

  return async (req: Request, res: Response, next: NextFunction) => {
    // Skip rate limiting for certain requests (e.g., health checks)
    if (skip(req)) {
      return next();
    }

    const key = keyGenerator(req);

    try {
      const result = await rateLimiter.allowRequest(key);

      // Set standard rate limit headers
      setRateLimitHeaders(res, result);

      if (!result.allowed) {
        // Request exceeded rate limit
        return handler(req, res);
      }

      // Request allowed, proceed
      next();
    } catch (error) {
      // Fail open: if rate limiter errors, allow request
      console.error('Rate limiter error:', error);
      next();
    }
  };
}

/**
 * Default key generator: uses user ID from auth, falls back to IP
 */
function defaultKeyGenerator(req: Request): string {
  // Assumes auth middleware sets req.user
  const userId = (req as any).user?.id;
  if (userId) {
    return `user:${userId}`;
  }

  // Fall back to IP address
  const ip = req.ip || req.socket.remoteAddress || 'unknown';
  return `ip:${ip}`;
}

/**
 * Set standard HTTP rate limit headers
 * Based on IETF draft: draft-ietf-httpapi-ratelimit-headers
 */
function setRateLimitHeaders(res: Response, result: RateLimitResult): void {
  res.setHeader('X-RateLimit-Limit', result.remaining + (result.allowed ? 1 : 0));
  res.setHeader('X-RateLimit-Remaining', result.remaining);
  res.setHeader('X-RateLimit-Reset', result.resetAt.toISOString());

  if (!result.allowed && result.retryAfterMs) {
    const retryAfterSeconds = Math.ceil(result.retryAfterMs / 1000);
    res.setHeader('Retry-After', retryAfterSeconds.toString());
  }
}

/**
 * Default handler for rate limit exceeded
 */
function defaultHandler(req: Request, res: Response): void {
  res.status(429).json({
    error: 'Too Many Requests',
    message: 'Rate limit exceeded. Please try again later.',
    code: 'RATE_LIMIT_EXCEEDED'
  });
}

/**
 * Example: Per-endpoint rate limiting with different limits
 */
export function createPerEndpointRateLimiter(
  redis: Redis
) {
  const strictLimiter = new RedisRateLimiter(redis, {
    windowMs: 60000,    // 1 minute
    maxRequests: 10     // 10 requests/minute
  });

  const normalLimiter = new RedisRateLimiter(redis, {
    windowMs: 60000,
    maxRequests: 60     // 60 requests/minute
  });

  return {
    strict: rateLimitMiddleware({ rateLimiter: strictLimiter }),
    normal: rateLimitMiddleware({ rateLimiter: normalLimiter })
  };
}

Apply this middleware to your MCP endpoints to enforce rate limits automatically. For complete API security practices, see our API Security Best Practices for ChatGPT Apps guide.

Per-Tool and Per-User Quota Management

Different MCP tools have different computational costs. A database query is cheaper than an AI model call. Implement per-tool quotas to reflect actual resource consumption.

/**
 * Multi-tier quota manager with per-tool rate limiting
 *
 * Supports:
 * - User subscription tiers (free, pro, enterprise)
 * - Per-tool cost multipliers
 * - Monthly quota tracking
 * - Burst allowances
 */

import { Redis } from 'ioredis';

interface ToolConfig {
  name: string;
  costMultiplier: number;  // How many "credits" this tool costs
}

interface TierConfig {
  name: string;
  monthlyQuota: number;     // Total monthly requests
  burstLimit: number;       // Max requests per minute
  toolLimits?: Record<string, number>;  // Per-tool overrides
}

interface QuotaStatus {
  tier: string;
  monthlyUsed: number;
  monthlyRemaining: number;
  burstRemaining: number;
  resetAt: Date;
}

export class QuotaManager {
  private redis: Redis;
  private tiers: Map<string, TierConfig>;
  private tools: Map<string, ToolConfig>;

  constructor(redis: Redis) {
    this.redis = redis;
    this.tiers = new Map();
    this.tools = new Map();

    this.initializeDefaultTiers();
    this.initializeDefaultTools();
  }

  private initializeDefaultTiers(): void {
    this.tiers.set('free', {
      name: 'free',
      monthlyQuota: 1000,
      burstLimit: 10
    });

    this.tiers.set('pro', {
      name: 'pro',
      monthlyQuota: 50000,
      burstLimit: 100
    });

    this.tiers.set('enterprise', {
      name: 'enterprise',
      monthlyQuota: 500000,
      burstLimit: 500
    });
  }

  private initializeDefaultTools(): void {
    this.tools.set('simple_query', { name: 'simple_query', costMultiplier: 1 });
    this.tools.set('ai_generation', { name: 'ai_generation', costMultiplier: 10 });
    this.tools.set('image_analysis', { name: 'image_analysis', costMultiplier: 15 });
    this.tools.set('batch_processing', { name: 'batch_processing', costMultiplier: 25 });
  }

  /**
   * Check if user can make request and consume quota if allowed
   */
  public async consumeQuota(
    userId: string,
    tier: string,
    toolName: string
  ): Promise<{ allowed: boolean; status: QuotaStatus }> {
    const tierConfig = this.tiers.get(tier);
    const toolConfig = this.tools.get(toolName);

    if (!tierConfig || !toolConfig) {
      throw new Error('Invalid tier or tool configuration');
    }

    const cost = toolConfig.costMultiplier;

    // Check monthly quota
    const monthlyStatus = await this.checkMonthlyQuota(userId, tierConfig, cost);
    if (!monthlyStatus.allowed) {
      return { allowed: false, status: monthlyStatus.status };
    }

    // Check burst limit
    const burstStatus = await this.checkBurstLimit(userId, tierConfig);
    if (!burstStatus.allowed) {
      return { allowed: false, status: monthlyStatus.status };
    }

    // Both checks passed, consume quota
    await this.incrementUsage(userId, cost);

    return { allowed: true, status: monthlyStatus.status };
  }

  private async checkMonthlyQuota(
    userId: string,
    tierConfig: TierConfig,
    cost: number
  ): Promise<{ allowed: boolean; status: QuotaStatus }> {
    const monthKey = this.getMonthKey();
    const quotaKey = `quota:monthly:${userId}:${monthKey}`;

    const used = parseInt(await this.redis.get(quotaKey) ?? '0', 10);
    const remaining = tierConfig.monthlyQuota - used;

    const allowed = remaining >= cost;

    const now = new Date();
    const resetAt = new Date(now.getFullYear(), now.getMonth() + 1, 1);

    return {
      allowed,
      status: {
        tier: tierConfig.name,
        monthlyUsed: used,
        monthlyRemaining: Math.max(0, remaining),
        burstRemaining: 0, // Will be set by checkBurstLimit
        resetAt
      }
    };
  }

  private async checkBurstLimit(
    userId: string,
    tierConfig: TierConfig
  ): Promise<{ allowed: boolean }> {
    const burstKey = `quota:burst:${userId}`;
    const now = Date.now();
    const windowStart = now - 60000; // 1 minute window

    // Use sorted set for sliding window
    await this.redis.zremrangebyscore(burstKey, 0, windowStart);
    const count = await this.redis.zcard(burstKey);

    return { allowed: count < tierConfig.burstLimit };
  }

  private async incrementUsage(userId: string, cost: number): Promise<void> {
    const monthKey = this.getMonthKey();
    const quotaKey = `quota:monthly:${userId}:${monthKey}`;
    const burstKey = `quota:burst:${userId}`;
    const now = Date.now();

    const pipeline = this.redis.pipeline();

    // Increment monthly usage
    pipeline.incrby(quotaKey, cost);
    pipeline.expire(quotaKey, 60 * 60 * 24 * 32); // Expire after 32 days

    // Add to burst window
    pipeline.zadd(burstKey, now, `${now}-${Math.random()}`);
    pipeline.expire(burstKey, 120); // Expire after 2 minutes

    await pipeline.exec();
  }

  /**
   * Get current quota status without consuming
   */
  public async getStatus(userId: string, tier: string): Promise<QuotaStatus> {
    const tierConfig = this.tiers.get(tier);
    if (!tierConfig) {
      throw new Error('Invalid tier configuration');
    }

    const monthKey = this.getMonthKey();
    const quotaKey = `quota:monthly:${userId}:${monthKey}`;
    const burstKey = `quota:burst:${userId}`;

    const used = parseInt(await this.redis.get(quotaKey) ?? '0', 10);
    const burstCount = await this.redis.zcard(burstKey);

    const now = new Date();
    const resetAt = new Date(now.getFullYear(), now.getMonth() + 1, 1);

    return {
      tier: tierConfig.name,
      monthlyUsed: used,
      monthlyRemaining: Math.max(0, tierConfig.monthlyQuota - used),
      burstRemaining: Math.max(0, tierConfig.burstLimit - burstCount),
      resetAt
    };
  }

  private getMonthKey(): string {
    const now = new Date();
    return `${now.getFullYear()}-${String(now.getMonth() + 1).padStart(2, '0')}`;
  }
}

This quota manager supports tiered pricing models with per-tool cost multipliers. For implementing usage-based billing on top of this system, see our Usage-Based Billing Implementation for ChatGPT Apps guide.

Rate Limit Response Handling and Client Retry Logic

When rate limits are exceeded, your MCP server must communicate this clearly to ChatGPT and implement client-side retry logic with exponential backoff.

/**
 * Client-side retry logic with exponential backoff
 *
 * Handles 429 responses and Retry-After headers correctly
 * Implements jittered exponential backoff
 */

interface RetryConfig {
  maxRetries: number;
  baseDelayMs: number;
  maxDelayMs: number;
  jitter: boolean;
}

interface RetryResult<T> {
  success: boolean;
  data?: T;
  error?: Error;
  attempts: number;
}

export class RateLimitedClient {
  private config: RetryConfig;

  constructor(config: Partial<RetryConfig> = {}) {
    this.config = {
      maxRetries: config.maxRetries ?? 3,
      baseDelayMs: config.baseDelayMs ?? 1000,
      maxDelayMs: config.maxDelayMs ?? 30000,
      jitter: config.jitter ?? true
    };
  }

  /**
   * Make request with automatic retry on rate limit
   */
  public async fetchWithRetry<T>(
    url: string,
    options: RequestInit = {}
  ): Promise<RetryResult<T>> {
    let lastError: Error | undefined;

    for (let attempt = 0; attempt <= this.config.maxRetries; attempt++) {
      try {
        const response = await fetch(url, options);

        if (response.ok) {
          const data = await response.json();
          return { success: true, data, attempts: attempt + 1 };
        }

        if (response.status === 429) {
          // Rate limited - check for Retry-After header
          const retryAfter = this.getRetryAfter(response);

          if (attempt < this.config.maxRetries) {
            await this.delay(retryAfter ?? this.calculateBackoff(attempt));
            continue;
          }

          lastError = new Error('Rate limit exceeded after all retries');
        } else {
          // Non-rate-limit error, don't retry
          lastError = new Error(`HTTP ${response.status}: ${response.statusText}`);
          break;
        }
      } catch (error) {
        lastError = error instanceof Error ? error : new Error(String(error));

        if (attempt < this.config.maxRetries) {
          await this.delay(this.calculateBackoff(attempt));
          continue;
        }
      }
    }

    return {
      success: false,
      error: lastError,
      attempts: this.config.maxRetries + 1
    };
  }

  /**
   * Extract Retry-After header (supports both seconds and HTTP date)
   */
  private getRetryAfter(response: Response): number | null {
    const retryAfter = response.headers.get('Retry-After');
    if (!retryAfter) return null;

    // Try parsing as seconds
    const seconds = parseInt(retryAfter, 10);
    if (!isNaN(seconds)) {
      return seconds * 1000;
    }

    // Try parsing as HTTP date
    const date = new Date(retryAfter);
    if (!isNaN(date.getTime())) {
      return Math.max(0, date.getTime() - Date.now());
    }

    return null;
  }

  /**
   * Calculate exponential backoff with optional jitter
   */
  private calculateBackoff(attempt: number): number {
    const exponentialDelay = Math.min(
      this.config.baseDelayMs * Math.pow(2, attempt),
      this.config.maxDelayMs
    );

    if (!this.config.jitter) {
      return exponentialDelay;
    }

    // Add random jitter (0-50% of delay)
    const jitterMs = Math.random() * exponentialDelay * 0.5;
    return exponentialDelay + jitterMs;
  }

  /**
   * Promise-based delay
   */
  private delay(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

/**
 * Custom error class for rate limiting
 */
export class RateLimitError extends Error {
  public retryAfterMs: number;
  public remaining: number;
  public resetAt: Date;

  constructor(
    message: string,
    retryAfterMs: number,
    remaining: number,
    resetAt: Date
  ) {
    super(message);
    this.name = 'RateLimitError';
    this.retryAfterMs = retryAfterMs;
    this.remaining = remaining;
    this.resetAt = resetAt;
  }
}

This client implementation respects Retry-After headers and implements jittered exponential backoff to prevent thundering herd problems. For comprehensive DDoS protection strategies, see our DDoS Protection for ChatGPT Apps guide.

Monitoring, Metrics, and Alerting

Production rate limiting requires comprehensive monitoring to detect abuse, tune limits, and troubleshoot issues. Integrate Prometheus metrics for observability.

/**
 * Prometheus metrics for rate limiting monitoring
 *
 * Tracks:
 * - Rate limit hits by user/tier
 * - Request counts by endpoint
 * - Quota consumption patterns
 * - Error rates
 */

import { Counter, Histogram, Gauge } from 'prom-client';

export class RateLimitMetrics {
  private rateLimitHits: Counter;
  private requestDuration: Histogram;
  private quotaUsage: Gauge;
  private activeUsers: Gauge;

  constructor() {
    this.rateLimitHits = new Counter({
      name: 'rate_limit_hits_total',
      help: 'Total number of rate limit hits',
      labelNames: ['tier', 'endpoint', 'limit_type']
    });

    this.requestDuration = new Histogram({
      name: 'rate_limit_check_duration_seconds',
      help: 'Duration of rate limit checks',
      labelNames: ['limiter_type'],
      buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1]
    });

    this.quotaUsage = new Gauge({
      name: 'user_quota_usage_percent',
      help: 'Current quota usage percentage by user',
      labelNames: ['user_id', 'tier']
    });

    this.activeUsers = new Gauge({
      name: 'rate_limited_active_users',
      help: 'Number of users currently being rate limited'
    });
  }

  public recordRateLimitHit(tier: string, endpoint: string, limitType: 'burst' | 'monthly'): void {
    this.rateLimitHits.inc({ tier, endpoint, limit_type: limitType });
  }

  public recordCheckDuration(limiterType: 'redis' | 'memory', durationMs: number): void {
    this.requestDuration.observe({ limiter_type: limiterType }, durationMs / 1000);
  }

  public updateQuotaUsage(userId: string, tier: string, usagePercent: number): void {
    this.quotaUsage.set({ user_id: userId, tier }, usagePercent);
  }

  public setActiveUsers(count: number): void {
    this.activeUsers.set(count);
  }
}

Set up alerts for unusual patterns:

  • Alert: Rate limit hits exceed 5% of total requests → May indicate limits too strict or bot activity
  • Alert: Single user consuming >80% monthly quota before month 50% complete → Risk of quota exhaustion
  • Alert: Burst limits hit >100 times/hour → Possible abuse or client bug
  • Alert: Rate limiter check duration >50ms (p95) → Performance degradation, consider caching

Conclusion

Production-grade rate limiting protects your MCP server from abuse while ensuring fair resource allocation across all users. The token bucket algorithm provides flexibility for burst traffic, while Redis-based sliding window counters deliver accurate distributed rate limiting for multi-instance deployments. Per-tool quota management ensures that expensive AI operations don't exhaust resources at the same rate as simple queries, and comprehensive monitoring detects abuse patterns before they impact service quality.

Implementing these strategies requires careful tuning based on your specific workload and infrastructure. Start with conservative limits, monitor closely, and adjust based on real usage patterns. Remember that rate limiting is not just a defensive mechanism—it's a core component of your product's pricing model and user experience.

For detailed information on building production ChatGPT apps with MCP servers, see our comprehensive Complete Guide to Building ChatGPT Applications.

Ready to deploy production-ready ChatGPT apps without managing infrastructure? MakeAIHQ provides built-in rate limiting, quota management, and distributed Redis caching out of the box. Our no-code platform handles the complexity of production MCP servers so you can focus on building amazing ChatGPT experiences. Start your free trial today and deploy your first ChatGPT app in under 48 hours—with enterprise-grade rate limiting included.