API Gateway Patterns for ChatGPT Apps: Production Architecture Guide
Building a production-grade ChatGPT application requires more than just connecting to OpenAI's APIs. As your app scales to hundreds or thousands of users, you need robust infrastructure to handle authentication, rate limiting, caching, request transformation, and observability. This is where API gateways become essential.
An API gateway acts as a reverse proxy that sits between your ChatGPT application and backend services, providing a unified entry point for all API traffic. Leading solutions like Kong, AWS API Gateway, and Azure API Management offer enterprise-grade features that solve common challenges in ChatGPT app development.
In this comprehensive guide, you'll learn how to implement production-ready API gateway patterns specifically designed for ChatGPT applications. We'll cover architecture fundamentals, hands-on implementations with Kong and AWS API Gateway, advanced patterns like circuit breakers and response caching, and monitoring strategies using OpenTelemetry.
Whether you're building a customer service chatbot serving 10,000 requests per day or an enterprise knowledge base handling millions of interactions, these patterns will help you build scalable, secure, and maintainable ChatGPT applications. For a broader understanding of ChatGPT app architecture, see our Complete Guide to Building ChatGPT Applications.
Why API Gateways Matter for ChatGPT Apps
ChatGPT applications face unique infrastructure challenges that API gateways are designed to solve:
Rate Limiting: OpenAI enforces strict rate limits on API calls (e.g., 10,000 tokens per minute for GPT-4). Without proper gateway-level rate limiting, your application could exhaust quotas during traffic spikes, causing service disruptions. API gateways implement intelligent rate limiting algorithms (token bucket, leaky bucket, sliding window) to smooth traffic and prevent quota violations.
Authentication Aggregation: Modern ChatGPT apps often integrate multiple services—OpenAI APIs, vector databases like Pinecone, analytics platforms, and internal microservices. Managing authentication across these services becomes complex. API gateways centralize authentication, validating tokens once at the gateway layer before routing requests to backend services. This reduces latency and improves security by minimizing credential exposure.
Request/Response Transformation: ChatGPT applications frequently need to transform data formats between frontend clients and backend services. For example, converting user input from a mobile app's JSON format to the OpenAI Chat Completions API format, or transforming streaming responses into server-sent events for real-time UI updates. Gateways handle these transformations declaratively, keeping business logic clean.
Caching: ChatGPT API calls are expensive—both in latency (200-1000ms) and cost ($0.03 per 1K tokens for GPT-4). Intelligent caching of common queries can reduce costs by 40-60% while dramatically improving response times. API gateways provide built-in caching layers with TTL management, cache invalidation, and distributed cache support via Redis.
Circuit Breaking & Failover: When OpenAI experiences outages or degraded performance, your ChatGPT app needs graceful degradation strategies. Circuit breakers detect failures and automatically route traffic to fallback services (e.g., cached responses, alternative LLM providers like Anthropic Claude). This improves reliability and user experience during incidents.
For ChatGPT apps specifically targeting the OpenAI App Store, proper gateway architecture ensures you meet OpenAI's security and performance requirements, including OAuth 2.1 PKCE authentication and sub-200ms API response times.
Gateway Architecture Fundamentals
Understanding the reverse proxy pattern is essential to implementing API gateways effectively. Here's how traffic flows through a gateway in a ChatGPT application:
Reverse Proxy Pattern:
- Client sends request:
POST https://api.yourapp.com/chat/completions - Gateway intercepts request at Layer 7 (HTTP)
- Authentication plugin validates JWT token
- Rate limiting plugin checks quota (e.g., 100 requests/minute per user)
- Request transformation plugin formats payload for OpenAI API
- Gateway forwards to backend:
POST https://api.openai.com/v1/chat/completions - Backend responds with streaming ChatGPT response
- Gateway applies response caching (if applicable)
- Gateway returns response to client
This pattern provides a single point of control for cross-cutting concerns that would otherwise be duplicated across microservices.
Authentication Aggregation consolidates multiple authentication mechanisms into a unified gateway layer:
Client Request (OAuth 2.0 token)
↓
Gateway validates token (JWT verification)
↓
Gateway enriches request with service credentials:
- OpenAI API key (from Vault)
- Pinecone API key (for vector search)
- Internal service tokens
↓
Backend services receive authenticated requests
This architecture eliminates the need for each backend service to implement OAuth validation independently, reducing attack surface and simplifying credential rotation.
Request/Response Transformation handles data format conversions declaratively. For example, transforming a mobile app's request format to OpenAI's Chat Completions API:
// Client request (simplified mobile format)
{
"message": "What's the weather in SF?",
"userId": "user_123"
}
// Gateway transforms to OpenAI format
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the weather in SF?"}
],
"user": "user_123",
"temperature": 0.7
}
This keeps mobile clients lightweight while maintaining compatibility with OpenAI's API specifications.
For more on architectural patterns in ChatGPT applications, see Microservices Architecture for ChatGPT Apps.
Kong Gateway Setup for ChatGPT Apps
Kong is an open-source API gateway built on Nginx and Lua, offering exceptional performance (50,000+ requests/second) and a rich plugin ecosystem. Here's a production-ready Kong setup for ChatGPT applications using declarative configuration.
Kong Declarative Configuration
# kong.yml - Declarative configuration for ChatGPT app gateway
_format_version: "3.0"
_transform: true
services:
- name: openai-chat-service
url: https://api.openai.com/v1/chat/completions
protocol: https
port: 443
connect_timeout: 5000
write_timeout: 60000
read_timeout: 60000
retries: 3
routes:
- name: chat-completions-route
paths:
- /v1/chat/completions
methods:
- POST
strip_path: false
preserve_host: false
plugins:
# Rate limiting: 100 requests/minute per consumer
- name: rate-limiting
config:
minute: 100
policy: redis
redis_host: redis.yourapp.com
redis_port: 6379
redis_timeout: 2000
fault_tolerant: true
hide_client_headers: false
# JWT authentication
- name: jwt
config:
uri_param_names:
- jwt
cookie_names:
- jwt
key_claim_name: kid
secret_is_base64: false
claims_to_verify:
- exp
maximum_expiration: 3600
# Request transformer: Add OpenAI API key
- name: request-transformer
config:
add:
headers:
- Authorization:Bearer ${OPENAI_API_KEY}
body:
- model:gpt-4
- temperature:0.7
remove:
headers:
- X-Internal-User-ID
# Response caching: Cache identical requests for 5 minutes
- name: proxy-cache
config:
strategy: memory
content_type:
- application/json
cache_ttl: 300
cache_control: false
memory:
dictionary_name: kong_cache
# CORS support for web clients
- name: cors
config:
origins:
- https://yourapp.com
- https://app.yourapp.com
methods:
- GET
- POST
- OPTIONS
headers:
- Accept
- Authorization
- Content-Type
exposed_headers:
- X-RateLimit-Limit
- X-RateLimit-Remaining
credentials: true
max_age: 3600
- name: pinecone-vector-service
url: https://your-index.pinecone.io
protocol: https
port: 443
routes:
- name: vector-search-route
paths:
- /v1/vector/search
methods:
- POST
plugins:
- name: rate-limiting
config:
minute: 500
policy: redis
redis_host: redis.yourapp.com
- name: request-transformer
config:
add:
headers:
- Api-Key:${PINECONE_API_KEY}
consumers:
- username: mobile-app
custom_id: app_mobile_v1
jwt_secrets:
- key: mobile-app-key
algorithm: HS256
secret: your-jwt-secret-here
- username: web-app
custom_id: app_web_v1
jwt_secrets:
- key: web-app-key
algorithm: HS256
secret: your-jwt-secret-here
plugins:
# Global request ID for tracing
- name: correlation-id
config:
header_name: X-Request-ID
generator: uuid
echo_downstream: true
# Global response compression
- name: response-transformer
config:
add:
headers:
- X-Gateway-Version:1.0.0
- X-Powered-By:Kong
This declarative configuration defines two services (OpenAI Chat Completions and Pinecone vector search) with comprehensive plugin configurations. The rate-limiting plugin uses Redis for distributed rate limiting across multiple Kong instances, critical for horizontal scaling.
Custom Rate Limiting Plugin
For advanced rate limiting scenarios (e.g., token-based limits for OpenAI APIs), create a custom Lua plugin:
-- kong/plugins/token-rate-limit/handler.lua
-- Custom rate limiting based on OpenAI token consumption
local kong = kong
local redis = require "resty.redis"
local cjson = require "cjson"
local TokenRateLimitHandler = {
VERSION = "1.0.0",
PRIORITY = 901, -- Execute before other rate limiting plugins
}
function TokenRateLimitHandler:access(conf)
local consumer_id = kong.client.get_consumer()
if not consumer_id then
return kong.response.exit(401, {
message = "Authentication required"
})
end
local identifier = consumer_id.id
local redis_client = redis:new()
redis_client:set_timeout(conf.redis_timeout)
local ok, err = redis_client:connect(conf.redis_host, conf.redis_port)
if not ok then
kong.log.err("Failed to connect to Redis: ", err)
if conf.fault_tolerant then
return -- Allow request through if Redis is down
end
return kong.response.exit(500, {
message = "Rate limiting service unavailable"
})
end
-- Check current token usage
local cache_key = "token_limit:" .. identifier .. ":" .. os.date("%Y%m%d%H%M")
local current_usage = redis_client:get(cache_key)
if current_usage == ngx.null then
current_usage = 0
else
current_usage = tonumber(current_usage)
end
-- Estimate tokens from request (rough approximation)
local request_body = kong.request.get_raw_body()
local estimated_tokens = 0
if request_body then
local body_json = cjson.decode(request_body)
if body_json.messages then
for _, message in ipairs(body_json.messages) do
-- Rough estimate: 1 token ≈ 4 characters
estimated_tokens = estimated_tokens + math.ceil(#message.content / 4)
end
end
-- Add max_tokens if specified
if body_json.max_tokens then
estimated_tokens = estimated_tokens + body_json.max_tokens
else
estimated_tokens = estimated_tokens + 500 -- Default assumption
end
end
-- Check if limit would be exceeded
if current_usage + estimated_tokens > conf.tokens_per_minute then
kong.response.set_header("X-RateLimit-Limit", conf.tokens_per_minute)
kong.response.set_header("X-RateLimit-Remaining", 0)
kong.response.set_header("X-RateLimit-Reset", os.time() + 60)
return kong.response.exit(429, {
message = "Token rate limit exceeded",
limit = conf.tokens_per_minute,
current_usage = current_usage,
estimated_request_tokens = estimated_tokens,
reset_at = os.time() + 60
})
end
-- Increment usage counter
local new_usage = current_usage + estimated_tokens
redis_client:set(cache_key, new_usage)
redis_client:expire(cache_key, 60) -- TTL: 1 minute
-- Set rate limit headers
kong.response.set_header("X-RateLimit-Limit", conf.tokens_per_minute)
kong.response.set_header("X-RateLimit-Remaining", conf.tokens_per_minute - new_usage)
kong.response.set_header("X-RateLimit-Reset", os.time() + 60)
redis_client:set_keepalive(10000, 100)
end
return TokenRateLimitHandler
-- kong/plugins/token-rate-limit/schema.lua
local typedefs = require "kong.db.schema.typedefs"
return {
name = "token-rate-limit",
fields = {
{ config = {
type = "record",
fields = {
{ tokens_per_minute = {
type = "number",
default = 10000,
required = true,
gt = 0,
}},
{ redis_host = typedefs.host({ required = true }) },
{ redis_port = typedefs.port({ required = true, default = 6379 }) },
{ redis_timeout = { type = "number", default = 2000 } },
{ fault_tolerant = { type = "boolean", default = true } },
},
}},
},
}
This custom plugin implements token-based rate limiting, crucial for ChatGPT apps that need to respect OpenAI's token-per-minute quotas. The plugin estimates token usage from request payloads and tracks consumption in Redis with per-minute granularity.
For more on rate limiting strategies, see Rate Limiting Patterns for ChatGPT Applications.
JWT Authentication Plugin
Configure JWT authentication to validate tokens issued by your identity provider:
-- kong/plugins/jwt-validator/handler.lua
-- Custom JWT validation with OpenID Connect support
local jwt_decoder = require "kong.plugins.jwt.jwt_parser"
local http = require "resty.http"
local cjson = require "cjson"
local JWTValidatorHandler = {
VERSION = "1.0.0",
PRIORITY = 1005, -- Execute early in plugin chain
}
-- Cache for JWKS (JSON Web Key Set)
local jwks_cache = {}
local jwks_cache_ttl = 3600 -- 1 hour
function JWTValidatorHandler:fetch_jwks(conf)
local cache_key = conf.jwks_uri
local cached_jwks = jwks_cache[cache_key]
if cached_jwks and cached_jwks.expires_at > ngx.time() then
return cached_jwks.keys
end
local httpc = http.new()
httpc:set_timeout(conf.http_timeout)
local res, err = httpc:request_uri(conf.jwks_uri, {
method = "GET",
headers = {
["Accept"] = "application/json",
},
})
if not res or res.status ~= 200 then
kong.log.err("Failed to fetch JWKS: ", err or res.status)
return nil, "JWKS fetch failed"
end
local jwks = cjson.decode(res.body)
jwks_cache[cache_key] = {
keys = jwks.keys,
expires_at = ngx.time() + jwks_cache_ttl,
}
return jwks.keys
end
function JWTValidatorHandler:access(conf)
local authorization = kong.request.get_header("Authorization")
if not authorization then
return kong.response.exit(401, {
message = "Missing Authorization header"
})
end
local token = authorization:match("Bearer%s+(.+)")
if not token then
return kong.response.exit(401, {
message = "Invalid Authorization header format"
})
end
-- Decode JWT without verification first to get key ID
local jwt, err = jwt_decoder:new(token)
if err then
return kong.response.exit(401, {
message = "Invalid JWT format",
error = err
})
end
local header = jwt.header
local kid = header.kid
if not kid then
return kong.response.exit(401, {
message = "Missing key ID in JWT header"
})
end
-- Fetch JWKS to get public key
local jwks, jwks_err = self:fetch_jwks(conf)
if jwks_err then
return kong.response.exit(500, {
message = "Failed to validate JWT",
error = jwks_err
})
end
-- Find matching key
local public_key = nil
for _, key in ipairs(jwks) do
if key.kid == kid then
public_key = key
break
end
end
if not public_key then
return kong.response.exit(401, {
message = "Public key not found for key ID",
kid = kid
})
end
-- Verify JWT signature
local verified = jwt:verify_signature(public_key)
if not verified then
return kong.response.exit(401, {
message = "JWT signature verification failed"
})
end
-- Validate claims
local claims = jwt.claims
local now = ngx.time()
if claims.exp and claims.exp < now then
return kong.response.exit(401, {
message = "JWT has expired",
expired_at = claims.exp,
current_time = now
})
end
if claims.nbf and claims.nbf > now then
return kong.response.exit(401, {
message = "JWT not yet valid",
valid_from = claims.nbf,
current_time = now
})
end
if conf.audience_required and not claims.aud then
return kong.response.exit(401, {
message = "Missing audience claim"
})
end
if conf.audience_required and claims.aud ~= conf.expected_audience then
return kong.response.exit(401, {
message = "Invalid audience",
expected = conf.expected_audience,
received = claims.aud
})
end
-- Set consumer based on subject claim
if claims.sub then
kong.service.request.set_header("X-Consumer-ID", claims.sub)
kong.service.request.set_header("X-Consumer-Email", claims.email or "")
end
-- Store validated claims for downstream plugins
kong.ctx.shared.jwt_claims = claims
end
return JWTValidatorHandler
This production-grade JWT validator implements OpenID Connect support with JWKS (JSON Web Key Set) fetching, signature verification, and comprehensive claim validation. It's essential for ChatGPT apps using OAuth 2.1 PKCE authentication as required by the OpenAI App Store.
For complete authentication implementation guides, see OAuth 2.1 PKCE for ChatGPT Apps.
AWS API Gateway Implementation
AWS API Gateway provides a fully managed service for creating, deploying, and securing APIs at scale. Here's a production Terraform configuration for a ChatGPT application gateway:
Terraform Configuration
# terraform/api-gateway.tf
# AWS API Gateway for ChatGPT application
terraform {
required_version = ">= 1.6"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# REST API Gateway
resource "aws_api_gateway_rest_api" "chatgpt_api" {
name = "chatgpt-app-gateway"
description = "API Gateway for ChatGPT application with rate limiting and authentication"
endpoint_configuration {
types = ["REGIONAL"]
}
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = "*"
Action = "execute-api:Invoke"
Resource = "*"
Condition = {
IpAddress = {
"aws:SourceIp" = [
"0.0.0.0/0" # Restrict to your IP ranges in production
]
}
}
}
]
})
}
# Lambda authorizer function
resource "aws_lambda_function" "jwt_authorizer" {
filename = "lambda/jwt-authorizer.zip"
function_name = "chatgpt-jwt-authorizer"
role = aws_iam_role.lambda_authorizer_role.arn
handler = "index.handler"
runtime = "nodejs20.x"
timeout = 10
memory_size = 256
environment {
variables = {
JWKS_URI = "https://your-auth-provider.com/.well-known/jwks.json"
AUDIENCE = "https://api.yourapp.com"
ISSUER = "https://your-auth-provider.com"
}
}
tags = {
Environment = "production"
Service = "chatgpt-gateway"
}
}
# API Gateway authorizer
resource "aws_api_gateway_authorizer" "jwt_authorizer" {
name = "jwt-authorizer"
rest_api_id = aws_api_gateway_rest_api.chatgpt_api.id
type = "TOKEN"
authorizer_uri = aws_lambda_function.jwt_authorizer.invoke_arn
authorizer_credentials = aws_iam_role.api_gateway_authorizer_role.arn
identity_source = "method.request.header.Authorization"
authorizer_result_ttl_in_seconds = 300 # Cache for 5 minutes
}
# /v1 resource
resource "aws_api_gateway_resource" "v1" {
rest_api_id = aws_api_gateway_rest_api.chatgpt_api.id
parent_id = aws_api_gateway_rest_api.chatgpt_api.root_resource_id
path_part = "v1"
}
# /v1/chat resource
resource "aws_api_gateway_resource" "chat" {
rest_api_id = aws_api_gateway_rest_api.chatgpt_api.id
parent_id = aws_api_gateway_resource.v1.id
path_part = "chat"
}
# /v1/chat/completions resource
resource "aws_api_gateway_resource" "completions" {
rest_api_id = aws_api_gateway_rest_api.chatgpt_api.id
parent_id = aws_api_gateway_resource.chat.id
path_part = "completions"
}
# POST method with request validation
resource "aws_api_gateway_method" "completions_post" {
rest_api_id = aws_api_gateway_rest_api.chatgpt_api.id
resource_id = aws_api_gateway_resource.completions.id
http_method = "POST"
authorization = "CUSTOM"
authorizer_id = aws_api_gateway_authorizer.jwt_authorizer.id
request_validator_id = aws_api_gateway_request_validator.chatgpt_validator.id
request_models = {
"application/json" = aws_api_gateway_model.chat_completion_request.name
}
request_parameters = {
"method.request.header.Authorization" = true
}
}
# Request validator
resource "aws_api_gateway_request_validator" "chatgpt_validator" {
name = "chatgpt-request-validator"
rest_api_id = aws_api_gateway_rest_api.chatgpt_api.id
validate_request_body = true
validate_request_parameters = true
}
# Request model schema
resource "aws_api_gateway_model" "chat_completion_request" {
rest_api_id = aws_api_gateway_rest_api.chatgpt_api.id
name = "ChatCompletionRequest"
description = "Schema for chat completion requests"
content_type = "application/json"
schema = jsonencode({
"$schema" = "http://json-schema.org/draft-04/schema#"
type = "object"
required = ["messages"]
properties = {
messages = {
type = "array"
minItems = 1
items = {
type = "object"
required = ["role", "content"]
properties = {
role = {
type = "string"
enum = ["system", "user", "assistant"]
}
content = {
type = "string"
minLength = 1
}
}
}
}
model = {
type = "string"
default = "gpt-4"
}
temperature = {
type = "number"
minimum = 0
maximum = 2
}
max_tokens = {
type = "integer"
minimum = 1
maximum = 4096
}
}
})
}
# HTTP integration with OpenAI
resource "aws_api_gateway_integration" "openai_integration" {
rest_api_id = aws_api_gateway_rest_api.chatgpt_api.id
resource_id = aws_api_gateway_resource.completions.id
http_method = aws_api_gateway_method.completions_post.http_method
type = "HTTP"
integration_http_method = "POST"
uri = "https://api.openai.com/v1/chat/completions"
request_templates = {
"application/json" = <<EOF
#set($inputRoot = $input.path('$'))
{
"model": "$inputRoot.model",
"messages": $inputRoot.messages,
"temperature": $inputRoot.temperature,
"max_tokens": $inputRoot.max_tokens,
"user": "$context.authorizer.principalId"
}
EOF
}
request_parameters = {
"integration.request.header.Authorization" = "'Bearer ${var.openai_api_key}'"
"integration.request.header.Content-Type" = "'application/json'"
}
timeout_milliseconds = 29000 # Max for API Gateway
}
# Usage plan for rate limiting
resource "aws_api_gateway_usage_plan" "chatgpt_usage_plan" {
name = "chatgpt-usage-plan"
description = "Rate limiting for ChatGPT API"
api_stages {
api_id = aws_api_gateway_rest_api.chatgpt_api.id
stage = aws_api_gateway_stage.production.stage_name
}
quota_settings {
limit = 100000 # 100K requests per month
period = "MONTH"
}
throttle_settings {
burst_limit = 200 # Allow bursts up to 200 requests
rate_limit = 100 # 100 requests per second sustained
}
}
# API Gateway stage
resource "aws_api_gateway_stage" "production" {
deployment_id = aws_api_gateway_deployment.production.id
rest_api_id = aws_api_gateway_rest_api.chatgpt_api.id
stage_name = "production"
cache_cluster_enabled = true
cache_cluster_size = "0.5" # 0.5 GB cache
xray_tracing_enabled = true
access_log_settings {
destination_arn = aws_cloudwatch_log_group.api_gateway_logs.arn
format = jsonencode({
requestId = "$context.requestId"
ip = "$context.identity.sourceIp"
caller = "$context.identity.caller"
user = "$context.identity.user"
requestTime = "$context.requestTime"
httpMethod = "$context.httpMethod"
resourcePath = "$context.resourcePath"
status = "$context.status"
protocol = "$context.protocol"
responseLength = "$context.responseLength"
})
}
}
# CloudWatch Logs
resource "aws_cloudwatch_log_group" "api_gateway_logs" {
name = "/aws/api-gateway/chatgpt-app"
retention_in_days = 30
}
# IAM roles (simplified - expand for production)
resource "aws_iam_role" "lambda_authorizer_role" {
name = "lambda-authorizer-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
Action = "sts:AssumeRole"
}
]
})
}
resource "aws_iam_role" "api_gateway_authorizer_role" {
name = "api-gateway-authorizer-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Service = "apigateway.amazonaws.com"
}
Action = "sts:AssumeRole"
}
]
})
}
# Outputs
output "api_gateway_url" {
value = "${aws_api_gateway_stage.production.invoke_url}/v1/chat/completions"
}
output "api_key_id" {
value = aws_api_gateway_api_key.chatgpt_api_key.id
}
This Terraform configuration provisions a complete AWS API Gateway with JWT authentication, request validation, rate limiting via usage plans, and CloudWatch logging integration.
Lambda Authorizer Implementation
// lambda/jwt-authorizer/index.js
// Lambda authorizer for JWT validation with JWKS support
const jwt = require('jsonwebtoken');
const jwksClient = require('jwks-rsa');
// JWKS client with caching
const client = jwksClient({
cache: true,
cacheMaxAge: 3600000, // 1 hour
rateLimit: true,
jwksRequestsPerMinute: 10,
jwksUri: process.env.JWKS_URI
});
/**
* Get signing key from JWKS
*/
function getKey(header, callback) {
client.getSigningKey(header.kid, (err, key) => {
if (err) {
console.error('Failed to get signing key:', err);
return callback(err);
}
const signingKey = key.publicKey || key.rsaPublicKey;
callback(null, signingKey);
});
}
/**
* Generate IAM policy document
*/
function generatePolicy(principalId, effect, resource, context = {}) {
const authResponse = {
principalId: principalId
};
if (effect && resource) {
authResponse.policyDocument = {
Version: '2012-10-17',
Statement: [
{
Action: 'execute-api:Invoke',
Effect: effect,
Resource: resource
}
]
};
}
// Add context for downstream Lambda functions
authResponse.context = context;
return authResponse;
}
/**
* Lambda handler
*/
exports.handler = async (event, context) => {
console.log('Authorization request:', JSON.stringify(event, null, 2));
// Extract token from Authorization header
const token = event.authorizationToken?.replace(/^Bearer\s+/, '');
if (!token) {
console.error('No token provided');
throw new Error('Unauthorized');
}
try {
// Decode token without verification to get header
const decoded = jwt.decode(token, { complete: true });
if (!decoded || !decoded.header) {
console.error('Invalid token format');
throw new Error('Unauthorized');
}
// Verify token signature and claims
const verified = await new Promise((resolve, reject) => {
jwt.verify(
token,
(header, callback) => getKey(header, callback),
{
audience: process.env.AUDIENCE,
issuer: process.env.ISSUER,
algorithms: ['RS256']
},
(err, decoded) => {
if (err) {
console.error('Token verification failed:', err);
return reject(err);
}
resolve(decoded);
}
);
});
console.log('Token verified:', verified);
// Generate allow policy
const policy = generatePolicy(
verified.sub,
'Allow',
event.methodArn,
{
userId: verified.sub,
email: verified.email || '',
scope: verified.scope || '',
tier: verified.tier || 'free'
}
);
return policy;
} catch (error) {
console.error('Authorization error:', error);
// Generate deny policy
return generatePolicy(
'unknown',
'Deny',
event.methodArn
);
}
};
This Lambda authorizer validates JWT tokens using JWKS (JSON Web Key Set) from your identity provider, verifies signature and claims, and generates IAM policies to allow/deny API Gateway invocations. The verified claims (user ID, email, tier) are passed as context to downstream integrations.
Request Validator
// lambda/request-validator/index.ts
// Advanced request validation with business logic
import { APIGatewayProxyEvent, APIGatewayProxyResult } from 'aws-lambda';
import Ajv, { JSONSchemaType } from 'ajv';
import addFormats from 'ajv-formats';
interface ChatMessage {
role: 'system' | 'user' | 'assistant';
content: string;
name?: string;
}
interface ChatCompletionRequest {
messages: ChatMessage[];
model?: string;
temperature?: number;
max_tokens?: number;
top_p?: number;
frequency_penalty?: number;
presence_penalty?: number;
user?: string;
}
// JSON Schema for request validation
const chatCompletionSchema: JSONSchemaType<ChatCompletionRequest> = {
type: 'object',
properties: {
messages: {
type: 'array',
minItems: 1,
maxItems: 50, // Prevent abuse
items: {
type: 'object',
properties: {
role: {
type: 'string',
enum: ['system', 'user', 'assistant']
},
content: {
type: 'string',
minLength: 1,
maxLength: 10000 // Prevent token abuse
},
name: {
type: 'string',
nullable: true
}
},
required: ['role', 'content']
}
},
model: {
type: 'string',
enum: ['gpt-4', 'gpt-4-turbo', 'gpt-3.5-turbo'],
nullable: true
},
temperature: {
type: 'number',
minimum: 0,
maximum: 2,
nullable: true
},
max_tokens: {
type: 'integer',
minimum: 1,
maximum: 4096,
nullable: true
},
top_p: {
type: 'number',
minimum: 0,
maximum: 1,
nullable: true
},
frequency_penalty: {
type: 'number',
minimum: -2,
maximum: 2,
nullable: true
},
presence_penalty: {
type: 'number',
minimum: -2,
maximum: 2,
nullable: true
},
user: {
type: 'string',
nullable: true
}
},
required: ['messages'],
additionalProperties: false
};
const ajv = new Ajv({ allErrors: true });
addFormats(ajv);
const validate = ajv.compile(chatCompletionSchema);
export const handler = async (
event: APIGatewayProxyEvent
): Promise<APIGatewayProxyResult> => {
console.log('Request validation event:', JSON.stringify(event, null, 2));
try {
// Parse request body
if (!event.body) {
return {
statusCode: 400,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
error: 'Missing request body'
})
};
}
const request: ChatCompletionRequest = JSON.parse(event.body);
// Schema validation
const valid = validate(request);
if (!valid) {
return {
statusCode: 400,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
error: 'Request validation failed',
details: validate.errors
})
};
}
// Business logic validation
const userTier = event.requestContext.authorizer?.tier || 'free';
const model = request.model || 'gpt-3.5-turbo';
// Tier-based model restrictions
if (userTier === 'free' && model === 'gpt-4') {
return {
statusCode: 403,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
error: 'GPT-4 requires Professional or Business tier',
current_tier: userTier,
upgrade_url: 'https://yourapp.com/pricing'
})
};
}
// Estimate token usage
const estimatedTokens = request.messages.reduce((total, msg) => {
return total + Math.ceil(msg.content.length / 4);
}, 0) + (request.max_tokens || 500);
// Tier-based token limits
const tokenLimits: Record<string, number> = {
free: 1000,
starter: 10000,
professional: 50000,
business: 200000
};
if (estimatedTokens > tokenLimits[userTier]) {
return {
statusCode: 429,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
error: 'Estimated token usage exceeds tier limit',
estimated_tokens: estimatedTokens,
tier_limit: tokenLimits[userTier],
current_tier: userTier
})
};
}
// Request is valid - pass through
return {
statusCode: 200,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
validated: true,
estimated_tokens: estimatedTokens
})
};
} catch (error) {
console.error('Validation error:', error);
return {
statusCode: 500,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
error: 'Internal validation error'
})
};
}
};
This request validator implements comprehensive validation including JSON schema validation, tier-based model restrictions, and token usage estimation. It prevents abuse by enforcing business rules before requests reach OpenAI.
For more on AWS Lambda patterns in ChatGPT apps, see Serverless Architecture for ChatGPT Applications.
Advanced Gateway Patterns
Beyond basic routing and authentication, production ChatGPT applications require sophisticated patterns for resilience and performance.
Circuit Breaker Middleware
// middleware/circuit-breaker.ts
// Circuit breaker pattern for resilient API gateway
interface CircuitBreakerConfig {
failureThreshold: number; // Number of failures before opening circuit
successThreshold: number; // Successes needed to close circuit
timeout: number; // Timeout before attempting retry (ms)
monitoringPeriod: number; // Time window for failure tracking (ms)
}
enum CircuitState {
CLOSED = 'CLOSED', // Normal operation
OPEN = 'OPEN', // Failures detected, blocking requests
HALF_OPEN = 'HALF_OPEN' // Testing if service recovered
}
class CircuitBreaker {
private state: CircuitState = CircuitState.CLOSED;
private failureCount: number = 0;
private successCount: number = 0;
private nextAttempt: number = Date.now();
private failures: number[] = [];
constructor(private config: CircuitBreakerConfig) {}
async execute<T>(
operation: () => Promise<T>,
fallback?: () => Promise<T>
): Promise<T> {
// Check if circuit is open
if (this.state === CircuitState.OPEN) {
if (Date.now() < this.nextAttempt) {
console.log('Circuit is OPEN, using fallback');
if (fallback) {
return fallback();
}
throw new Error('Service unavailable (circuit breaker open)');
}
// Transition to half-open to test service
this.state = CircuitState.HALF_OPEN;
console.log('Circuit transitioning to HALF_OPEN');
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
if (fallback) {
console.log('Operation failed, using fallback');
return fallback();
}
throw error;
}
}
private onSuccess(): void {
this.failureCount = 0;
if (this.state === CircuitState.HALF_OPEN) {
this.successCount++;
if (this.successCount >= this.config.successThreshold) {
console.log('Circuit closing after successful recovery');
this.state = CircuitState.CLOSED;
this.successCount = 0;
}
}
}
private onFailure(): void {
const now = Date.now();
this.failures.push(now);
// Remove old failures outside monitoring period
this.failures = this.failures.filter(
timestamp => now - timestamp < this.config.monitoringPeriod
);
this.failureCount = this.failures.length;
if (this.failureCount >= this.config.failureThreshold) {
console.log(`Circuit opening after ${this.failureCount} failures`);
this.state = CircuitState.OPEN;
this.nextAttempt = now + this.config.timeout;
this.successCount = 0;
}
}
getState(): CircuitState {
return this.state;
}
getStats() {
return {
state: this.state,
failureCount: this.failureCount,
successCount: this.successCount,
nextAttempt: this.nextAttempt
};
}
}
// Usage in API gateway middleware
export const createCircuitBreakerMiddleware = (
config: CircuitBreakerConfig
) => {
const breaker = new CircuitBreaker(config);
return async (req: any, res: any, next: any) => {
try {
await breaker.execute(
async () => {
// Proxy request to backend
return next();
},
async () => {
// Fallback: Return cached response or error
res.status(503).json({
error: 'Service temporarily unavailable',
circuit_state: breaker.getState(),
retry_after: Math.ceil(
(breaker.getStats().nextAttempt - Date.now()) / 1000
)
});
}
);
} catch (error) {
res.status(500).json({
error: 'Request failed',
circuit_state: breaker.getState()
});
}
};
};
Circuit breakers prevent cascading failures when OpenAI or other backend services experience outages. The circuit "opens" after a threshold of failures, immediately returning cached responses or error messages instead of overwhelming the failing service.
Response Caching with Redis
// middleware/response-cache.ts
// Intelligent response caching for ChatGPT API calls
import { createClient, RedisClientType } from 'redis';
import crypto from 'crypto';
interface CacheConfig {
ttl: number; // Time to live (seconds)
keyPrefix: string; // Redis key prefix
varyOn: string[]; // Request properties to vary cache on
excludePaths?: string[]; // Paths to exclude from caching
}
export class ResponseCache {
private redis: RedisClientType;
constructor(private config: CacheConfig) {
this.redis = createClient({
url: process.env.REDIS_URL || 'redis://localhost:6379'
});
this.redis.on('error', (err) => {
console.error('Redis connection error:', err);
});
this.redis.connect();
}
/**
* Generate cache key from request
*/
private generateCacheKey(req: any): string {
const varyData: any = {};
for (const key of this.config.varyOn) {
if (key === 'body') {
varyData.body = req.body;
} else if (key === 'user') {
varyData.user = req.user?.id || 'anonymous';
} else if (key.startsWith('header.')) {
const headerName = key.substring(7);
varyData[key] = req.headers[headerName];
}
}
const hash = crypto
.createHash('sha256')
.update(JSON.stringify(varyData))
.digest('hex');
return `${this.config.keyPrefix}:${hash}`;
}
/**
* Check if path should be cached
*/
private shouldCache(path: string): boolean {
if (this.config.excludePaths) {
return !this.config.excludePaths.some(excluded =>
path.startsWith(excluded)
);
}
return true;
}
/**
* Cache middleware
*/
middleware() {
return async (req: any, res: any, next: any) => {
// Only cache GET and POST requests
if (!['GET', 'POST'].includes(req.method)) {
return next();
}
// Check if path should be cached
if (!this.shouldCache(req.path)) {
return next();
}
const cacheKey = this.generateCacheKey(req);
try {
// Check cache
const cached = await this.redis.get(cacheKey);
if (cached) {
console.log(`Cache HIT: ${cacheKey}`);
const response = JSON.parse(cached);
res.setHeader('X-Cache', 'HIT');
res.setHeader('X-Cache-Key', cacheKey);
return res.status(200).json(response);
}
console.log(`Cache MISS: ${cacheKey}`);
res.setHeader('X-Cache', 'MISS');
// Intercept response to cache it
const originalJson = res.json.bind(res);
res.json = async (body: any) => {
// Only cache successful responses
if (res.statusCode === 200) {
try {
await this.redis.setEx(
cacheKey,
this.config.ttl,
JSON.stringify(body)
);
console.log(`Cached response: ${cacheKey} (TTL: ${this.config.ttl}s)`);
} catch (error) {
console.error('Cache write error:', error);
}
}
return originalJson(body);
};
next();
} catch (error) {
console.error('Cache read error:', error);
// Fail open: Continue without cache
next();
}
};
}
/**
* Invalidate cache entries by pattern
*/
async invalidate(pattern: string): Promise<number> {
try {
const keys = await this.redis.keys(`${this.config.keyPrefix}:${pattern}*`);
if (keys.length === 0) {
return 0;
}
await this.redis.del(keys);
console.log(`Invalidated ${keys.length} cache entries`);
return keys.length;
} catch (error) {
console.error('Cache invalidation error:', error);
return 0;
}
}
/**
* Get cache statistics
*/
async getStats(): Promise<any> {
try {
const info = await this.redis.info('stats');
const keyspace = await this.redis.info('keyspace');
return {
info,
keyspace,
prefix: this.config.keyPrefix
};
} catch (error) {
console.error('Failed to get cache stats:', error);
return null;
}
}
}
// Usage example
const cache = new ResponseCache({
ttl: 300, // 5 minutes
keyPrefix: 'chatgpt-api',
varyOn: ['body', 'user'],
excludePaths: ['/v1/admin']
});
export default cache;
This caching middleware dramatically reduces OpenAI API costs (40-60% savings) and latency for common queries. It generates cache keys based on request body and user ID, stores responses in Redis with TTL, and provides cache invalidation for content updates.
Request Transformation Pipeline
// middleware/request-transformer.ts
// Advanced request transformation for ChatGPT gateway
interface TransformRule {
match: (req: any) => boolean;
transform: (req: any) => any;
}
export class RequestTransformer {
private rules: TransformRule[] = [];
/**
* Add transformation rule
*/
addRule(rule: TransformRule): void {
this.rules.push(rule);
}
/**
* Transform request through pipeline
*/
async transform(req: any): Promise<any> {
let transformed = { ...req };
for (const rule of this.rules) {
if (rule.match(transformed)) {
transformed = await rule.transform(transformed);
}
}
return transformed;
}
/**
* Express middleware
*/
middleware() {
return async (req: any, res: any, next: any) => {
try {
req.body = await this.transform(req.body);
next();
} catch (error) {
console.error('Request transformation error:', error);
res.status(400).json({
error: 'Invalid request format',
details: error.message
});
}
};
}
}
// Create transformer with common ChatGPT rules
const transformer = new RequestTransformer();
// Rule 1: Add system message if missing
transformer.addRule({
match: (req) => req.messages && !req.messages.some((m: any) => m.role === 'system'),
transform: (req) => ({
...req,
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
...req.messages
]
})
});
// Rule 2: Set default model based on user tier
transformer.addRule({
match: (req) => !req.model,
transform: (req) => {
const userTier = req.user?.tier || 'free';
const modelMap: Record<string, string> = {
free: 'gpt-3.5-turbo',
starter: 'gpt-3.5-turbo',
professional: 'gpt-4-turbo',
business: 'gpt-4-turbo'
};
return {
...req,
model: modelMap[userTier]
};
}
});
// Rule 3: Inject user ID for usage tracking
transformer.addRule({
match: (req) => !req.user,
transform: (req) => ({
...req,
user: req.userId || 'anonymous'
})
});
// Rule 4: Apply temperature constraints
transformer.addRule({
match: (req) => req.temperature !== undefined,
transform: (req) => ({
...req,
temperature: Math.max(0, Math.min(2, req.temperature))
})
});
export default transformer;
Request transformation pipelines standardize incoming requests, apply business rules (tier-based model selection), and enrich payloads with metadata. This keeps client code simple while enforcing consistent API usage.
For additional gateway patterns, see Microservices Architecture for ChatGPT Apps.
Monitoring & Observability with OpenTelemetry
Production API gateways require comprehensive observability to detect issues, optimize performance, and track costs.
// telemetry/opentelemetry.ts
// OpenTelemetry instrumentation for API gateway
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
// Initialize OpenTelemetry SDK
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'chatgpt-api-gateway',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development'
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces'
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/metrics'
}),
exportIntervalMillis: 60000 // 1 minute
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
ignoreIncomingPaths: ['/health', '/metrics']
},
'@opentelemetry/instrumentation-express': {
enabled: true
}
})
]
});
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('OpenTelemetry SDK shut down'))
.catch((error) => console.error('Error shutting down OpenTelemetry SDK', error))
.finally(() => process.exit(0));
});
/**
* Custom span creation for ChatGPT API calls
*/
export const traceChatGPTRequest = async (
operation: string,
attributes: any,
fn: () => Promise<any>
): Promise<any> => {
const tracer = trace.getTracer('chatgpt-gateway');
return tracer.startActiveSpan(operation, async (span) => {
try {
// Set span attributes
span.setAttributes({
'chatgpt.model': attributes.model || 'unknown',
'chatgpt.user_id': attributes.userId || 'anonymous',
'chatgpt.message_count': attributes.messageCount || 0,
'chatgpt.estimated_tokens': attributes.estimatedTokens || 0
});
const result = await fn();
// Record success
span.setStatus({ code: SpanStatusCode.OK });
// Add result attributes
if (result.usage) {
span.setAttributes({
'chatgpt.tokens.prompt': result.usage.prompt_tokens,
'chatgpt.tokens.completion': result.usage.completion_tokens,
'chatgpt.tokens.total': result.usage.total_tokens
});
}
return result;
} catch (error: any) {
// Record error
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
};
/**
* Express middleware for request tracing
*/
export const tracingMiddleware = () => {
return (req: any, res: any, next: any) => {
const tracer = trace.getTracer('chatgpt-gateway');
const span = tracer.startSpan(`${req.method} ${req.path}`, {
attributes: {
'http.method': req.method,
'http.url': req.url,
'http.target': req.path,
'http.user_agent': req.headers['user-agent'],
'user.id': req.user?.id || 'anonymous'
}
});
// Store span in request context
req.span = span;
// Intercept response to record status
const originalEnd = res.end.bind(res);
res.end = (...args: any[]) => {
span.setAttribute('http.status_code', res.statusCode);
if (res.statusCode >= 400) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: `HTTP ${res.statusCode}`
});
} else {
span.setStatus({ code: SpanStatusCode.OK });
}
span.end();
return originalEnd(...args);
};
next();
};
};
export default sdk;
This OpenTelemetry implementation provides distributed tracing across your API gateway, backend services, and OpenAI API calls. You'll see exact latency breakdowns, error rates, and token usage in tools like Jaeger, Grafana, or Datadog.
Key Metrics to Monitor:
- Request Rate: Requests per second (track spikes that might trigger rate limits)
- Latency Percentiles: p50, p95, p99 response times (target <500ms for p95)
- Error Rate: 4xx and 5xx errors (alert on >5% error rate)
- Token Usage: Total tokens consumed per hour (cost monitoring)
- Cache Hit Rate: Percentage of cached responses (target >40%)
- Circuit Breaker State: OPEN/CLOSED/HALF_OPEN counts (detect backend failures)
For complete monitoring strategies, see Observability Patterns for ChatGPT Applications.
Production Deployment Checklist
Before deploying your API gateway to production, validate these critical requirements:
Security:
- JWT validation with JWKS rotation support
- HTTPS/TLS 1.3 enforcement
- API key rotation mechanism
- Rate limiting per user/tier
- DDoS protection (CloudFlare, AWS Shield)
- Security headers (HSTS, CSP, X-Frame-Options)
Performance:
- Response caching with Redis
- Circuit breakers for OpenAI API
- Request timeouts (30s max)
- Connection pooling
- Horizontal scaling (3+ gateway instances)
Observability:
- OpenTelemetry instrumentation
- Centralized logging (CloudWatch, Datadog)
- Alerting on error rates >5%
- Dashboard for key metrics
Cost Optimization:
- Cache hit rate >40%
- Token usage monitoring
- Tier-based model routing
Compliance:
- OAuth 2.1 PKCE implementation
- Access token verification
- Audit logging for sensitive operations
For comprehensive deployment guides, see ChatGPT Applications Guide.
Conclusion: Building Production-Grade ChatGPT Infrastructure
API gateways are the foundation of scalable, secure, and cost-effective ChatGPT applications. By implementing the patterns covered in this guide—Kong declarative configurations, AWS API Gateway with Lambda authorizers, circuit breakers, response caching, and OpenTelemetry observability—you'll build infrastructure that handles production traffic reliably.
Key Takeaways:
- Kong Gateway excels for high-throughput scenarios (50,000+ requests/second) with rich Lua plugin ecosystem and declarative configuration
- AWS API Gateway provides fully managed infrastructure with built-in rate limiting, request validation, and seamless AWS service integration
- Circuit breakers prevent cascading failures when OpenAI experiences outages, improving overall reliability by 95%+
- Response caching reduces OpenAI API costs by 40-60% while improving latency from 500ms to <50ms for cached requests
- OpenTelemetry provides end-to-end observability with distributed tracing, metrics, and logging across your entire stack
For teams building ChatGPT applications targeting the OpenAI App Store, proper API gateway architecture is non-negotiable. It ensures you meet OpenAI's OAuth 2.1 PKCE security requirements, maintain sub-200ms response times, and provide enterprise-grade reliability.
Whether you choose Kong for maximum control and performance, or AWS API Gateway for managed simplicity, the patterns in this guide give you production-ready starting points. Combine them with comprehensive monitoring, intelligent caching, and resilience patterns to build ChatGPT applications that scale from 100 to 100 million users.
Ready to build your ChatGPT application with professional API gateway architecture? Start your free trial at MakeAIHQ.com and deploy production-ready ChatGPT apps in 48 hours—no coding required. Our platform automatically generates API gateways, authentication flows, and infrastructure based on best practices from this guide.
For more ChatGPT app architecture patterns, explore:
- Complete Guide to Building ChatGPT Applications (pillar article)
- Rate Limiting Patterns for ChatGPT Apps
- Microservices Architecture for ChatGPT Apps
- OAuth 2.1 PKCE Authentication for ChatGPT Apps
Building the future of conversational AI, one gateway at a time.
References:
- Kong Gateway Documentation - Official Kong Gateway documentation with plugin guides
- AWS API Gateway Best Practices - AWS best practices for production API Gateway deployments
- OpenTelemetry Specification - Official OpenTelemetry specification for distributed tracing and metrics