SLI/SLO/SLA Definition for ChatGPT Apps
Site Reliability Engineering (SRE) principles from Google provide the foundation for defining measurable service reliability for ChatGPT applications. Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) create a framework for quantifying user experience, setting engineering targets, and establishing customer commitments. Understanding these concepts is critical for production ChatGPT apps serving real users at scale.
ChatGPT applications present unique SLI/SLO challenges compared to traditional web services. MCP server latency, OpenAI API dependencies, widget rendering performance, and conversational context all impact user experience. This guide demonstrates how to define production-grade SLIs, calculate SLOs with error budgets, implement SLA compliance monitoring, and make data-driven reliability decisions using Google SRE methodology.
Service Level Indicators (SLIs)
Service Level Indicators are quantitative measurements of service behavior that matter to users. For ChatGPT applications, three primary SLI categories drive user satisfaction: availability (can users access your app), latency (how fast does it respond), and quality (does it return correct results).
Availability SLI measures the proportion of successful requests. A ChatGPT app serving widgets in conversation threads must handle MCP tool calls, widget rendering, and state updates. Availability captures whether these operations succeed:
availability_sli = successful_requests / total_requests
Successful requests return HTTP 200 with valid MCP responses. Failed requests include 5xx errors, timeouts exceeding 30 seconds, or malformed responses rejected by ChatGPT.
Latency SLI measures response time distribution. ChatGPT users expect near-instant responses (< 1 second for simple queries, < 5 seconds for complex operations). Latency percentiles (p50, p95, p99) reveal tail performance:
latency_p95 = 95th_percentile(request_duration)
Measuring latency requires instrumenting MCP tool handlers, database queries, external API calls, and widget render time.
Quality SLI measures correctness of responses. For ChatGPT apps, quality includes valid JSON responses, schema compliance, widget state consistency, and user-reported errors:
quality_sli = valid_responses / total_responses
Quality failures include schema validation errors, widget runtime crashes, or state corruption requiring user intervention.
Implementation Strategy: Instrument SLIs at request boundaries using middleware, OpenTelemetry, or custom logging. Store measurements in time-series databases (Prometheus, CloudWatch, Datadog) with labels for tool name, user tier, region, and failure mode.
Here's a production TypeScript implementation for SLI measurement:
// sli-measurement.ts
import { Request, Response, NextFunction } from 'express';
import { Counter, Histogram, Registry } from 'prom-client';
interface SLIMetrics {
requestCounter: Counter;
requestDuration: Histogram;
errorCounter: Counter;
qualityCounter: Counter;
}
export class SLIMeasurement {
private metrics: SLIMetrics;
private registry: Registry;
constructor(registry?: Registry) {
this.registry = registry || new Registry();
this.metrics = {
requestCounter: new Counter({
name: 'chatgpt_app_requests_total',
help: 'Total number of MCP requests',
labelNames: ['tool', 'status', 'tier', 'region'],
registers: [this.registry]
}),
requestDuration: new Histogram({
name: 'chatgpt_app_request_duration_seconds',
help: 'Request duration in seconds',
labelNames: ['tool', 'tier'],
buckets: [0.1, 0.5, 1, 2, 5, 10, 30],
registers: [this.registry]
}),
errorCounter: new Counter({
name: 'chatgpt_app_errors_total',
help: 'Total number of errors',
labelNames: ['tool', 'error_type', 'tier'],
registers: [this.registry]
}),
qualityCounter: new Counter({
name: 'chatgpt_app_quality_total',
help: 'Response quality measurements',
labelNames: ['tool', 'valid', 'tier'],
registers: [this.registry]
})
};
}
// Middleware for automatic SLI collection
middleware() {
return (req: Request, res: Response, next: NextFunction) => {
const startTime = Date.now();
const tool = req.body?.tool || 'unknown';
const tier = req.user?.tier || 'free';
const region = req.headers['cloudfront-viewer-country'] as string || 'unknown';
// Capture response
const originalSend = res.send;
res.send = function(data: any) {
const duration = (Date.now() - startTime) / 1000;
const status = res.statusCode >= 200 && res.statusCode < 300 ? 'success' : 'error';
// Record availability SLI
this.metrics.requestCounter.inc({
tool,
status,
tier,
region
});
// Record latency SLI
this.metrics.requestDuration.observe({ tool, tier }, duration);
// Record errors
if (status === 'error') {
const errorType = this.getErrorType(res.statusCode);
this.metrics.errorCounter.inc({ tool, error_type: errorType, tier });
}
// Record quality SLI
const isValid = this.validateResponse(data);
this.metrics.qualityCounter.inc({
tool,
valid: isValid ? 'true' : 'false',
tier
});
return originalSend.call(this, data);
}.bind(this);
next();
};
}
private getErrorType(statusCode: number): string {
if (statusCode >= 500) return '5xx_server_error';
if (statusCode === 429) return 'rate_limit';
if (statusCode >= 400) return '4xx_client_error';
return 'unknown';
}
private validateResponse(data: any): boolean {
try {
const parsed = typeof data === 'string' ? JSON.parse(data) : data;
// Validate MCP response structure
if (!parsed.content && !parsed.structuredContent) {
return false;
}
// Validate widget state if present
if (parsed._meta?.widgetState) {
const state = parsed._meta.widgetState;
if (typeof state !== 'object' || state === null) {
return false;
}
}
return true;
} catch (error) {
return false;
}
}
// Get current SLI values
async getCurrentSLIs(timeWindow: number = 300): Promise<SLISnapshot> {
const metrics = await this.registry.metrics();
// Calculate availability from last N seconds
const availability = this.calculateAvailability(timeWindow);
const latencyP95 = this.calculateLatencyPercentile(95, timeWindow);
const quality = this.calculateQuality(timeWindow);
return {
timestamp: Date.now(),
window_seconds: timeWindow,
availability_percent: availability,
latency_p95_seconds: latencyP95,
quality_percent: quality,
total_requests: this.getTotalRequests(timeWindow)
};
}
private calculateAvailability(windowSeconds: number): number {
// Query Prometheus or calculate from in-memory metrics
// This is a simplified example
const successCount = this.getMetricValue('chatgpt_app_requests_total', { status: 'success' });
const totalCount = this.getMetricValue('chatgpt_app_requests_total');
return totalCount > 0 ? (successCount / totalCount) * 100 : 100;
}
private calculateLatencyPercentile(percentile: number, windowSeconds: number): number {
// Calculate percentile from histogram buckets
const histogram = this.metrics.requestDuration;
const buckets = (histogram as any).hashMap || {};
// Simplified percentile calculation
// Production should use proper histogram quantile calculation
return 0.95; // Placeholder
}
private calculateQuality(windowSeconds: number): number {
const validCount = this.getMetricValue('chatgpt_app_quality_total', { valid: 'true' });
const totalCount = this.getMetricValue('chatgpt_app_quality_total');
return totalCount > 0 ? (validCount / totalCount) * 100 : 100;
}
private getMetricValue(metricName: string, labels?: Record<string, string>): number {
// Simplified metric retrieval
// Production should query actual metric values with label matching
return 0;
}
private getTotalRequests(windowSeconds: number): number {
return this.getMetricValue('chatgpt_app_requests_total');
}
// Export metrics in Prometheus format
async getMetrics(): Promise<string> {
return this.registry.metrics();
}
}
interface SLISnapshot {
timestamp: number;
window_seconds: number;
availability_percent: number;
latency_p95_seconds: number;
quality_percent: number;
total_requests: number;
}
// Usage example
const sliMeasurement = new SLIMeasurement();
app.use(sliMeasurement.middleware());
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', 'text/plain');
res.send(await sliMeasurement.getMetrics());
});
Service Level Objectives (SLOs)
Service Level Objectives define target values for SLIs over time windows. SLOs balance user expectations with engineering cost: stricter SLOs (99.99% availability) require more resources than relaxed SLOs (95% availability). Google SRE recommends setting SLOs just below the threshold where users complain.
Availability SLO for ChatGPT apps typically targets 99.5% to 99.9% over rolling 30-day windows. This allows 43 minutes to 3.6 hours of monthly downtime for maintenance, incidents, and dependencies:
availability_slo = 99.9% over 30 days
error_budget = (1 - 0.999) * 30 days = 43.2 minutes
Latency SLO defines acceptable response time percentiles. ChatGPT users tolerate higher latency for complex operations. A common SLO: 95% of requests complete in < 1 second, 99% in < 5 seconds:
latency_p95_slo = 1.0 seconds
latency_p99_slo = 5.0 seconds
Quality SLO targets valid response rates. Schema violations, widget crashes, or state corruption degrade user experience. A typical quality SLO: 99.5% of responses pass validation:
quality_slo = 99.5%
Rolling Windows: Calculate SLOs over rolling time windows (7 days, 30 days) rather than calendar months. This prevents "reset" behavior where teams burn entire error budgets early in the month.
Error Budget: The inverse of SLO availability is your error budget—the amount of unreliability you can tolerate. A 99.9% SLO grants 0.1% error budget. If availability drops below 99.9%, you've consumed error budget. When error budget exhausts, freeze feature development and focus on reliability.
Here's a TypeScript implementation for SLO calculation:
// slo-calculator.ts
import { SLISnapshot } from './sli-measurement';
interface SLOTarget {
name: string;
sli_type: 'availability' | 'latency' | 'quality';
target_percent: number;
window_days: number;
}
interface SLOStatus {
target: SLOTarget;
current_percent: number;
slo_met: boolean;
error_budget_remaining_percent: number;
error_budget_minutes: number;
burn_rate: number;
time_to_exhaustion_hours: number | null;
}
export class SLOCalculator {
private targets: SLOTarget[];
constructor(targets: SLOTarget[]) {
this.targets = targets;
}
// Calculate SLO status for all targets
async calculateStatus(
snapshots: SLISnapshot[],
windowDays: number
): Promise<SLOStatus[]> {
const statuses: SLOStatus[] = [];
for (const target of this.targets) {
if (target.window_days !== windowDays) continue;
const status = this.calculateSLOStatus(target, snapshots);
statuses.push(status);
}
return statuses;
}
private calculateSLOStatus(
target: SLOTarget,
snapshots: SLISnapshot[]
): SLOStatus {
const current = this.calculateCurrentSLI(target, snapshots);
const sloMet = current >= target.target_percent;
const errorBudgetPercent = 100 - target.target_percent;
const consumedBudgetPercent = target.target_percent - current;
const remainingBudgetPercent = errorBudgetPercent - consumedBudgetPercent;
const windowMinutes = target.window_days * 24 * 60;
const errorBudgetMinutes = (remainingBudgetPercent / 100) * windowMinutes;
const burnRate = this.calculateBurnRate(target, snapshots);
const timeToExhaustion = this.calculateTimeToExhaustion(
remainingBudgetPercent,
burnRate,
windowMinutes
);
return {
target,
current_percent: current,
slo_met: sloMet,
error_budget_remaining_percent: remainingBudgetPercent,
error_budget_minutes: errorBudgetMinutes,
burn_rate: burnRate,
time_to_exhaustion_hours: timeToExhaustion
};
}
private calculateCurrentSLI(
target: SLOTarget,
snapshots: SLISnapshot[]
): number {
if (snapshots.length === 0) return 100;
const recent = snapshots[snapshots.length - 1];
switch (target.sli_type) {
case 'availability':
return recent.availability_percent;
case 'quality':
return recent.quality_percent;
case 'latency':
// For latency SLOs, calculate % of requests meeting target
return 100 - (recent.latency_p95_seconds > 1.0 ? 5 : 0);
default:
return 100;
}
}
private calculateBurnRate(
target: SLOTarget,
snapshots: SLISnapshot[]
): number {
if (snapshots.length < 2) return 0;
// Calculate error rate over recent snapshots
const recentWindow = snapshots.slice(-12); // Last hour (5min intervals)
const errorRates = recentWindow.map(snap => {
const current = this.calculateCurrentSLI(target, [snap]);
return 100 - current;
});
const avgErrorRate = errorRates.reduce((a, b) => a + b, 0) / errorRates.length;
const allowedErrorRate = 100 - target.target_percent;
return allowedErrorRate > 0 ? avgErrorRate / allowedErrorRate : 0;
}
private calculateTimeToExhaustion(
remainingBudgetPercent: number,
burnRate: number,
windowMinutes: number
): number | null {
if (burnRate <= 0) return null;
if (remainingBudgetPercent <= 0) return 0;
const budgetMinutes = (remainingBudgetPercent / 100) * windowMinutes;
const exhaustionMinutes = budgetMinutes / burnRate;
return exhaustionMinutes / 60; // Convert to hours
}
// Generate SLO report
generateReport(statuses: SLOStatus[]): string {
let report = '# SLO Status Report\n\n';
report += `Generated: ${new Date().toISOString()}\n\n`;
for (const status of statuses) {
report += `## ${status.target.name}\n`;
report += `- Type: ${status.target.sli_type}\n`;
report += `- Target: ${status.target.target_percent}%\n`;
report += `- Current: ${status.current_percent.toFixed(2)}%\n`;
report += `- Status: ${status.slo_met ? '✅ MET' : '❌ VIOLATED'}\n`;
report += `- Error Budget Remaining: ${status.error_budget_remaining_percent.toFixed(2)}% (${Math.floor(status.error_budget_minutes)} minutes)\n`;
report += `- Burn Rate: ${status.burn_rate.toFixed(2)}x\n`;
if (status.time_to_exhaustion_hours !== null) {
report += `- Time to Budget Exhaustion: ${status.time_to_exhaustion_hours.toFixed(1)} hours\n`;
}
report += '\n';
}
return report;
}
}
// Example SLO targets
const sloTargets: SLOTarget[] = [
{
name: 'Availability (30-day)',
sli_type: 'availability',
target_percent: 99.9,
window_days: 30
},
{
name: 'Availability (7-day)',
sli_type: 'availability',
target_percent: 99.5,
window_days: 7
},
{
name: 'Quality (30-day)',
sli_type: 'quality',
target_percent: 99.5,
window_days: 30
},
{
name: 'Latency P95 (7-day)',
sli_type: 'latency',
target_percent: 95.0,
window_days: 7
}
];
const sloCalculator = new SLOCalculator(sloTargets);
Service Level Agreements (SLAs)
Service Level Agreements are contractual commitments to customers with financial penalties for non-compliance. SLAs must be less strict than internal SLOs to provide safety margin. If your availability SLO is 99.9%, set customer SLA at 99.5%. This buffer prevents SLA violations during normal operational variance.
SLA Structure includes three components: commitment percentage, measurement window, and penalty terms. A typical ChatGPT app SLA:
Commitment: 99.5% monthly availability
Measurement: Calendar month, measured by server logs
Penalty: 10% monthly fee credit per 0.1% below SLA
Measurement Methodology must be transparent and auditable. ChatGPT app SLAs measure availability from server-side logs (not client-side, which includes network issues). Document exclusions: scheduled maintenance, customer-caused outages, force majeure.
Penalty Tiers scale with SLA violation severity. Example tiered structure:
- 99.5% - 99.0%: 10% credit
- 99.0% - 98.0%: 25% credit
- Below 98.0%: 50% credit + termination right
Legal Considerations: SLAs are contracts requiring legal review. Terms must define measurement methodology, exclusion criteria, credit request process, and dispute resolution. ChatGPT apps with enterprise customers require SLAs for procurement approval.
Operational Impact: SLAs drive architectural decisions. A 99.5% SLA allows 3.6 hours monthly downtime—sufficient for maintenance windows and incident response. A 99.99% SLA allows only 4 minutes monthly downtime, requiring multi-region deployment, automated failover, and on-call engineering.
Error Budget
Error budgets transform reliability from philosophical debate to data-driven decision framework. When error budget remains, invest in feature velocity. When error budget exhausts, freeze features and improve reliability.
Error Budget Calculation:
error_budget = (1 - slo_target) * time_window
availability_slo = 99.9%
error_budget = 0.001 * 30 days = 43.2 minutes per month
Burn Rate measures how quickly you consume error budget. A burn rate of 1.0 means you're consuming budget at exactly the sustainable rate. Burn rate > 1.0 indicates reliability problems:
burn_rate = actual_error_rate / allowed_error_rate
allowed_error_rate = 1 - slo_target = 0.001
actual_error_rate = errors / total_requests
If errors = 0.005 (0.5%):
burn_rate = 0.005 / 0.001 = 5.0x
A 5x burn rate exhausts monthly error budget in 6 days.
Decision Framework:
- Error budget remaining > 50%: Optimize for velocity, deploy features aggressively
- Error budget remaining 20-50%: Balanced approach, increase testing rigor
- Error budget remaining < 20%: Focus on reliability, freeze risky deploys
- Error budget exhausted: Code freeze, incident review, reliability improvements
Multi-Window Budgets: Track error budgets across multiple windows (7-day, 30-day) to detect short-term incidents vs. chronic issues. A 7-day budget violation with healthy 30-day budget indicates a recent incident. Violations across both windows indicate systemic problems.
Here's a TypeScript implementation for error budget tracking:
// error-budget-tracker.ts
import { SLOStatus } from './slo-calculator';
interface BudgetPolicy {
remaining_threshold_percent: number;
action: 'optimize_velocity' | 'balanced' | 'focus_reliability' | 'code_freeze';
description: string;
}
interface BudgetAlert {
severity: 'info' | 'warning' | 'critical';
message: string;
recommended_action: string;
burn_rate: number;
time_to_exhaustion_hours: number | null;
}
export class ErrorBudgetTracker {
private policies: BudgetPolicy[] = [
{
remaining_threshold_percent: 50,
action: 'optimize_velocity',
description: 'Healthy error budget - optimize for feature velocity'
},
{
remaining_threshold_percent: 20,
action: 'balanced',
description: 'Moderate error budget - balance velocity and reliability'
},
{
remaining_threshold_percent: 5,
action: 'focus_reliability',
description: 'Low error budget - focus on reliability improvements'
},
{
remaining_threshold_percent: 0,
action: 'code_freeze',
description: 'Error budget exhausted - implement code freeze'
}
];
// Determine current policy based on error budget
getCurrentPolicy(status: SLOStatus): BudgetPolicy {
const remaining = status.error_budget_remaining_percent;
for (let i = 0; i < this.policies.length; i++) {
if (remaining >= this.policies[i].remaining_threshold_percent) {
return this.policies[i];
}
}
return this.policies[this.policies.length - 1];
}
// Generate alerts based on burn rate and remaining budget
generateAlerts(statuses: SLOStatus[]): BudgetAlert[] {
const alerts: BudgetAlert[] = [];
for (const status of statuses) {
// Critical burn rate alert
if (status.burn_rate > 10) {
alerts.push({
severity: 'critical',
message: `Extreme burn rate detected for ${status.target.name}: ${status.burn_rate.toFixed(1)}x`,
recommended_action: 'Immediate incident response required. Error budget will exhaust in hours.',
burn_rate: status.burn_rate,
time_to_exhaustion_hours: status.time_to_exhaustion_hours
});
} else if (status.burn_rate > 5) {
alerts.push({
severity: 'critical',
message: `High burn rate for ${status.target.name}: ${status.burn_rate.toFixed(1)}x`,
recommended_action: 'Investigate cause and implement mitigation. Consider rolling back recent changes.',
burn_rate: status.burn_rate,
time_to_exhaustion_hours: status.time_to_exhaustion_hours
});
} else if (status.burn_rate > 2) {
alerts.push({
severity: 'warning',
message: `Elevated burn rate for ${status.target.name}: ${status.burn_rate.toFixed(1)}x`,
recommended_action: 'Monitor closely. Review recent deployments and error rates.',
burn_rate: status.burn_rate,
time_to_exhaustion_hours: status.time_to_exhaustion_hours
});
}
// Budget exhaustion alerts
if (status.error_budget_remaining_percent <= 0) {
alerts.push({
severity: 'critical',
message: `Error budget exhausted for ${status.target.name}`,
recommended_action: 'Implement code freeze. Focus all engineering on reliability improvements.',
burn_rate: status.burn_rate,
time_to_exhaustion_hours: 0
});
} else if (status.error_budget_remaining_percent < 5) {
alerts.push({
severity: 'warning',
message: `Error budget critically low for ${status.target.name}: ${status.error_budget_remaining_percent.toFixed(1)}% remaining`,
recommended_action: 'Defer risky changes. Increase testing and monitoring.',
burn_rate: status.burn_rate,
time_to_exhaustion_hours: status.time_to_exhaustion_hours
});
}
}
return alerts;
}
// Generate weekly report
generateWeeklyReport(statuses: SLOStatus[]): string {
let report = '# Error Budget Weekly Report\n\n';
report += `Period: ${new Date().toISOString()}\n\n`;
report += '## Executive Summary\n\n';
const exhaustedBudgets = statuses.filter(s => s.error_budget_remaining_percent <= 0);
const criticalBudgets = statuses.filter(s =>
s.error_budget_remaining_percent > 0 &&
s.error_budget_remaining_percent < 10
);
if (exhaustedBudgets.length > 0) {
report += `⚠️ **${exhaustedBudgets.length} error budget(s) exhausted**\n\n`;
report += 'Action Required: Implement code freeze and focus on reliability.\n\n';
} else if (criticalBudgets.length > 0) {
report += `⚠️ **${criticalBudgets.length} error budget(s) critically low**\n\n`;
report += 'Action Required: Defer risky changes and increase reliability focus.\n\n';
} else {
report += '✅ All error budgets healthy\n\n';
}
report += '## Detailed Status\n\n';
for (const status of statuses) {
const policy = this.getCurrentPolicy(status);
report += `### ${status.target.name}\n`;
report += `- Current SLI: ${status.current_percent.toFixed(2)}%\n`;
report += `- Target SLO: ${status.target.target_percent}%\n`;
report += `- Budget Remaining: ${status.error_budget_remaining_percent.toFixed(2)}%\n`;
report += `- Burn Rate: ${status.burn_rate.toFixed(2)}x\n`;
report += `- Policy: **${policy.action}**\n`;
report += `- Guidance: ${policy.description}\n\n`;
}
const alerts = this.generateAlerts(statuses);
if (alerts.length > 0) {
report += '## Active Alerts\n\n';
for (const alert of alerts) {
const emoji = alert.severity === 'critical' ? '🚨' : '⚠️';
report += `${emoji} **${alert.severity.toUpperCase()}**: ${alert.message}\n`;
report += `- Action: ${alert.recommended_action}\n\n`;
}
}
return report;
}
}
// Usage example
const tracker = new ErrorBudgetTracker();
const policy = tracker.getCurrentPolicy(sloStatus);
if (policy.action === 'code_freeze') {
console.log('🚨 CODE FREEZE: Error budget exhausted');
console.log('Block all non-critical deployments');
console.log('Focus team on reliability improvements');
}
Implementation
Implementing SLI/SLO/SLA monitoring requires instrumentation, storage, querying, and alerting infrastructure. Prometheus provides the standard foundation for SRE metrics with recording rules, alerting rules, and Grafana dashboards.
Prometheus Recording Rules pre-compute SLI values to reduce query load:
# prometheus-rules.yaml
groups:
- name: chatgpt_slis
interval: 60s
rules:
# Availability SLI
- record: chatgpt:availability:ratio_rate5m
expr: |
sum(rate(chatgpt_app_requests_total{status="success"}[5m]))
/
sum(rate(chatgpt_app_requests_total[5m]))
# Latency SLI (% meeting target)
- record: chatgpt:latency:good_ratio_rate5m
expr: |
sum(rate(chatgpt_app_request_duration_seconds_bucket{le="1.0"}[5m]))
/
sum(rate(chatgpt_app_request_duration_seconds_count[5m]))
# Quality SLI
- record: chatgpt:quality:ratio_rate5m
expr: |
sum(rate(chatgpt_app_quality_total{valid="true"}[5m]))
/
sum(rate(chatgpt_app_quality_total[5m]))
# 30-day availability SLO
- record: chatgpt:availability:ratio_rate30d
expr: |
sum(rate(chatgpt_app_requests_total{status="success"}[30d]))
/
sum(rate(chatgpt_app_requests_total[30d]))
# Error budget remaining (as fraction)
- record: chatgpt:error_budget:remaining_ratio
expr: |
(0.999 - (1 - chatgpt:availability:ratio_rate30d))
/
(1 - 0.999)
# Burn rate (actual error rate / allowed error rate)
- record: chatgpt:error_budget:burn_rate_1h
expr: |
(1 - chatgpt:availability:ratio_rate5m)
/
(1 - 0.999)
Burn Rate Alerts detect rapid error budget consumption:
# burn-rate-alerts.yaml
groups:
- name: chatgpt_error_budget
rules:
# Page: 2% budget consumed in 1 hour
- alert: ErrorBudgetBurnRateCritical
expr: chatgpt:error_budget:burn_rate_1h > 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "Critical error budget burn rate"
description: "Burning through monthly error budget in 2 days. Current rate: {{ $value }}x"
# Page: 5% budget consumed in 6 hours
- alert: ErrorBudgetBurnRateHigh
expr: chatgpt:error_budget:burn_rate_1h > 6
for: 30m
labels:
severity: warning
annotations:
summary: "High error budget burn rate"
description: "Burning through monthly error budget in 5 days. Current rate: {{ $value }}x"
# Ticket: Budget critically low
- alert: ErrorBudgetLow
expr: chatgpt:error_budget:remaining_ratio < 0.1
for: 1h
labels:
severity: warning
annotations:
summary: "Error budget critically low"
description: "Only {{ $value | humanizePercentage }} of monthly error budget remaining"
SLO Dashboard (Grafana JSON):
{
"dashboard": {
"title": "ChatGPT App SLOs",
"panels": [
{
"title": "30-Day Availability",
"targets": [
{
"expr": "chatgpt:availability:ratio_rate30d * 100",
"legendFormat": "Current"
},
{
"expr": "99.9",
"legendFormat": "SLO Target"
}
],
"yaxes": [
{
"min": 99,
"max": 100,
"format": "percent"
}
]
},
{
"title": "Error Budget Remaining",
"targets": [
{
"expr": "chatgpt:error_budget:remaining_ratio * 100",
"legendFormat": "Budget %"
}
],
"thresholds": [
{
"value": 0,
"color": "red"
},
{
"value": 20,
"color": "yellow"
},
{
"value": 50,
"color": "green"
}
]
},
{
"title": "Burn Rate (1-hour)",
"targets": [
{
"expr": "chatgpt:error_budget:burn_rate_1h",
"legendFormat": "Current"
}
],
"thresholds": [
{
"value": 1.0,
"color": "green"
},
{
"value": 2.0,
"color": "yellow"
},
{
"value": 5.0,
"color": "red"
}
]
},
{
"title": "Request Success Rate",
"targets": [
{
"expr": "sum(rate(chatgpt_app_requests_total{status='success'}[5m])) / sum(rate(chatgpt_app_requests_total[5m])) * 100",
"legendFormat": "Success Rate"
}
]
},
{
"title": "P95 Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(chatgpt_app_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95"
},
{
"expr": "1.0",
"legendFormat": "SLO Target"
}
]
},
{
"title": "Quality Score",
"targets": [
{
"expr": "sum(rate(chatgpt_app_quality_total{valid='true'}[5m])) / sum(rate(chatgpt_app_quality_total[5m])) * 100",
"legendFormat": "Valid %"
},
{
"expr": "99.5",
"legendFormat": "SLO Target"
}
]
}
]
}
}
Compliance Reporter generates monthly SLA compliance reports:
// compliance-reporter.ts
import { SLOStatus } from './slo-calculator';
interface ComplianceReport {
period_start: Date;
period_end: Date;
sla_target_percent: number;
actual_percent: number;
sla_met: boolean;
downtime_minutes: number;
allowed_downtime_minutes: number;
credit_percent: number;
incidents: IncidentSummary[];
}
interface IncidentSummary {
start_time: Date;
end_time: Date;
duration_minutes: number;
impact_percent: number;
root_cause: string;
excluded: boolean;
exclusion_reason?: string;
}
export class ComplianceReporter {
private slaTarget: number;
constructor(slaTargetPercent: number = 99.5) {
this.slaTarget = slaTargetPercent;
}
async generateMonthlyReport(
year: number,
month: number,
incidents: IncidentSummary[]
): Promise<ComplianceReport> {
const periodStart = new Date(year, month - 1, 1);
const periodEnd = new Date(year, month, 0, 23, 59, 59);
const totalMinutes = (periodEnd.getTime() - periodStart.getTime()) / (1000 * 60);
const allowedDowntimeMinutes = ((100 - this.slaTarget) / 100) * totalMinutes;
// Calculate actual downtime (excluding scheduled maintenance)
const downtimeMinutes = incidents
.filter(i => !i.excluded)
.reduce((sum, i) => sum + i.duration_minutes, 0);
const actualPercent = ((totalMinutes - downtimeMinutes) / totalMinutes) * 100;
const slaMet = actualPercent >= this.slaTarget;
const creditPercent = this.calculateCreditPercent(actualPercent);
return {
period_start: periodStart,
period_end: periodEnd,
sla_target_percent: this.slaTarget,
actual_percent: actualPercent,
sla_met: slaMet,
downtime_minutes: downtimeMinutes,
allowed_downtime_minutes: allowedDowntimeMinutes,
credit_percent: creditPercent,
incidents
};
}
private calculateCreditPercent(actualPercent: number): number {
if (actualPercent >= this.slaTarget) return 0;
if (actualPercent >= 99.0) return 10;
if (actualPercent >= 98.0) return 25;
return 50;
}
formatReport(report: ComplianceReport): string {
let output = '# SLA Compliance Report\n\n';
output += `**Period**: ${report.period_start.toISOString().split('T')[0]} to ${report.period_end.toISOString().split('T')[0]}\n\n`;
output += '## Summary\n\n';
output += `- SLA Target: ${report.sla_target_percent}%\n`;
output += `- Actual Availability: ${report.actual_percent.toFixed(3)}%\n`;
output += `- Status: ${report.sla_met ? '✅ SLA MET' : '❌ SLA VIOLATED'}\n`;
output += `- Downtime: ${Math.floor(report.downtime_minutes)} minutes (allowed: ${Math.floor(report.allowed_downtime_minutes)})\n`;
if (report.credit_percent > 0) {
output += `- **Service Credit**: ${report.credit_percent}% of monthly fee\n`;
}
output += '\n## Incidents\n\n';
for (const incident of report.incidents) {
output += `### ${incident.start_time.toISOString()}\n`;
output += `- Duration: ${incident.duration_minutes} minutes\n`;
output += `- Impact: ${incident.impact_percent}% of requests\n`;
output += `- Root Cause: ${incident.root_cause}\n`;
if (incident.excluded) {
output += `- **Excluded**: ${incident.exclusion_reason}\n`;
}
output += '\n';
}
return output;
}
}
Conclusion
Service Level Indicators, Objectives, and Agreements provide the quantitative foundation for production ChatGPT applications. SLIs measure what users care about—availability, latency, and quality. SLOs define engineering targets with error budgets that balance velocity and reliability. SLAs create contractual commitments with financial accountability.
Production ChatGPT apps require disciplined SRE practices. Instrument MCP handlers with Prometheus metrics, calculate SLOs from rolling time windows, track error budgets to guide deployment decisions, and generate compliance reports for enterprise customers. Tools like Prometheus metrics collection, Grafana monitoring dashboards, and incident management complete the reliability stack.
Google SRE methodology transforms reliability from reactive firefighting to proactive engineering. Define clear SLIs, set achievable SLOs, honor error budgets, and build systems that meet customer expectations. For the complete guide to building production ChatGPT applications with comprehensive observability, explore our complete guide to building ChatGPT applications.
Ready to build ChatGPT apps with production-grade SLI/SLO/SLA monitoring? MakeAIHQ provides built-in observability, error budget tracking, and compliance reporting. Deploy ChatGPT applications that meet enterprise SLAs without infrastructure expertise. Start your free trial and launch ChatGPT apps with Google SRE reliability.
Last updated: December 2026