Alerting Strategies & On-Call for ChatGPT Apps

Production ChatGPT applications require intelligent alerting systems that notify the right people at the right time without overwhelming your team with false positives. The difference between a well-designed alerting strategy and alert chaos is the line between confident scaling and constant firefighting.

This comprehensive guide covers production-ready alerting strategies, on-call management, escalation policies, and alert fatigue prevention specifically designed for ChatGPT applications. You'll learn how to design symptom-based alerts, implement PagerDuty integration, configure escalation chains, and prevent the alert fatigue that plagues many production systems.

Whether you're supporting 100 users or 100,000, effective alerting ensures you catch critical issues before users do while maintaining team sanity. Alert design isn't just about setting thresholds—it's about creating actionable, contextual notifications that enable rapid response and resolution.

By the end of this guide, you'll have production-tested alert configurations, escalation policies, and integration patterns that scale from startup to enterprise. Let's build an alerting system that protects your application without burning out your team.

The Foundation: Symptom-Based Alert Design

The most critical principle of effective alerting is focusing on symptoms (user-facing impact) rather than causes (internal component failures). A symptom-based approach ensures every alert represents a real problem requiring immediate attention.

Symptoms vs. Causes

Symptom-based alerts (actionable):

  • "API response time p95 > 5 seconds for 5 minutes" (users experiencing slowness)
  • "Error rate > 5% for 10 minutes" (users encountering failures)
  • "ChatGPT API quota exhausted" (users unable to get responses)

Cause-based alerts (often noise):

  • "Container CPU > 80%" (may be normal under load)
  • "Redis connection count > 100" (may not impact users)
  • "Disk usage > 70%" (not immediately critical)

Alert Severity Levels

Structure alerts into clear severity tiers with defined response expectations:

# alert-rules.yaml - Production Alert Configuration
#
# Severity Levels:
# - P0/Critical: Immediate response required, page on-call
# - P1/High: Response within 15 minutes, notify team
# - P2/Medium: Response within 1 hour, ticket created
# - P3/Low: Review during business hours

groups:
  - name: chatgpt_app_critical
    interval: 30s
    rules:
      # P0: Complete Service Outage
      - alert: ChatGPTAppDown
        expr: up{job="chatgpt-app"} == 0
        for: 2m
        labels:
          severity: critical
          team: platform
          component: core
        annotations:
          summary: "ChatGPT app {{ $labels.instance }} is down"
          description: "Application has been unreachable for 2 minutes. This is a complete service outage."
          runbook_url: "https://docs.company.com/runbooks/app-down"
          dashboard_url: "https://grafana.company.com/d/chatgpt-overview"

      # P0: High Error Rate
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Error rate above 5% (current: {{ $value | humanizePercentage }})"
          description: "Users are experiencing significant failures. Investigate immediately."
          impact: "{{ $value | humanizePercentage }} of user requests failing"

      # P0: API Latency Degradation
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
          ) > 5
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "API latency p95 > 5s on {{ $labels.endpoint }}"
          description: "Users experiencing severe slowness. Current p95: {{ $value }}s"

      # P1: ChatGPT API Quota Warning
      - alert: ChatGPTQuotaNearLimit
        expr: |
          (
            chatgpt_api_quota_used
            /
            chatgpt_api_quota_limit
          ) > 0.90
        for: 10m
        labels:
          severity: high
          team: platform
        annotations:
          summary: "ChatGPT API quota at {{ $value | humanizePercentage }}"
          description: "Approaching quota limit. May need to throttle or upgrade."

      # P1: Memory Pressure
      - alert: HighMemoryUsage
        expr: |
          (
            container_memory_usage_bytes{container="chatgpt-app"}
            /
            container_spec_memory_limit_bytes{container="chatgpt-app"}
          ) > 0.90
        for: 15m
        labels:
          severity: high
          team: platform
        annotations:
          summary: "Memory usage at {{ $value | humanizePercentage }}"
          description: "Risk of OOM kills. Consider scaling horizontally."

      # P2: Elevated Error Rate
      - alert: ModerateErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[10m]))
            /
            sum(rate(http_requests_total[10m]))
          ) > 0.01
        for: 10m
        labels:
          severity: medium
          team: platform
        annotations:
          summary: "Error rate above 1% (current: {{ $value | humanizePercentage }})"
          description: "Elevated but not critical error rate. Monitor for escalation."

      # P3: Certificate Expiration Warning
      - alert: TLSCertificateExpiringSoon
        expr: |
          (
            probe_ssl_earliest_cert_expiry{job="blackbox"}
            - time()
          ) / 86400 < 14
        for: 1h
        labels:
          severity: low
          team: platform
        annotations:
          summary: "TLS certificate expires in {{ $value | humanizeDuration }}"
          description: "Renew certificate before expiration to avoid service disruption."

Making Alerts Actionable

Every alert must answer three questions:

  1. What is wrong? (Clear symptom description)
  2. Why does it matter? (User impact)
  3. What should I do? (Runbook link or first steps)

Threshold Configuration Strategies

Setting appropriate thresholds prevents both alert fatigue (thresholds too sensitive) and missed incidents (thresholds too lax).

Static Thresholds

For well-understood metrics with stable baselines:

// static-thresholds.ts - Fixed Threshold Configuration
interface ThresholdConfig {
  metric: string;
  threshold: number;
  duration: string;
  severity: 'critical' | 'high' | 'medium' | 'low';
  description: string;
}

const STATIC_THRESHOLDS: ThresholdConfig[] = [
  {
    metric: 'error_rate',
    threshold: 0.05, // 5%
    duration: '5m',
    severity: 'critical',
    description: 'User-facing errors above acceptable threshold'
  },
  {
    metric: 'response_time_p95',
    threshold: 5.0, // 5 seconds
    duration: '5m',
    severity: 'critical',
    description: 'API slowness impacting user experience'
  },
  {
    metric: 'chatgpt_api_quota_usage',
    threshold: 0.90, // 90%
    duration: '10m',
    severity: 'high',
    description: 'Risk of hitting quota limit'
  },
  {
    metric: 'disk_usage',
    threshold: 0.85, // 85%
    duration: '1h',
    severity: 'medium',
    description: 'Disk space running low'
  }
];

// Generate Prometheus alert rules from threshold config
export function generateAlertRules(thresholds: ThresholdConfig[]): string {
  const rules = thresholds.map(t => `
  - alert: ${metricToAlertName(t.metric)}
    expr: ${t.metric} > ${t.threshold}
    for: ${t.duration}
    labels:
      severity: ${t.severity}
    annotations:
      summary: "${t.description}"
      current_value: "{{ $value }}"
      threshold: "${t.threshold}"
  `);

  return `groups:\n  - name: static_thresholds\n    rules:${rules.join('')}`;
}

function metricToAlertName(metric: string): string {
  return metric
    .split('_')
    .map(word => word.charAt(0).toUpperCase() + word.slice(1))
    .join('');
}

Dynamic Thresholds with Anomaly Detection

For metrics with time-based patterns or seasonal variation:

// dynamic-thresholds.ts - Anomaly Detection for Alerting
import { Client } from '@prometheus/client';

interface AnomalyConfig {
  metric: string;
  windowSize: string; // e.g., '1h', '1d', '1w'
  stdDevMultiplier: number; // 2.0 = 2 standard deviations
  minSamples: number;
}

export class AnomalyDetector {
  private prometheus: Client;

  constructor(prometheusUrl: string) {
    this.prometheus = new Client({ url: prometheusUrl });
  }

  /**
   * Generate dynamic alert rule based on historical baseline
   *
   * Example: Alert if error rate exceeds (baseline + 2*stddev)
   */
  async generateDynamicThreshold(config: AnomalyConfig): Promise<string> {
    const { metric, windowSize, stdDevMultiplier, minSamples } = config;

    // Calculate baseline (mean) over historical window
    const baselineQuery = `avg_over_time(${metric}[${windowSize}])`;

    // Calculate standard deviation
    const stdDevQuery = `stddev_over_time(${metric}[${windowSize}])`;

    // Dynamic threshold = baseline + (stdDev * multiplier)
    const thresholdQuery = `
      (${baselineQuery} + (${stdDevQuery} * ${stdDevMultiplier}))
    `;

    // Only alert if we have sufficient historical data
    const alertExpr = `
      (
        ${metric} > ${thresholdQuery}
        and
        count_over_time(${metric}[${windowSize}]) >= ${minSamples}
      )
    `;

    return `
  - alert: ${metric.toUpperCase()}_ANOMALY
    expr: |
      ${alertExpr}
    for: 10m
    labels:
      severity: high
      type: anomaly
    annotations:
      summary: "Anomaly detected in ${metric}"
      current_value: "{{ $value }}"
      baseline: "{{ query \\"${baselineQuery}\\" | first | value }}"
      threshold: "{{ query \\"${thresholdQuery}\\" | first | value }}"
      description: "Metric exceeds baseline by ${stdDevMultiplier} standard deviations"
    `;
  }

  /**
   * Time-based threshold adjustment
   * Example: Higher error tolerance during maintenance windows
   */
  generateTimeBasedThreshold(
    metric: string,
    businessHoursThreshold: number,
    offHoursThreshold: number
  ): string {
    return `
  - alert: ${metric.toUpperCase()}_HIGH
    expr: |
      (
        (${metric} > ${businessHoursThreshold})
        and
        (hour() >= 9 and hour() < 17)  # Business hours: 9am-5pm
        and
        (day_of_week() > 0 and day_of_week() < 6)  # Monday-Friday
      )
      or
      (
        (${metric} > ${offHoursThreshold})
        and
        (
          hour() < 9 or hour() >= 17
          or day_of_week() == 0 or day_of_week() == 6
        )
      )
    for: 5m
    labels:
      severity: high
    annotations:
      summary: "High ${metric} during {{ if and (ge (hour) 9) (lt (hour) 17) }}business hours{{ else }}off-hours{{ end }}"
    `;
  }
}

Escalation Policies & On-Call Management

Well-designed escalation policies ensure incidents reach the right people without delay or confusion.

Escalation Chain Configuration

// escalation-policy.ts - PagerDuty-Style Escalation
interface OnCallSchedule {
  teamId: string;
  primaryOnCall: string;
  secondaryOnCall: string;
  managerEscalation: string;
  schedule: {
    timezone: string;
    rotationWeeks: number;
  };
}

interface EscalationRule {
  delay: number; // minutes
  targets: string[]; // user IDs or team IDs
  notifyChannels: ('sms' | 'email' | 'phone' | 'push')[];
}

interface EscalationPolicy {
  id: string;
  name: string;
  description: string;
  rules: EscalationRule[];
  acknowledgementTimeout: number; // minutes
  autoResolveTimeout: number; // minutes
}

export const CHATGPT_APP_ESCALATION: EscalationPolicy = {
  id: 'chatgpt-app-critical',
  name: 'ChatGPT App Critical Escalation',
  description: 'Escalation policy for P0/P1 incidents',
  acknowledgementTimeout: 5,
  autoResolveTimeout: 60,
  rules: [
    {
      delay: 0, // Immediate
      targets: ['oncall-primary'],
      notifyChannels: ['sms', 'push', 'phone']
    },
    {
      delay: 5, // After 5 minutes if not acknowledged
      targets: ['oncall-primary', 'oncall-secondary'],
      notifyChannels: ['sms', 'phone']
    },
    {
      delay: 10, // After 10 minutes total
      targets: ['oncall-primary', 'oncall-secondary', 'team-lead'],
      notifyChannels: ['sms', 'phone']
    },
    {
      delay: 20, // After 20 minutes total - critical escalation
      targets: ['oncall-primary', 'oncall-secondary', 'team-lead', 'engineering-manager'],
      notifyChannels: ['sms', 'phone']
    }
  ]
};

export const CHATGPT_APP_ESCALATION_MEDIUM: EscalationPolicy = {
  id: 'chatgpt-app-medium',
  name: 'ChatGPT App Medium Priority',
  description: 'Escalation for P2/P3 incidents',
  acknowledgementTimeout: 15,
  autoResolveTimeout: 240,
  rules: [
    {
      delay: 0,
      targets: ['oncall-primary'],
      notifyChannels: ['push', 'email']
    },
    {
      delay: 15,
      targets: ['oncall-primary', 'oncall-secondary'],
      notifyChannels: ['push', 'sms']
    }
  ]
};

// Map alert severity to escalation policy
export function getEscalationPolicy(severity: string): EscalationPolicy {
  switch (severity) {
    case 'critical':
    case 'high':
      return CHATGPT_APP_ESCALATION;
    case 'medium':
    case 'low':
      return CHATGPT_APP_ESCALATION_MEDIUM;
    default:
      return CHATGPT_APP_ESCALATION_MEDIUM;
  }
}

PagerDuty Integration

// pagerduty-integration.ts - Production PagerDuty Integration
import axios from 'axios';

interface PagerDutyEvent {
  routing_key: string; // Integration key
  event_action: 'trigger' | 'acknowledge' | 'resolve';
  dedup_key?: string; // Unique incident identifier
  payload: {
    summary: string;
    severity: 'critical' | 'error' | 'warning' | 'info';
    source: string;
    timestamp?: string;
    component?: string;
    group?: string;
    class?: string;
    custom_details?: Record<string, any>;
  };
  links?: Array<{
    href: string;
    text: string;
  }>;
  images?: Array<{
    src: string;
    href?: string;
    alt?: string;
  }>;
}

export class PagerDutyIntegration {
  private readonly apiUrl = 'https://events.pagerduty.com/v2/enqueue';
  private routingKey: string;

  constructor(routingKey: string) {
    this.routingKey = routingKey;
  }

  /**
   * Trigger a new incident in PagerDuty
   */
  async triggerAlert(
    summary: string,
    severity: 'critical' | 'error' | 'warning' | 'info',
    details: {
      source: string;
      component?: string;
      runbookUrl?: string;
      dashboardUrl?: string;
      customDetails?: Record<string, any>;
    }
  ): Promise<{ dedup_key: string; status: string }> {
    const dedupKey = `chatgpt-${details.source}-${Date.now()}`;

    const event: PagerDutyEvent = {
      routing_key: this.routingKey,
      event_action: 'trigger',
      dedup_key: dedupKey,
      payload: {
        summary,
        severity,
        source: details.source,
        timestamp: new Date().toISOString(),
        component: details.component,
        custom_details: details.customDetails
      },
      links: [
        details.runbookUrl && {
          href: details.runbookUrl,
          text: 'Runbook'
        },
        details.dashboardUrl && {
          href: details.dashboardUrl,
          text: 'Dashboard'
        }
      ].filter(Boolean) as any
    };

    const response = await axios.post(this.apiUrl, event);

    return {
      dedup_key: dedupKey,
      status: response.data.status
    };
  }

  /**
   * Acknowledge an existing incident
   */
  async acknowledgeAlert(dedupKey: string): Promise<void> {
    const event: PagerDutyEvent = {
      routing_key: this.routingKey,
      event_action: 'acknowledge',
      dedup_key: dedupKey,
      payload: {
        summary: 'Acknowledged',
        severity: 'info',
        source: 'automation'
      }
    };

    await axios.post(this.apiUrl, event);
  }

  /**
   * Resolve an incident
   */
  async resolveAlert(dedupKey: string, resolutionNote?: string): Promise<void> {
    const event: PagerDutyEvent = {
      routing_key: this.routingKey,
      event_action: 'resolve',
      dedup_key: dedupKey,
      payload: {
        summary: resolutionNote || 'Resolved',
        severity: 'info',
        source: 'automation'
      }
    };

    await axios.post(this.apiUrl, event);
  }

  /**
   * Send alert from Prometheus AlertManager webhook
   */
  async handlePrometheusWebhook(alertmanagerPayload: any): Promise<void> {
    const alerts = alertmanagerPayload.alerts;

    for (const alert of alerts) {
      const dedupKey = `${alert.labels.alertname}-${alert.labels.instance || 'global'}`;

      if (alert.status === 'firing') {
        await this.triggerAlert(
          alert.annotations.summary || alert.labels.alertname,
          this.mapSeverity(alert.labels.severity),
          {
            source: alert.labels.instance || 'unknown',
            component: alert.labels.component,
            runbookUrl: alert.annotations.runbook_url,
            dashboardUrl: alert.annotations.dashboard_url,
            customDetails: {
              labels: alert.labels,
              annotations: alert.annotations,
              startsAt: alert.startsAt,
              generatorURL: alert.generatorURL
            }
          }
        );
      } else if (alert.status === 'resolved') {
        await this.resolveAlert(
          dedupKey,
          'Alert resolved automatically'
        );
      }
    }
  }

  private mapSeverity(prometheuseSeverity: string): 'critical' | 'error' | 'warning' | 'info' {
    switch (prometheuseSeverity?.toLowerCase()) {
      case 'critical':
        return 'critical';
      case 'high':
        return 'error';
      case 'medium':
        return 'warning';
      default:
        return 'info';
    }
  }
}

// Usage example
const pagerduty = new PagerDutyIntegration(process.env.PAGERDUTY_ROUTING_KEY!);

// Trigger critical alert
await pagerduty.triggerAlert(
  'ChatGPT App Down - Complete Outage',
  'critical',
  {
    source: 'prod-chatgpt-app-1',
    component: 'core-api',
    runbookUrl: 'https://docs.company.com/runbooks/app-down',
    dashboardUrl: 'https://grafana.company.com/d/chatgpt-overview',
    customDetails: {
      instance: 'prod-chatgpt-app-1',
      region: 'us-east-1',
      downtime_duration: '2m',
      last_successful_health_check: '2026-12-25T10:00:00Z'
    }
  }
);

Alert Fatigue Prevention

Alert fatigue—when teams become desensitized to alerts—is one of the biggest risks in production monitoring. Prevention requires intentional alert design.

Alert Grouping & Deduplication

// alert-grouping.ts - Intelligent Alert Aggregation
interface Alert {
  id: string;
  name: string;
  severity: string;
  labels: Record<string, string>;
  annotations: Record<string, string>;
  timestamp: Date;
  fingerprint: string;
}

interface AlertGroup {
  groupKey: string;
  alerts: Alert[];
  firstFired: Date;
  lastFired: Date;
  count: number;
}

export class AlertGrouper {
  private groups = new Map<string, AlertGroup>();
  private groupWindow = 5 * 60 * 1000; // 5 minutes

  /**
   * Group similar alerts together to prevent notification spam
   *
   * Example: 10 pods crashing → 1 notification, not 10
   */
  addAlert(alert: Alert): { isNew: boolean; group: AlertGroup } {
    const groupKey = this.generateGroupKey(alert);

    let group = this.groups.get(groupKey);

    if (!group) {
      // New group
      group = {
        groupKey,
        alerts: [alert],
        firstFired: alert.timestamp,
        lastFired: alert.timestamp,
        count: 1
      };
      this.groups.set(groupKey, group);

      return { isNew: true, group };
    }

    // Check if alert is within grouping window
    const timeSinceLastAlert = alert.timestamp.getTime() - group.lastFired.getTime();

    if (timeSinceLastAlert < this.groupWindow) {
      // Add to existing group
      group.alerts.push(alert);
      group.lastFired = alert.timestamp;
      group.count++;

      return { isNew: false, group };
    } else {
      // Old group expired, create new one
      const newGroup: AlertGroup = {
        groupKey,
        alerts: [alert],
        firstFired: alert.timestamp,
        lastFired: alert.timestamp,
        count: 1
      };
      this.groups.set(groupKey, newGroup);

      return { isNew: true, group: newGroup };
    }
  }

  /**
   * Generate group key based on alert characteristics
   *
   * Alerts with same name, severity, and component are grouped together
   */
  private generateGroupKey(alert: Alert): string {
    const keyComponents = [
      alert.name,
      alert.severity,
      alert.labels.component || 'unknown',
      alert.labels.environment || 'production'
    ];

    return keyComponents.join(':');
  }

  /**
   * Format grouped alert notification
   */
  formatGroupNotification(group: AlertGroup): string {
    const firstAlert = group.alerts[0];

    if (group.count === 1) {
      return `${firstAlert.annotations.summary}`;
    }

    return `
🚨 ${firstAlert.name} (${group.count} instances)

First occurrence: ${group.firstFired.toISOString()}
Last occurrence: ${group.lastFired.toISOString()}

Affected components:
${this.getAffectedComponents(group)}

Summary: ${firstAlert.annotations.summary}
    `.trim();
  }

  private getAffectedComponents(group: AlertGroup): string {
    const components = new Set(
      group.alerts
        .map(a => a.labels.instance || a.labels.pod || 'unknown')
        .slice(0, 10) // Limit to 10 to avoid huge notifications
    );

    const componentList = Array.from(components).join('\n- ');
    const remaining = group.count - components.size;

    if (remaining > 0) {
      return `- ${componentList}\n... and ${remaining} more`;
    }

    return `- ${componentList}`;
  }
}

Alert Silencing & Maintenance Windows

// alert-silencing.ts - Silence Alerts During Maintenance
interface Silence {
  id: string;
  matchers: Array<{
    name: string;
    value: string;
    isRegex: boolean;
  }>;
  startsAt: Date;
  endsAt: Date;
  createdBy: string;
  comment: string;
}

export class AlertSilencer {
  private silences: Silence[] = [];

  /**
   * Create silence for planned maintenance
   *
   * Example: Silence all alerts during deployment window
   */
  createSilence(
    matchers: Silence['matchers'],
    duration: number, // minutes
    comment: string,
    createdBy: string
  ): Silence {
    const silence: Silence = {
      id: `silence-${Date.now()}`,
      matchers,
      startsAt: new Date(),
      endsAt: new Date(Date.now() + duration * 60 * 1000),
      createdBy,
      comment
    };

    this.silences.push(silence);

    return silence;
  }

  /**
   * Check if alert should be silenced
   */
  isSilenced(alert: Alert): boolean {
    const now = new Date();

    return this.silences.some(silence => {
      // Check if silence is active
      if (now < silence.startsAt || now > silence.endsAt) {
        return false;
      }

      // Check if alert matches silence criteria
      return silence.matchers.every(matcher => {
        const alertValue = alert.labels[matcher.name];
        if (!alertValue) return false;

        if (matcher.isRegex) {
          return new RegExp(matcher.value).test(alertValue);
        }

        return alertValue === matcher.value;
      });
    });
  }

  /**
   * Create maintenance window silence
   */
  createMaintenanceWindow(
    startTime: Date,
    endTime: Date,
    component: string,
    engineer: string
  ): Silence {
    return this.createSilence(
      [
        { name: 'component', value: component, isRegex: false },
        { name: 'environment', value: 'production', isRegex: false }
      ],
      (endTime.getTime() - startTime.getTime()) / (60 * 1000),
      `Planned maintenance on ${component}`,
      engineer
    );
  }

  /**
   * Silence all non-critical alerts during incident response
   */
  createIncidentFocus(incidentId: string, responder: string): Silence {
    return this.createSilence(
      [
        { name: 'severity', value: 'medium|low', isRegex: true }
      ],
      60, // 1 hour
      `Focusing on incident ${incidentId}`,
      responder
    );
  }
}

Root Cause Alert Suppression

// root-cause-suppression.ts - Suppress Downstream Alerts
interface AlertDependency {
  upstream: string; // Alert name
  downstream: string[]; // Dependent alert names
  suppressionWindow: number; // minutes
}

const ALERT_DEPENDENCIES: AlertDependency[] = [
  {
    upstream: 'ChatGPTAppDown',
    downstream: [
      'HighLatency',
      'HighErrorRate',
      'LowThroughput',
      'HealthCheckFailing'
    ],
    suppressionWindow: 30
  },
  {
    upstream: 'DatabaseDown',
    downstream: [
      'HighDatabaseLatency',
      'DatabaseConnectionPoolExhausted',
      'HighErrorRate'
    ],
    suppressionWindow: 30
  },
  {
    upstream: 'ChatGPTAPIQuotaExhausted',
    downstream: [
      'ChatGPTAPIErrors',
      'HighLatency',
      'UserComplaintsHigh'
    ],
    suppressionWindow: 60
  }
];

export class RootCauseSuppressor {
  private activeUpstreamAlerts = new Map<string, Date>();

  /**
   * Record when upstream alert fires
   */
  recordUpstreamAlert(alertName: string): void {
    this.activeUpstreamAlerts.set(alertName, new Date());
  }

  /**
   * Check if downstream alert should be suppressed
   */
  shouldSuppress(alertName: string): boolean {
    const now = new Date();

    for (const dep of ALERT_DEPENDENCIES) {
      if (!dep.downstream.includes(alertName)) continue;

      const upstreamFiredAt = this.activeUpstreamAlerts.get(dep.upstream);
      if (!upstreamFiredAt) continue;

      const timeSinceUpstream = (now.getTime() - upstreamFiredAt.getTime()) / (60 * 1000);

      if (timeSinceUpstream < dep.suppressionWindow) {
        console.log(
          `Suppressing ${alertName} due to upstream alert ${dep.upstream} ` +
          `(fired ${timeSinceUpstream.toFixed(1)}m ago)`
        );
        return true;
      }
    }

    return false;
  }

  /**
   * Clear upstream alert (when resolved)
   */
  clearUpstreamAlert(alertName: string): void {
    this.activeUpstreamAlerts.delete(alertName);
  }
}

Notification Channels & Routing

Different alert severities require different notification mechanisms.

Multi-Channel Notification Strategy

// notification-router.ts - Route Alerts to Appropriate Channels
import { WebClient as SlackClient } from '@slack/web-api';
import { PagerDutyIntegration } from './pagerduty-integration';

interface NotificationChannel {
  name: string;
  type: 'slack' | 'pagerduty' | 'email' | 'webhook';
  config: any;
}

export class NotificationRouter {
  private slackClient: SlackClient;
  private pagerduty: PagerDutyIntegration;

  constructor(
    slackToken: string,
    pagerdutyRoutingKey: string
  ) {
    this.slackClient = new SlackClient(slackToken);
    this.pagerduty = new PagerDutyIntegration(pagerdutyRoutingKey);
  }

  /**
   * Route alert to appropriate channels based on severity
   */
  async routeAlert(alert: Alert): Promise<void> {
    switch (alert.severity) {
      case 'critical':
        // P0: PagerDuty page + Slack alert
        await Promise.all([
          this.sendToPagerDuty(alert),
          this.sendToSlack(alert, '#incidents-critical', true)
        ]);
        break;

      case 'high':
        // P1: Slack alert + PagerDuty notification
        await Promise.all([
          this.sendToSlack(alert, '#incidents-high', true),
          this.sendToPagerDuty(alert)
        ]);
        break;

      case 'medium':
        // P2: Slack notification only
        await this.sendToSlack(alert, '#monitoring-alerts', false);
        break;

      case 'low':
        // P3: Log only (or slack during business hours)
        if (this.isBusinessHours()) {
          await this.sendToSlack(alert, '#monitoring-info', false);
        }
        break;
    }
  }

  private async sendToPagerDuty(alert: Alert): Promise<void> {
    await this.pagerduty.triggerAlert(
      alert.annotations.summary,
      alert.severity as any,
      {
        source: alert.labels.instance || 'unknown',
        component: alert.labels.component,
        runbookUrl: alert.annotations.runbook_url,
        dashboardUrl: alert.annotations.dashboard_url,
        customDetails: {
          labels: alert.labels,
          annotations: alert.annotations
        }
      }
    );
  }

  private async sendToSlack(
    alert: Alert,
    channel: string,
    mentionOnCall: boolean
  ): Promise<void> {
    const color = this.getSlackColor(alert.severity);
    const emoji = this.getEmoji(alert.severity);

    const blocks = [
      {
        type: 'header',
        text: {
          type: 'plain_text',
          text: `${emoji} ${alert.name}`,
          emoji: true
        }
      },
      {
        type: 'section',
        fields: [
          {
            type: 'mrkdwn',
            text: `*Severity:*\n${alert.severity.toUpperCase()}`
          },
          {
            type: 'mrkdwn',
            text: `*Component:*\n${alert.labels.component || 'Unknown'}`
          },
          {
            type: 'mrkdwn',
            text: `*Environment:*\n${alert.labels.environment || 'production'}`
          },
          {
            type: 'mrkdwn',
            text: `*Instance:*\n${alert.labels.instance || 'N/A'}`
          }
        ]
      },
      {
        type: 'section',
        text: {
          type: 'mrkdwn',
          text: `*Description:*\n${alert.annotations.description || alert.annotations.summary}`
        }
      }
    ];

    // Add action buttons
    if (alert.annotations.runbook_url || alert.annotations.dashboard_url) {
      blocks.push({
        type: 'actions',
        elements: [
          alert.annotations.runbook_url && {
            type: 'button',
            text: { type: 'plain_text', text: 'Runbook', emoji: true },
            url: alert.annotations.runbook_url,
            style: 'primary'
          },
          alert.annotations.dashboard_url && {
            type: 'button',
            text: { type: 'plain_text', text: 'Dashboard', emoji: true },
            url: alert.annotations.dashboard_url
          }
        ].filter(Boolean) as any
      });
    }

    let text = `${emoji} *${alert.name}*`;
    if (mentionOnCall) {
      text = `<!subteam^S01234ABCDE> ${text}`; // Replace with actual on-call group ID
    }

    await this.slackClient.chat.postMessage({
      channel,
      text,
      blocks,
      attachments: [{
        color,
        fallback: alert.annotations.summary
      }]
    });
  }

  private getSlackColor(severity: string): string {
    switch (severity) {
      case 'critical': return '#FF0000'; // Red
      case 'high': return '#FF6600'; // Orange
      case 'medium': return '#FFCC00'; // Yellow
      default: return '#0099FF'; // Blue
    }
  }

  private getEmoji(severity: string): string {
    switch (severity) {
      case 'critical': return '🚨';
      case 'high': return '⚠️';
      case 'medium': return '⚡';
      default: return 'ℹ️';
    }
  }

  private isBusinessHours(): boolean {
    const now = new Date();
    const hour = now.getHours();
    const day = now.getDay();

    // Monday-Friday, 9am-5pm
    return day >= 1 && day <= 5 && hour >= 9 && hour < 17;
  }
}

Conclusion: Building a Sustainable Alerting Culture

Effective alerting isn't just about configuration—it's about culture. Every alert should represent a real problem requiring human intervention, with clear ownership and actionable guidance.

Alerting Best Practices Recap

  1. Symptom-based alerts: Focus on user impact, not internal component states
  2. Appropriate thresholds: Use static thresholds for stable metrics, dynamic for variable ones
  3. Clear escalation: Define who gets notified when, with appropriate timeouts
  4. Prevent fatigue: Group related alerts, suppress downstream noise, silence during maintenance
  5. Actionable notifications: Include runbook links, dashboard URLs, and context
  6. Multi-channel routing: Route by severity—PagerDuty for P0, Slack for P2
  7. Regular review: Tune thresholds based on false positive rates and incident data

Take Your ChatGPT App Monitoring Further

Ready to build production ChatGPT applications with world-class monitoring and alerting?

MakeAIHQ provides the complete infrastructure for ChatGPT app development, including:

  • Pre-configured monitoring dashboards with Prometheus + Grafana
  • Built-in PagerDuty integration for instant alerting
  • Production-ready alert rules for ChatGPT-specific metrics
  • On-call rotation management and escalation policies
  • Real-time incident tracking and postmortem tools

Start building with MakeAIHQ →

Or explore our related guides:


About the Author: The MakeAIHQ team has built and scaled ChatGPT applications serving millions of users, implementing production monitoring systems that catch issues before customers notice. We've learned these alerting strategies through years of on-call experience and hundreds of incidents.