Cloud-Native Application Pipelines: Part 4 – Production Deployment and Observability

Production deployment transforms automated CI/CD pipelines into live applications that serve real users. This transformation requires sophisticated monitoring, logging, and observability systems that provide visibility into application performance, user behavior, and system health. Modern observability practices extend beyond traditional monitoring to encompass distributed tracing, performance profiling, and intelligent alerting that enables proactive issue resolution.

The observability ecosystem has evolved around OpenTelemetry standards, Prometheus metrics collection, and Grafana visualization platforms. These technologies work together to provide comprehensive insight into application behavior across distributed systems. Production deployment strategies must account for scaling patterns, traffic management, and incident response procedures that maintain service reliability while supporting continuous delivery practices.

Production Kubernetes Configuration

Production Kubernetes deployments require careful consideration of resource allocation, security policies, and operational procedures. Unlike development environments, production configurations emphasize reliability, performance, and security over convenience and flexibility.

# kubernetes/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-application
  namespace: production
  labels:
    app: web-application
    version: v1.0.0
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  selector:
    matchLabels:
      app: web-application
  template:
    metadata:
      labels:
        app: web-application
        version: v1.0.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "3000"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001
      containers:
      - name: web-application
        image: gcr.io/project-id/web-application:v1.0.0
        ports:
        - containerPort: 3000
          name: http
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health/live
            port: http
          initialDelaySeconds: 45
          periodSeconds: 10

Comprehensive Monitoring with Prometheus

Prometheus provides metrics collection and alerting capabilities that form the foundation of production monitoring systems. Modern Prometheus deployments emphasize service discovery, efficient data storage, and integration with alerting systems that enable rapid incident response.

# monitoring/prometheus-config.yaml
global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    cluster: 'production'
    environment: 'production'

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - action: labelmap
      regex: __meta_kubernetes_pod_label_(.+)

Observability with OpenTelemetry and Grafana

OpenTelemetry provides standardized telemetry collection that enables comprehensive observability across distributed systems. Integration with Grafana creates unified dashboards that correlate metrics, traces, and logs for effective troubleshooting and performance analysis.

# monitoring/otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: 'otel-collector'
        scrape_interval: 30s
        static_configs:
        - targets: ['localhost:8888']

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512
  resource:
    attributes:
    - key: service.instance.id
      value: ${HOSTNAME}
      action: upsert

exporters:
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"
    tls:
      insecure: true
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    tenant_id: production

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [loki]

Incident Response and Troubleshooting

Effective incident response requires structured procedures, comprehensive tooling, and team coordination that enables rapid problem resolution. Modern incident response integrates automated alerting, runbook automation, and post-incident analysis that prevents recurring issues.

# monitoring/alert-rules.yaml
groups:
- name: application.rules
  interval: 30s
  rules:
  - alert: HighErrorRate
    expr: |
      (
        rate(http_requests_total{status=~"5.."}[5m]) /
        rate(http_requests_total[5m])
      ) > 0.05
    for: 5m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
      runbook_url: "https://runbooks.example.com/high-error-rate"
  
  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket[5m])
      ) > 0.5
    for: 10m
    labels:
      severity: warning
      team: platform
    annotations:
      summary: "High latency detected"
      description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
  
  - alert: PodCrashLooping
    expr: |
      rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 5m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "Pod is crash looping"
      description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"

The production deployment and observability foundation completes the cloud-native application pipeline from development through production operations. This comprehensive approach integrates sophisticated monitoring, automated incident response, and continuous optimization practices that support reliable, scalable application delivery. The combination of Kubernetes orchestration, Prometheus monitoring, OpenTelemetry observability, and Grafana visualization creates a robust platform that enables teams to operate complex distributed systems with confidence and efficiency.

Leave a Reply

Your email address will not be published. Required fields are marked *