Production deployment transforms automated CI/CD pipelines into live applications that serve real users. This transformation requires sophisticated monitoring, logging, and observability systems that provide visibility into application performance, user behavior, and system health. Modern observability practices extend beyond traditional monitoring to encompass distributed tracing, performance profiling, and intelligent alerting that enables proactive issue resolution.
The observability ecosystem has evolved around OpenTelemetry standards, Prometheus metrics collection, and Grafana visualization platforms. These technologies work together to provide comprehensive insight into application behavior across distributed systems. Production deployment strategies must account for scaling patterns, traffic management, and incident response procedures that maintain service reliability while supporting continuous delivery practices.
Production Kubernetes Configuration
Production Kubernetes deployments require careful consideration of resource allocation, security policies, and operational procedures. Unlike development environments, production configurations emphasize reliability, performance, and security over convenience and flexibility.
# kubernetes/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-application
namespace: production
labels:
app: web-application
version: v1.0.0
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
selector:
matchLabels:
app: web-application
template:
metadata:
labels:
app: web-application
version: v1.0.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3000"
prometheus.io/path: "/metrics"
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
containers:
- name: web-application
image: gcr.io/project-id/web-application:v1.0.0
ports:
- containerPort: 3000
name: http
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 45
periodSeconds: 10
Comprehensive Monitoring with Prometheus
Prometheus provides metrics collection and alerting capabilities that form the foundation of production monitoring systems. Modern Prometheus deployments emphasize service discovery, efficient data storage, and integration with alerting systems that enable rapid incident response.
# monitoring/prometheus-config.yaml
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
cluster: 'production'
environment: 'production'
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
Observability with OpenTelemetry and Grafana
OpenTelemetry provides standardized telemetry collection that enables comprehensive observability across distributed systems. Integration with Grafana creates unified dashboards that correlate metrics, traces, and logs for effective troubleshooting and performance analysis.
# monitoring/otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 30s
static_configs:
- targets: ['localhost:8888']
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
resource:
attributes:
- key: service.instance.id
value: ${HOSTNAME}
action: upsert
exporters:
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
tls:
insecure: true
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
tenant_id: production
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [jaeger]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, resource]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [loki]
Incident Response and Troubleshooting
Effective incident response requires structured procedures, comprehensive tooling, and team coordination that enables rapid problem resolution. Modern incident response integrates automated alerting, runbook automation, and post-incident analysis that prevents recurring issues.
# monitoring/alert-rules.yaml
groups:
- name: application.rules
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
) > 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
runbook_url: "https://runbooks.example.com/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 0.5
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"
The production deployment and observability foundation completes the cloud-native application pipeline from development through production operations. This comprehensive approach integrates sophisticated monitoring, automated incident response, and continuous optimization practices that support reliable, scalable application delivery. The combination of Kubernetes orchestration, Prometheus monitoring, OpenTelemetry observability, and Grafana visualization creates a robust platform that enables teams to operate complex distributed systems with confidence and efficiency.
Leave a Reply