Container Monitoring and Observability

Overview

Container monitoring and observability are critical for understanding the health, performance, and behavior of containerized applications. This article explores comprehensive monitoring and observability strategies for containerized environments, covering metrics, logging, tracing, and visualization.

Observability Fundamentals

The Three Pillars of Observability

Observability consists of three core components that provide insights into system behavior:

Metrics

Quantitative measurements of system behavior over time.

System metrics: CPU, memory, disk, network
Application metrics: Response times, error rates, throughput
Business metrics: User actions, conversions, revenue

Logs

Structured and unstructured records of events and activities.

Application logs: Business logic and operational events
System logs: Infrastructure and platform events
Access logs: Request and response information

Traces

End-to-end tracking of requests through distributed systems.

Request flow: Path through services
Performance analysis: Timing and bottlenecks
Error propagation: Issue tracking across services

Why Observability Matters

Operational Benefits:

Faster problem resolution: Quickly identify and fix issues
Performance optimization: Understand system behavior
Capacity planning: Predict resource needs
SLA compliance: Monitor service level agreements

Business Benefits:

Customer experience: Monitor application performance
Revenue protection: Prevent outages and degradation
Cost optimization: Right-size resources
Compliance: Meet regulatory requirements

Container Metrics

Key Container Metrics

Resource Utilization Metrics:

CPU usage: Percentage of CPU utilized
Memory usage: Memory consumption and limits
Network I/O: Bytes sent/received
Disk I/O: Read/write operations and throughput

Container-Specific Metrics:

Container count: Running, stopped, paused containers
Image pull time: Time to pull container images
Container startup time: Time to start containers
Restarts: Container restart counts

Metrics Collection

Docker Metrics:

BASH

# View container stats
docker stats

# Get detailed stats in JSON format
docker stats --format "{{json .}}"

# Monitor specific containers
docker stats container1 container2

Kubernetes Metrics:

kubelet: Node and pod metrics
cAdvisor: Container resource usage
kube-state-metrics: Kubernetes object metrics
metrics-server: Resource metrics API

Metrics Formats and Standards

Prometheus Metrics Format:

TEXT

# HELP container_cpu_usage_seconds_total Cumulative cpu time consumed by the container in seconds.
# TYPE container_cpu_usage_seconds_total counter
container_cpu_usage_seconds_total{id="/system.slice/docker.service",cpu="cpu0"} 1234567890
container_memory_usage_bytes{id="/docker/abcd1234"} 1048576

OpenMetrics Standard:

Extension of Prometheus format
Improved data types and metadata
Better internationalization support

Prometheus for Container Monitoring

Prometheus Architecture

Prometheus follows a pull-based monitoring model with several key components:

Core Components:

Prometheus Server: Stores and queries metrics
Pushgateway: For batch jobs and short-lived metrics
Alertmanager: Handles alerts and notifications
Exporters: Convert metrics to Prometheus format

Container-Specific Exporters:

cAdvisor: Container resource metrics
kube-state-metrics: Kubernetes object metrics
node_exporter: Node-level metrics
blackbox_exporter: Blackbox monitoring

Prometheus Configuration

prometheus.yml:

YAML

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
    - role: node
    relabel_configs:
    - source_labels: [__address__]
      regex: '(.*):10250'
      target_label: __address__
      replacement: '${1}:10255'

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([0-9]+)
      target_label: __port__
      replacement: ${1}

PromQL Queries for Containers

Common Container Queries:

PROMQL

# CPU usage by container
rate(container_cpu_usage_seconds_total[5m])

# Memory usage by container
container_memory_usage_bytes

# Network I/O by container
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])

# Disk usage by container
container_fs_usage_bytes

# Container uptime
time() - container_start_time_seconds

Container Logging

Container Logging Architecture

Docker Logging:

JSON File: Default logging driver
Syslog: Send logs to syslog
Journald: Use systemd journal
Fluentd: Forward to fluentd
Splunk: Forward to Splunk

Kubernetes Logging:

Node-level logging: Agent on each node
Application-level logging: Sidecar containers
Cluster-level logging: Centralized solution

Log Aggregation Solutions

ELK Stack (Elasticsearch, Logstash, Kibana):

Elasticsearch: Search and analytics engine
Logstash: Data processing pipeline
Kibana: Data visualization

Alternative Solutions:

EFK Stack: Elasticsearch, Fluentd, Kibana
Graylog: Centralized log management
Loki: Lightweight log aggregation (by Grafana Labs)

Log Processing Pipeline

Fluentd Configuration:

XML

<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  format json
  time_key time
  time_format %Y-%m-%dT%H:%M:%S.%NZ
</source>

<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

<match kubernetes.**>
  @type elasticsearch
  host "#{ENV['OUTPUT_HOST']}"
  port "#{ENV['OUTPUT_PORT']}"
  logstash_format true
  include_tag_key true
  type_name _doc
</match>

Structured Logging

JSON Logging Format:

JSON

{
  "timestamp": "2026-01-08T10:00:00.000Z",
  "level": "INFO",
  "service": "web-api",
  "traceId": "abc123",
  "spanId": "def456",
  "message": "User authenticated successfully",
  "userId": "user123",
  "requestId": "req789"
}

Benefits of Structured Logging:

Machine-readable: Easy to parse and analyze
Consistent format: Uniform log structure
Rich metadata: Additional context and information
Query-friendly: Efficient searching and filtering

Distributed Tracing

Tracing Concepts

Trace Structure:

Trace: End-to-end request journey
Span: Individual operation within trace
Span Context: Propagated across services
Tags: Metadata attached to spans

Tracing Data:

Timeline: Operation sequence and timing
Relationships: Service interaction map
Performance: Latency and bottleneck analysis
Errors: Exception and failure tracking

OpenTelemetry

OpenTelemetry is the industry standard for observability data collection.

SDK Components:

Traces: Distributed tracing
Metrics: Metric collection
Logs: Log collection and processing
Context propagation: Distributed context management

Instrumentation:

JAVASCRIPT

// Example Node.js instrumentation
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter(),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Jaeger for Distributed Tracing

Jaeger is a popular distributed tracing system.

Jaeger Components:

Agent: Collects and forwards spans
Collector: Receives and processes spans
Query: UI for exploring traces
Storage: Backend storage (Cassandra, Elasticsearch)

Jaeger Integration:

YAML

# Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
spec:
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:latest
        ports:
        - containerPort: 16686  # Query UI
        - containerPort: 14268  # Collector
        env:
        - name: COLLECTOR_ZIPKIN_HTTP_PORT
          value: "9411"

Grafana for Visualization

Grafana Dashboards

Grafana provides powerful visualization capabilities for container metrics.

Dashboard Panels:

Graphs: Time-series data visualization
Tables: Tabular metric display
Single stats: Key metric highlighting
Heatmaps: Distribution visualization
Alerts: Visual alert indicators

Container Dashboard Example:

JSON

{
  "dashboard": {
    "title": "Container Monitoring",
    "panels": [
      {
        "title": "CPU Usage by Container",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(container_cpu_usage_seconds_total[5m])",
            "legendFormat": "{{name}}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "container_memory_usage_bytes",
            "legendFormat": "{{name}}"
          }
        ]
      }
    ]
  }
}

Grafana Data Sources

Supported Data Sources:

Prometheus: Primary time-series database
Loki: Log aggregation system
Elasticsearch: Search and analytics
InfluxDB: Time-series database
MySQL/PostgreSQL: Relational databases

Container Monitoring Patterns

Golden Signals

The four golden signals of monitoring for containerized applications:

Latency

Time taken to serve requests.

TEXT

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Traffic

Amount of requests being served.

TEXT

sum(rate(http_requests_total[5m])) by (status_code)

Errors

Rate of failed requests.

TEXT

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Saturation

Resource utilization.

TEXT

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

RED Method

The RED method focuses on Request metrics:

Rate: Requests per second
Errors: Error rate
Duration: Request duration distribution

USE Method

The USE method monitors system resources:

Utilization: Percentage of time busy
Saturation: Queue length or wait time
Errors: Count of errors

Kubernetes Monitoring

Kube-State-Metrics

kube-state-metrics generates metrics from Kubernetes API objects.

Sample Metrics:

TEXT

kube_deployment_status_replicas{deployment="my-app",namespace="production"}
kube_pod_status_phase{pod="my-app-7d5b6c9f8c-xyz",namespace="production",phase="Running"}
kube_node_status_condition{node="node-1",condition="Ready",status="true"}

Metrics Server

Metrics Server provides resource metrics API for Kubernetes.

BASH

# Get node metrics
kubectl top nodes

# Get pod metrics
kubectl top pods

# Get pod metrics by namespace
kubectl top pods -n production

Cluster Monitoring Architecture

YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: monitoring-stack
spec:
  replicas: 1
  selector:
    matchLabels:
      app: monitoring
  template:
    metadata:
      labels:
        app: monitoring
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus/
      - name: grafana
        image: grafana/grafana:latest
        ports:
        - containerPort: 3000
      volumes:
      - name: config
        configMap:
          name: prometheus-config

Alerting and Notification

Alert Configuration

Prometheus Alert Rules:

YAML

groups:
- name: container_alerts
  rules:
  - alert: ContainerHighCpuUsage
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "Container {{ $labels.name }} has CPU usage above 80% for more than 2 minutes"

  - alert: ContainerMemoryPressure
    expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.9
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Memory pressure on container"
      description: "Container {{ $labels.name }} is using more than 90% of its memory limit"

Alertmanager Configuration

YAML

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: default-receiver
  routes:
  - match:
      severity: critical
    receiver: critical-team

receivers:
- name: default-receiver
  slack_configs:
  - channel: '#alerts'
    send_resolved: true
- name: critical-team
  email_configs:
  - to: '[email protected]'
    send_resolved: true

Notification Channels

Common Notification Methods:

Email: Traditional notification method
Slack: Team communication platform
PagerDuty: Incident response platform
Webhook: Custom integration endpoints
SMS: Emergency notification

Monitoring Best Practices

Metrics Best Practices

Metric Naming:

Use consistent naming conventions
Include appropriate labels
Follow dimensional modeling
Make metrics queryable

Metric Collection:

Collect at appropriate intervals
Use histograms for latency data
Implement metric expiry
Monitor metric cardinality

Logging Best Practices

Log Management:

Use structured logging
Include correlation IDs
Implement log rotation
Secure sensitive data
Maintain log retention policies

Log Analysis:

Centralize log storage
Implement log searching
Create log-based alerts
Perform trend analysis

Tracing Best Practices

Trace Sampling:

Implement adaptive sampling
Use probabilistic sampling
Sample based on request characteristics
Monitor sampling effectiveness

Trace Analysis:

Identify performance bottlenecks
Map service dependencies
Track error propagation
Optimize critical paths

Troubleshooting and Debugging

Common Issues

Metric Collection Issues:

Missing metrics: Check scraping configuration
Incomplete data: Verify time ranges and retention
High cardinality: Monitor label combinations
Performance impact: Tune collection intervals

Log Issues:

Missing logs: Check log driver configuration
Log flooding: Implement log level controls
Parsing errors: Validate log formats
Retention problems: Configure proper retention

Diagnostic Commands

Container Diagnostics:

BASH

# View container logs
docker logs container-name

# Follow container logs
docker logs -f container-name

# View container stats
docker stats container-name

# Inspect container
docker inspect container-name

Kubernetes Diagnostics:

BASH

# View pod logs
kubectl logs pod-name

# Follow pod logs
kubectl logs -f pod-name

# View logs from previous instance
kubectl logs pod-name --previous

# View cluster events
kubectl get events --sort-by='.lastTimestamp'

# Describe pod for detailed information
kubectl describe pod pod-name

Security Considerations

Monitoring Data Security

Data Protection:

Encryption at rest: Secure stored metrics/logs
Encryption in transit: Secure data transmission
Access controls: Limit access to monitoring data
Data anonymization: Protect sensitive information

Monitoring System Security:

Secure endpoints: Protect monitoring interfaces
Authentication: Require authentication for access
Authorization: Implement role-based access control
Audit trails: Monitor access to monitoring systems

Future Trends

Emerging Technologies

eBPF Observability:

Kernel-level insights: Deep system visibility
Low overhead: Minimal performance impact
Rich telemetry: Comprehensive data collection

AI/ML in Observability:

Anomaly detection: Automated issue identification
Predictive analytics: Forecast system behavior
Root cause analysis: Automated problem determination

Observability Platforms:

Unified platforms: Integrated metrics, logs, traces
Cloud-native solutions: Kubernetes-native observability
Open standards: Interoperable observability tools

Conclusion

Container monitoring and observability are essential for maintaining healthy, performant containerized applications. By implementing comprehensive monitoring strategies that include metrics, logging, and tracing, organizations can gain valuable insights into their containerized infrastructure and applications, enabling faster problem resolution and better decision-making.

In the next article, we'll explore container deployment strategies, covering different approaches to deploying and managing containerized applications.

Series

Container Series

Introduction to Containers

Docker Fundamentals

Container Orchestration with Kubernetes

Container Security Best Practices

Container Networking

Container Storage

Container Monitoring and Observability

Container Deployment Strategies

Container Scaling and Resource Management

Container Ecosystem Tools and Technologies

Share this article

You might also like