Monitoring and Observability

Overview

Monitoring and observability are critical components of DevOps practices that provide insight into system health, performance, and behavior. While monitoring focuses on predefined metrics and alerts, observability encompasses the ability to understand system internals through external outputs like logs, metrics, and traces. Together, they enable teams to proactively identify issues, troubleshoot problems, and optimize system performance.

Understanding Monitoring vs. Observability

Monitoring

Monitoring is the systematic observation of systems, applications, and infrastructure to track performance, availability, and health indicators. It typically involves collecting predefined metrics and setting up alerts based on specific thresholds.

Key Characteristics:

Predefined Metrics: Focus on known, expected measurements
Reactive: Primarily responds to predefined conditions
Alert-Driven: Generates alerts when thresholds are crossed
Historical Analysis: Tracks trends over time for capacity planning

Types of Monitoring:

Infrastructure Monitoring: CPU, memory, disk, network usage
Application Monitoring: Response times, error rates, throughput
Business Monitoring: User engagement, conversion rates, revenue metrics
Synthetic Monitoring: Simulated user interactions to test availability

Observability

Observability is the ability to understand the internal state of a system by examining its external outputs. It goes beyond traditional monitoring by providing deeper insights into system behavior and enabling teams to ask new questions about system performance.

Key Characteristics:

Exploratory: Enables investigation of unknown issues
Proactive: Helps identify potential problems before they occur
Rich Context: Provides detailed information for troubleshooting
Three Pillars: Metrics, logs, and traces working together

Three Pillars of Observability:

Metrics: Quantitative measurements of system behavior over time
Logs: Timestamped records of discrete events
Traces: End-to-end request journeys through distributed systems

Metrics Collection and Storage

Prometheus for Metrics

Prometheus is a popular open-source monitoring system that collects and stores metrics as time series data.

Prometheus Configuration:

YAML

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['app-service:3000']
    metrics_path: '/metrics'
    scrape_interval: 5s
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '(.+?):\d+'
        replacement: '$1'

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

Alert Rules Configuration:

YAML

# prometheus/alert_rules.yml
groups:
  - name: application_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected for {{ $labels.job }}"
          description: "HTTP error rate is above 5% for more than 2 minutes. Current rate: {{ $value }}"
          runbook_url: "https://docs.example.com/runbooks/high_error_rate"
      
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 5 minutes. Current usage: {{ $value }}%"
      
      - alert: LowDiskSpace
        expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) * 100 / node_filesystem_size_bytes > 85
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is above 85%. Current usage: {{ $value }}%"
      
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "Service has been down for more than 1 minute"
      
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) * 100 / node_memory_MemTotal_bytes > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90%. Current usage: {{ $value }}%"

Application Metrics Instrumentation

Node.js Application Metrics:

JAVASCRIPT

// src/metrics.js
const client = require('prom-client');
const express = require('express');
const app = express();

// Create a Registry which registers the metrics
const register = new client.Registry();

// Add a default label which is applied to all metrics
register.setDefaultLabels({
  app: 'my-application'
});

// Enable the collection of default metrics
client.collectDefaultMetrics({
  register,
  prefix: 'myapp_'
});

// Custom metrics
const httpRequestDurationMicroseconds = new client.Histogram({
  name: 'http_request_duration_ms',
  help: 'Duration of HTTP requests in ms',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 5, 15, 50, 100, 200, 300, 400, 500]
});

const httpRequestTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeUsers = new client.Gauge({
  name: 'active_users',
  help: 'Number of active users'
});

// Register custom metrics
register.registerMetric(httpRequestDurationMicroseconds);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeUsers);

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = Date.now() - start;
    
    httpRequestDurationMicroseconds
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);
    
    httpRequestTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();
  });
  
  next();
});

// Endpoint to serve metrics
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Health check endpoint
app.get('/health', (req, res) => {
  res.status(200).json({ status: 'healthy', timestamp: new Date().toISOString() });
});

module.exports = { app, register, activeUsers };

Python Application Metrics:

PYTHON

# app.py
from flask import Flask
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
from prometheus_client.exposition import MetricsHandler
import time
import threading

app = Flask(__name__)

# Custom metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP Requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP Request Latency',
    ['method', 'endpoint']
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Active connections gauge'
)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    # Record metrics
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint,
        status=response.status_code
    ).inc()
    
    if hasattr(request, 'start_time'):
        latency = time.time() - request.start_time
        REQUEST_LATENCY.labels(
            method=request.method,
            endpoint=request.endpoint
        ).observe(latency)
    
    return response

@app.route('/metrics')
def metrics():
    return generate_latest(REGISTRY)

@app.route('/health')
def health():
    return {'status': 'healthy', 'timestamp': time.time()}

@app.route('/')
def hello():
    return {'message': 'Hello World!'}

@app.route('/slow')
def slow_endpoint():
    time.sleep(2)  # Simulate slow operation
    return {'message': 'Slow response'}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=3000)

Metrics Best Practices

Metric Naming Conventions:

JAVASCRIPT

// Good metric naming practices
const metrics = {
  // Use consistent prefixes
  'app_http_requests_total': 'Counter for total HTTP requests',
  'app_http_request_duration_seconds': 'Histogram for request duration',
  'app_database_connections': 'Gauge for active DB connections',
  
  // Use descriptive labels
  'http_requests_total{method="GET", status="200", handler="/users"}': 'Specific request counter',
  
  // Follow naming conventions
  'namespace_component_metric_type': 'Recommended format',
  
  // Examples of good names:
  'user_service_login_attempts_total': 'Total login attempts',
  'payment_processing_duration_seconds': 'Payment processing time',
  'cache_hit_ratio': 'Cache hit ratio metric',
  
  // Avoid these naming patterns:
  'metric1': 'Too generic',
  'myCoolMetric': 'Not descriptive',
  'cpu_percentage': 'Should be cpu_usage_percent'
};

Histogram Bucket Selection:

JAVASCRIPT

// Example histogram bucket selection for different use cases
const histograms = {
  // API response times (typically fast)
  api_response_time: new client.Histogram({
    name: 'api_response_time_seconds',
    buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
  }),
  
  // Database query times (medium range)
  db_query_duration: new client.Histogram({
    name: 'db_query_duration_seconds',
    buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
  }),
  
  // Batch job durations (potentially long)
  batch_job_duration: new client.Histogram({
    name: 'batch_job_duration_seconds',
    buckets: [10, 30, 60, 120, 300, 600, 1800, 3600]
  })
};

Log Management and Analysis

Structured Logging

JSON Logging Configuration:

JAVASCRIPT

// src/logger.js
const winston = require('winston');
const { createLogger, format, transports } = winston;

const logger = createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: format.combine(
    format.timestamp(),
    format.errors({ stack: true }),
    format.splat(),
    format.json()
  ),
  defaultMeta: { service: 'my-service' },
  transports: [
    new transports.File({ filename: 'error.log', level: 'error' }),
    new transports.File({ filename: 'combined.log' }),
    new transports.Console({
      format: format.combine(
        format.colorize(),
        format.simple()
      )
    })
  ]
});

// Custom logging middleware
const logMiddleware = (req, res, next) => {
  const startTime = Date.now();
  
  res.on('finish', () => {
    const duration = Date.now() - startTime;
    
    logger.info('HTTP Request', {
      method: req.method,
      url: req.url,
      statusCode: res.statusCode,
      durationMs: duration,
      userAgent: req.get('User-Agent'),
      ip: req.ip,
      userId: req.user?.id // if authenticated
    });
  });
  
  next();
};

module.exports = { logger, logMiddleware };

Python Structured Logging:

PYTHON

# logger_config.py
import logging
import json
from pythonjsonlogger import jsonlogger
from datetime import datetime

class CustomJsonFormatter(jsonlogger.JsonFormatter):
    def add_fields(self, log_record, record, message_dict):
        super(CustomJsonFormatter, self).add_fields(log_record, record, message_dict)
        if not log_record.get('timestamp'):
            log_record['timestamp'] = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.%fZ')
        if log_record.get('level'):
            log_record['level'] = log_record['level'].lower()
        else:
            log_record['level'] = record.levelname.lower()

def setup_logger(name, log_file=None, level=logging.INFO):
    """Function to setup as many loggers as you want"""
    
    formatter = CustomJsonFormatter('%(timestamp)s %(level)s %(name)s %(message)s')
    
    handler = logging.StreamHandler()
    handler.setFormatter(formatter)
    
    logger = logging.getLogger(name)
    logger.setLevel(level)
    logger.addHandler(handler)
    
    return logger

# Usage
logger = setup_logger('my_app')

def log_api_request(method, url, status_code, duration, user_id=None):
    logger.info("API Request", extra={
        'event_type': 'api_request',
        'method': method,
        'url': url,
        'status_code': status_code,
        'duration_ms': duration,
        'user_id': user_id
    })

def log_error(error_message, error_type, context=None):
    logger.error("Application Error", extra={
        'event_type': 'error',
        'error_message': error_message,
        'error_type': error_type,
        'context': context
    })

Log Aggregation with ELK Stack

Filebeat Configuration:

YAML

# filebeat/filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/myapp/*.log
    - /var/log/nginx/*.log
  fields:
    service: myapp
    environment: production
  multiline.pattern: '^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}'
  multiline.negate: true
  multiline.match: after

filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

setup.template.settings:
  index.number_of_shards: 1
  index.number_of_replicas: 1
  index.codec: best_compression

setup.kibana:
  host: "kibana:5601"

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  username: "elastic"
  password: "${ELASTIC_PASSWORD}"

processors:
  - add_host_metadata:
      when.not.contains.tags: forwarded
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~

Logstash Configuration:

RUBY

# logstash/pipeline/logstash.conf
input {
  beats {
    port => 5044
  }
  
  file {
    path => "/var/log/myapp/*.log"
    start_position => "beginning"
    codec => json
  }
}

filter {
  if [type] == "application" {
    grok {
      match => { 
        "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:loglevel}\] %{GREEDYDATA:logger} - %{GREEDYDATA:message}" 
      }
    }
    date {
      match => [ "timestamp", "ISO8601" ]
    }
  }
  
  # Parse structured logs
  if [message] {
    json {
      source => "message"
      target => "parsed_message"
    }
  }
  
  # Add geographic information for IP addresses
  if [client_ip] {
    geoip {
      source => "client_ip"
      target => "geo_location"
    }
  }
  
  # Filter out debug logs in production
  if [loglevel] == "DEBUG" and [environment] == "production" {
    drop {}
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "myapp-%{+YYYY.MM.dd}"
  }
  
  # Also send critical alerts to Slack
  if [loglevel] == "ERROR" or [loglevel] == "FATAL" {
    http {
      url => "${SLACK_WEBHOOK_URL}"
      http_method => "post"
      format => "json"
      mapping => {
        "text" => "Critical error in %{service}: %{message}"
      }
    }
  }
  
  stdout {
    codec => rubydebug
  }
}

Distributed Tracing

OpenTelemetry Implementation

Node.js Tracing Setup:

JAVASCRIPT

// src/tracing.js
const opentelemetry = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const traceExporter = new OTLPTraceExporter({
  url: process.env.OTLP_ENDPOINT || 'http://jaeger:4318/v1/traces',
});

const sdk = new opentelemetry.NodeSDK({
  traceExporter,
  instrumentations: [getNodeAutoInstrumentations()],
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'my-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
  }),
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.log('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

module.exports = { sdk };

Express.js Tracing Middleware:

JAVASCRIPT

// src/middleware/tracing.js
const { trace, context, propagation } = require('@opentelemetry/api');
const { SpanStatusCode } = require('@opentelemetry/api');

const tracingMiddleware = (req, res, next) => {
  // Extract context from incoming request
  const extractedContext = propagation.extract(context.active(), req.headers);
  
  // Create a new span for the request
  const tracer = trace.getTracer(process.env.SERVICE_NAME || 'my-service');
  
  const span = tracer.startSpan(`${req.method} ${req.path}`, {
    attributes: {
      'http.method': req.method,
      'http.url': req.url,
      'http.user_agent': req.get('User-Agent'),
      'http.client_ip': req.ip,
      'net.host.name': req.hostname,
      'net.host.port': req.socket.localPort
    }
  }, extractedContext);
  
  // Store span in request context
  context.with(trace.setSpan(context.active(), span), () => {
    // Capture response details when response finishes
    res.on('finish', () => {
      span.setAttributes({
        'http.status_code': res.statusCode,
        'http.response_content_length': res.getHeader('Content-Length')
      });
      
      // Set span status based on response status
      if (res.statusCode >= 500) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: `HTTP ${res.statusCode}`
        });
      } else if (res.statusCode >= 400) {
        span.setStatus({
          code: SpanStatusCode.UNSET
        });
      } else {
        span.setStatus({
          code: SpanStatusCode.OK
        });
      }
      
      span.end();
    });
    
    next();
  });
};

module.exports = { tracingMiddleware };

Database Query Tracing:

JAVASCRIPT

// src/database/tracing.js
const { trace } = require('@opentelemetry/api');
const { SpanStatusCode } = require('@opentelemetry/api');

class TracedDatabaseClient {
  constructor(client) {
    this.client = client;
  }
  
  async query(sql, params) {
    const tracer = trace.getTracer('database');
    const span = tracer.startSpan('db.query', {
      attributes: {
        'db.statement': sql,
        'db.operation': this.getOperationType(sql),
        'db.params.count': Array.isArray(params) ? params.length : 0
      }
    });
    
    try {
      const result = await this.client.query(sql, params);
      
      span.setAttributes({
        'db.rows.affected': result.rowCount || 0
      });
      
      return result;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  }
  
  getOperationType(sql) {
    const trimmedSql = sql.trim().toUpperCase();
    if (trimmedSql.startsWith('SELECT')) return 'read';
    if (trimmedSql.startsWith('INSERT')) return 'write';
    if (trimmedSql.startsWith('UPDATE')) return 'write';
    if (trimmedSql.startsWith('DELETE')) return 'write';
    return 'other';
  }
}

module.exports = { TracedDatabaseClient };

Jaeger Configuration

YAML

# docker-compose.jaeger.yml
version: '3.8'

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # gRPC
      - "4318:4318"    # HTTP
      - "14268:14268"  # Legacy HTTP
    volumes:
      - jaeger_data:/badger

volumes:
  jaeger_data:

Alerting and Notification Systems

AlertManager Configuration

YAML

# alertmanager/config.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'your-smtp-password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'default-receiver'
  
  routes:
    - match:
        severity: critical
      receiver: 'critical-receiver'
      group_wait: 10s
      repeat_interval: 30m
    
    - match:
        severity: warning
      receiver: 'warning-receiver'
    
    - match_re:
        service: ^database.*
      receiver: 'database-team'
      group_interval: 1m

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        text: '{{ .CommonAnnotations.summary }}'
    
  - name: 'critical-receiver'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
    webhook_configs:
      - url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
        send_resolved: true
    
  - name: 'warning-receiver'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
    
  - name: 'database-team'
    email_configs:
      - to: '[email protected]'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

Custom Alert Handlers

JAVASCRIPT

// src/alert-handlers.js
const axios = require('axios');

class AlertHandler {
  constructor(config) {
    this.config = config;
  }
  
  async handleAlert(alert) {
    const { labels, annotations, status, startsAt, endsAt } = alert;
    
    // Determine severity and route to appropriate handler
    switch (labels.severity) {
      case 'critical':
        await this.handleCriticalAlert(alert);
        break;
      case 'warning':
        await this.handleWarningAlert(alert);
        break;
      default:
        await this.handleInfoAlert(alert);
    }
  }
  
  async handleCriticalAlert(alert) {
    const { labels, annotations } = alert;
    
    // Send immediate notification to on-call team
    await this.sendSlackNotification({
      channel: '#critical-alerts',
      text: `🚨 CRITICAL ALERT: ${annotations.summary}`,
      attachments: [{
        color: 'danger',
        fields: [
          { title: 'Service', value: labels.service, short: true },
          { title: 'Severity', value: labels.severity, short: true },
          { title: 'Description', value: annotations.description }
        ]
      }]
    });
    
    // Trigger incident response workflow
    await this.triggerIncidentResponse(labels.service, annotations.summary);
  }
  
  async handleWarningAlert(alert) {
    const { labels, annotations } = alert;
    
    // Send notification to appropriate team
    const teamChannel = this.getTeamChannel(labels.service);
    
    await this.sendSlackNotification({
      channel: teamChannel,
      text: `⚠️ WARNING: ${annotations.summary}`,
      attachments: [{
        color: 'warning',
        fields: [
          { title: 'Service', value: labels.service, short: true },
          { title: 'Severity', value: labels.severity, short: true },
          { title: 'Description', value: annotations.description }
        ]
      }]
    });
  }
  
  async sendSlackNotification(message) {
    try {
      await axios.post(this.config.slackWebhookUrl, message);
    } catch (error) {
      console.error('Failed to send Slack notification:', error.message);
    }
  }
  
  async triggerIncidentResponse(service, summary) {
    // Integrate with incident management system
    // This could trigger PagerDuty, OpsGenie, etc.
    console.log(`Triggering incident response for ${service}: ${summary}`);
  }
  
  getTeamChannel(service) {
    const serviceTeams = {
      'user-service': '#user-team',
      'payment-service': '#payments-team',
      'notification-service': '#notifications-team'
    };
    
    return serviceTeams[service] || '#general';
  }
}

module.exports = { AlertHandler };

Visualization and Dashboards

Grafana Dashboard Configuration

JSON

{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "Prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "9.0.0"
    },
    {
      "type": "panel",
      "id": "timeseries",
      "name": "Time series",
      "version": ""
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 18,
      "panels": [],
      "title": "System Overview",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 1
      },
      "id": 2,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom"
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
          "legendFormat": "{{instance}} CPU Usage",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "CPU Usage",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 1
      },
      "id": 4,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom"
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) * 100 / node_memory_MemTotal_bytes",
          "legendFormat": "{{instance}} Memory Usage",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Memory Usage",
      "type": "timeseries"
    },
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 9
      },
      "id": 16,
      "panels": [],
      "title": "Application Metrics",
      "type": "row"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "reqps"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 10
      },
      "id": 6,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom"
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "sum(rate(http_requests_total[5m])) by (method, handler)",
          "legendFormat": "{{method}} {{handler}}",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Request Rate",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 5
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 10
      },
      "id": 8,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom"
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler))",
          "legendFormat": "{{handler}} P95",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Request Duration (P95)",
      "type": "timeseries"
    }
  ],
  "refresh": "5s",
  "schemaVersion": 36,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": [
      {
        "current": {
          "selected": false,
          "text": "Prometheus",
          "value": "Prometheus"
        },
        "hide": 0,
        "includeAll": false,
        "label": "Data Source",
        "multi": false,
        "name": "DS_PROMETHEUS",
        "options": [],
        "query": "prometheus",
        "queryValue": "",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      }
    ]
  },
  "time": {
    "from": "now-1h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "System and Application Dashboard",
  "uid": "system-dashboard",
  "version": 1,
  "weekStart": ""
}

Monitoring Best Practices

SLOs and SLIs

JAVASCRIPT

// src/slo-calculator.js
class SLOCalculator {
  constructor(config) {
    this.config = config;
  }
  
  calculateAvailability(errors, totalRequests) {
    return ((totalRequests - errors) / totalRequests) * 100;
  }
  
  calculateLatencyPercentile(latencies, percentile) {
    const sorted = latencies.sort((a, b) => a - b);
    const index = Math.floor((percentile / 100) * sorted.length);
    return sorted[index];
  }
  
  // Example SLO definitions
  getSLOs() {
    return {
      availability: {
        objective: 99.9, // 99.9% availability
        window: '28d', // 28-day sliding window
        description: 'Overall system availability'
      },
      latency: {
        objective: 95, // 95% of requests served within threshold
        threshold: 200, // 200ms response time
        window: '7d', // 7-day sliding window
        description: 'Request latency P95'
      },
      freshness: {
        objective: 99, // 99% of data fresh within threshold
        threshold: 60, // 60 seconds data freshness
        window: '1d', // 1-day window
        description: 'Data freshness'
      }
    };
  }
  
  checkSLOBreach(slo, currentValue) {
    const breaches = [];
    
    if (slo.metric === 'availability' && currentValue < slo.objective) {
      breaches.push({
        slo: slo.name,
        current: currentValue,
        objective: slo.objective,
        status: 'breach',
        impact: 'high'
      });
    }
    
    return breaches;
  }
}

module.exports = { SLOCalculator };

Golden Signals

JAVASCRIPT

// src/golden-signals.js
const goldenSignals = {
  // Latency: Time taken to service a request
  latency: {
    metric: 'http_request_duration_seconds',
    query: 'histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler))',
    alert: 'latency > 1.0' // Alert if p95 latency > 1 second
  },
  
  // Traffic: Requests per second
  traffic: {
    metric: 'http_requests_total',
    query: 'sum(rate(http_requests_total[5m])) by (handler)',
    alert: 'traffic < 10' // Alert if traffic drops below 10 RPS
  },
  
  // Errors: Number of failed requests
  errors: {
    metric: 'http_requests_total',
    query: 'sum(rate(http_requests_total{status=~"5.."}[5m])) by (handler)',
    alert: 'error_rate > 0.05' // Alert if error rate > 5%
  },
  
  // Saturation: How full the system is
  saturation: {
    metric: 'process_cpu_seconds_total',
    query: 'rate(process_cpu_seconds_total[5m]) * 100',
    alert: 'saturation > 80' // Alert if CPU > 80%
  }
};

module.exports = { goldenSignals };

Synthetic Monitoring

JAVASCRIPT

// src/synthetic-monitor.js
const puppeteer = require('puppeteer');
const prometheus = require('prom-client');

// Synthetic monitoring metrics
const syntheticUp = new prometheus.Gauge({
  name: 'synthetic_monitor_up',
  help: 'Synthetic monitor status',
  labelNames: ['monitor_name', 'url']
});

const syntheticResponseTime = new prometheus.Histogram({
  name: 'synthetic_monitor_response_time_seconds',
  help: 'Synthetic monitor response time',
  labelNames: ['monitor_name', 'url'],
  buckets: [0.1, 0.5, 1, 2, 5, 10]
});

async function runSyntheticCheck(url, monitorName) {
  const startTime = Date.now();
  
  try {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    
    // Add performance metrics
    await page.evaluateOnNewDocument(() => {
      window.performance.mark('navigation-start');
    });
    
    const response = await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
    
    if (!response.ok()) {
      throw new Error(`HTTP ${response.status()}`);
    }
    
    // Measure response time
    const responseTime = (Date.now() - startTime) / 1000;
    
    syntheticUp.labels(monitorName, url).set(1);
    syntheticResponseTime.labels(monitorName, url).observe(responseTime);
    
    await browser.close();
    
    return { success: true, responseTime, status: response.status() };
  } catch (error) {
    syntheticUp.labels(monitorName, url).set(0);
    
    return { success: false, error: error.message };
  }
}

// Schedule synthetic checks
setInterval(async () => {
  const checks = [
    { name: 'homepage', url: 'https://example.com' },
    { name: 'login', url: 'https://example.com/login' },
    { name: 'api-status', url: 'https://api.example.com/status' }
  ];
  
  for (const check of checks) {
    await runSyntheticCheck(check.url, check.name);
  }
}, 60000); // Run every minute

module.exports = { runSyntheticCheck };

Monitoring in CI/CD Pipelines

Monitoring Validation in Pipelines

YAML

# .github/workflows/monitoring-validation.yml
name: Monitoring Validation

on:
  pull_request:
    branches: [ main ]
    paths:
      - 'monitoring/**'
      - 'src/**/*'
      - 'Dockerfile'

jobs:
  validate-monitoring:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3
      
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
      
      - name: Install dependencies
        run: npm ci
      
      - name: Validate Prometheus configuration
        run: |
          # Install promtool
          curl -sSL https://github.com/prometheus/prometheus/releases/download/v2.37.0/prometheus-2.37.0.linux-amd64.tar.gz | tar xz
          ./prometheus-2.37.0.linux-amd64/promtool check config monitoring/prometheus/prometheus.yml
          ./prometheus-2.37.0.linux-amd64/promtool check rules monitoring/prometheus/rules.yml
      
      - name: Run monitoring unit tests
        run: npm run test:monitoring
      
      - name: Validate Grafana dashboards
        run: |
          # Check dashboard JSON validity
          for dashboard in monitoring/grafana/dashboards/*.json; do
            jq empty "$dashboard" || exit 1
          done

Chaos Engineering for Monitoring

JAVASCRIPT

// chaos-engineering.js
const { exec } = require('child_process');
const prometheus = require('./src/metrics'); // Our metrics client

class ChaosEngineer {
  constructor() {
    this.scenarios = [
      this.cpuHog.bind(this),
      this.memoryLeak.bind(this),
      this.networkDelay.bind(this),
      this.diskFill.bind(this)
    ];
  }
  
  async cpuHog(duration = 30) {
    console.log(`Starting CPU hog for ${duration} seconds`);
    
    // Monitor CPU usage during chaos
    const cpuBefore = await this.getCPUUsage();
    
    // Create CPU intensive process
    const child = exec('stress-ng --cpu 4 --timeout 30s');
    
    setTimeout(() => {
      child.kill();
      console.log('CPU hog completed');
    }, duration * 1000);
  }
  
  async networkDelay(targetHost, delayMs = 1000, duration = 60) {
    console.log(`Adding ${delayMs}ms network delay to ${targetHost} for ${duration}s`);
    
    // This would typically be run on the target system
    exec(`sudo tc qdisc add dev eth0 root netem delay ${delayMs}ms`, (error) => {
      if (error) {
        console.error('Network delay setup failed:', error);
        return;
      }
      
      setTimeout(() => {
        exec('sudo tc qdisc del dev eth0 root', () => {
          console.log('Network delay removed');
        });
      }, duration * 1000);
    });
  }
  
  async runExperiments() {
    for (const scenario of this.scenarios) {
      console.log(`Running experiment: ${scenario.name}`);
      
      // Record metrics before experiment
      const metricsBefore = await this.getCurrentMetrics();
      
      await scenario();
      
      // Wait for metrics to propagate
      await new Promise(resolve => setTimeout(resolve, 10000));
      
      // Record metrics after experiment
      const metricsAfter = await this.getCurrentMetrics();
      
      console.log('Metrics comparison:', {
        before: metricsBefore,
        after: metricsAfter
      });
      
      // Wait between experiments
      await new Promise(resolve => setTimeout(resolve, 30000));
    }
  }
  
  async getCurrentMetrics() {
    // Collect current system metrics
    return {
      cpu: await this.getCPUUsage(),
      memory: await this.getMemoryUsage(),
      disk: await this.getDiskUsage()
    };
  }
  
  async getCPUUsage() {
    // Implementation to get current CPU usage
    return new Promise((resolve) => {
      exec('top -bn1 | grep "Cpu(s)" | awk \'{print $2}\' | sed "s/%us,//"', (error, stdout) => {
        resolve(parseFloat(stdout.trim()));
      });
    });
  }
  
  async getMemoryUsage() {
    // Implementation to get current memory usage
    return new Promise((resolve) => {
      exec('free | grep Mem | awk \'{printf("%.2f", $3/$2 * 100.0)}\'', (error, stdout) => {
        resolve(parseFloat(stdout.trim()));
      });
    });
  }
  
  async getDiskUsage() {
    // Implementation to get current disk usage
    return new Promise((resolve) => {
      exec('df / | tail -1 | awk \'{print $5}\' | sed "s/%//"', (error, stdout) => {
        resolve(parseFloat(stdout.trim()));
      });
    });
  }
}

module.exports = { ChaosEngineer };

Troubleshooting and Root Cause Analysis

Diagnostic Tools and Techniques

BASH

#!/bin/bash
# diagnostic-tools.sh

# Comprehensive system diagnostics script
echo "=== System Diagnostics Report ==="
echo "Generated on: $(date)"
echo

echo "--- System Information ---"
echo "Hostname: $(hostname)"
echo "OS: $(uname -s)"
echo "Kernel: $(uname -r)"
echo "Architecture: $(uname -m)"
echo

echo "--- CPU Information ---"
echo "CPU Count: $(nproc)"
echo "Load Average: $(uptime | awk -F'load average:' '{print $2}')"
echo

echo "--- Memory Information ---"
free -h
echo

echo "--- Disk Usage ---"
df -h
echo

echo "--- Network Connections ---"
netstat -tuln | head -20
echo

echo "--- Process Information ---"
echo "Top 10 processes by CPU:"
ps aux --sort=-%cpu | head -11
echo

echo "Top 10 processes by Memory:"
ps aux --sort=-%mem | head -11
echo

echo "--- Application Logs (last 50 lines) ---"
if [ -f "/var/log/myapp/combined.log" ]; then
    tail -50 /var/log/myapp/combined.log
else
    echo "Application logs not found"
fi
echo

echo "--- Recent System Messages ---"
dmesg | tail -20
echo

echo "--- Prometheus Metrics Sample ---"
if curl -sf http://localhost:9090/metrics >/dev/null 2>&1; then
    curl -s http://localhost:9090/metrics | grep -E "up|requests_total|errors_total" | head -20
else
    echo "Prometheus not accessible"
fi
echo

echo "=== End of Report ==="

Log Analysis Queries

SQL

-- SQL-like queries for log analysis (using tools like Loki or Elasticsearch)

-- Top error patterns in last hour
SELECT 
  log_level,
  message,
  COUNT(*) as error_count
FROM logs 
WHERE timestamp > NOW() - INTERVAL 1 HOUR
  AND log_level = 'ERROR'
GROUP BY log_level, message
ORDER BY error_count DESC
LIMIT 10;

-- Slowest API endpoints
SELECT 
  endpoint,
  AVG(response_time) as avg_response_time,
  MAX(response_time) as max_response_time,
  COUNT(*) as request_count
FROM api_requests
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY endpoint
HAVING avg_response_time > 1000  -- More than 1 second
ORDER BY avg_response_time DESC;

-- User activity patterns
SELECT 
  DATE_TRUNC('hour', timestamp) as hour,
  COUNT(DISTINCT user_id) as active_users,
  COUNT(*) as total_requests
FROM user_requests
WHERE timestamp > NOW() - INTERVAL 24 HOURS
GROUP BY hour
ORDER BY hour;

-- Error correlation analysis
SELECT 
  error_type,
  COUNT(*) as error_count,
  COUNT(DISTINCT session_id) as affected_sessions,
  STRING_AGG(DISTINCT service_name, ', ') as affected_services
FROM error_events
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY error_type
HAVING error_count > 10
ORDER BY error_count DESC;

Conclusion

Monitoring and observability are fundamental to successful DevOps practices, providing the visibility needed to maintain system health, performance, and reliability. Effective monitoring combines traditional metrics collection with modern observability practices, including distributed tracing and structured logging.

The key to successful monitoring is to start with the basics—system metrics, application performance, and error tracking—and gradually add more sophisticated observability practices. Organizations should focus on implementing the "four golden signals" (latency, traffic, errors, and saturation) as a foundation, then expand to include business metrics and user experience monitoring.

In the next article, we'll explore security practices in DevOps, covering DevSecOps principles, security automation, and how to integrate security throughout the software delivery lifecycle.

Series

DevOps Series

Introduction to DevOps

DevOps Tools and Technologies

Continuous Integration and Delivery

Infrastructure as Code

Monitoring and Observability

Security in DevOps (DevSecOps)

Team Collaboration and Communication

Scaling DevOps in Enterprise

Emerging Trends and Future Directions

Implementing DevOps: Practical Roadmap

Share this article

You might also like