Monitoring and Observability
Overview
Monitoring and observability are critical components of DevOps practices that provide insight into system health, performance, and behavior. While monitoring focuses on predefined metrics and alerts, observability encompasses the ability to understand system internals through external outputs like logs, metrics, and traces. Together, they enable teams to proactively identify issues, troubleshoot problems, and optimize system performance.
Understanding Monitoring vs. Observability
Monitoring
Monitoring is the systematic observation of systems, applications, and infrastructure to track performance, availability, and health indicators. It typically involves collecting predefined metrics and setting up alerts based on specific thresholds.
Key Characteristics:
- Predefined Metrics: Focus on known, expected measurements
- Reactive: Primarily responds to predefined conditions
- Alert-Driven: Generates alerts when thresholds are crossed
- Historical Analysis: Tracks trends over time for capacity planning
Types of Monitoring:
- Infrastructure Monitoring: CPU, memory, disk, network usage
- Application Monitoring: Response times, error rates, throughput
- Business Monitoring: User engagement, conversion rates, revenue metrics
- Synthetic Monitoring: Simulated user interactions to test availability
Observability
Observability is the ability to understand the internal state of a system by examining its external outputs. It goes beyond traditional monitoring by providing deeper insights into system behavior and enabling teams to ask new questions about system performance.
Key Characteristics:
- Exploratory: Enables investigation of unknown issues
- Proactive: Helps identify potential problems before they occur
- Rich Context: Provides detailed information for troubleshooting
- Three Pillars: Metrics, logs, and traces working together
Three Pillars of Observability:
- Metrics: Quantitative measurements of system behavior over time
- Logs: Timestamped records of discrete events
- Traces: End-to-end request journeys through distributed systems
Metrics Collection and Storage
Prometheus for Metrics
Prometheus is a popular open-source monitoring system that collects and stores metrics as time series data.
Prometheus Configuration:
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'application'
static_configs:
- targets: ['app-service:3000']
metrics_path: '/metrics'
scrape_interval: 5s
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '(.+?):\d+'
replacement: '$1'
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_nameAlert Rules Configuration:
# prometheus/alert_rules.yml
groups:
- name: application_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected for {{ $labels.job }}"
description: "HTTP error rate is above 5% for more than 2 minutes. Current rate: {{ $value }}"
runbook_url: "https://docs.example.com/runbooks/high_error_rate"
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes. Current usage: {{ $value }}%"
- alert: LowDiskSpace
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) * 100 / node_filesystem_size_bytes > 85
for: 2m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is above 85%. Current usage: {{ $value }}%"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "Service has been down for more than 1 minute"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) * 100 / node_memory_MemTotal_bytes > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90%. Current usage: {{ $value }}%"Application Metrics Instrumentation
Node.js Application Metrics:
// src/metrics.js
const client = require('prom-client');
const express = require('express');
const app = express();
// Create a Registry which registers the metrics
const register = new client.Registry();
// Add a default label which is applied to all metrics
register.setDefaultLabels({
app: 'my-application'
});
// Enable the collection of default metrics
client.collectDefaultMetrics({
register,
prefix: 'myapp_'
});
// Custom metrics
const httpRequestDurationMicroseconds = new client.Histogram({
name: 'http_request_duration_ms',
help: 'Duration of HTTP requests in ms',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 5, 15, 50, 100, 200, 300, 400, 500]
});
const httpRequestTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const activeUsers = new client.Gauge({
name: 'active_users',
help: 'Number of active users'
});
// Register custom metrics
register.registerMetric(httpRequestDurationMicroseconds);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeUsers);
// Middleware to track metrics
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
httpRequestDurationMicroseconds
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, req.route?.path || req.path, res.statusCode)
.inc();
});
next();
});
// Endpoint to serve metrics
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Health check endpoint
app.get('/health', (req, res) => {
res.status(200).json({ status: 'healthy', timestamp: new Date().toISOString() });
});
module.exports = { app, register, activeUsers };Python Application Metrics:
# app.py
from flask import Flask
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
from prometheus_client.exposition import MetricsHandler
import time
import threading
app = Flask(__name__)
# Custom metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP Requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP Request Latency',
['method', 'endpoint']
)
ACTIVE_CONNECTIONS = Gauge(
'active_connections',
'Active connections gauge'
)
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
# Record metrics
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()
if hasattr(request, 'start_time'):
latency = time.time() - request.start_time
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.endpoint
).observe(latency)
return response
@app.route('/metrics')
def metrics():
return generate_latest(REGISTRY)
@app.route('/health')
def health():
return {'status': 'healthy', 'timestamp': time.time()}
@app.route('/')
def hello():
return {'message': 'Hello World!'}
@app.route('/slow')
def slow_endpoint():
time.sleep(2) # Simulate slow operation
return {'message': 'Slow response'}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=3000)Metrics Best Practices
Metric Naming Conventions:
// Good metric naming practices
const metrics = {
// Use consistent prefixes
'app_http_requests_total': 'Counter for total HTTP requests',
'app_http_request_duration_seconds': 'Histogram for request duration',
'app_database_connections': 'Gauge for active DB connections',
// Use descriptive labels
'http_requests_total{method="GET", status="200", handler="/users"}': 'Specific request counter',
// Follow naming conventions
'namespace_component_metric_type': 'Recommended format',
// Examples of good names:
'user_service_login_attempts_total': 'Total login attempts',
'payment_processing_duration_seconds': 'Payment processing time',
'cache_hit_ratio': 'Cache hit ratio metric',
// Avoid these naming patterns:
'metric1': 'Too generic',
'myCoolMetric': 'Not descriptive',
'cpu_percentage': 'Should be cpu_usage_percent'
};Histogram Bucket Selection:
// Example histogram bucket selection for different use cases
const histograms = {
// API response times (typically fast)
api_response_time: new client.Histogram({
name: 'api_response_time_seconds',
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
}),
// Database query times (medium range)
db_query_duration: new client.Histogram({
name: 'db_query_duration_seconds',
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
}),
// Batch job durations (potentially long)
batch_job_duration: new client.Histogram({
name: 'batch_job_duration_seconds',
buckets: [10, 30, 60, 120, 300, 600, 1800, 3600]
})
};Log Management and Analysis
Structured Logging
JSON Logging Configuration:
// src/logger.js
const winston = require('winston');
const { createLogger, format, transports } = winston;
const logger = createLogger({
level: process.env.LOG_LEVEL || 'info',
format: format.combine(
format.timestamp(),
format.errors({ stack: true }),
format.splat(),
format.json()
),
defaultMeta: { service: 'my-service' },
transports: [
new transports.File({ filename: 'error.log', level: 'error' }),
new transports.File({ filename: 'combined.log' }),
new transports.Console({
format: format.combine(
format.colorize(),
format.simple()
)
})
]
});
// Custom logging middleware
const logMiddleware = (req, res, next) => {
const startTime = Date.now();
res.on('finish', () => {
const duration = Date.now() - startTime;
logger.info('HTTP Request', {
method: req.method,
url: req.url,
statusCode: res.statusCode,
durationMs: duration,
userAgent: req.get('User-Agent'),
ip: req.ip,
userId: req.user?.id // if authenticated
});
});
next();
};
module.exports = { logger, logMiddleware };Python Structured Logging:
# logger_config.py
import logging
import json
from pythonjsonlogger import jsonlogger
from datetime import datetime
class CustomJsonFormatter(jsonlogger.JsonFormatter):
def add_fields(self, log_record, record, message_dict):
super(CustomJsonFormatter, self).add_fields(log_record, record, message_dict)
if not log_record.get('timestamp'):
log_record['timestamp'] = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S.%fZ')
if log_record.get('level'):
log_record['level'] = log_record['level'].lower()
else:
log_record['level'] = record.levelname.lower()
def setup_logger(name, log_file=None, level=logging.INFO):
"""Function to setup as many loggers as you want"""
formatter = CustomJsonFormatter('%(timestamp)s %(level)s %(name)s %(message)s')
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger = logging.getLogger(name)
logger.setLevel(level)
logger.addHandler(handler)
return logger
# Usage
logger = setup_logger('my_app')
def log_api_request(method, url, status_code, duration, user_id=None):
logger.info("API Request", extra={
'event_type': 'api_request',
'method': method,
'url': url,
'status_code': status_code,
'duration_ms': duration,
'user_id': user_id
})
def log_error(error_message, error_type, context=None):
logger.error("Application Error", extra={
'event_type': 'error',
'error_message': error_message,
'error_type': error_type,
'context': context
})Log Aggregation with ELK Stack
Filebeat Configuration:
# filebeat/filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/myapp/*.log
- /var/log/nginx/*.log
fields:
service: myapp
environment: production
multiline.pattern: '^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}'
multiline.negate: true
multiline.match: after
filebeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
setup.template.settings:
index.number_of_shards: 1
index.number_of_replicas: 1
index.codec: best_compression
setup.kibana:
host: "kibana:5601"
output.elasticsearch:
hosts: ["elasticsearch:9200"]
username: "elastic"
password: "${ELASTIC_PASSWORD}"
processors:
- add_host_metadata:
when.not.contains.tags: forwarded
- add_docker_metadata: ~
- add_kubernetes_metadata: ~Logstash Configuration:
# logstash/pipeline/logstash.conf
input {
beats {
port => 5044
}
file {
path => "/var/log/myapp/*.log"
start_position => "beginning"
codec => json
}
}
filter {
if [type] == "application" {
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:loglevel}\] %{GREEDYDATA:logger} - %{GREEDYDATA:message}"
}
}
date {
match => [ "timestamp", "ISO8601" ]
}
}
# Parse structured logs
if [message] {
json {
source => "message"
target => "parsed_message"
}
}
# Add geographic information for IP addresses
if [client_ip] {
geoip {
source => "client_ip"
target => "geo_location"
}
}
# Filter out debug logs in production
if [loglevel] == "DEBUG" and [environment] == "production" {
drop {}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "myapp-%{+YYYY.MM.dd}"
}
# Also send critical alerts to Slack
if [loglevel] == "ERROR" or [loglevel] == "FATAL" {
http {
url => "${SLACK_WEBHOOK_URL}"
http_method => "post"
format => "json"
mapping => {
"text" => "Critical error in %{service}: %{message}"
}
}
}
stdout {
codec => rubydebug
}
}Distributed Tracing
OpenTelemetry Implementation
Node.js Tracing Setup:
// src/tracing.js
const opentelemetry = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const traceExporter = new OTLPTraceExporter({
url: process.env.OTLP_ENDPOINT || 'http://jaeger:4318/v1/traces',
});
const sdk = new opentelemetry.NodeSDK({
traceExporter,
instrumentations: [getNodeAutoInstrumentations()],
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'my-service',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
}),
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
module.exports = { sdk };Express.js Tracing Middleware:
// src/middleware/tracing.js
const { trace, context, propagation } = require('@opentelemetry/api');
const { SpanStatusCode } = require('@opentelemetry/api');
const tracingMiddleware = (req, res, next) => {
// Extract context from incoming request
const extractedContext = propagation.extract(context.active(), req.headers);
// Create a new span for the request
const tracer = trace.getTracer(process.env.SERVICE_NAME || 'my-service');
const span = tracer.startSpan(`${req.method} ${req.path}`, {
attributes: {
'http.method': req.method,
'http.url': req.url,
'http.user_agent': req.get('User-Agent'),
'http.client_ip': req.ip,
'net.host.name': req.hostname,
'net.host.port': req.socket.localPort
}
}, extractedContext);
// Store span in request context
context.with(trace.setSpan(context.active(), span), () => {
// Capture response details when response finishes
res.on('finish', () => {
span.setAttributes({
'http.status_code': res.statusCode,
'http.response_content_length': res.getHeader('Content-Length')
});
// Set span status based on response status
if (res.statusCode >= 500) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: `HTTP ${res.statusCode}`
});
} else if (res.statusCode >= 400) {
span.setStatus({
code: SpanStatusCode.UNSET
});
} else {
span.setStatus({
code: SpanStatusCode.OK
});
}
span.end();
});
next();
});
};
module.exports = { tracingMiddleware };Database Query Tracing:
// src/database/tracing.js
const { trace } = require('@opentelemetry/api');
const { SpanStatusCode } = require('@opentelemetry/api');
class TracedDatabaseClient {
constructor(client) {
this.client = client;
}
async query(sql, params) {
const tracer = trace.getTracer('database');
const span = tracer.startSpan('db.query', {
attributes: {
'db.statement': sql,
'db.operation': this.getOperationType(sql),
'db.params.count': Array.isArray(params) ? params.length : 0
}
});
try {
const result = await this.client.query(sql, params);
span.setAttributes({
'db.rows.affected': result.rowCount || 0
});
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
} finally {
span.end();
}
}
getOperationType(sql) {
const trimmedSql = sql.trim().toUpperCase();
if (trimmedSql.startsWith('SELECT')) return 'read';
if (trimmedSql.startsWith('INSERT')) return 'write';
if (trimmedSql.startsWith('UPDATE')) return 'write';
if (trimmedSql.startsWith('DELETE')) return 'write';
return 'other';
}
}
module.exports = { TracedDatabaseClient };Jaeger Configuration
# docker-compose.jaeger.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686" # UI
- "4317:4317" # gRPC
- "4318:4318" # HTTP
- "14268:14268" # Legacy HTTP
volumes:
- jaeger_data:/badger
volumes:
jaeger_data:Alerting and Notification Systems
AlertManager Configuration
# alertmanager/config.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'your-smtp-password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
group_wait: 10s
repeat_interval: 30m
- match:
severity: warning
receiver: 'warning-receiver'
- match_re:
service: ^database.*
receiver: 'database-team'
group_interval: 1m
receivers:
- name: 'default-receiver'
email_configs:
- to: '[email protected]'
send_resolved: true
slack_configs:
- channel: '#alerts'
send_resolved: true
text: '{{ .CommonAnnotations.summary }}'
- name: 'critical-receiver'
email_configs:
- to: '[email protected]'
send_resolved: true
webhook_configs:
- url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
send_resolved: true
- name: 'warning-receiver'
email_configs:
- to: '[email protected]'
send_resolved: true
- name: 'database-team'
email_configs:
- to: '[email protected]'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']Custom Alert Handlers
// src/alert-handlers.js
const axios = require('axios');
class AlertHandler {
constructor(config) {
this.config = config;
}
async handleAlert(alert) {
const { labels, annotations, status, startsAt, endsAt } = alert;
// Determine severity and route to appropriate handler
switch (labels.severity) {
case 'critical':
await this.handleCriticalAlert(alert);
break;
case 'warning':
await this.handleWarningAlert(alert);
break;
default:
await this.handleInfoAlert(alert);
}
}
async handleCriticalAlert(alert) {
const { labels, annotations } = alert;
// Send immediate notification to on-call team
await this.sendSlackNotification({
channel: '#critical-alerts',
text: `🚨 CRITICAL ALERT: ${annotations.summary}`,
attachments: [{
color: 'danger',
fields: [
{ title: 'Service', value: labels.service, short: true },
{ title: 'Severity', value: labels.severity, short: true },
{ title: 'Description', value: annotations.description }
]
}]
});
// Trigger incident response workflow
await this.triggerIncidentResponse(labels.service, annotations.summary);
}
async handleWarningAlert(alert) {
const { labels, annotations } = alert;
// Send notification to appropriate team
const teamChannel = this.getTeamChannel(labels.service);
await this.sendSlackNotification({
channel: teamChannel,
text: `⚠️ WARNING: ${annotations.summary}`,
attachments: [{
color: 'warning',
fields: [
{ title: 'Service', value: labels.service, short: true },
{ title: 'Severity', value: labels.severity, short: true },
{ title: 'Description', value: annotations.description }
]
}]
});
}
async sendSlackNotification(message) {
try {
await axios.post(this.config.slackWebhookUrl, message);
} catch (error) {
console.error('Failed to send Slack notification:', error.message);
}
}
async triggerIncidentResponse(service, summary) {
// Integrate with incident management system
// This could trigger PagerDuty, OpsGenie, etc.
console.log(`Triggering incident response for ${service}: ${summary}`);
}
getTeamChannel(service) {
const serviceTeams = {
'user-service': '#user-team',
'payment-service': '#payments-team',
'notification-service': '#notifications-team'
};
return serviceTeams[service] || '#general';
}
}
module.exports = { AlertHandler };Visualization and Dashboards
Grafana Dashboard Configuration
{
"__inputs": [
{
"name": "DS_PROMETHEUS",
"label": "Prometheus",
"description": "",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
],
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "9.0.0"
},
{
"type": "panel",
"id": "timeseries",
"name": "Time series",
"version": ""
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
}
],
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"collapsed": false,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 18,
"panels": [],
"title": "System Overview",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 1
},
"id": 2,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}} CPU Usage",
"range": true,
"refId": "A"
}
],
"title": "CPU Usage",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 1
},
"id": 4,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) * 100 / node_memory_MemTotal_bytes",
"legendFormat": "{{instance}} Memory Usage",
"range": true,
"refId": "A"
}
],
"title": "Memory Usage",
"type": "timeseries"
},
{
"collapsed": false,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 9
},
"id": 16,
"panels": [],
"title": "Application Metrics",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 10
},
"id": 6,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "sum(rate(http_requests_total[5m])) by (method, handler)",
"legendFormat": "{{method}} {{handler}}",
"range": true,
"refId": "A"
}
],
"title": "Request Rate",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 5
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 10
},
"id": 8,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler))",
"legendFormat": "{{handler}} P95",
"range": true,
"refId": "A"
}
],
"title": "Request Duration (P95)",
"type": "timeseries"
}
],
"refresh": "5s",
"schemaVersion": 36,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"current": {
"selected": false,
"text": "Prometheus",
"value": "Prometheus"
},
"hide": 0,
"includeAll": false,
"label": "Data Source",
"multi": false,
"name": "DS_PROMETHEUS",
"options": [],
"query": "prometheus",
"queryValue": "",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"type": "datasource"
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "System and Application Dashboard",
"uid": "system-dashboard",
"version": 1,
"weekStart": ""
}Monitoring Best Practices
SLOs and SLIs
// src/slo-calculator.js
class SLOCalculator {
constructor(config) {
this.config = config;
}
calculateAvailability(errors, totalRequests) {
return ((totalRequests - errors) / totalRequests) * 100;
}
calculateLatencyPercentile(latencies, percentile) {
const sorted = latencies.sort((a, b) => a - b);
const index = Math.floor((percentile / 100) * sorted.length);
return sorted[index];
}
// Example SLO definitions
getSLOs() {
return {
availability: {
objective: 99.9, // 99.9% availability
window: '28d', // 28-day sliding window
description: 'Overall system availability'
},
latency: {
objective: 95, // 95% of requests served within threshold
threshold: 200, // 200ms response time
window: '7d', // 7-day sliding window
description: 'Request latency P95'
},
freshness: {
objective: 99, // 99% of data fresh within threshold
threshold: 60, // 60 seconds data freshness
window: '1d', // 1-day window
description: 'Data freshness'
}
};
}
checkSLOBreach(slo, currentValue) {
const breaches = [];
if (slo.metric === 'availability' && currentValue < slo.objective) {
breaches.push({
slo: slo.name,
current: currentValue,
objective: slo.objective,
status: 'breach',
impact: 'high'
});
}
return breaches;
}
}
module.exports = { SLOCalculator };Golden Signals
// src/golden-signals.js
const goldenSignals = {
// Latency: Time taken to service a request
latency: {
metric: 'http_request_duration_seconds',
query: 'histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler))',
alert: 'latency > 1.0' // Alert if p95 latency > 1 second
},
// Traffic: Requests per second
traffic: {
metric: 'http_requests_total',
query: 'sum(rate(http_requests_total[5m])) by (handler)',
alert: 'traffic < 10' // Alert if traffic drops below 10 RPS
},
// Errors: Number of failed requests
errors: {
metric: 'http_requests_total',
query: 'sum(rate(http_requests_total{status=~"5.."}[5m])) by (handler)',
alert: 'error_rate > 0.05' // Alert if error rate > 5%
},
// Saturation: How full the system is
saturation: {
metric: 'process_cpu_seconds_total',
query: 'rate(process_cpu_seconds_total[5m]) * 100',
alert: 'saturation > 80' // Alert if CPU > 80%
}
};
module.exports = { goldenSignals };Synthetic Monitoring
// src/synthetic-monitor.js
const puppeteer = require('puppeteer');
const prometheus = require('prom-client');
// Synthetic monitoring metrics
const syntheticUp = new prometheus.Gauge({
name: 'synthetic_monitor_up',
help: 'Synthetic monitor status',
labelNames: ['monitor_name', 'url']
});
const syntheticResponseTime = new prometheus.Histogram({
name: 'synthetic_monitor_response_time_seconds',
help: 'Synthetic monitor response time',
labelNames: ['monitor_name', 'url'],
buckets: [0.1, 0.5, 1, 2, 5, 10]
});
async function runSyntheticCheck(url, monitorName) {
const startTime = Date.now();
try {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Add performance metrics
await page.evaluateOnNewDocument(() => {
window.performance.mark('navigation-start');
});
const response = await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
if (!response.ok()) {
throw new Error(`HTTP ${response.status()}`);
}
// Measure response time
const responseTime = (Date.now() - startTime) / 1000;
syntheticUp.labels(monitorName, url).set(1);
syntheticResponseTime.labels(monitorName, url).observe(responseTime);
await browser.close();
return { success: true, responseTime, status: response.status() };
} catch (error) {
syntheticUp.labels(monitorName, url).set(0);
return { success: false, error: error.message };
}
}
// Schedule synthetic checks
setInterval(async () => {
const checks = [
{ name: 'homepage', url: 'https://example.com' },
{ name: 'login', url: 'https://example.com/login' },
{ name: 'api-status', url: 'https://api.example.com/status' }
];
for (const check of checks) {
await runSyntheticCheck(check.url, check.name);
}
}, 60000); // Run every minute
module.exports = { runSyntheticCheck };Monitoring in CI/CD Pipelines
Monitoring Validation in Pipelines
# .github/workflows/monitoring-validation.yml
name: Monitoring Validation
on:
pull_request:
branches: [ main ]
paths:
- 'monitoring/**'
- 'src/**/*'
- 'Dockerfile'
jobs:
validate-monitoring:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
- name: Install dependencies
run: npm ci
- name: Validate Prometheus configuration
run: |
# Install promtool
curl -sSL https://github.com/prometheus/prometheus/releases/download/v2.37.0/prometheus-2.37.0.linux-amd64.tar.gz | tar xz
./prometheus-2.37.0.linux-amd64/promtool check config monitoring/prometheus/prometheus.yml
./prometheus-2.37.0.linux-amd64/promtool check rules monitoring/prometheus/rules.yml
- name: Run monitoring unit tests
run: npm run test:monitoring
- name: Validate Grafana dashboards
run: |
# Check dashboard JSON validity
for dashboard in monitoring/grafana/dashboards/*.json; do
jq empty "$dashboard" || exit 1
doneChaos Engineering for Monitoring
// chaos-engineering.js
const { exec } = require('child_process');
const prometheus = require('./src/metrics'); // Our metrics client
class ChaosEngineer {
constructor() {
this.scenarios = [
this.cpuHog.bind(this),
this.memoryLeak.bind(this),
this.networkDelay.bind(this),
this.diskFill.bind(this)
];
}
async cpuHog(duration = 30) {
console.log(`Starting CPU hog for ${duration} seconds`);
// Monitor CPU usage during chaos
const cpuBefore = await this.getCPUUsage();
// Create CPU intensive process
const child = exec('stress-ng --cpu 4 --timeout 30s');
setTimeout(() => {
child.kill();
console.log('CPU hog completed');
}, duration * 1000);
}
async networkDelay(targetHost, delayMs = 1000, duration = 60) {
console.log(`Adding ${delayMs}ms network delay to ${targetHost} for ${duration}s`);
// This would typically be run on the target system
exec(`sudo tc qdisc add dev eth0 root netem delay ${delayMs}ms`, (error) => {
if (error) {
console.error('Network delay setup failed:', error);
return;
}
setTimeout(() => {
exec('sudo tc qdisc del dev eth0 root', () => {
console.log('Network delay removed');
});
}, duration * 1000);
});
}
async runExperiments() {
for (const scenario of this.scenarios) {
console.log(`Running experiment: ${scenario.name}`);
// Record metrics before experiment
const metricsBefore = await this.getCurrentMetrics();
await scenario();
// Wait for metrics to propagate
await new Promise(resolve => setTimeout(resolve, 10000));
// Record metrics after experiment
const metricsAfter = await this.getCurrentMetrics();
console.log('Metrics comparison:', {
before: metricsBefore,
after: metricsAfter
});
// Wait between experiments
await new Promise(resolve => setTimeout(resolve, 30000));
}
}
async getCurrentMetrics() {
// Collect current system metrics
return {
cpu: await this.getCPUUsage(),
memory: await this.getMemoryUsage(),
disk: await this.getDiskUsage()
};
}
async getCPUUsage() {
// Implementation to get current CPU usage
return new Promise((resolve) => {
exec('top -bn1 | grep "Cpu(s)" | awk \'{print $2}\' | sed "s/%us,//"', (error, stdout) => {
resolve(parseFloat(stdout.trim()));
});
});
}
async getMemoryUsage() {
// Implementation to get current memory usage
return new Promise((resolve) => {
exec('free | grep Mem | awk \'{printf("%.2f", $3/$2 * 100.0)}\'', (error, stdout) => {
resolve(parseFloat(stdout.trim()));
});
});
}
async getDiskUsage() {
// Implementation to get current disk usage
return new Promise((resolve) => {
exec('df / | tail -1 | awk \'{print $5}\' | sed "s/%//"', (error, stdout) => {
resolve(parseFloat(stdout.trim()));
});
});
}
}
module.exports = { ChaosEngineer };Troubleshooting and Root Cause Analysis
Diagnostic Tools and Techniques
#!/bin/bash
# diagnostic-tools.sh
# Comprehensive system diagnostics script
echo "=== System Diagnostics Report ==="
echo "Generated on: $(date)"
echo
echo "--- System Information ---"
echo "Hostname: $(hostname)"
echo "OS: $(uname -s)"
echo "Kernel: $(uname -r)"
echo "Architecture: $(uname -m)"
echo
echo "--- CPU Information ---"
echo "CPU Count: $(nproc)"
echo "Load Average: $(uptime | awk -F'load average:' '{print $2}')"
echo
echo "--- Memory Information ---"
free -h
echo
echo "--- Disk Usage ---"
df -h
echo
echo "--- Network Connections ---"
netstat -tuln | head -20
echo
echo "--- Process Information ---"
echo "Top 10 processes by CPU:"
ps aux --sort=-%cpu | head -11
echo
echo "Top 10 processes by Memory:"
ps aux --sort=-%mem | head -11
echo
echo "--- Application Logs (last 50 lines) ---"
if [ -f "/var/log/myapp/combined.log" ]; then
tail -50 /var/log/myapp/combined.log
else
echo "Application logs not found"
fi
echo
echo "--- Recent System Messages ---"
dmesg | tail -20
echo
echo "--- Prometheus Metrics Sample ---"
if curl -sf http://localhost:9090/metrics >/dev/null 2>&1; then
curl -s http://localhost:9090/metrics | grep -E "up|requests_total|errors_total" | head -20
else
echo "Prometheus not accessible"
fi
echo
echo "=== End of Report ==="Log Analysis Queries
-- SQL-like queries for log analysis (using tools like Loki or Elasticsearch)
-- Top error patterns in last hour
SELECT
log_level,
message,
COUNT(*) as error_count
FROM logs
WHERE timestamp > NOW() - INTERVAL 1 HOUR
AND log_level = 'ERROR'
GROUP BY log_level, message
ORDER BY error_count DESC
LIMIT 10;
-- Slowest API endpoints
SELECT
endpoint,
AVG(response_time) as avg_response_time,
MAX(response_time) as max_response_time,
COUNT(*) as request_count
FROM api_requests
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY endpoint
HAVING avg_response_time > 1000 -- More than 1 second
ORDER BY avg_response_time DESC;
-- User activity patterns
SELECT
DATE_TRUNC('hour', timestamp) as hour,
COUNT(DISTINCT user_id) as active_users,
COUNT(*) as total_requests
FROM user_requests
WHERE timestamp > NOW() - INTERVAL 24 HOURS
GROUP BY hour
ORDER BY hour;
-- Error correlation analysis
SELECT
error_type,
COUNT(*) as error_count,
COUNT(DISTINCT session_id) as affected_sessions,
STRING_AGG(DISTINCT service_name, ', ') as affected_services
FROM error_events
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY error_type
HAVING error_count > 10
ORDER BY error_count DESC;Conclusion
Monitoring and observability are fundamental to successful DevOps practices, providing the visibility needed to maintain system health, performance, and reliability. Effective monitoring combines traditional metrics collection with modern observability practices, including distributed tracing and structured logging.
The key to successful monitoring is to start with the basics—system metrics, application performance, and error tracking—and gradually add more sophisticated observability practices. Organizations should focus on implementing the "four golden signals" (latency, traffic, errors, and saturation) as a foundation, then expand to include business metrics and user experience monitoring.
In the next article, we'll explore security practices in DevOps, covering DevSecOps principles, security automation, and how to integrate security throughout the software delivery lifecycle.