监控与日志管理
概述
监控与日志管理是现代应用运维的核心组成部分,通过实时监控系统状态、收集和分析日志数据,帮助运维团队及时发现问题、定位故障、优化性能。本文档详细介绍监控与日志管理的最佳实践、工具选择和实施方案。
监控体系架构
监控层次模型
┌─────────────────────────────────────────────────────────────┐
│ 监控层次模型 │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ 业务监控 │ │
│ │ (Business Monitoring) │ │
│ │ • 用户行为分析 • 业务指标 • 转化率 • 收入 │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ 应用监控 │ │
│ │ (Application Monitoring) │ │
│ │ • 响应时间 • 错误率 • 吞吐量 • 依赖关系 │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ 系统监控 │ │
│ │ (System Monitoring) │ │
│ │ • CPU使用率 • 内存使用 • 磁盘I/O • 网络流量 │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ 基础设施监控 │ │
│ │ (Infrastructure Monitoring) │ │
│ │ • 服务器状态 • 网络设备 • 存储系统 • 云资源 │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
监控数据流
┌─────────────────────────────────────────────────────────────┐
│ 监控数据流 │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 数据采集 │───▶│ 数据存储 │───▶│ 数据分析 │ │
│ │ Data Collect│ │ Data Storage│ │Data Analysis│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 数据传输 │ │ 数据查询 │ │ 告警通知 │ │
│ │Data Transport│ │ Data Query │ │ Alerting │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 可视化 │ │
│ │Visualization│ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
Prometheus监控实践
Prometheus配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 5s
metrics_path: /metrics
# Node Exporter - 系统监控
- job_name: 'node-exporter'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100'
scrape_interval: 10s
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):.+'
replacement: '${1}'
# Kubernetes监控
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
# 应用监控
- job_name: 'myapp'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# 数据库监控
- job_name: 'postgres-exporter'
static_configs:
- targets: ['postgres-exporter:9187']
scrape_interval: 30s
- job_name: 'redis-exporter'
static_configs:
- targets: ['redis-exporter:9121']
scrape_interval: 30s
# 消息队列监控
- job_name: 'rabbitmq-exporter'
static_configs:
- targets: ['rabbitmq-exporter:9419']
scrape_interval: 30s
# Nginx监控
- job_name: 'nginx-exporter'
static_configs:
- targets: ['nginx-exporter:9113']
scrape_interval: 15s
# Blackbox监控 - 外部服务可用性
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://myapp.com
- https://api.myapp.com
- https://admin.myapp.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# 远程写入配置(可选)
remote_write:
- url: "https://prometheus-remote-write.example.com/api/v1/write"
basic_auth:
username: "prometheus"
password: "password"
queue_config:
max_samples_per_send: 1000
max_shards: 200
capacity: 2500
告警规则配置
# rules/alerts.yml
groups:
- name: system.rules
rules:
# 系统级告警
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% on {{ $labels.instance }} for more than 5 minutes. Current value: {{ $value }}%"
runbook_url: "https://runbooks.example.com/high-cpu"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% on {{ $labels.instance }}. Current value: {{ $value }}%"
- alert: DiskSpaceUsage
expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 90
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Disk space usage critical"
description: "Disk usage is above 90% on {{ $labels.instance }} mount {{ $labels.mountpoint }}. Current value: {{ $value }}%"
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Node is down"
description: "Node {{ $labels.instance }} has been down for more than 1 minute"
- name: application.rules
rules:
# 应用级告警
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for {{ $labels.service }}. Current value: {{ $value }}%"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High response time detected"
description: "95th percentile response time is above 1s for {{ $labels.service }}. Current value: {{ $value }}s"
- alert: LowThroughput
expr: rate(http_requests_total[5m]) < 10
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "Low throughput detected"
description: "Request rate is below 10 req/s for {{ $labels.service }}. Current value: {{ $value }} req/s"
- alert: ApplicationDown
expr: up{job="myapp"} == 0
for: 1m
labels:
severity: critical
team: backend
annotations:
summary: "Application is down"
description: "Application {{ $labels.instance }} is not responding"
- name: database.rules
rules:
# 数据库告警
- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
team: database
annotations:
summary: "PostgreSQL is down"
description: "PostgreSQL instance {{ $labels.instance }} is down"
- alert: PostgreSQLTooManyConnections
expr: sum by (instance) (pg_stat_activity_count) > pg_settings_max_connections * 0.8
for: 5m
labels:
severity: warning
team: database
annotations:
summary: "PostgreSQL too many connections"
description: "PostgreSQL instance {{ $labels.instance }} has too many connections. Current: {{ $value }}"
- alert: PostgreSQLSlowQueries
expr: rate(pg_stat_activity_max_tx_duration[5m]) > 60
for: 5m
labels:
severity: warning
team: database
annotations:
summary: "PostgreSQL slow queries detected"
description: "PostgreSQL instance {{ $labels.instance }} has slow running queries"
- name: kubernetes.rules
rules:
# Kubernetes告警
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
- alert: KubernetesPodNotReady
expr: kube_pod_status_ready{condition="false"} == 1
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pod not ready"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready"
- alert: KubernetesNodeNotReady
expr: kube_node_status_ready{condition="false"} == 1
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Node not ready"
description: "Node {{ $labels.node }} is not ready"
- alert: KubernetesDeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Deployment replicas mismatch"
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has {{ $labels.spec_replicas }} desired but {{ $labels.available_replicas }} available replicas"
Alertmanager配置
# alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@mycompany.com'
smtp_auth_username: 'alerts@mycompany.com'
smtp_auth_password: 'app-password'
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
# 关键告警立即通知
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 0s
repeat_interval: 5m
# 基础设施告警
- match:
team: infrastructure
receiver: 'infrastructure-team'
# 应用告警
- match:
team: backend
receiver: 'backend-team'
# 数据库告警
- match:
team: database
receiver: 'database-team'
# 平台告警
- match:
team: platform
receiver: 'platform-team'
# 工作时间外的告警
- match:
severity: warning
receiver: 'warning-alerts'
active_time_intervals:
- business-hours
inhibit_rules:
# 如果节点宕机,抑制该节点上的其他告警
- source_match:
alertname: 'NodeDown'
target_match_re:
instance: '.*'
equal: ['instance']
# 如果应用宕机,抑制高错误率告警
- source_match:
alertname: 'ApplicationDown'
target_match:
alertname: 'HighErrorRate'
equal: ['service']
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'critical-alerts'
slack_configs:
- channel: '#critical-alerts'
title: '🚨 CRITICAL: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: 'danger'
email_configs:
- to: 'oncall@mycompany.com'
subject: '🚨 CRITICAL Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
{{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'infrastructure-team'
slack_configs:
- channel: '#infrastructure'
title: '🔧 Infrastructure Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'backend-team'
slack_configs:
- channel: '#backend'
title: '⚙️ Backend Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'database-team'
slack_configs:
- channel: '#database'
title: '🗄️ Database Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'platform-team'
slack_configs:
- channel: '#platform'
title: '☸️ Platform Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'warning-alerts'
slack_configs:
- channel: '#warnings'
title: '⚠️ Warning: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: 'warning'
time_intervals:
- name: business-hours
time_intervals:
- times:
- start_time: '09:00'
end_time: '18:00'
weekdays: ['monday:friday']
location: 'Asia/Shanghai'
Grafana可视化
Grafana配置
# grafana.yml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
data:
grafana.ini: |
[server]
protocol = http
http_port = 3000
domain = grafana.mycompany.com
root_url = https://grafana.mycompany.com
[database]
type = postgres
host = postgres:5432
name = grafana
user = grafana
password = grafana_password
ssl_mode = require
[session]
provider = redis
provider_config = addr=redis:6379,pool_size=100,db=grafana
[security]
admin_user = admin
admin_password = admin_password
secret_key = your_secret_key
disable_gravatar = true
cookie_secure = true
cookie_samesite = strict
[auth]
disable_login_form = false
disable_signout_menu = false
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
allow_sign_up = true
[auth.oauth]
enabled = true
[auth.github]
enabled = true
allow_sign_up = true
client_id = your_github_client_id
client_secret = your_github_client_secret
scopes = user:email,read:org
auth_url = https://github.com/login/oauth/authorize
token_url = https://github.com/login/oauth/access_token
api_url = https://api.github.com/user
allowed_organizations = mycompany
[smtp]
enabled = true
host = smtp.gmail.com:587
user = alerts@mycompany.com
password = app_password
from_address = alerts@mycompany.com
from_name = Grafana
[alerting]
enabled = true
execute_alerts = true
[metrics]
enabled = true
interval_seconds = 10
[log]
mode = console file
level = info
[log.console]
level = info
format = console
[log.file]
level = info
format = text
log_rotate = true
max_lines = 1000000
max_size_shift = 28
daily_rotate = true
max_days = 7
仪表盘配置示例
{
"dashboard": {
"id": null,
"title": "Application Performance Dashboard",
"tags": ["application", "performance"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{service}} - {{method}} {{status}}",
"refId": "A"
}
],
"yAxes": [
{
"label": "Requests/sec",
"min": 0
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
}
},
{
"id": 2,
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile",
"refId": "B"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "99th percentile",
"refId": "C"
}
],
"yAxes": [
{
"label": "Seconds",
"min": 0
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
}
},
{
"id": 3,
"title": "Error Rate",
"type": "singlestat",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m]) * 100",
"refId": "A"
}
],
"valueName": "current",
"format": "percent",
"thresholds": "1,5",
"colorBackground": true,
"colors": ["#299c46", "#e24d42", "#d44a3a"],
"gridPos": {
"h": 4,
"w": 6,
"x": 0,
"y": 8
}
},
{
"id": 4,
"title": "Active Connections",
"type": "singlestat",
"targets": [
{
"expr": "sum(http_connections_active)",
"refId": "A"
}
],
"valueName": "current",
"format": "short",
"gridPos": {
"h": 4,
"w": 6,
"x": 6,
"y": 8
}
},
{
"id": 5,
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "process_resident_memory_bytes",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"yAxes": [
{
"label": "Bytes",
"min": 0
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
}
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "5s"
}
}
日志管理
ELK Stack配置
Elasticsearch配置
# elasticsearch.yml
cluster.name: "logging-cluster"
node.name: "elasticsearch-master"
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
# 集群配置
discovery.seed_hosts: ["elasticsearch-master", "elasticsearch-data-1", "elasticsearch-data-2"]
cluster.initial_master_nodes: ["elasticsearch-master"]
# 内存配置
bootstrap.memory_lock: true
# 索引配置
action.auto_create_index: true
action.destructive_requires_name: true
# 安全配置
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12
# 监控配置
xpack.monitoring.collection.enabled: true
# 索引生命周期管理
xpack.ilm.enabled: true
Logstash配置
# logstash.conf
input {
# Filebeat输入
beats {
port => 5044
}
# Syslog输入
syslog {
port => 5514
}
# HTTP输入
http {
port => 8080
codec => json
}
# Kafka输入
kafka {
bootstrap_servers => "kafka:9092"
topics => ["application-logs", "system-logs"]
group_id => "logstash"
codec => json
}
}
filter {
# 解析应用日志
if [fields][log_type] == "application" {
# 解析JSON格式日志
if [message] =~ /^\{.*\}$/ {
json {
source => "message"
}
}
# 解析时间戳
date {
match => [ "timestamp", "ISO8601" ]
}
# 添加地理位置信息
if [client_ip] {
geoip {
source => "client_ip"
target => "geoip"
}
}
# 解析User-Agent
if [user_agent] {
useragent {
source => "user_agent"
target => "ua"
}
}
}
# 解析Nginx访问日志
if [fields][log_type] == "nginx" {
grok {
match => {
"message" => "%{NGINXACCESS}"
}
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
mutate {
convert => { "response" => "integer" }
convert => { "bytes" => "integer" }
convert => { "responsetime" => "float" }
}
}
# 解析系统日志
if [fields][log_type] == "system" {
grok {
match => {
"message" => "%{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:host} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:message}"
}
overwrite => [ "message" ]
}
date {
match => [ "timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ]
}
}
# 添加通用字段
mutate {
add_field => { "[@metadata][index_prefix]" => "logs" }
}
# 根据日志级别设置索引
if [level] {
if [level] in ["ERROR", "FATAL"] {
mutate {
add_field => { "[@metadata][index_prefix]" => "logs-error" }
}
} else if [level] == "WARN" {
mutate {
add_field => { "[@metadata][index_prefix]" => "logs-warning" }
}
}
}
# 移除不需要的字段
mutate {
remove_field => [ "host", "agent", "ecs", "@version" ]
}
}
output {
# 输出到Elasticsearch
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "%{[@metadata][index_prefix]}-%{+YYYY.MM.dd}"
template_name => "logs"
template => "/usr/share/logstash/templates/logs.json"
template_overwrite => true
# 认证配置
user => "logstash_writer"
password => "password"
}
# 错误日志输出到文件
if [level] in ["ERROR", "FATAL"] {
file {
path => "/var/log/logstash/errors.log"
codec => line { format => "%{timestamp} [%{level}] %{logger}: %{message}" }
}
}
# 调试输出
if [@metadata][debug] {
stdout {
codec => rubydebug
}
}
}
Filebeat配置
# filebeat.yml
filebeat.inputs:
# 应用日志
- type: log
enabled: true
paths:
- /var/log/myapp/*.log
- /var/log/myapp/*/*.log
fields:
log_type: application
service: myapp
fields_under_root: true
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
# Nginx访问日志
- type: log
enabled: true
paths:
- /var/log/nginx/access.log
fields:
log_type: nginx
service: nginx
fields_under_root: true
# Nginx错误日志
- type: log
enabled: true
paths:
- /var/log/nginx/error.log
fields:
log_type: nginx-error
service: nginx
fields_under_root: true
multiline.pattern: '^\d{4}/\d{2}/\d{2}'
multiline.negate: true
multiline.match: after
# 系统日志
- type: log
enabled: true
paths:
- /var/log/syslog
- /var/log/messages
fields:
log_type: system
fields_under_root: true
# Docker容器日志
- type: container
enabled: true
paths:
- '/var/lib/docker/containers/*/*.log'
processors:
- add_docker_metadata:
host: "unix:///var/run/docker.sock"
- decode_json_fields:
fields: ["message"]
target: ""
overwrite_keys: true
# Kubernetes日志
- type: kubernetes
enabled: true
hints.enabled: true
hints.default_config:
type: container
paths:
- /var/log/containers/*${data.kubernetes.container.id}.log
# 处理器配置
processors:
# 添加主机信息
- add_host_metadata:
when.not.contains.tags: forwarded
# 添加Docker元数据
- add_docker_metadata: ~
# 添加Kubernetes元数据
- add_kubernetes_metadata: ~
# 丢弃空消息
- drop_event:
when:
equals:
message: ""
# 字段重命名
- rename:
fields:
- from: "agent.hostname"
to: "host.name"
# 输出配置
output.logstash:
hosts: ["logstash:5044"]
# 可选:直接输出到Elasticsearch
# output.elasticsearch:
# hosts: ["elasticsearch:9200"]
# username: "filebeat_writer"
# password: "password"
# index: "filebeat-%{+yyyy.MM.dd}"
# 日志配置
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
# 监控配置
monitoring.enabled: true
monitoring.elasticsearch:
hosts: ["elasticsearch:9200"]
username: "beats_system"
password: "password"
日志索引模板
{
"index_patterns": ["logs-*"],
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.refresh_interval": "5s",
"index.codec": "best_compression",
"index.mapping.total_fields.limit": 2000,
"index.lifecycle.name": "logs-policy",
"index.lifecycle.rollover_alias": "logs"
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"level": {
"type": "keyword"
},
"logger": {
"type": "keyword"
},
"message": {
"type": "text",
"analyzer": "standard"
},
"service": {
"type": "keyword"
},
"host": {
"properties": {
"name": {
"type": "keyword"
},
"ip": {
"type": "ip"
}
}
},
"kubernetes": {
"properties": {
"namespace": {
"type": "keyword"
},
"pod": {
"properties": {
"name": {
"type": "keyword"
}
}
},
"container": {
"properties": {
"name": {
"type": "keyword"
}
}
}
}
},
"http": {
"properties": {
"method": {
"type": "keyword"
},
"status_code": {
"type": "integer"
},
"url": {
"type": "keyword"
},
"response_time": {
"type": "float"
}
}
},
"error": {
"properties": {
"type": {
"type": "keyword"
},
"message": {
"type": "text"
},
"stack_trace": {
"type": "text"
}
}
},
"geoip": {
"properties": {
"location": {
"type": "geo_point"
},
"country_name": {
"type": "keyword"
},
"city_name": {
"type": "keyword"
}
}
}
}
}
}
索引生命周期管理
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "10GB",
"max_age": "1d"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "2d",
"actions": {
"set_priority": {
"priority": 50
},
"allocate": {
"number_of_replicas": 0
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"cold": {
"min_age": "7d",
"actions": {
"set_priority": {
"priority": 0
},
"allocate": {
"number_of_replicas": 0
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
应用性能监控(APM)
Jaeger分布式追踪
Jaeger部署配置
# jaeger-all-in-one.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
labels:
app: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.35
ports:
- containerPort: 16686
name: ui
- containerPort: 14268
name: collector
- containerPort: 6831
name: agent-compact
- containerPort: 6832
name: agent-binary
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
- name: SPAN_STORAGE_TYPE
value: "elasticsearch"
- name: ES_SERVER_URLS
value: "http://elasticsearch:9200"
- name: ES_USERNAME
value: "jaeger"
- name: ES_PASSWORD
value: "jaeger_password"
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
labels:
app: jaeger
spec:
ports:
- port: 16686
name: ui
targetPort: 16686
- port: 14268
name: collector
targetPort: 14268
- port: 6831
name: agent-compact
targetPort: 6831
protocol: UDP
- port: 6832
name: agent-binary
targetPort: 6832
protocol: UDP
selector:
app: jaeger
type: ClusterIP
Node.js应用集成Jaeger
// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
// 配置Jaeger导出器
const jaegerExporter = new JaegerExporter({
endpoint: process.env.JAEGER_ENDPOINT || 'http://jaeger:14268/api/traces',
});
// 配置Prometheus指标导出器
const prometheusExporter = new PrometheusExporter({
port: 9090,
endpoint: '/metrics',
});
// 配置SDK
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'myapp',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
}),
traceExporter: jaegerExporter,
metricReader: new PeriodicExportingMetricReader({
exporter: prometheusExporter,
exportIntervalMillis: 5000,
}),
instrumentations: [getNodeAutoInstrumentations({
// 禁用某些自动仪表化
'@opentelemetry/instrumentation-fs': {
enabled: false,
},
// 配置HTTP仪表化
'@opentelemetry/instrumentation-http': {
enabled: true,
ignoreIncomingRequestHook: (req) => {
// 忽略健康检查请求
return req.url === '/health' || req.url === '/metrics';
},
ignoreOutgoingRequestHook: (options) => {
// 忽略内部请求
return options.hostname === 'localhost';
},
},
// 配置Express仪表化
'@opentelemetry/instrumentation-express': {
enabled: true,
},
// 配置数据库仪表化
'@opentelemetry/instrumentation-pg': {
enabled: true,
},
'@opentelemetry/instrumentation-redis': {
enabled: true,
},
})],
});
// 启动SDK
sdk.start();
// 优雅关闭
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
module.exports = sdk;
自定义追踪代码
// custom-tracing.js
const { trace, context, SpanStatusCode } = require('@opentelemetry/api');
const tracer = trace.getTracer('myapp', '1.0.0');
class TracingService {
// 创建自定义span
static async traceFunction(name, fn, attributes = {}) {
const span = tracer.startSpan(name, {
attributes,
});
try {
const result = await context.with(trace.setSpan(context.active(), span), fn);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
// 数据库操作追踪
static async traceDbOperation(operation, query, params = []) {
return this.traceFunction(`db.${operation}`, async () => {
const span = trace.getActiveSpan();
span?.setAttributes({
'db.system': 'postgresql',
'db.statement': query,
'db.operation': operation,
});
// 执行数据库操作
const result = await db.query(query, params);
span?.setAttributes({
'db.rows_affected': result.rowCount,
});
return result;
});
}
// HTTP请求追踪
static async traceHttpRequest(url, options = {}) {
return this.traceFunction('http.request', async () => {
const span = trace.getActiveSpan();
span?.setAttributes({
'http.method': options.method || 'GET',
'http.url': url,
'http.user_agent': options.headers?.['user-agent'],
});
const response = await fetch(url, options);
span?.setAttributes({
'http.status_code': response.status,
'http.response.size': response.headers.get('content-length'),
});
return response;
});
}
// 业务逻辑追踪
static async traceBusinessOperation(operationName, userId, fn) {
return this.traceFunction(`business.${operationName}`, fn, {
'user.id': userId,
'operation.type': 'business',
});
}
}
// 中间件示例
function tracingMiddleware(req, res, next) {
const span = trace.getActiveSpan();
// 添加请求信息
span?.setAttributes({
'http.route': req.route?.path,
'user.id': req.user?.id,
'request.id': req.headers['x-request-id'],
});
// 添加响应信息
res.on('finish', () => {
span?.setAttributes({
'http.response.status_code': res.statusCode,
});
});
next();
}
module.exports = { TracingService, tracingMiddleware };
New Relic APM集成
// newrelic.js
'use strict';
exports.config = {
app_name: [process.env.NEW_RELIC_APP_NAME || 'MyApp'],
license_key: process.env.NEW_RELIC_LICENSE_KEY,
// 日志配置
logging: {
level: 'info',
filepath: 'stdout',
},
// 性能监控配置
performance_monitoring: {
enabled: true,
},
// 错误收集配置
error_collector: {
enabled: true,
ignore_status_codes: [404],
capture_events: true,
max_event_samples_stored: 100,
},
// 事务追踪配置
transaction_tracer: {
enabled: true,
transaction_threshold: 'apdex_f',
record_sql: 'obfuscated',
explain_threshold: 500,
stack_trace_threshold: 500,
top_n: 20,
},
// 慢查询配置
slow_sql: {
enabled: true,
max_samples: 10,
},
// 浏览器监控
browser_monitoring: {
enable: true,
},
// 自定义属性
attributes: {
enabled: true,
include: [
'request.headers.userAgent',
'request.headers.referer',
],
exclude: [
'request.headers.authorization',
'request.headers.cookie',
],
},
// 分布式追踪
distributed_tracing: {
enabled: true,
},
// 应用日志转发
application_logging: {
enabled: true,
forwarding: {
enabled: true,
max_samples_stored: 10000,
},
metrics: {
enabled: true,
},
local_decorating: {
enabled: true,
},
},
};
日志分析与可视化
Kibana仪表盘配置
{
"version": "7.15.0",
"objects": [
{
"id": "application-logs-dashboard",
"type": "dashboard",
"attributes": {
"title": "Application Logs Dashboard",
"hits": 0,
"description": "Overview of application logs and metrics",
"panelsJSON": "[\n {\n \"version\": \"7.15.0\",\n \"gridData\": {\n \"x\": 0,\n \"y\": 0,\n \"w\": 24,\n \"h\": 15,\n \"i\": \"1\"\n },\n \"panelIndex\": \"1\",\n \"embeddableConfig\": {},\n \"panelRefName\": \"panel_1\"\n },\n {\n \"version\": \"7.15.0\",\n \"gridData\": {\n \"x\": 24,\n \"y\": 0,\n \"w\": 24,\n \"h\": 15,\n \"i\": \"2\"\n },\n \"panelIndex\": \"2\",\n \"embeddableConfig\": {},\n \"panelRefName\": \"panel_2\"\n }\n]",
"timeRestore": false,
"timeTo": "now",
"timeFrom": "now-24h",
"refreshInterval": {
"pause": false,
"value": 30000
},
"kibanaSavedObjectMeta": {
"searchSourceJSON": "{\"query\":{\"query\":\"\",\"language\":\"kuery\"},\"filter\":[]}"
}
},
"references": [
{
"name": "panel_1",
"type": "visualization",
"id": "log-levels-pie-chart"
},
{
"name": "panel_2",
"type": "visualization",
"id": "error-rate-timeline"
}
]
}
]
}
日志告警配置
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": ["logs-error-*"],
"body": {
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "now-5m"
}
}
},
{
"term": {
"level": "ERROR"
}
}
]
}
},
"aggs": {
"error_count": {
"cardinality": {
"field": "message.keyword"
}
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.aggregations.error_count.value": {
"gt": 10
}
}
},
"actions": {
"send_email": {
"email": {
"profile": "standard",
"to": ["alerts@mycompany.com"],
"subject": "High Error Rate Alert",
"body": "High error rate detected: {{ctx.payload.aggregations.error_count.value}} unique errors in the last 5 minutes."
}
},
"send_slack": {
"slack": {
"message": {
"to": ["#alerts"],
"text": "🚨 High error rate detected: {{ctx.payload.aggregations.error_count.value}} unique errors in the last 5 minutes."
}
}
}
}
}
监控最佳实践
1. 监控指标设计原则
四个黄金信号
# 延迟 (Latency)
metrics:
- name: http_request_duration_seconds
type: histogram
help: "HTTP request duration in seconds"
buckets: [0.1, 0.25, 0.5, 1, 2.5, 5, 10]
# 流量 (Traffic)
- name: http_requests_total
type: counter
help: "Total HTTP requests"
labels: [method, status, endpoint]
# 错误 (Errors)
- name: http_requests_errors_total
type: counter
help: "Total HTTP request errors"
labels: [method, status, error_type]
# 饱和度 (Saturation)
- name: system_cpu_usage_percent
type: gauge
help: "System CPU usage percentage"
- name: system_memory_usage_percent
type: gauge
help: "System memory usage percentage"
- name: database_connections_active
type: gauge
help: "Active database connections"
业务指标监控
// business-metrics.js
const client = require('prom-client');
// 业务指标定义
const businessMetrics = {
// 用户注册
userRegistrations: new client.Counter({
name: 'user_registrations_total',
help: 'Total user registrations',
labelNames: ['source', 'plan'],
}),
// 订单处理
orderProcessing: new client.Histogram({
name: 'order_processing_duration_seconds',
help: 'Order processing duration',
labelNames: ['status', 'payment_method'],
buckets: [1, 5, 10, 30, 60, 300],
}),
// 收入指标
revenue: new client.Gauge({
name: 'revenue_total',
help: 'Total revenue',
labelNames: ['currency', 'period'],
}),
// 活跃用户
activeUsers: new client.Gauge({
name: 'active_users_current',
help: 'Current active users',
labelNames: ['type'],
}),
// 转化率
conversionRate: new client.Gauge({
name: 'conversion_rate_percent',
help: 'Conversion rate percentage',
labelNames: ['funnel_step'],
}),
};
// 业务事件追踪
class BusinessMetricsCollector {
static trackUserRegistration(source, plan) {
businessMetrics.userRegistrations.inc({ source, plan });
}
static trackOrderProcessing(duration, status, paymentMethod) {
businessMetrics.orderProcessing
.labels(status, paymentMethod)
.observe(duration);
}
static updateRevenue(amount, currency, period) {
businessMetrics.revenue
.labels(currency, period)
.set(amount);
}
static updateActiveUsers(count, type) {
businessMetrics.activeUsers
.labels(type)
.set(count);
}
static updateConversionRate(rate, funnelStep) {
businessMetrics.conversionRate
.labels(funnelStep)
.set(rate);
}
}
module.exports = { businessMetrics, BusinessMetricsCollector };
2. 告警策略
告警分级
# 告警级别定义
alert_levels:
critical:
description: "需要立即响应的严重问题"
response_time: "5分钟内"
notification_channels: ["pagerduty", "phone", "slack", "email"]
examples:
- "服务完全不可用"
- "数据丢失"
- "安全漏洞"
warning:
description: "需要关注但不紧急的问题"
response_time: "30分钟内"
notification_channels: ["slack", "email"]
examples:
- "性能下降"
- "资源使用率高"
- "非关键功能异常"
info:
description: "信息性告警,用于趋势分析"
response_time: "工作时间内"
notification_channels: ["email"]
examples:
- "容量规划提醒"
- "定期报告"
- "配置变更通知"
告警抑制规则
# 告警抑制配置
inhibit_rules:
# 服务级抑制
- source_match:
alertname: 'ServiceDown'
target_match_re:
alertname: '(HighErrorRate|HighResponseTime|LowThroughput)'
equal: ['service']
# 节点级抑制
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '(HighCPU|HighMemory|DiskFull)'
equal: ['instance']
# 集群级抑制
- source_match:
alertname: 'ClusterDown'
target_match_re:
alertname: '.*'
equal: ['cluster']
# 时间窗口抑制
- source_match:
alertname: 'MaintenanceMode'
target_match_re:
alertname: '.*'
equal: ['service']
3. 性能优化
监控数据优化
# Prometheus配置优化
global:
scrape_interval: 15s # 根据需要调整
evaluation_interval: 15s
# 存储优化
storage:
tsdb:
retention.time: 15d # 根据需要调整
retention.size: 100GB
min-block-duration: 2h
max-block-duration: 25h
# 查询优化
query:
max-concurrency: 20
timeout: 2m
max-samples: 50000000
# 规则评估优化
rule_files:
- "rules/*.yml"
evaluation_interval: 30s # 降低评估频率
日志优化策略
# 日志级别配置
log_levels:
production:
default: "INFO"
database: "WARN"
security: "DEBUG"
staging:
default: "DEBUG"
development:
default: "DEBUG"
# 日志采样配置
sampling:
# 高频日志采样
high_frequency:
rate: 0.1 # 10%采样率
patterns:
- "HTTP request"
- "Database query"
# 错误日志全量收集
errors:
rate: 1.0 # 100%采样率
levels: ["ERROR", "FATAL"]
# 日志字段优化
field_optimization:
# 移除敏感字段
exclude_fields:
- "password"
- "token"
- "credit_card"
# 压缩长字段
compress_fields:
- "stack_trace"
- "request_body"
# 字段重命名
rename_fields:
"@timestamp": "ts"
"message": "msg"
总结
监控与日志管理是现代应用运维的基石,通过建立完善的监控体系和日志管理流程,可以:
关键要素
-
全面监控覆盖
- 基础设施监控
- 应用性能监控
- 业务指标监控
- 用户体验监控
-
高效日志管理
- 结构化日志格式
- 集中化日志收集
- 智能日志分析
- 生命周期管理
-
智能告警机制
- 分级告警策略
- 告警抑制规则
- 多渠道通知
- 自动化响应
-
可视化分析
- 实时仪表盘
- 趋势分析
- 异常检测
- 根因分析
-
性能优化
- 数据采样策略
- 存储优化
- 查询优化
- 成本控制
最佳实践
- 监控即代码:将监控配置纳入版本控制
- 渐进式部署:从核心指标开始,逐步完善
- 团队协作:建立监控责任制和响应流程
- 持续改进:定期评估和优化监控策略
- 文档维护:保持监控文档的及时更新
通过实施这些监控与日志管理实践,可以显著提升系统的可观测性、可靠性和运维效率,为业务的稳定运行提供强有力的保障。