Prometheus监控进程
process-export主要用来做进程监控,监控进程比如某个服务的监控进程进程数、消耗了多少CPU、监控进程内存等资源。监控进程
一、监控进程process-exporter使用
1.1 下载 process-exporter
process-exporter GibHUB地址
process-exporter 下载地址
process-exporter可以使用命令行参数也可以指定配置文件启动
1.2 配置 process-exporter
vim /usr/local/process-exporter/process_name.yaml #存放脚本的监控进程地方process_names:# - name: "{ { .Comm}}"# cmdline:# - '.+' - name: "{ { .Matches}}" cmdline: - 'nginx' #唯一标识 - name: "{ { .Matches}}" cmdline: - '/opt/atlassian/confluence/bin/tomcat-juli.jar' - name: "{ { .Matches}}" cmdline: - 'vsftpd' - name: "{ { .Matches}}" cmdline: - 'redis-server'
示例:
cmdline: 所选进程的唯一标识,ps -ef 可以查询到。监控进程如果改进程不存在,监控进程则不会有该进程的监控进程数据采集到。
例如:>ps -ef | grep redis
redis 监控进程4287 4127 0 Oct31 ? 00:58:12 redis-server *:6379
{ { .Matches}} groupname=”map[:redis]” 表示配置到关键字“redis”
1.3 编写启动脚本
vim /usr/lib/systemd/system/process_exporter.service [Unit]Description=Prometheus exporter for processors metrics, written in Go with pluggable metric collectors.Documentation=https://github.com/ncabatoff/process-exporterAfter=network.target [Service]Type=simpleUser=rootWorkingDirectory=/usr/local/process-exporterExecStart=/usr/local/process-exporter/process-exporter -config.path=/usr/local/process-exporter/process-exporter.yamlRestart=on-failure [Install]WantedBy=multi-user.target
1.4 启动 procexx-export
systemctl daemon-reloadsystemctl start process_exportersystemctl enable process_exporter
验证监控数据
curl http://localhost:9256/metrics#相关测试的数据# HELP http_response_size_bytes The HTTP response sizes in bytes.# TYPE http_response_size_bytes summaryhttp_response_size_bytes{ handler="prometheus",quantile="0.5"} 2988http_response_size_bytes{ handler="prometheus",quantile="0.9"} 2996http_response_size_bytes{ handler="prometheus",quantile="0.99"} 3006http_response_size_bytes_sum{ handler="prometheus"} 1.34205181e+08http_response_size_bytes_count{ handler="prometheus"} 45188# HELP namedprocess_namegroup_context_switches_total Context switches# TYPE namedprocess_namegroup_context_switches_total counternamedprocess_namegroup_context_switches_total{ ctxswitchtype="nonvoluntary",groupname="map[:bladebit]"} 7.7977455e+07namedprocess_namegroup_context_switches_total{ ctxswitchtype="nonvoluntary",groupname="map[:pw_python.py]"} 2.02666e+06namedprocess_namegroup_context_switches_total{ ctxswitchtype="voluntary",groupname="map[:bladebit]"} 3.335109e+06namedprocess_namegroup_context_switches_total{ ctxswitchtype="voluntary",groupname="map[:pw_python.py]"} 8.22652233e+08# HELP namedprocess_namegroup_cpu_system_seconds_total Cpu system usage in seconds# TYPE namedprocess_namegroup_cpu_system_seconds_total counternamedprocess_namegroup_cpu_system_seconds_total{ groupname="map[:bladebit]"} 94275.01000000017namedprocess_namegroup_cpu_system_seconds_total{ groupname="map[:pw_python.py]"} 64818.93000000004# HELP namedprocess_namegroup_cpu_user_seconds_total Cpu user usage in seconds# TYPE namedprocess_namegroup_cpu_user_seconds_total counternamedprocess_namegroup_cpu_user_seconds_total{ groupname="map[:bladebit]"} 2.42621264299998e+07namedprocess_namegroup_cpu_user_seconds_total{ groupname="map[:pw_python.py]"} 85.29000000000613# HELP namedprocess_namegroup_major_page_faults_total Major page faults# TYPE namedprocess_namegroup_major_page_faults_total counternamedprocess_namegroup_major_page_faults_total{ groupname="map[:bladebit]"} 18261namedprocess_namegroup_major_page_faults_total{ groupname="map[:pw_python.py]"} 1236# HELP namedprocess_namegroup_memory_bytes number of bytes of memory in use# TYPE namedprocess_namegroup_memory_bytes gaugenamedprocess_namegroup_memory_bytes{ groupname="map[:bladebit]",memtype="resident"} 4.46810939392e+11namedprocess_namegroup_memory_bytes{ groupname="map[:bladebit]",memtype="swapped"} 0namedprocess_namegroup_memory_bytes{ groupname="map[:bladebit]",memtype="virtual"} 4.47847292928e+11namedprocess_namegroup_memory_bytes{ groupname="map[:pw_python.py]",memtype="resident"} 1.2959744e+07namedprocess_namegroup_memory_bytes{ groupname="map[:pw_python.py]",memtype="swapped"} 0namedprocess_namegroup_memory_bytes{ groupname="map[:pw_python.py]",memtype="virtual"} 2.4733696e+08
二、prometheus 配置
添加或修改配置
- job_name: 'dev_prometheus' scrape_interval: 10s honor_labels: true metrics_path: '/metrics' static_configs: - targets: ['127.0.0.1:9090',监控进程'127.0.0.1:9100'] labels: { cluster: 'dev',type: 'basic',env: 'dev',job: 'prometheus',export: 'prometheus'} - targets: ['127.0.0.1:9256'] labels: { cluster: 'dev',type: 'process',env: 'dev',job: 'prometheus',export: 'process_exporter'}
重启prometheus服务
curl -X POST http://127.0.0.1:9090/-/reload
三、grafana出图
process-exporter对应的监控进程dashboard为:https://grafana.com/grafana/dashboards/249
效果如下
四、常用监控规则
进程数
alert: 进程告警expr: sum(namedprocess_namegroup_states) by (cluster,监控进程job,instance) >500for: 20slabels: severity: warningannotations: value: 服务器当前已产生 { $value }} 个进程,大于告警阈值
僵尸进程数
alert: 进程告警expr: sum by(cluster,监控进程 job, instance, groupname) (namedprocess_namegroup_states{ state="Zombie"}) >0for: 1mlabels: severity: warningannotations: value: 当前产生 { $value }} 个僵尸进程
进程重启
alert: 进程重启告警expr: ceil(time() - max by(cluster, job, instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60for: 25slabels: label: alert_once severity: warningannotations: value: 进程 { $value }} 秒前发生重启
进程退出
alert: 进程退出告警expr: up{ groupname=~"^map.*"}[10d])) < 0for: 55slabels: severity: warningannotations: value: 进程 { $labels.export}} 已退出
五、Ansible批量添加
这里采用Consul注册发现方式,监控进程相关类容可以查询网上
5.1Consul注册脚本
#!/bin/bashservice_name=$1instance_id=$2ip=$3port=$4 curl -X PUT -d '{ "id": "'"$instance_id"'","name": "'"$service_name"'","address": "'"$ip"'","port": '"$port"',"tags": ["'"$service_name"'"],"checks": [{ "http": "http://'"$ip"':'"$port"'","interval": "5s"}]}' http://10.1.8.202:8500/v1/agent/service/register
Ansible剧本脚本
[root@openvpn process]# cat playbook.yml - hosts: Harvester remote_user: root gather_facts: no tasks: - name: 推送采集器安装包 unarchive: src=process-exporter.tar.gz dest=/usr/local/ - name: 重命名 shell: | cd /usr/local/ if [ ! -d process-exporter ];then mv process-exporter-0.4.0.linux-amd64 process-exporter fi - name: 查询主机名称 shell: echo "h-`hostname`" register: name_host - name: 推送system文件 copy: src=process_exporter.service dest=/usr/lib/systemd/system - name: 启动服务 systemd: name=process_exporter state=started enabled=yes - name: 推送注册脚本 copy: src=consul-register.sh dest=/usr/local/process-exporter - name: 注册当前节点 shell: /bin/sh /usr/local/process-exporter/consul-register.sh { inventory_hostname }} 9256