Qwen3.5-9B实战教程Prometheus监控GPU指标采集告警规则配置1. 项目概述Qwen3.5-9B是一款拥有90亿参数的开源大语言模型具备强大的逻辑推理、代码生成和多轮对话能力。其多模态变体Qwen3.5-9B-VL还支持图文输入理解上下文长度最高可达128K tokens。本教程将详细介绍如何为Qwen3.5-9B模型搭建完整的监控系统包括Prometheus监控服务部署GPU指标采集配置告警规则设置可视化面板创建2. 环境准备2.1 基础环境确保已安装以下组件# 检查NVIDIA驱动 nvidia-smi # 检查Docker docker --version # 检查Python环境 python3 --version2.2 项目结构/root/qwen3.5-9b/ ├── monitoring/ # 监控相关配置 │ ├── prometheus/ # Prometheus配置 │ ├── grafana/ # Grafana配置 │ └── alertmanager/ # 告警管理配置 └── app.py # 主程序3. Prometheus监控部署3.1 安装Prometheus# 创建Prometheus配置文件目录 mkdir -p /root/qwen3.5-9b/monitoring/prometheus # 编写prometheus.yml配置文件 cat /root/qwen3.5-9b/monitoring/prometheus/prometheus.yml EOF global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: prometheus static_configs: - targets: [localhost:9090] - job_name: node static_configs: - targets: [localhost:9100] - job_name: gpu static_configs: - targets: [localhost:9400] EOF # 启动Prometheus容器 docker run -d \ --nameprometheus \ -p 9090:9090 \ -v /root/qwen3.5-9b/monitoring/prometheus:/etc/prometheus \ prom/prometheus3.2 验证安装访问http://服务器IP:9090查看Prometheus Web界面在Status Targets中应能看到三个监控目标。4. GPU指标采集4.1 安装NVIDIA GPU Exporter# 启动NVIDIA GPU Exporter容器 docker run -d \ --namenvidia_gpu_exporter \ --gpusall \ -p 9400:9400 \ nvidia/gpu-monitoring-tools:latest4.2 配置Prometheus采集GPU指标修改/root/qwen3.5-9b/monitoring/prometheus/prometheus.yml添加以下内容scrape_configs: - job_name: nvidia-gpu static_configs: - targets: [nvidia_gpu_exporter:9400] metrics_path: /metrics重启Prometheus使配置生效docker restart prometheus5. 关键监控指标5.1 GPU相关指标指标名称说明告警阈值建议nvidia_gpu_utilizationGPU利用率90%持续5分钟nvidia_gpu_memory_used_bytesGPU显存使用量90%总显存nvidia_gpu_temperature_celsiusGPU温度85°C5.2 系统资源指标指标名称说明告警阈值建议node_memory_MemAvailable_bytes可用内存1GBnode_cpu_seconds_totalCPU使用率90%持续5分钟node_filesystem_avail_bytes磁盘可用空间10GB6. 告警规则配置6.1 创建告警规则文件mkdir -p /root/qwen3.5-9b/monitoring/prometheus/alerts cat /root/qwen3.5-9b/monitoring/prometheus/alerts/rules.yml EOF groups: - name: qwen3.5-9b-alerts rules: - alert: HighGPUUsage expr: avg(nvidia_gpu_utilization) by (gpu) 90 for: 5m labels: severity: warning annotations: summary: High GPU utilization on {{ $labels.gpu }} description: GPU {{ $labels.gpu }} is at {{ $value }}% utilization for more than 5 minutes - alert: HighGPUTemperature expr: nvidia_gpu_temperature_celsius 85 for: 2m labels: severity: critical annotations: summary: High GPU temperature on {{ $labels.gpu }} description: GPU {{ $labels.gpu }} temperature is at {{ $value }}°C - alert: LowMemory expr: node_memory_MemAvailable_bytes / 1024 / 1024 1024 for: 5m labels: severity: warning annotations: summary: Low available memory description: Available memory is only {{ $value }}MB EOF6.2 更新Prometheus配置修改prometheus.yml添加告警规则rule_files: - /etc/prometheus/alerts/rules.yml重启Prometheusdocker restart prometheus7. Alertmanager配置7.1 安装Alertmanagermkdir -p /root/qwen3.5-9b/monitoring/alertmanager cat /root/qwen3.5-9b/monitoring/alertmanager/config.yml EOF route: group_by: [alertname] group_wait: 10s group_interval: 5m repeat_interval: 3h receiver: web.hook receivers: - name: web.hook webhook_configs: - url: http://your-webhook-url/alert EOF docker run -d \ --namealertmanager \ -p 9093:9093 \ -v /root/qwen3.5-9b/monitoring/alertmanager:/etc/alertmanager \ prom/alertmanager7.2 配置Prometheus与Alertmanager集成修改prometheus.yml添加alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093重启Prometheus和Alertmanagerdocker restart prometheus docker restart alertmanager8. Grafana可视化8.1 安装Grafanadocker run -d \ --namegrafana \ -p 3000:3000 \ grafana/grafana8.2 配置数据源访问http://服务器IP:3000(默认账号admin/admin)添加Prometheus数据源URL:http://prometheus:9090Access: Server8.3 导入Qwen3.5-9B监控仪表板创建新的仪表板或导入以下JSON模板{ title: Qwen3.5-9B Monitoring, panels: [ { title: GPU Utilization, type: graph, targets: [ { expr: nvidia_gpu_utilization, legendFormat: GPU {{gpu}} } ] }, { title: GPU Memory Usage, type: graph, targets: [ { expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100, legendFormat: GPU {{gpu}} } ] } ] }9. 监控系统维护9.1 日常检查命令# 检查Prometheus状态 docker ps | grep prometheus # 检查GPU Exporter状态 docker ps | grep nvidia_gpu_exporter # 检查告警规则 curl -s http://localhost:9090/api/v1/rules | jq9.2 日志轮转配置# 创建日志轮转配置 cat /etc/logrotate.d/qwen3.5-9b-monitoring EOF /root/qwen3.5-9b/monitoring/*/*.log { daily rotate 7 compress missingok notifempty create 644 root root } EOF10. 总结通过本教程我们完成了Qwen3.5-9B模型的完整监控系统搭建Prometheus作为监控核心采集系统和GPU指标NVIDIA GPU Exporter提供详细的GPU监控数据Alertmanager实现告警通知功能Grafana提供可视化监控面板这套监控系统可以帮助您实时了解模型运行状态及时发现性能瓶颈预防资源耗尽导致的故障优化模型部署配置获取更多AI镜像想探索更多AI镜像和应用场景访问 CSDN星图镜像广场提供丰富的预置镜像覆盖大模型推理、图像生成、视频生成、模型微调等多个领域支持一键部署。