作者:MeshCloud脉时云公有云架构师 向凯华
前提:
- Google-monitoring 不支持直接飞书(钉钉)告警。
- 使用默认的邮件告警通道,测试有非常大的延迟。(邮件有时甚至会延迟好几个小时到达)。
- 因此,编写一个user-webhook脚本接受google自身webhook告警信息,并进行信息提取与格式转换,发送至飞书群机器人webhook,实现Google-monitoring到飞书群机器人告警。
告警流程图:
GCP告警信息格式:
以下消息格式(示例)为使用user-webhook脚本接收到的 google monitoring webhook 告警消息文本格式。
注意:
- 通过console上 SELECT A METRIC 手动配置的告警消息,与使用 MQL 方式配置的告警规则,消息格式上会有一些区别。
- 比如: 通过MQL方式配置告警规则,消息通知中没有如下连个值。
# 告警阈值
"threshold_value": "0.7",
# 观测值
"observed_value": "1.000",
告警消息格式:
{
"incident": {
"condition": {
"conditionThreshold": {
"aggregations": [
{
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
],
"comparison": "COMPARISON_GT",
"duration": "120s",
"evaluationMissingData": "EVALUATION_MISSING_DATA_ACTIVE",
"filter": "resource.type = \"gce_instance\" AND metric.type = \"compute.googleapis.com/instance/cpu/utilization\" AND metric.labels.instance_name = monitoring.regex.full_match(\"xkh-test.*\")",
"thresholdValue": 0.7,
"trigger": {
"count": 1
}
},
"displayName": "VM Instance - CPU utilization",
"name": "projects/mec-test-344202/alertPolicies/2705589879665297521/conditions/7587364556342473822"
},
"condition_name": "VM Instance - CPU utilization",
"documentation": {
"content": "test",
"mime_type": "text/markdown"
},
"ended_at": "None",
"incident_id": "0.mnshxcgwudjy",
"metadata": {
"system_labels": {},
"user_labels": {}
},
"metric": {
"displayName": "CPU utilization",
"labels": {
"instance_name": "xkh-test-g02-01"
},
"type": "compute.googleapis.com/instance/cpu/utilization"
},
"observed_value": "1.000",
"policy_name": "xkh-tset-cpu-alert",
"policy_user_labels": {
"kind": "cpu"
},
"resource": {
"labels": {
"instance_id": "8011306628148215104",
"project_id": "mec-test-344202",
"zone": "europe-west6-a"
},
"type": "gce_instance"
},
"resource_display_name": "xkh-test-g02-01",
"resource_id": "",
"resource_name": "mec-test-344202 xkh-test-g02-01",
"resource_type_display_name": "VM Instance",
"scoping_project_id": "mec-test-344202",
"scoping_project_number": 328842067835,
"started_at": 1665459241,
"state": "open",
"summary": "CPU utilization for mec-test-344202 xkh-test-g02-01 with metric labels {instance_name=xkh-test-g02-01} is above the threshold of 0.700 with a value of 1.000.",
"threshold_value": "0.7",
"url": "https://console.cloud.google.com/monitoring/alerting/incidents/0.mnshxcgwudjy?project=mec-test-344202"
},
"version": "1.2"
}
恢复消息格式:
{
"incident": {
"condition": {
"conditionThreshold": {
"aggregations": [
{
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
],
"comparison": "COMPARISON_GT",
"duration": "120s",
"evaluationMissingData": "EVALUATION_MISSING_DATA_ACTIVE",
"filter": "resource.type = \"gce_instance\" AND metric.type = \"compute.googleapis.com/instance/cpu/utilization\" AND metric.labels.instance_name = monitoring.regex.full_match(\"xkh-test.*\")",
"thresholdValue": 0.7,
"trigger": {
"count": 1
}
},
"displayName": "VM Instance - CPU utilization",
"name": "projects/mec-test-344202/alertPolicies/2705589879665297521/conditions/7587364556342473822"
},
"condition_name": "VM Instance - CPU utilization",
"documentation": {
"content": "test",
"mime_type": "text/markdown"
},
"ended_at": 1665461592,
"incident_id": "0.mnshxcgwudjy",
"metadata": {
"system_labels": {},
"user_labels": {}
},
"metric": {
"displayName": "CPU utilization",
"labels": {
"instance_name": "xkh-test-g02-01"
},
"type": "compute.googleapis.com/instance/cpu/utilization"
},
"observed_value": "0.258",
"policy_name": "xkh-tset-cpu-alert",
"policy_user_labels": {
"kind": "cpu"
},
"resource": {
"labels": {
"instance_id": "8011306628148215104",
"project_id": "mec-test-344202",
"zone": "europe-west6-a"
},
"type": "gce_instance"
},
"resource_display_name": "xkh-test-g02-01",
"resource_id": "",
"resource_name": "mec-test-344202 xkh-test-g02-01",
"resource_type_display_name": "VM Instance",
"scoping_project_id": "mec-test-344202",
"scoping_project_number": 328842067835,
"started_at": 1665459241,
"state": "closed",
"summary": "CPU utilization for mec-test-344202 xkh-test-g02-01 with metric labels {instance_name=xkh-test-g02-01} returned to normal with a value of 0.258.",
"threshold_value": "0.7",
"url": "https://console.cloud.google.com/monitoring/alerting/incidents/0.mnshxcgwudjy?project=mec-test-344202"
},
"version": "1.2"
}
部署过程
过程一:添加飞书机器人,并获取url
飞书群 -> 设置 -> 群机器人 -> 添加机器人 -> 自定义机器人
如图: 配置机器人名称和描述
点击添加。
复制机器人webhook URL。
过程二:vm上部署user-webhook
手动部署user-webhook
操作系统: Linux各版本均可,需要python3.6 +
安装python3 和 第三方模块:(示例环境:centos7)
yum install -y python3
pip3 install flask
pip3 install requests
user-webhook脚本全文:
- 脚本支持同时使用 3个 webhook 告警通道/alermhooka/alermhookb/alermhookc
- 每个告警通道可以单独配置 飞书消息标题,飞书机器人url
- 容器化封装之后,通过启动时传入 环境变量 获取配置。
脚本全文如下:
#!/usr/bin/python3
import json
import sys,os
import requests
from flask import Flask,request
from datetime import datetime
app = Flask(__name__)
def feishu_mes(json_obj,fs_title,fs_url):
stats = json_obj["incident"]["state"]
project_id = json_obj["incident"]["resource"]["labels"].get("project_id","not found")
resource_type = json_obj["incident"]["resource"]["type"]
resource_name = json_obj["incident"]["resource_name"]
alerm_name = json_obj["incident"]["policy_name"]
stime = json_obj["incident"]["started_at"]
rtime = json_obj["incident"]["ended_at"]
alert_url = json_obj["incident"]["url"]
# 阈值和当前值,如果使用mql查询的情况,告警消息没有这两个字段。
threshold_value = json_obj["incident"].get("threshold_value","not found")
observed_value = json_obj["incident"].get("observed_value","not found")
# coments处理,为了适配gcp test connection
comments = json_obj["incident"]["documentation"]
if isinstance(comments,dict):
comments = comments.get("content","not found")
alert_time = datetime.fromtimestamp(stime).isoformat()
now_time = datetime.now().isoformat(timespec="seconds")
if stats == "open":
stat = "Fire"
corlor = "red"
else:
stat = "Resolved"
corlor = "green"
end_time = datetime.fromtimestamp(rtime).isoformat()
message = ""
message += "**消息类型:** {} \n".format(stat)
message += "**告警名称:** {} \n".format(alerm_name)
message += "**项目ID:** {} \n".format(project_id)
message += "**资源类型:** {} \n".format(resource_type)
message += "**资源名称:** {} \n".format(resource_name)
message += "**告警阈值:** {} \n".format(threshold_value)
message += "**当前值:** {} \n".format(observed_value)
message += "**告警时间:** {} \n".format(alert_time)
if stats == "open":
message += "**当前时间:** {} \n".format(now_time)
else:
message += "**恢复时间:** {} \n".format(end_time)
message += "**告警描述:** {} \n".format(comments)
message += "**打开告警页面:** [google-console alerting页面]({})\n".format(alert_url)
body = {
"msg_type": "interactive",
"card": {
"header": {
"title": {
"tag": "plain_text",
"content": fs_title
},
"template": corlor
},
"elements": [
{
"tag": "markdown",
"content": message
}
]
}
}
headers = {'Content-Type': 'application/json'}
return requests.post(fs_url, headers=headers, json=body)
@app.route("/healthy")
def healthy_check():
name = os.environ.get("NAME","World")
return "ok\n"
@app.route("/getconfig")
def get_config():
# 用于容器启动时获取当前配置参数。手动启动的在webhook函数中自行修改和查看。
confa = "alermhooka:\n title: {}\n url: {}".format(os.environ.get("FS_TITLE_A","no input"),os.environ.get("FS_URL_A","no input"))
confb = "alermhookb:\n title: {}\n url: {}".format(os.environ.get("FS_TITLE_B","no input"),os.environ.get("FS_URL_B","no input"))
confc = "alermhookc:\n title: {}\n url: {}".format(os.environ.get("FS_TITLE_C","no input"),os.environ.get("FS_URL_C","no input"))
return "{}\n{}\n{}\n".format(confa,confb,confc)
@app.route("/alermhooka",methods=["POST"])
def alerm_hooka():
inpt_data = json.loads(request.data)
print(json.dumps(inpt_data),flush=True)
# 使用环境变量配置,或者在这里修改飞书输出的title 和告警的 机器人url
fs_title = os.environ.get("FS_TITLE_A","GCP-告警消息A")
fs_url = os.environ.get("FS_URL_A",'https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxxxxxxxxxxxxx')
return "{}\n".format(feishu_mes(inpt_data,fs_title,fs_url))
@app.route("/alermhookb",methods=["POST"])
def alerm_hookb():
inpt_data = json.loads(request.data)
print(json.dumps(inpt_data),flush=True)
# 使用环境变量配置,或者在这里修改飞书输出的title 和告警的 机器人url
fs_title = os.environ.get("FS_TITLE_B","GCP-告警消息B")
fs_url = os.environ.get("FS_URL_B",'https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxxxxxxxxx')
return "{}\n".format(feishu_mes(inpt_data,fs_title,fs_url))
@app.route("/alermhookc",methods=["POST"])
def alerm_hookc():
inpt_data = json.loads(request.data)
print(json.dumps(inpt_data),flush=True)
# 使用环境变量配置,或者在这里修改飞书输出的title 和告警的 机器人url
fs_title = os.environ.get("FS_TITLE_C","GCP-告警消息C")
fs_url = os.environ.get("FS_URL_C",'https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxxxxxxxxxxxx')
return "{}\n".format(feishu_mes(inpt_data,fs_title,fs_url))
if __name__ == "__main__":
app.run(debug=True,host="0.0.0.0",port=15015)
配置环境变量:
自行配置飞书告警消息中的TITLE 和 飞书群机器人webhook url
export FS_TITLE_A="feishu-alert-test"
export FS_URL_A="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxx"
启动:
前台启动:(直接启动方式主要用于测试)
python3 -u webapp.py
启动成功输出如下:
放入后台启动:
nohup python3 -u webapp.py &> logs &
访问API测试:
1) 健康状态检查,返回ok为正常。
/healthy
2) 获取当前配置的alermhook APIS 的title 和 机器人url配置。
/getconfig
3) webhook告警通道:
- 原始提供3个webhook api,配置几个用几个就行。
/alermhooka
/alermhookb
/alermhookc
过程三:配置GCP monitoring – alerting – CHANNELS 添加webhook通道
配置路径:
GCP console -> Monitoring -> alerting -> MANAGE CHANNELS
-> Webhooks -> ADD NEWS
过程图1:
过程图2:
注意user-webhook默认地址为: http://IP:15015/webhooka
点击TEST CONNECTION 可以测试webhook。
过程四: 配置GCP monitoring – alerting policy 并使用webhook
以上,配置过程已经完成。
user-webhook脚本的容器化封装:
Docker build
docker file:
FROM python:3.7.15-alpine
LABEL usage="gcp alerm webhook"
RUN pip install flask requests
COPY webapp.py /opt/webhook/
RUN apk add -U tzdata && cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
WORKDIR /opt/webhook/
ENV LANG="en_US.UTF-8" \
LC_ALL="en_US.UTF-8" \
FS_TITLE_A="GCP-告警消息" \
FS_URL_A="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx" \
FS_TITLE_B="GCP-告警消息" \
FS_URL_B="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx" \
FS_TITLE_C="GCP-告警消息" \
FS_URL_C="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx"
ENTRYPOINT ["/usr/local/bin/python3"]
CMD ["-u","webapp.py"]
build:
# docker build -t mesh-gcp-webhook:v1.0.1-alpine .
# docker build -t mesh-gcp-webhook:v1.0.2-alpine .
docker build -t mesh-gcp-webhook:v1.0.3-alpine .
Run test:
docker run -dit --name webhook-alpine -p 15015:15015 -e FS_URL_A='https://open.feishu.cn/open-apis/bot/v2/hook/f265193d-c107-4fce-852c-c5e554d5388b' mesh-gcp-webhook:v1.0.1-alpine
启动容器时传入配置:
可以同时使用3个webhook
每个webhook分别配置飞书消息titile 和 飞书机器人url
通过传入环境变量的方式传入配置:
/alermhooka
FS_TITLE_A="GCP-告警消息" \
FS_URL_A="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx" \
/alermhookb
FS_TITLE_B="GCP-告警消息" \
FS_URL_B="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx" \
/alermhookc
FS_TITLE_C="GCP-告警消息" \
FS_URL_C="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx"
cloud run部署user-webhook
上传容器至gcp Container Registry
- 需要提前开启Container Registry API
cloud shell 操作:
需要将build好的image先传到docker hub,然后在cloud shell里进行pull,tag ,push到GCR。
如下操作示例:
docker pull suiyueran/mesh-gcp-webhook:v1.0.1-alpine
docker tag mesh-gcp-webhook:v1.0.1-alpine gcr.io/mec-test-344202/mesh-gcp-webhook:v1.0.1-alpine
docker push gcr.io/mec-test-344202/mesh-gcp-webhook:v1.0.1-alpine
其中tag格式为:
gcr.io/PROJECT-NAME/IMAGE-NAME:IAMGE-VERSION gcr.io: Container Registry美国地区存储的域名 也可以使用其他区域域名,具体查阅GCR官方文档。
https://cloud.google.com/container-registry/docs/overview
Vm 内认证Gcloud 操作:
在build image的 机器上直接gcloud认证,然后执行tag 和 push。
# 登录
gcloud auth login --no-launch-browser
# set project
gcloud config set project mec-test-344202
# tag
docker tag mesh-gcp-webhook:v1.0.1-alpine gcr.io/mec-test-344202/mesh-gcp-webhook:v1.0.1-alpine
# 认证
gcloud services enable containerregistry.googleapis.com
gcloud auth configure-docker
# 推送
docker push gcr.io/mec-test-344202/mesh-gcp-webhook:v1.0.1-alpine
Cloud run 启动服务
Console 页面直接启动。
注意:
- 容器端口 15015
- 使用环境变量传入配置。如果只用一个hook,那就传一个配置就行。主要是 飞书url 需要传入。
获取服务访问域名:
Cloud run服务动态更新
第一步: 上传新版本image
第二部: service -> EDIT & DEPLOY NEW REVISION -> 选择新版本image, -> DEPLOY