Google-monitoring —>飞书告警webhook

作者:MeshCloud脉时云公有云架构师 向凯华

前提:

  • Google-monitoring 不支持直接飞书(钉钉)告警。
  • 使用默认的邮件告警通道,测试有非常大的延迟。(邮件有时甚至会延迟好几个小时到达)。
  • 因此,编写一个user-webhook脚本接受google自身webhook告警信息,并进行信息提取与格式转换,发送至飞书群机器人webhook,实现Google-monitoring到飞书群机器人告警。

告警流程图:

GCP告警信息格式:

以下消息格式(示例)为使用user-webhook脚本接收到的 google monitoring webhook 告警消息文本格式。

注意:

  • 通过console上 SELECT A METRIC 手动配置的告警消息,与使用 MQL 方式配置的告警规则,消息格式上会有一些区别。
  • 比如: 通过MQL方式配置告警规则,消息通知中没有如下连个值。
# 告警阈值
"threshold_value": "0.7",
# 观测值
"observed_value": "1.000",

告警消息格式:

{
  "incident": {
    "condition": {
      "conditionThreshold": {
        "aggregations": [
          {
            "alignmentPeriod": "60s",
            "perSeriesAligner": "ALIGN_MEAN"
          }
        ],
        "comparison": "COMPARISON_GT",
        "duration": "120s",
        "evaluationMissingData": "EVALUATION_MISSING_DATA_ACTIVE",
        "filter": "resource.type = \"gce_instance\" AND metric.type = \"compute.googleapis.com/instance/cpu/utilization\" AND metric.labels.instance_name = monitoring.regex.full_match(\"xkh-test.*\")",
        "thresholdValue": 0.7,
        "trigger": {
          "count": 1
        }
      },
      "displayName": "VM Instance - CPU utilization",
      "name": "projects/mec-test-344202/alertPolicies/2705589879665297521/conditions/7587364556342473822"
    },
    "condition_name": "VM Instance - CPU utilization",
    "documentation": {
      "content": "test",
      "mime_type": "text/markdown"
    },
    "ended_at": "None",
    "incident_id": "0.mnshxcgwudjy",
    "metadata": {
      "system_labels": {},
      "user_labels": {}
    },
    "metric": {
      "displayName": "CPU utilization",
      "labels": {
        "instance_name": "xkh-test-g02-01"
      },
      "type": "compute.googleapis.com/instance/cpu/utilization"
    },
    "observed_value": "1.000",
    "policy_name": "xkh-tset-cpu-alert",
    "policy_user_labels": {
      "kind": "cpu"
    },
    "resource": {
      "labels": {
        "instance_id": "8011306628148215104",
        "project_id": "mec-test-344202",
        "zone": "europe-west6-a"
      },
      "type": "gce_instance"
    },
    "resource_display_name": "xkh-test-g02-01",
    "resource_id": "",
    "resource_name": "mec-test-344202 xkh-test-g02-01",
    "resource_type_display_name": "VM Instance",
    "scoping_project_id": "mec-test-344202",
    "scoping_project_number": 328842067835,
    "started_at": 1665459241,
    "state": "open",
    "summary": "CPU utilization for mec-test-344202 xkh-test-g02-01 with metric labels {instance_name=xkh-test-g02-01} is above the threshold of 0.700 with a value of 1.000.",
    "threshold_value": "0.7",
    "url": "https://console.cloud.google.com/monitoring/alerting/incidents/0.mnshxcgwudjy?project=mec-test-344202"
  },
  "version": "1.2"
}

恢复消息格式:

{
  "incident": {
    "condition": {
      "conditionThreshold": {
        "aggregations": [
          {
            "alignmentPeriod": "60s",
            "perSeriesAligner": "ALIGN_MEAN"
          }
        ],
        "comparison": "COMPARISON_GT",
        "duration": "120s",
        "evaluationMissingData": "EVALUATION_MISSING_DATA_ACTIVE",
        "filter": "resource.type = \"gce_instance\" AND metric.type = \"compute.googleapis.com/instance/cpu/utilization\" AND metric.labels.instance_name = monitoring.regex.full_match(\"xkh-test.*\")",
        "thresholdValue": 0.7,
        "trigger": {
          "count": 1
        }
      },
      "displayName": "VM Instance - CPU utilization",
      "name": "projects/mec-test-344202/alertPolicies/2705589879665297521/conditions/7587364556342473822"
    },
    "condition_name": "VM Instance - CPU utilization",
    "documentation": {
      "content": "test",
      "mime_type": "text/markdown"
    },
    "ended_at": 1665461592,
    "incident_id": "0.mnshxcgwudjy",
    "metadata": {
      "system_labels": {},
      "user_labels": {}
    },
    "metric": {
      "displayName": "CPU utilization",
      "labels": {
        "instance_name": "xkh-test-g02-01"
      },
      "type": "compute.googleapis.com/instance/cpu/utilization"
    },
    "observed_value": "0.258",
    "policy_name": "xkh-tset-cpu-alert",
    "policy_user_labels": {
      "kind": "cpu"
    },
    "resource": {
      "labels": {
        "instance_id": "8011306628148215104",
        "project_id": "mec-test-344202",
        "zone": "europe-west6-a"
      },
      "type": "gce_instance"
    },
    "resource_display_name": "xkh-test-g02-01",
    "resource_id": "",
    "resource_name": "mec-test-344202 xkh-test-g02-01",
    "resource_type_display_name": "VM Instance",
    "scoping_project_id": "mec-test-344202",
    "scoping_project_number": 328842067835,
    "started_at": 1665459241,
    "state": "closed",
    "summary": "CPU utilization for mec-test-344202 xkh-test-g02-01 with metric labels {instance_name=xkh-test-g02-01} returned to normal with a value of 0.258.",
    "threshold_value": "0.7",
    "url": "https://console.cloud.google.com/monitoring/alerting/incidents/0.mnshxcgwudjy?project=mec-test-344202"
  },
  "version": "1.2"
}

部署过程

过程一:添加飞书机器人,并获取url

飞书群 -> 设置 -> 群机器人 -> 添加机器人 -> 自定义机器人

如图: 配置机器人名称和描述

点击添加。

复制机器人webhook URL。

过程二:vm上部署user-webhook

手动部署user-webhook

操作系统: Linux各版本均可,需要python3.6 +

安装python3 和 第三方模块:(示例环境:centos7)

yum install -y python3
pip3 install flask
pip3 install requests

user-webhook脚本全文:

  • 脚本支持同时使用 3个 webhook 告警通道/alermhooka/alermhookb/alermhookc
  • 每个告警通道可以单独配置 飞书消息标题,飞书机器人url
  • 容器化封装之后,通过启动时传入 环境变量 获取配置。

脚本全文如下:

#!/usr/bin/python3
import json
import sys,os
import requests
from flask import Flask,request
from datetime import datetime

app = Flask(__name__)

def feishu_mes(json_obj,fs_title,fs_url):
    stats = json_obj["incident"]["state"]
    project_id = json_obj["incident"]["resource"]["labels"].get("project_id","not found")
    resource_type = json_obj["incident"]["resource"]["type"]
    resource_name = json_obj["incident"]["resource_name"]
    alerm_name = json_obj["incident"]["policy_name"]
    stime = json_obj["incident"]["started_at"]
    rtime = json_obj["incident"]["ended_at"]
    alert_url = json_obj["incident"]["url"]

    # 阈值和当前值,如果使用mql查询的情况,告警消息没有这两个字段。
    threshold_value = json_obj["incident"].get("threshold_value","not found")
    observed_value = json_obj["incident"].get("observed_value","not found")

    # coments处理,为了适配gcp test connection
    comments = json_obj["incident"]["documentation"]
    if isinstance(comments,dict):
        comments = comments.get("content","not found")

    alert_time = datetime.fromtimestamp(stime).isoformat()
    now_time = datetime.now().isoformat(timespec="seconds")

    if stats == "open":
        stat = "Fire"
        corlor = "red"
    else:
        stat = "Resolved"
        corlor = "green"
        end_time = datetime.fromtimestamp(rtime).isoformat()

    message = ""
    message += "**消息类型:** {} \n".format(stat)
    message += "**告警名称:** {} \n".format(alerm_name)
    message += "**项目ID:** {} \n".format(project_id)
    message += "**资源类型:** {} \n".format(resource_type)
    message += "**资源名称:** {} \n".format(resource_name)
    message += "**告警阈值:** {} \n".format(threshold_value)
    message += "**当前值:** {} \n".format(observed_value)
    message += "**告警时间:** {} \n".format(alert_time)

    if stats == "open":
        message += "**当前时间:** {} \n".format(now_time)
    else:
        message += "**恢复时间:** {} \n".format(end_time)

    message += "**告警描述:** {} \n".format(comments)
    message += "**打开告警页面:** [google-console alerting页面]({})\n".format(alert_url)

    body = {
      "msg_type": "interactive",
      "card": {
        "header": {
          "title": {
            "tag": "plain_text",
            "content": fs_title
          },
          "template": corlor
        },
        "elements": [
          {
            "tag": "markdown",
            "content": message
          }
        ]
      }
    }

    headers = {'Content-Type': 'application/json'}

    return requests.post(fs_url, headers=headers, json=body)


@app.route("/healthy")
def healthy_check():
    name = os.environ.get("NAME","World")
    return "ok\n"

@app.route("/getconfig")
def get_config():
    # 用于容器启动时获取当前配置参数。手动启动的在webhook函数中自行修改和查看。
    confa = "alermhooka:\n    title: {}\n    url: {}".format(os.environ.get("FS_TITLE_A","no input"),os.environ.get("FS_URL_A","no input"))
    confb = "alermhookb:\n    title: {}\n    url: {}".format(os.environ.get("FS_TITLE_B","no input"),os.environ.get("FS_URL_B","no input"))
    confc = "alermhookc:\n    title: {}\n    url: {}".format(os.environ.get("FS_TITLE_C","no input"),os.environ.get("FS_URL_C","no input"))

    return "{}\n{}\n{}\n".format(confa,confb,confc)

@app.route("/alermhooka",methods=["POST"])
def alerm_hooka():
    inpt_data = json.loads(request.data)
    print(json.dumps(inpt_data),flush=True)

    # 使用环境变量配置,或者在这里修改飞书输出的title 和告警的 机器人url
    fs_title = os.environ.get("FS_TITLE_A","GCP-告警消息A")
    fs_url = os.environ.get("FS_URL_A",'https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxxxxxxxxxxxxx')

    return "{}\n".format(feishu_mes(inpt_data,fs_title,fs_url))

@app.route("/alermhookb",methods=["POST"])
def alerm_hookb():
    inpt_data = json.loads(request.data)
    print(json.dumps(inpt_data),flush=True)

    # 使用环境变量配置,或者在这里修改飞书输出的title 和告警的 机器人url
    fs_title = os.environ.get("FS_TITLE_B","GCP-告警消息B")
    fs_url = os.environ.get("FS_URL_B",'https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxxxxxxxxx')

    return "{}\n".format(feishu_mes(inpt_data,fs_title,fs_url))

@app.route("/alermhookc",methods=["POST"])
def alerm_hookc():
    inpt_data = json.loads(request.data)
    print(json.dumps(inpt_data),flush=True)

    # 使用环境变量配置,或者在这里修改飞书输出的title 和告警的 机器人url
    fs_title = os.environ.get("FS_TITLE_C","GCP-告警消息C")
    fs_url = os.environ.get("FS_URL_C",'https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxxxxxxxxxxxx')

    return "{}\n".format(feishu_mes(inpt_data,fs_title,fs_url))

if __name__ == "__main__":
    app.run(debug=True,host="0.0.0.0",port=15015)

配置环境变量:

自行配置飞书告警消息中的TITLE 和 飞书群机器人webhook url

export FS_TITLE_A="feishu-alert-test"
export FS_URL_A="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxx"

启动:

前台启动:(直接启动方式主要用于测试)

python3 -u webapp.py

启动成功输出如下:

放入后台启动:

nohup python3 -u webapp.py &> logs &

访问API测试:

1) 健康状态检查,返回ok为正常。

/healthy 

2) 获取当前配置的alermhook APIS 的title 和 机器人url配置。

/getconfig

3) webhook告警通道:

  • 原始提供3个webhook api,配置几个用几个就行。
/alermhooka

/alermhookb

/alermhookc

过程三:配置GCP monitoring – alerting – CHANNELS 添加webhook通道

配置路径:

GCP console -> Monitoring -> alerting -> MANAGE CHANNELS

-> Webhooks -> ADD NEWS

过程图1:

过程图2:

注意user-webhook默认地址为: http://IP:15015/webhooka

点击TEST CONNECTION 可以测试webhook。

过程四: 配置GCP monitoring – alerting policy 并使用webhook

以上,配置过程已经完成。

user-webhook脚本的容器化封装:

Docker build

docker file:

FROM python:3.7.15-alpine
LABEL usage="gcp alerm webhook"
RUN pip install flask requests
COPY webapp.py /opt/webhook/
RUN apk add -U tzdata && cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
WORKDIR /opt/webhook/
ENV LANG="en_US.UTF-8" \
    LC_ALL="en_US.UTF-8" \
    FS_TITLE_A="GCP-告警消息" \
    FS_URL_A="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx" \
    FS_TITLE_B="GCP-告警消息" \
    FS_URL_B="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx" \
    FS_TITLE_C="GCP-告警消息" \
    FS_URL_C="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx"

ENTRYPOINT ["/usr/local/bin/python3"]
CMD ["-u","webapp.py"]

build:

# docker build -t mesh-gcp-webhook:v1.0.1-alpine .
# docker build -t mesh-gcp-webhook:v1.0.2-alpine .
docker build -t mesh-gcp-webhook:v1.0.3-alpine .

Run test:

docker run -dit --name webhook-alpine -p 15015:15015 -e FS_URL_A='https://open.feishu.cn/open-apis/bot/v2/hook/f265193d-c107-4fce-852c-c5e554d5388b' mesh-gcp-webhook:v1.0.1-alpine

启动容器时传入配置:

可以同时使用3个webhook

每个webhook分别配置飞书消息titile 和 飞书机器人url

通过传入环境变量的方式传入配置:

/alermhooka
    FS_TITLE_A="GCP-告警消息" \
    FS_URL_A="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx" \
    
/alermhookb
    FS_TITLE_B="GCP-告警消息" \
    FS_URL_B="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx" \

/alermhookc
    FS_TITLE_C="GCP-告警消息" \
    FS_URL_C="https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx"

cloud run部署user-webhook

上传容器至gcp Container Registry

  • 需要提前开启Container Registry API

cloud shell 操作:

需要将build好的image先传到docker hub,然后在cloud shell里进行pull,tag ,push到GCR。

如下操作示例:

docker pull suiyueran/mesh-gcp-webhook:v1.0.1-alpine
docker tag mesh-gcp-webhook:v1.0.1-alpine gcr.io/mec-test-344202/mesh-gcp-webhook:v1.0.1-alpine
docker push gcr.io/mec-test-344202/mesh-gcp-webhook:v1.0.1-alpine

其中tag格式为:
gcr.io/PROJECT-NAME/IMAGE-NAME:IAMGE-VERSION gcr.io: Container Registry美国地区存储的域名 也可以使用其他区域域名,具体查阅GCR官方文档。
https://cloud.google.com/container-registry/docs/overview

Vm 内认证Gcloud 操作:

在build image的 机器上直接gcloud认证,然后执行tag 和 push。

# 登录
gcloud auth login --no-launch-browser

# set project
gcloud config set project mec-test-344202

# tag
docker tag mesh-gcp-webhook:v1.0.1-alpine gcr.io/mec-test-344202/mesh-gcp-webhook:v1.0.1-alpine

# 认证
gcloud services enable containerregistry.googleapis.com
gcloud auth configure-docker

# 推送
docker push gcr.io/mec-test-344202/mesh-gcp-webhook:v1.0.1-alpine

Cloud run 启动服务

Console 页面直接启动。

注意:

  • 容器端口 15015
  • 使用环境变量传入配置。如果只用一个hook,那就传一个配置就行。主要是 飞书url 需要传入。

获取服务访问域名:

Cloud run服务动态更新

第一步: 上传新版本image

第二部: service -> EDIT & DEPLOY NEW REVISION -> 选择新版本image, -> DEPLOY

告警消息示例:

发表评论

您的电子邮箱地址不会被公开。