Prometheus学习笔记|03.四大度量指标了解与应用

什么是度量指标

度量是指对于一个物体或是事件的某个性质给予一个数字，使其可以和其他物体或是事件的相同性质比较。度量可以是对一物理量（如长度、尺寸或容量等）的估计或测定，也可以是其他较抽象的特质。

简单来讲，也就是数据的量化，形成对应的数据指标。

Prometheus 的指标格式

为了能够帮助用户理解和区分这些不同监控指标之间的差异，Prometheus 定义了 4 种不同的度量指标类型(metric type)：Counter（计数器）、Gauge（仪表盘）、Histogram（直方图）、Summary（摘要）。在 Exporter 返回的样本数据中，其注释中也包含了该样本的类型。例如：

1
2
3


# HELP node_cpu Seconds the cpus spent in each mode.
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="idle"} 362812.7890625

在 Prometheus 中，我们的指标表示格式如下：

1

<metric name>{<label name>=<label value>, ...}

主体为指标名称和标签组成：

1

api_http_requests_total{method="GET", handler="/user/info"}

Counter：只增不减的计数器

Counter 类型的指标其工作方式和计数器一样，只增不减（除非系统发生重置）。常见的监控指标，如 http_requests_total，node_cpu 都是 Counter 类型的监控指标。一般在定义 Counter 类型指标的名称时推荐使用 _total 作为后缀。

Counter 是一个简单但有强大的工具，例如我们可以在应用程序中记录某些事件发生的次数，通过以时序的形式存储这些数据，我们可以轻松的了解该事件产生速率的变化。 PromQL 内置的聚合操作和函数可以让用户对这些数据进行进一步的分析。

例如，通过 rate() 函数获取HTTP请求量的增长率：

1

rate(http_requests_total[5m])

查询当前系统中，访问量前 10 的 HTTP 地址：

1

topk(10, http_requests_total)

Counter 类型一共包含两个常规方法，如下：

1
2


Inc()     // 将计数器递增 1。
Add(float64)    // 将给定值添加到计数器，如果设置的值 < 0，则发生错误。

示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


var AccessCounter = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "api_requests_total",
    },
    []string{"method", "path"},
)

func init() {
    prometheus.MustRegister(AccessCounter)
}

func main() {
    // ...
    engine.GET("/counter", func(c *gin.Context) {
        purl, _ := url.Parse(c.Request.RequestURI)
        AccessCounter.With(prometheus.Labels{
            "method": c.Request.Method,
            "path":   purl.Path,
        }).Add(1)
    })
    engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
    engine.Run(":10001")
}

这时候我们访问 http://127.0.0.1:10001/counter，就可以发现 metrics +1：

1
2
3


# HELP api_requests_total 
# TYPE api_requests_total counter
api_requests_total{method="GET",path="/counter"} 1

Gauge：可增可减的仪表盘

与 Counter 不同，Gauge 类型的指标侧重于反应系统的当前状态。因此这类指标的样本数据可增可减。常见指标如：node_memory_MemFree（主机当前空闲的内容大小）、node_memory_MemAvailable（可用内存大小）都是 Gauge 类型的监控指标。

通过 Gauge 指标，用户可以直接查看系统的当前状态：

1

node_memory_MemFree

对于 Gauge 类型的监控指标，通过 PromQL 内置函数 delta() 可以获取样本在一段时间返回内的变化情况。例如，计算 CPU 温度在两个小时内的差异：

1

delta(cpu_temp_celsius{host="zeus"}[2h])

还可以使用 deriv() 计算样本的线性回归模型，甚至是直接使用 predict_linear() 对数据的变化趋势进行预测。例如，预测系统磁盘空间在 4 个小时之后的剩余情况：

1

predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600)

Gauge 类型一共包含六个常规方法，如下：

1
2
3
4
5
6


Set(float64)	// 将仪表设置为任意值。
Inc()	// 将仪表增加 1。
Dec()	// 将仪表减少 1。
Add(float64)	// 将给定值添加到仪表，该值如果为负数，那么将导致仪表值减少。
Sub(float64)	// 从仪表中减去给定值，该值如果为负数，那么将导致仪表值增加。
SetToCurrentTime()	// 将仪表设置为当前 Unix 时间（以秒为单位）。

示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


var QueueGauge = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "queue_num_total",
    },
	[]string{"name"},
)

func init() {
    prometheus.MustRegister(AccessCounter)
}

func main() {
    // ...
    engine.GET("/queue", func(c *gin.Context) {
        num := c.Query("num")
        fnum, _ := strconv.ParseFloat(num, 32)
        QueueGauge.With(prometheus.Labels{"name": "queue"}).Set(fnum)
    })
    engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
    engine.Run(":10001")
}

访问 http://127.0.0.1:10001/queue?num=5 后，再查看 metrics 结果：

1
2
3


# HELP queue_num_total 
# TYPE queue_num_total gauge
queue_num_total{name="queue"} 5

使用Histogram和Summary分析数据分布情况

Histogram（累积直方图）

Histogram 类型将会在一段时间范围内对数据进行采样（通常是请求持续时间或响应大小等等），并将其计入可配置的存储桶（bucket）中，后续可通过指定区间筛选样本，也可以统计样本总数。

简单来讲，也就是在配置 Histogram 类型时，我们会设置分组区间，例如要分析请求的响应时间，我们可以分为 0-100ms，100-500ms，500-1000ms 等等区间段，那么在 metrics 的上报接口中，将会分为多个维度显示统计情况。

Histogram 类型一共包含一个常规方法，如下：

1

Observe(float64)	// 将一个观察值添加到直方图。

示例：

Histogram 类型在应用场景中非常的常用，因为其代表的就是分组区间的统计，而在分布式场景盛行的现在，链路追踪系统是必不可少的，那么针对不同的链路的分析统计就非常的有必要，例如像是对 RPC、SQL、HTTP、Redis 的 P90、P95、P99 进行计算统计，并且更进一步的做告警，就能够及时的发现应用链路缓慢，进而发现和减少第三方系统的影响。

我们模仿记录 HTTP 调用响应时间的应用场景：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


var HttpDurationsHistogram = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_durations_histogram_seconds",
        Buckets: []float64{0.2, 0.5, 1, 2, 5, 10, 30},
    },
    []string{"path"},
)

func init() {
    prometheus.MustRegister(HttpDurationsHistogram)
}

func main() {
	// ...
    engine.GET("/histogram", func(c *gin.Context) {
        purl, _ := url.Parse(c.Request.RequestURI)
        HttpDurationsHistogram.With(prometheus.Labels{"path": purl.Path}).Observe(float64(rand.Intn(30)))
    })
    engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
    engine.Run(":10001")
}

多次调用 http://127.0.0.1:10001/histogram，查看 metrics：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


# HELP http_durations_histogram_seconds 
# TYPE http_durations_histogram_seconds histogram
http_durations_histogram_seconds_bucket{path="/histogram",le="0.2"} 1
http_durations_histogram_seconds_bucket{path="/histogram",le="0.5"} 1
http_durations_histogram_seconds_bucket{path="/histogram",le="1"} 3
http_durations_histogram_seconds_bucket{path="/histogram",le="2"} 3
http_durations_histogram_seconds_bucket{path="/histogram",le="5"} 3
http_durations_histogram_seconds_bucket{path="/histogram",le="10"} 3
http_durations_histogram_seconds_bucket{path="/histogram",le="30"} 13
http_durations_histogram_seconds_bucket{path="/histogram",le="+Inf"} 13
http_durations_histogram_seconds_sum{path="/histogram"} 191
http_durations_histogram_seconds_count{path="/histogram"} 13

我们结合 histogram metrics 的结果来看，可以发现其分为了三个部分：

1
2
3


http_durations_histogram_seconds_bucket：在 Buckets 中你可以发现一共包含 8 个值，分别代表：0-0.2s、0.2-0.5s、0.5-1s、1-2s、2-5s、5-10s、10-30s 以及大于 30s（+Inf），这是我们在 HistogramOpts.Buckets 中所定义的区间值。
http_durations_histogram_seconds_sum：调用的总耗时。
http_durations_histogram_seconds_count：调用总次数。

Histogram 是一个比较精巧类型，首先 Buckets 的分布区间要根据你的实际应用情况，合理的设置，否则就会出现不均，自然而然 PXX（P95、P99 等）计算也就会有问题，同时在 Grafana 上的绘图也会出现偏差，因此需要在理论上多多理解，然后再进行具体的设置，否则后期改来改去会比较麻烦。

同时我们也可以利用 http_durations_histogram_seconds_sum 和 http_durations_histogram_seconds_count 相除得出平均耗时，一举多得。

Summary（摘要）

Summary 类型将会在一段时间范围内对数据进行采样，但是与 Histogram 类型不同的是 Summary 类型将会存储分位数（在客户端进行计算），而不像 Histogram 类型，根据所设置的区间情况统计存储。

Summary 类型在采样计算后，一共提供三种摘要指标，如下：

样本值的分位数分布情况。
所有样本值的大小总和。
样本总数。

Summary 类型一共包含一个常规方法，如下：

1

Observe(float64)	// 将一个观察值添加到摘要。

示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


var HttpDurations = prometheus.NewSummaryVec(
    prometheus.SummaryOpts{
        Name:       "http_durations_seconds",
        Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
    },
    []string{"path"},
)

func init() {
    prometheus.MustRegister(HttpDurations)
}

func main() {
    // ...
    engine.GET("/summary", func(c *gin.Context) {
        purl, _ := url.Parse(c.Request.RequestURI)
        HttpDurations.With(prometheus.Labels{"path": purl.Path}).Observe(float64(rand.Intn(30)))
    })
    engine.GET("/metrics", gin.WrapH(promhttp.Handler()))
    engine.Run(":10001")
}

多次调用 http://127.0.0.1:10001/summary，查看 metrics：

1
2
3
4
5
6
7


# HELP http_durations_seconds 
# TYPE http_durations_seconds summary
http_durations_seconds{path="/summary",quantile="0.5"} 17
http_durations_seconds{path="/summary",quantile="0.9"} 29
http_durations_seconds{path="/summary",quantile="0.99"} 29
http_durations_seconds_sum{path="/summary"} 85
http_durations_seconds_count{path="/summary"} 5

结合 summary metrics 来看，同样分为了三个部分：

1
2
3


http_durations_seconds：分别是中位数（0.5），9 分位数（0.9）以及 99 分位数（0.99），对应 SummaryOpts.Objectives 中我们所定义的中位数，而各自的意义代表着中位数（0.5）的耗时为 17s，9 分位数为 29s，99 分位数为 29s。
http_durations_seconds_sum：调用总耗时。
http_durations_seconds_count：调用总次数。

Timer（计时器）

Timer 是计时功能的辅助类型，提供的观察者用于观察持续时间（以秒为单位）。

Timer 通常用于通过以下方式为函数调用计时：

1
2
3
4
5


func TimeMe() {
    timer := NewTimer(myHistogram)
    defer timer.ObserveDuration()
    // Do actual work.
}

Timer 类型包含两个方法：

1
2


func NewTimer(o Observer) *Timer    // NewTimer创建一个新的计时器
func (t *Timer) ObserveDuration() time.Duration     // ObserveDuration记 录自从使用 NewTimer 创建计时器以来经过的持续时间（以秒为单位）

示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


var (
	requestDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
		Name:    "example_request_duration_seconds",
		Help:    "Histogram for the runtime of a simple example function.",
		Buckets: prometheus.LinearBuckets(0.01, 0.01, 10),
	})

    funcDuration = prometheus.NewGauge(prometheus.GaugeOpts{
		Name: "example_function_duration_seconds",
		Help: "Duration of the last call of an example function.",
	})
)

func main() {
	timer1 := prometheus.NewTimer(requestDuration)
	defer timer1.ObserveDuration()

    timer2 := prometheus.NewTimer(prometheus.ObserverFunc(funcDuration.Set))
	defer timer2.ObserveDuration()

	// Do something here that takes time.
	time.Sleep(time.Duration(rand.NormFloat64()*10000+50000) * time.Microsecond)
}

小结

在节中我们介绍并实操了 Prometheus 的四种度量指标类型 Counter、Gauge、Histogram、Summary，这四种度量类型都极具代表性：Counter 是单调递增的计数器，Gauge 是可任意调整数值的仪表盘，Histogram 是分组区间统计，Summary 是中位数统计。

其中 Histogram 和 Summary 具有一定的 “相似” 度，因为在 Histogram 指标中我们可以通过 histogram_quantile 函数计算出分位值，而 Summary 也可以计算分位值，两者区别就在于 Histogram 是在服务端计算的，而 Summary 是在客户端就进行了计算，其一个计算好了再推上去，一个直接推上去，数据维度不一样，可以做的事情也不一样，有利有弊，具体可以根据指标的实际情况做衡量。

最后，单独介绍了下计时功能的辅助类型 Timer，主要用于统计持续时间。

文章目录