Skip to content

聚合分析

聚合简介

啥是聚合

聚合分析是数据库中重要的功能特性,例如:找出某字段的最大值最小值平均值总和等等。

对一个数据集求 max、min、avg、sum 等,在 ES 中称为 指标聚合 metric

而在关系型数据库中还可以对查出来的数据进行分组 Group By,再在组上 max、min 等,在 ES 中称为 分桶、桶聚合 bucketing

除此之外 ES 还提供 矩阵聚合 matrix管道聚合 pipeline 等。

聚合语法

{
    "aggregations": {       // 聚合关键词,可简写为 aggs
        "<AGG_NAME>": {     // 聚合名称
            "<AGG_TYPE>": { // 聚合类型
                <AGG_BODY>  // 聚合体:对哪些字段聚合
            }
            [, "aggregations": {[<SUB_AGGREGATION>]+ }]?    // 聚合里定义子聚合
            [, "meta": {[<META_DATA_BODY>]}]?               // 定义元信息
        }
    }
    [, "aggregations": {...}]*  // 其他聚合,0或N个
}

*: 0 or N, +: 1 or N, ?: 0 or 1.

聚合值的来源

聚合计算的值可以取字段的值,也可是脚本计算的结果。

指标聚合

max/min/sum/avg

查询所有客户中余额的最大值
POST /bank/_search HTTP/1.1
Content-Type: application/json

{
    "size": 0, 
    "aggs": {
        "maxbalance": {
            "max": {"field": "balance"}
        }
    }
}
{
    ...
    "hits": {
        "total": 1000,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {   // 数据结果在这
        "masssbalance": {
            "value": 49989
        }
    }
}
查询年龄为24岁的客户中的余额最大值
POST /bank/_search HTTP/1.1
Content-Type: application/json

{
    "size": 2, 
    "query": {"match": {"age": 24}},
    "sort": [{"balance": {"order": "desc"}}],
    "aggs": {
        "max_balance": {
            "max": {"field": "balance"}
        }
    }
}
{
    "aggregations": {
        "max_balance": {
            "value": 48745
        }
    },
    ...
    "hits": {
        "total": 42,
        "max_score": null,
        "hits": [
            {
                ...
                "_source": {
                    "account_number": 697,
                    "balance": 48745,
                    "firstname": "Mallory",
                    "lastname": "Emerson",
                    "age": 24,
                    "gender": "F",
                    "address": "318 Dunne Court",
                    "employer": "Exoplode",
                    "email": "malloryemerson@exoplode.com",
                    "city": "Montura",
                    "state": "LA"
                },
                "sort": [48745]
            },
            {
                ...
                "_source": {
                    "account_number": 917,
                    "balance": 47782,
                    "firstname": "Parks",
                    "lastname": "Hurst",
                    "age": 24,
                    "gender": "M",
                    "address": "933 Cozine Avenue",
                    "employer": "Pyramis",
                    "email": "parkshurst@pyramis.com",
                    "city": "Lindcove",
                    "state": "GA"
                },
                "sort": [47782]
            }
        ]
    }
}
值来源于脚本,查询所有客户的平均年龄是多少,并对平均年龄加10

==="Req"

POST /bank/_search?size=0 HTTP/1.1
Content-Type: application/json

{
    "aggs": {
        "avg_age": {
            "avg": {"script": {"source": "doc.age.value"}}
        },
        "avg_age10": {
            "avg": {"script": {"source": "doc.age.value + 10"}}
        }
    }
}

{
    ...
    "hits": {
        "total": 1000,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "avg_age": {"value": 30.171},
        "avg_age10": {"value": 40.171}
    }
}

count: 文档计数

Value count: 统计某字段有值的文档数

cardinality: 值去重

Example
POST /bank/_search HTTP/1.1
Content-Type: application/json

{
    "aggs": {
        "age_count": {
            "cardinality": {"field": "age"}
        },
        "state_count": {
            "cardinality": {"field": "state.keyword"}
        }
    }
}
{
    ...
    "hits": {
        "total": 1000,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "state_count": {"value": 51},
        "age_count": { "value": 21}
    }
}

stats: 统计 count max min avg sum 5个值

Extended stats

比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间

Example
POST /bank/_search HTTP/1.1
Content-Type: application/json

{
    "aggs": {
        "age_stats": {
            "extended_stats": {"field": "age"}
        }
    }
}
{
    ...
    "hits": {
        "total": 1000,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "age_stats": {
        "count": 1000,
        "min": 20,
        "max": 40,
        "avg": 30.171,
        "sum": 30171,
        "sum_of_squares": 946393,
        "variance": 36.10375899999996,
        "std_deviation": 6.008640362012022,
            "std_deviation_bounds": {
                "upper": 42.18828072402404,
                "lower": 18.153719275975956
            }
        }
    }
}

Percentiles 占比百分位对应的值统计

Percentiles rank 统计值小于等于指定值的文档占比

桶聚合