ElasticSearch2基本操作(04关于分词)

恩,有些初步的感觉了没?那回过头来我们看下最基础的东西:

ES中,常见数据类型如下:

类型名称 数据类型
字符串 string
整数 byte, short, integer, long
浮点数 float, double
布尔 boolean
日期 date
对象 object
嵌套结构 nested
地理位置(经纬度) geo_point

常用字段分析类型如下:

分析类型 含义
analyzed 首先分析这个字符串,然后索引。换言之,以全文形式索引此字段。
not_analyzed 索引这个字段,使之可以被搜索,但是索引内容和指定值一样。不分析此字段。
no 不索引这个字段。这个字段不能被搜索到。

然后,我们测试一下分词器

1、首先测试一下用标准分词进行分词

curl -XPOST http://localhost:9200/_analyze?analyzer=standard&text=小明同学大吃一惊

{
    "tokens": [
        {
            "token": "小",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "明",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "同",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "学",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "大",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "吃",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "一",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "惊",
            "start_offset": 7,
            "end_offset": 8,
            "type": "<IDEOGRAPHIC>",
            "position": 7
        }
    ]
}

2、然后对比一下用IK分词进行分词

curl -XGET http://localhost:9200/_analyze?analyzer=ik&text=小明同学大吃一惊

{
    "tokens": [
        {
            "token": "小明",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "同学",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "大吃一惊",
            "start_offset": 4,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "大吃",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "吃",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "一惊",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "一",
            "start_offset": 6,
            "end_offset": 7,
            "type": "TYPE_CNUM",
            "position": 6
        },
        {
            "token": "惊",
            "start_offset": 7,
            "end_offset": 8,
            "type": "CN_CHAR",
            "position": 7
        }
    ]
}

3、测试一下按”家庭住址”字段进行分词

curl -XGET http://localhost:9200/myindex/_analyze?field=家庭住址&text=我爱北京天安门

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "爱",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "北京",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "京",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "天安门",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "天安",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "门",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 6
        }
    ]
}

4、测试一下按”性别”字段进行分词

curl -XGET http://localhost:9200/myindex/_analyze?field=性别&text=我爱北京天安门

{
    "tokens": [
        {
            "token": "我爱北京天安门",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 0
        }
    ]
}

大家可以看到,不同的分词器,使用场景、针对语言是不一样的,所以要选择合适的分词器。
此外,对于不同的字段,要选择不同的分析方式及适用的分词器,会让你事半功倍。

Leave a Reply

Your email address will not be published. Required fields are marked *

*