恩,有些初步的感觉了没?那回过头来我们看下最基础的东西:
ES中,常见数据类型如下:
类型名称 | 数据类型 |
字符串 | string |
整数 | byte, short, integer, long |
浮点数 | float, double |
布尔 | boolean |
日期 | date |
对象 | object |
嵌套结构 | nested |
地理位置(经纬度) | geo_point |
常用字段分析类型如下:
分析类型 | 含义 |
analyzed | 首先分析这个字符串,然后索引。换言之,以全文形式索引此字段。 |
not_analyzed | 索引这个字段,使之可以被搜索,但是索引内容和指定值一样。不分析此字段。 |
no | 不索引这个字段。这个字段不能被搜索到。 |
然后,我们测试一下分词器
1、首先测试一下用标准分词进行分词
curl -XPOST http://localhost:9200/_analyze?analyzer=standard&text=小明同学大吃一惊 { "tokens": [ { "token": "小", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "明", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "同", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "学", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "大", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "吃", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "一", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "惊", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 7 } ] }
2、然后对比一下用IK分词进行分词
curl -XGET http://localhost:9200/_analyze?analyzer=ik&text=小明同学大吃一惊 { "tokens": [ { "token": "小明", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "同学", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "大吃一惊", "start_offset": 4, "end_offset": 8, "type": "CN_WORD", "position": 2 }, { "token": "大吃", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 3 }, { "token": "吃", "start_offset": 5, "end_offset": 6, "type": "CN_WORD", "position": 4 }, { "token": "一惊", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 5 }, { "token": "一", "start_offset": 6, "end_offset": 7, "type": "TYPE_CNUM", "position": 6 }, { "token": "惊", "start_offset": 7, "end_offset": 8, "type": "CN_CHAR", "position": 7 } ] }
3、测试一下按”家庭住址”字段进行分词
curl -XGET http://localhost:9200/myindex/_analyze?field=家庭住址&text=我爱北京天安门 { "tokens": [ { "token": "我", "start_offset": 0, "end_offset": 1, "type": "CN_CHAR", "position": 0 }, { "token": "爱", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 1 }, { "token": "北京", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 2 }, { "token": "京", "start_offset": 3, "end_offset": 4, "type": "CN_WORD", "position": 3 }, { "token": "天安门", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "天安", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 5 }, { "token": "门", "start_offset": 6, "end_offset": 7, "type": "CN_CHAR", "position": 6 } ] }
4、测试一下按”性别”字段进行分词
curl -XGET http://localhost:9200/myindex/_analyze?field=性别&text=我爱北京天安门 { "tokens": [ { "token": "我爱北京天安门", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 } ] }
大家可以看到,不同的分词器,使用场景、针对语言是不一样的,所以要选择合适的分词器。
此外,对于不同的字段,要选择不同的分析方式及适用的分词器,会让你事半功倍。