恩,有些初步的感觉了没?那回过头来我们看下最基础的东西:
ES中,常见数据类型如下:
| 类型名称 | 数据类型 |
| 字符串 | string |
| 整数 | byte, short, integer, long |
| 浮点数 | float, double |
| 布尔 | boolean |
| 日期 | date |
| 对象 | object |
| 嵌套结构 | nested |
| 地理位置(经纬度) | geo_point |
常用字段分析类型如下:
| 分析类型 | 含义 |
| analyzed | 首先分析这个字符串,然后索引。换言之,以全文形式索引此字段。 |
| not_analyzed | 索引这个字段,使之可以被搜索,但是索引内容和指定值一样。不分析此字段。 |
| no | 不索引这个字段。这个字段不能被搜索到。 |
然后,我们测试一下分词器
1、首先测试一下用标准分词进行分词
curl -XPOST http://localhost:9200/_analyze?analyzer=standard&text=小明同学大吃一惊
{
"tokens": [
{
"token": "小",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "明",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "同",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "学",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "大",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "吃",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "一",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "惊",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 7
}
]
}
2、然后对比一下用IK分词进行分词
curl -XGET http://localhost:9200/_analyze?analyzer=ik&text=小明同学大吃一惊
{
"tokens": [
{
"token": "小明",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "同学",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "大吃一惊",
"start_offset": 4,
"end_offset": 8,
"type": "CN_WORD",
"position": 2
},
{
"token": "大吃",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 3
},
{
"token": "吃",
"start_offset": 5,
"end_offset": 6,
"type": "CN_WORD",
"position": 4
},
{
"token": "一惊",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 5
},
{
"token": "一",
"start_offset": 6,
"end_offset": 7,
"type": "TYPE_CNUM",
"position": 6
},
{
"token": "惊",
"start_offset": 7,
"end_offset": 8,
"type": "CN_CHAR",
"position": 7
}
]
}
3、测试一下按”家庭住址”字段进行分词
curl -XGET http://localhost:9200/myindex/_analyze?field=家庭住址&text=我爱北京天安门
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "爱",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "北京",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "京",
"start_offset": 3,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "天安门",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "天安",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 5
},
{
"token": "门",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 6
}
]
}
4、测试一下按”性别”字段进行分词
curl -XGET http://localhost:9200/myindex/_analyze?field=性别&text=我爱北京天安门
{
"tokens": [
{
"token": "我爱北京天安门",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
}
]
}
大家可以看到,不同的分词器,使用场景、针对语言是不一样的,所以要选择合适的分词器。
此外,对于不同的字段,要选择不同的分析方式及适用的分词器,会让你事半功倍。