ES的搜索,不是关系数据库中的LIKE,而是通过搜索条件及文档之间的相关性来进行的。
对于一次搜索,对于每一个文档,都有一个浮点数字段_score 来表示文档与搜索主题的相关性, _score 的评分越高,相关性越高。
评分的计算方式取决于不同的查询类型:
fuzzy查询会计算与关键词的拼写相似程度
terms查询会计算找到的内容与关键词组成部分匹配的百分比
而全文本搜索是指计算内容与关键词的类似程度。
ES通过计算TF/IDF(即检索词频率/反向文档频率, Term Frequency/Inverse Document Frequency)作为相关性指标,具体与下面三个指标相关:
检索词频率TF: 对于一条记录,检索词在查询字段中出现的频率越高,相关性也越高。比如,一共有5个检索词,有4个出现在第一条记录,3条出现在第二条记录,则第一条记录TF会比第二条高一些。
反向文档频率IDF: 每个检索词在所有文档的该字段中出现的频率越高,则该词相关性越低。比如有5个检索词,如果一个词在所有文档中都出现,而另一个词之出现了一次,则所有文档中都包含的词几乎可以被忽略,只出现了一次的这个词权重会很高。
字段长度: 对于一条记录,查询字段的长度越长,相关性越低。比如有一条记录长度为10个词,另一条记录长度为100个词,而一个关键词,在两条记录里都出现了一次。则长度为10个词的记录,比长度为100个词的记录,相关性会高很多。
通过对TF/IDF的了解,可以让你解释一些看似不应该出现的结果。同时,你应该清楚,这不是一种精确匹配算法,而是一种评分算法,根据相关性进行了排序。
如果认为评分结果不合理,可以用下面的语句,查看评分过程:
#解释查询是如何进行评分的 crul -XPost http://127.0.0.1:9200/myindex/user/_search?explain -d' { "query" : { "match" : { "家庭住址" : "魔都大街" }} }' #结果如下: { "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 4, "max_score": 4, "hits": [ { "_shard": 4, "_node": "5Tv2a5YaQDqmzUFbTp4iaw", "_index": "myindex", "_type": "user", "_id": "u002", "_score": 4, "_source": { "用户ID": "u002", "姓名": "李四", "性别": "男", "年龄": "25", "家庭住址": "上海市闸北区魔都大街007号", "注册时间": "2015-02-01 08:30:00" }, "_explanation": { "value": 4, "description": "sum of:", "details": [ { "value": 4, "description": "sum of:", "details": [ { "value": 1, "description": "weight(家庭住址:魔 in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 1, "description": "score(doc=0,freq=1.0), product of:", "details": [ { "value": 0.5, "description": "queryWeight, product of:", "details": [ { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 0.5, "description": "queryNorm", "details": [] } ] }, { "value": 2, "description": "fieldWeight in 0, product of:", "details": [ { "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] } ] }, { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 2, "description": "fieldNorm(doc=0)", "details": [] } ] } ] } ] }, { "value": 1, "description": "weight(家庭住址:都 in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 1, "description": "score(doc=0,freq=1.0), product of:", "details": [ { "value": 0.5, "description": "queryWeight, product of:", "details": [ { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 0.5, "description": "queryNorm", "details": [] } ] }, { "value": 2, "description": "fieldWeight in 0, product of:", "details": [ { "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] } ] }, { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 2, "description": "fieldNorm(doc=0)", "details": [] } ] } ] } ] }, { "value": 1, "description": "weight(家庭住址:大街 in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 1, "description": "score(doc=0,freq=1.0), product of:", "details": [ { "value": 0.5, "description": "queryWeight, product of:", "details": [ { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 0.5, "description": "queryNorm", "details": [] } ] }, { "value": 2, "description": "fieldWeight in 0, product of:", "details": [ { "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] } ] }, { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 2, "description": "fieldNorm(doc=0)", "details": [] } ] } ] } ] }, { "value": 1, "description": "weight(家庭住址:街 in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 1, "description": "score(doc=0,freq=1.0), product of:", "details": [ { "value": 0.5, "description": "queryWeight, product of:", "details": [ { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 0.5, "description": "queryNorm", "details": [] } ] }, { "value": 2, "description": "fieldWeight in 0, product of:", "details": [ { "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] } ] }, { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 2, "description": "fieldNorm(doc=0)", "details": [] } ] } ] } ] } ] }, { "value": 0, "description": "match on required clause, product of:", "details": [ { "value": 0, "description": "# clause", "details": [] }, { "value": 0.5, "description": "_type:user, product of:", "details": [ { "value": 1, "description": "boost", "details": [] }, { "value": 0.5, "description": "queryNorm", "details": [] } ] } ] } ] } }, { "_shard": 0, "_node": "5Tv2a5YaQDqmzUFbTp4iaw", "_index": "myindex", "_type": "user", "_id": "u003", "_score": 0.71918744, "_source": { "用户ID": "u003", "姓名": "王五", "性别": "男", "年龄": "26", "家庭住址": "广州市花都区花城大街010号", "注册时间": "2015-03-01 08:30:00" }, "_explanation": { "value": 0.71918744, "description": "sum of:", "details": [ { "value": 0.71918744, "description": "product of:", "details": [ { "value": 1.4383749, "description": "sum of:", "details": [ { "value": 0.71918744, "description": "weight(家庭住址:大街 in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.71918744, "description": "score(doc=0,freq=1.0), product of:", "details": [ { "value": 0.35959372, "description": "queryWeight, product of:", "details": [ { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 0.35959372, "description": "queryNorm", "details": [] } ] }, { "value": 2, "description": "fieldWeight in 0, product of:", "details": [ { "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] } ] }, { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 2, "description": "fieldNorm(doc=0)", "details": [] } ] } ] } ] }, { "value": 0.71918744, "description": "weight(家庭住址:街 in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.71918744, "description": "score(doc=0,freq=1.0), product of:", "details": [ { "value": 0.35959372, "description": "queryWeight, product of:", "details": [ { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 0.35959372, "description": "queryNorm", "details": [] } ] }, { "value": 2, "description": "fieldWeight in 0, product of:", "details": [ { "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] } ] }, { "value": 1, "description": "idf(docFreq=1, maxDocs=2)", "details": [] }, { "value": 2, "description": "fieldNorm(doc=0)", "details": [] } ] } ] } ] } ] }, { "value": 0.5, "description": "coord(2/4)", "details": [] } ] }, { "value": 0, "description": "match on required clause, product of:", "details": [ { "value": 0, "description": "# clause", "details": [] }, { "value": 0.35959372, "description": "_type:user, product of:", "details": [ { "value": 1, "description": "boost", "details": [] }, { "value": 0.35959372, "description": "queryNorm", "details": [] } ] } ] } ] } }, ...... ] } }
你可以看到,不仅是“魔都大街”的记录被查询出来了,只要有“大街”的记录也被查出来了哦。同时,也告诉了你,为什么”u002″是最靠前的。
还有一种用法,就是让ES告诉你,查询语句哪里错了:
curl -XPOST http://127.0.0.1:9200/myindex/user/_validate/query?explain -d' { "query" : { "matchA" : { "家庭住址" : "魔都大街" }} }' { "valid": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "explanations": [ { "index": "myindex", "valid": false, "error": "org.elasticsearch.index.query.QueryParsingException: No query registered for [matchA]" } ] }
ES会告诉你matchA这里错了哦。