TensorFlow入门02:Tensor

Tensorflow中所有的数据都称为Tensor,可以是一个变量、数组或者多维数组。Tensor 有几个重要的属性:
Rank:纬数,比如scalar rank=0, vector rank=1, matrix rank=2
Shape:形状,比如vector shape=[D0], matrix shape=[D0, D1]
类型:数据类型,比如tf.float32, tc.uint8等

Rank与Shape关系如下表所示

Rank Shape Dimension number Example
0 [] 0-D A 0-D tensor. A scalar.
1 [D0] 1-D A 1-D tensor with shape [5].
2 [D0, D1] 2-D A 2-D tensor with shape [3, 4].
3 [D0, D1, D2] 3-D A 3-D tensor with shape [1, 4, 3].
n [D0, D1, … Dn-1] n-D A tensor with shape [D0, D1, … Dn-1].

TensorFlow入门01:环境搭建

1、CPU版本安装

1.1 安装tensorflow

pip3 install --upgrade tensorflow

1.2 Python验证,看到版本信息就可以了

python3
>>> import tensorflow as tf
>>> print('Tensorflow version ', tf.__version__)

Tensorflow version  1.12.0

2、GPU版本安装(需要NVIDIA显卡)

2.1 检查驱动信息

nvidia-smi

Fri Nov 16 21:22:13 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77                 Driver Version: 390.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P2    27W /  N/A |   5938MiB /  6078MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7953      G   /usr/lib/xorg/Xorg                           126MiB |
|    0      8215      G   /usr/bin/gnome-shell                         109MiB |
|    0     13578    C+G   python3                                     5689MiB |
+-----------------------------------------------------------------------------+

2.2 安装CUDA

# 查看网站 https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1704&target_type=runfilelocal
# 选择下载这个版本 Linux x86_64 Ubuntu 17.04 runfile
# 安装,但注意不要更新驱动
sudo chmod +x cuda_9.0.176_384.81_linux.run
./cuda_9.0.176_384.81_linux.run --override

2.3 安装CUDNN

# 查看网站 https://developer.nvidia.com/rdp/cudnn-download
# 选择下载这个版本 9.0 cuDNN Library for Linux
# 解压
tar -zxvf cudnn-9.0-linux-x64-v7.tgz
# 手工安装
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-9.0/lib64/
sudo cp  cuda/include/cudnn.h /usr/local/cuda-9.0/include/
# 调整权限
sudo chmod a+r /usr/local/cuda-9.0/include/cudnn.h /usr/local/cuda-9.0/lib64/libcudnn*

2.3 安装libcupti-dev

sudo apt-get install libcupti-dev

2.4 修改.bashrc

# 增加下面两行
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

2.5 安装tensorflow-gpu

pip3 install --upgrade tensorflow-gpu

2.6 Python验证,看到GPU就可以啦

Python3
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
...
incarnation: 2559160109308400478
physical_device_desc: "device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1"
]

3、Docker方式安装
3.1 CPU版

# 运行tensorflow
docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow

3.2 GPU版

# 安装nvidia-docker
wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i nvidia-docker*.deb

# 测试nvidia-docker,执行nvidia-smi命令
nvidia-docker run --rm nvidia/cuda nvidia-smi

# 运行tensorflow
nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:latest-gpu

4、编译CUDA Demo(非必须)

# 咱们选用的版本只支持到gcc6
apt-get install gcc-6 g++-6
ln -s /bin/gcc /bin/gcc-6

# 安装libmpich-dev
sudo apt-get install libmpich-dev


# 切换路径
cd PATH_TO_DEMO

# 编译
make

知识图谱03:JENA

1、下载apache-jena-fuseki和apache-jena
https://jena.apache.org/download/index.cgi

2、将上一篇教程的nt文件转换为tdb格式

cd apache-jena-3.9.0\bat
tdbloader.bat --loc="PATH_TO_TDB\tdb" "PATH_TO_NT\movies_mapping.nt"

3、切换到apache-jena-fuseki-3.9.0目录,启动一次服务,然后退出

4、将教程1里面的Movies.owl,拷贝到apache-jena-fuseki-3.9.0\run\databases路径下面,并重命名为Movies.ttl

5、创建配置文件apache-jena-fuseki-3.9.0\run\configuration\fuseki_conf.ttl

@prefix fuseki: <http://jena.apache.org/fuseki#> . 
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> . 
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . 
@prefix : <http://base/#> . 

<#service> rdf:type fuseki:Service ; 
    fuseki:name "movies" ;
    fuseki:serviceQuery "sparql" ;
    fuseki:dataset <#dataset> ; 
    fuseki:serviceReadGraphStore      "get" ;
    fuseki:serviceReadWriteGraphStore "data" ;
    fuseki:serviceUpdate              "update" ;
    fuseki:serviceUpload              "upload"
    . 

<#dataset> rdf:type ja:RDFDataset ;
	ja:defaultGraph <#modelInf> ;
	.

<#modelInf> 
    rdf:type ja:InfModel ;
    #ja:reasoner [ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLFBRuleReasoner>]  
    ja:reasoner [ 
        ja:reasonerURL <http://jena.hpl.hp.com/2003/GenericRuleReasoner> ; 
        ja:rulesFrom <file:///D:/ProjectsMy/KG/apache-jena-fuseki-3.9.0/run/databases/Rules.ttl> ] ; 
    ja:baseModel <#baseModel> ; 
    . 

<#baseModel> rdf:type tdb:GraphTDB ; 
    tdb:location "D:/ProjectsMy/KG/workspace/data/tdb" ; 
    tdb:unionDefaultGraph true ; 
    .

6、创建规则文件apache-jena-fuseki-3.9.0\run\databases\Movies.ttl
这个规则规定了,演过喜剧的演员,叫做喜剧演员(Comedian)

@prefix xsd: <XML Schema> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.neohope.com/hansen/ontologies/2018/movies#> .

[ruleComedian: (?aPerson :hasActedIn ?aMovie) (?aMovie :hasGenre ?aGenre) (?aGenre :genreName '喜剧') -> (?aPerson rdf:type :Comedian)]
[ruleInverse: (?aPerson :hasActedIn ?aMove) -> (?aMovie :hasActor ?aPerson)]

7、启动apache-jena-fuseki-3.9.0

8、访问http://localhost:3030/

9、进行查询,上一篇的例子也都可以用
http://localhost:3030/dataset.html?tab=query&ds=/movies

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix : <http://www.neohope.com/hansen/ontologies/2018/movies#>

SELECT ?name WHERE {
?aComedian rdf:type :Comedian.
?aComedian :personName ?name.
}
LIMIT 10

10、通过python访问
https://github.com/neohope/kg-demo-for-movie/tree/master/src/query-jena.py

参考链接:
https://zhuanlan.zhihu.com/knowledgegraph
https://github.com/SimmerChan/KG-demo-for-movie

PS:
参考教程中,原作者通过结巴分词+正则匹配+Jena,实现了一个简单的问答系统,感兴趣的话,大家可以看下。

知识图谱02:RDF

1、安装MySQL5,并新建movies库

2、导入数据
https://github.com/neohope/kg-demo-for-movie/tree/master/data/movies.sql

3、下载d2rq,并配置好JDK环境变量
http://d2rq.org/

4、利用d2rq生成mapping

generate-mapping -u movie -p password -o movies_mapping.ttl jdbc:mysql:///movies

5、手工编辑ttl,任务如下
设置正确的域名
修正类名与属性名
删除一些不需要的字段
修改前后的数据可以在这里找到
https://github.com/neohope/kg-demo-for-movie/tree/master/data/movies_mapping.ttl
https://github.com/neohope/kg-demo-for-movie/tree/master/data/movies_mapping_ok.ttl

6、输出RDF文件,用于后续的教程

dump-rdf.bat -o movies_mapping.nt movies_mapping_ok.ttl

7、启动d2r服务

d2r-server.bat movies_mapping_ok.ttl

8、访问及浏览数据
http://localhost:2020/

9、查询
http://localhost:2020/snorql/

#周星驰演过的电影
SELECT ?title WHERE {
  ?aPerson rdf:type :Person.
  ?aPerson :personName '周星驰'.
  ?aPerson :hasActedIn ?aMovie.
  ?aMovie :movieTitle ?title
}
LIMIT 10


#英雄的演员
SELECT ?actor WHERE {
  ?aMovie rdf:type :Movie.
  ?aMovie :movieTitle '英雄'.
  ?aPerson :hasActedIn ?aMovie.
  ?aPerson :personName ?actor
}
LIMIT 10


#巩俐参演的,评分高于7的电影
SELECT ?title WHERE {
  ?aPerson rdf:type :Person.
  ?aPerson  :personName '巩俐'.
  ?aPerson  :hasActedIn ?aMovie.
  ?aMovie :movieTitle ?title.
  ?aMovie :movieRating ?rating.
  FILTER (?rating>=7)
}
LIMIT 10

10、通过python访问
https://github.com/neohope/kg-demo-for-movie/tree/master/src/query-d2rq.py

参考链接:
https://zhuanlan.zhihu.com/knowledgegraph
https://github.com/SimmerChan/KG-demo-for-movie

知识图谱01:本体建模

1、下载Protege工具

https://protege.stanford.edu/

2、安装JDK,并在配置好JDK环境变量

3、打开Protege

4、在Active Ontology页面,填写两个IRI,我分别填写了下面的数值

#Ontology IRI
http://www.neohope.com/hansen/ontologies/2018/movies
#Ontology Version IRI
http://www.neohope.com/hansen/ontologies/2018/movies/1.0.0

5、在Entities页面,切换到Classes,新建三个Class

Genre
Movie
Person

6、Entities页面,切换到Data properties,新建以下属性

genereId{Domain=Genre,Ranges=xsd:string}
genereName{Domain=Genre,Ranges=xsd:string}
movieId{Domain=Movie,Ranges=xsd:string}
movieIntroduction{Domain=Movie,Ranges=xsd:string}
movieRating{Domain=Movie,Ranges=xsd:string}
movieReleaseDate{Domain=Movie,Ranges=xsd:string}
movieTitile{Domain=Movie,Ranges=xsd:string}
personAppellation{Domain=Person,Ranges=xsd:string}
->personEnglishName{Domain=Person,Ranges=xsd:string}
->personName{Domain=Person,Ranges=xsd:string}
personBiography{Domain=Person,Ranges=xsd:string}
personbirthDay{Domain=Person,Ranges=xsd:string}
personBirthPlace{Domain=Person,Ranges=xsd:string}
personDeathDay{Domain=Person,Ranges=xsd:string}
personId{Domain=Person,Ranges=xsd:string}

7、Entities页面,切换到Object Properties,新建以下属性

hasActedIn{Domain=Person,Range=Movie,InverseOf=hasActor}
hasActor{Domain=Movie,Range=Person}
hasGenre{Domain=Person,Range=Genre}

8、保存为Movies.owl,这个文件可以在后面jena的例子中用到

9、建模后的结果,可以在这里获取:
https://github.com/neohope/kg-demo-for-movie/tree/master/protege

参考链接:
https://zhuanlan.zhihu.com/knowledgegraph
https://github.com/SimmerChan/KG-demo-for-movie

利用Speech Recognition进行语音识别

首先说明一下,我用的是MacOS,多数命令Linux可用,但部分相关命令要进行调整。不建议用Windows。

1、安装环境

#安装pyaudio
brew install portaudio
pip install pyaudio

#安装Sphinx
pip install PocketSphinx

#安装tensorflow
pip install tensorflow

#安装SpeechRecognition
pip install SpeechRecognition

2、测试

#提示你进行语音输入
#默认用Google Speech Recognition
#所以需要梯子,才能用这个命令哦
python -m speech_recognition

3、建议大家看一下自带例子
https://github.com/Uberi/speech_recognition/tree/master/examples
这个库,可以支持多种语音引擎,只有PocketSphinx是离线的,其余都是在线的。
而PocketSphinx的识别率,实在是有些差,需要一些辅助才能达到较好的效果。

PS:
tensorflow需要2018年(3.8.1以后,不包括3.8.1)以后的版本才支持。

利用Face Recognition进行人脸识别

首先说明一下,我用的是MacOS,多数命令Linux可用,但部分相关命令要进行调整。不建议用Windows。

1、安装环境

#安装cmake
brew install cmake
#安装boost
brew install boost-python
#设置boost环境
export CMAKE_PREFIX_PATH="/usr/local:/usr/local/Cellar/boost/1.61.0:$PATH"
#安装pip
sudo easy_install pip
#安装fc
pip install face_recognition
#安装opencv
pip install opencv-python

2、测试

#将已知单人照片,放到iknow文件夹中,而且照片名称就是人名,比如zhangsan.jpg等
#将需要识别的照片,放到unknown文件夹中,照片可以是多人的
#tolerance是精确度,感觉默认的库是以欧美人标准进行训练的,不太适用于亚洲人,建议将tolerance设置到0.5-0.6之间
#命令行会通过iknow的照片,识别unknown中的图片,并输入每幅图片中有哪些人
face_recognition --tolerance 0.56 ./iknow/ ./unknown/

3、建议大家去看一下项目自带的example
https://github.com/ageitgey/face_recognition/tree/master/examples

4、建议大家去看一下作者写的文章,真的写的很通俗易懂
https://medium.com/@ageitgey

可能遇到的问题:
1、如果安装环境时,遇到下面的问题

OSError: [Errno 1] Operation not permitted: '/PATH/XXX-info'

解决步骤如下:
1.1、重启电脑,按command+R进入恢复模式
1.2、点击菜单【实用工具】,打开【终端】,输入csrutil disable,关闭System Integrity Protection
1.3、重启电脑,正常开机
1.4、打开【终端】输入csrutil status,验证配置已生效

ElasticSearch2基本操作(06关于查询条件及过滤)

过滤

match_all 全部匹配,不做过滤,默认
term 精确匹配
terms 精确匹配多个词
range 范围匹配
exists 文档包含某属性
missing 文档不包含某属性
bool 多个过滤条件的组合

其中,对于bool过滤,可以有下面的组合条件:

must 多个查询条件的完全匹配,相当于 and。
must_not 多个查询条件的相反匹配,相当于 not。
should 至少有一个查询条件匹配, 相当于 or。

查询

match_all 全部匹配,默认
match 首先对查询条件进行分词,然后用TF/IDF评分
multi_match 与match类似,但可以用多个条件
bool 多个条件的组合查询

其中,对于bool查询,可以有下面的组合条件:

must 多个查询条件的完全匹配,相当于 and。
must_not 多个查询条件的相反匹配,相当于 not。
should 至少有一个查询条件匹配, 相当于 or。
#查询性别为男,年龄不是25,家庭住址最好有魔都两个字的记录
curl -XPOST http://127.0.0.1:9200/myindex/user/_search -d'
{
    "query": {
        "bool": {
            "must": {
                "term": {
                    "性别": "男"
                }
            },
            "must_not": {
                "match": {
                    "年龄": "25"
                }
            },
            "should": {
                "match": {
                    "家庭住址": "魔都"
                }
            }
        }
    }
}'

#查询注册时间从2015-04-01到2016-04-01的用户
curl -XPOST http://127.0.0.1:9200/myindex/user/_search -d'
{
    "query": {
        "bool": {
            "must": {
                "range": {
                    "注册时间": {
                        "gte": "2015-04-01 00:00:00",
                        "lt": "2016-04-01 00:00:00"
                    }
                }
            }
        }
    }
}'

#查询没有年龄字段的记录
curl -XPOST http://127.0.0.1:9200/myindex/user/_search -d'
{
    "query": {
        "bool": {
            "must": {
                "missing": {
                    "field": "年龄"
                }
            }
        }
    }
}'

#查询家庭地址或工作地址中包含北京的用户
curl -XPOST http://127.0.0.1:9200/myindex/user/_search -d'
{
    "query": {
        "multi_match": {
            "query": "北京",
            "type": "most_fields",
            "fields": [
                "家庭住址",
                "工作地址"
            ]
        }
    }
}'

#查询性别为男的用户
curl -XPOST http://127.0.0.1:9200/myindex/user/_search -d'
{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "term": {
                    "性别": "男"
                }
            }
        }
    }
}'

#查询注册时间为两年内的用户
curl -XPOST http://127.0.0.1:9200/myindex/user/_search -d'
{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "range": {
                    "注册时间": {"gt" : "now-2y"}
                }
            }
        }
    }
}'

排序

#查询所有用户,按注册时间进行排序
curl -XPOST http://127.0.0.1:9200/myindex/user/_search -d'
{
    "query": {
        "match_all": {}
    },
    "sort": {
        "注册时间": {
            "order": "desc"
        }
    }
}'

分页

#查询前三条记录
curl -XPOST http://127.0.0.1:9200/myindex/user/_search -d'
{
    "query": {
        "match_all": {}
    },
    "from": 0,
    "size": 3
}'

带缓存的分页

#进行分页
curl -XPOST http://127.0.0.1:9200/myindex/user/_search?search_type=scan&scroll=5m -d'
{
    "query": { "match_all": {}},
    "size":  10
}'

#返回_scroll_id
{"_scroll_id":"c2Nhbjs1OzE1MzE6NVR2MmE1WWFRRHFtelVGYlRwNGlhdzsxNTMzOjVUdjJhNVlhUURxbXpVRmJUcDRpYXc7MTUzNDo1VHYyYTVZYVFEcW16VUZiVHA0aWF3OzE1MzU6NVR2MmE1WWFRRHFtelVGYlRwNGlhdzsxNTMyOjVUdjJhNVlhUURxbXpVRmJUcDRpYXc7MTt0b3RhbF9oaXRzOjc7","took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":7,"max_score":0.0,"hits":[]}}

#发送_scroll_id开始查询
curl -XPOST http://127.0.0.1:9200/_search/scroll?scroll=5m
c2Nhbjs1OzE1MzE6NVR2MmE1WWFRRHFtelVGYlRwNGlhdzsxNTMzOjVUdjJhNVlhUURxbXpVRmJUcDRpYXc7MTUzNDo1VHYyYTVZYVFEcW16VUZiVHA0aWF3OzE1MzU6NVR2MmE1WWFRRHFtelVGYlRwNGlhdzsxNTMyOjVUdjJhNVlhUURxbXpVRmJUcDRpYXc7MTt0b3RhbF9oaXRzOjc7

ElasticSearch2基本操作(05关于搜索)

ES的搜索,不是关系数据库中的LIKE,而是通过搜索条件及文档之间的相关性来进行的。

对于一次搜索,对于每一个文档,都有一个浮点数字段_score 来表示文档与搜索主题的相关性, _score 的评分越高,相关性越高。

评分的计算方式取决于不同的查询类型:
fuzzy查询会计算与关键词的拼写相似程度
terms查询会计算找到的内容与关键词组成部分匹配的百分比
而全文本搜索是指计算内容与关键词的类似程度。

ES通过计算TF/IDF(即检索词频率/反向文档频率, Term Frequency/Inverse Document Frequency)作为相关性指标,具体与下面三个指标相关:
检索词频率TF: 对于一条记录,检索词在查询字段中出现的频率越高,相关性也越高。比如,一共有5个检索词,有4个出现在第一条记录,3条出现在第二条记录,则第一条记录TF会比第二条高一些。

反向文档频率IDF: 每个检索词在所有文档的该字段中出现的频率越高,则该词相关性越低。比如有5个检索词,如果一个词在所有文档中都出现,而另一个词之出现了一次,则所有文档中都包含的词几乎可以被忽略,只出现了一次的这个词权重会很高。

字段长度: 对于一条记录,查询字段的长度越长,相关性越低。比如有一条记录长度为10个词,另一条记录长度为100个词,而一个关键词,在两条记录里都出现了一次。则长度为10个词的记录,比长度为100个词的记录,相关性会高很多。

通过对TF/IDF的了解,可以让你解释一些看似不应该出现的结果。同时,你应该清楚,这不是一种精确匹配算法,而是一种评分算法,根据相关性进行了排序。

如果认为评分结果不合理,可以用下面的语句,查看评分过程:

#解释查询是如何进行评分的
crul -XPost http://127.0.0.1:9200/myindex/user/_search?explain -d'
{
   "query"   : { "match" : { "家庭住址" : "魔都大街" }}
}'

#结果如下:
{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 4,
        "hits": [
            {
                "_shard": 4,
                "_node": "5Tv2a5YaQDqmzUFbTp4iaw",
                "_index": "myindex",
                "_type": "user",
                "_id": "u002",
                "_score": 4,
                "_source": {
                    "用户ID": "u002",
                    "姓名": "李四",
                    "性别": "男",
                    "年龄": "25",
                    "家庭住址": "上海市闸北区魔都大街007号",
                    "注册时间": "2015-02-01 08:30:00"
                },
                "_explanation": {
                    "value": 4,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 4,
                            "description": "sum of:",
                            "details": [
                                {
                                    "value": 1,
                                    "description": "weight(家庭住址:魔 in 0) [PerFieldSimilarity], result of:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "score(doc=0,freq=1.0), product of:",
                                            "details": [
                                                {
                                                    "value": 0.5,
                                                    "description": "queryWeight, product of:",
                                                    "details": [
                                                        {
                                                            "value": 1,
                                                            "description": "idf(docFreq=1, maxDocs=2)",
                                                            "details": []
                                                        },
                                                        {
                                                            "value": 0.5,
                                                            "description": "queryNorm",
                                                            "details": []
                                                        }
                                                    ]
                                                },
                                                {
                                                    "value": 2,
                                                    "description": "fieldWeight in 0, product of:",
                                                    "details": [
                                                        {
                                                            "value": 1,
                                                            "description": "tf(freq=1.0), with freq of:",
                                                            "details": [
                                                                {
                                                                    "value": 1,
                                                                    "description": "termFreq=1.0",
                                                                    "details": []
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "value": 1,
                                                            "description": "idf(docFreq=1, maxDocs=2)",
                                                            "details": []
                                                        },
                                                        {
                                                            "value": 2,
                                                            "description": "fieldNorm(doc=0)",
                                                            "details": []
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                },
                                {
                                    "value": 1,
                                    "description": "weight(家庭住址:都 in 0) [PerFieldSimilarity], result of:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "score(doc=0,freq=1.0), product of:",
                                            "details": [
                                                {
                                                    "value": 0.5,
                                                    "description": "queryWeight, product of:",
                                                    "details": [
                                                        {
                                                            "value": 1,
                                                            "description": "idf(docFreq=1, maxDocs=2)",
                                                            "details": []
                                                        },
                                                        {
                                                            "value": 0.5,
                                                            "description": "queryNorm",
                                                            "details": []
                                                        }
                                                    ]
                                                },
                                                {
                                                    "value": 2,
                                                    "description": "fieldWeight in 0, product of:",
                                                    "details": [
                                                        {
                                                            "value": 1,
                                                            "description": "tf(freq=1.0), with freq of:",
                                                            "details": [
                                                                {
                                                                    "value": 1,
                                                                    "description": "termFreq=1.0",
                                                                    "details": []
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "value": 1,
                                                            "description": "idf(docFreq=1, maxDocs=2)",
                                                            "details": []
                                                        },
                                                        {
                                                            "value": 2,
                                                            "description": "fieldNorm(doc=0)",
                                                            "details": []
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                },
                                {
                                    "value": 1,
                                    "description": "weight(家庭住址:大街 in 0) [PerFieldSimilarity], result of:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "score(doc=0,freq=1.0), product of:",
                                            "details": [
                                                {
                                                    "value": 0.5,
                                                    "description": "queryWeight, product of:",
                                                    "details": [
                                                        {
                                                            "value": 1,
                                                            "description": "idf(docFreq=1, maxDocs=2)",
                                                            "details": []
                                                        },
                                                        {
                                                            "value": 0.5,
                                                            "description": "queryNorm",
                                                            "details": []
                                                        }
                                                    ]
                                                },
                                                {
                                                    "value": 2,
                                                    "description": "fieldWeight in 0, product of:",
                                                    "details": [
                                                        {
                                                            "value": 1,
                                                            "description": "tf(freq=1.0), with freq of:",
                                                            "details": [
                                                                {
                                                                    "value": 1,
                                                                    "description": "termFreq=1.0",
                                                                    "details": []
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "value": 1,
                                                            "description": "idf(docFreq=1, maxDocs=2)",
                                                            "details": []
                                                        },
                                                        {
                                                            "value": 2,
                                                            "description": "fieldNorm(doc=0)",
                                                            "details": []
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                },
                                {
                                    "value": 1,
                                    "description": "weight(家庭住址:街 in 0) [PerFieldSimilarity], result of:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "score(doc=0,freq=1.0), product of:",
                                            "details": [
                                                {
                                                    "value": 0.5,
                                                    "description": "queryWeight, product of:",
                                                    "details": [
                                                        {
                                                            "value": 1,
                                                            "description": "idf(docFreq=1, maxDocs=2)",
                                                            "details": []
                                                        },
                                                        {
                                                            "value": 0.5,
                                                            "description": "queryNorm",
                                                            "details": []
                                                        }
                                                    ]
                                                },
                                                {
                                                    "value": 2,
                                                    "description": "fieldWeight in 0, product of:",
                                                    "details": [
                                                        {
                                                            "value": 1,
                                                            "description": "tf(freq=1.0), with freq of:",
                                                            "details": [
                                                                {
                                                                    "value": 1,
                                                                    "description": "termFreq=1.0",
                                                                    "details": []
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "value": 1,
                                                            "description": "idf(docFreq=1, maxDocs=2)",
                                                            "details": []
                                                        },
                                                        {
                                                            "value": 2,
                                                            "description": "fieldNorm(doc=0)",
                                                            "details": []
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        },
                        {
                            "value": 0,
                            "description": "match on required clause, product of:",
                            "details": [
                                {
                                    "value": 0,
                                    "description": "# clause",
                                    "details": []
                                },
                                {
                                    "value": 0.5,
                                    "description": "_type:user, product of:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 0.5,
                                            "description": "queryNorm",
                                            "details": []
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            },
            {
                "_shard": 0,
                "_node": "5Tv2a5YaQDqmzUFbTp4iaw",
                "_index": "myindex",
                "_type": "user",
                "_id": "u003",
                "_score": 0.71918744,
                "_source": {
                    "用户ID": "u003",
                    "姓名": "王五",
                    "性别": "男",
                    "年龄": "26",
                    "家庭住址": "广州市花都区花城大街010号",
                    "注册时间": "2015-03-01 08:30:00"
                },
                "_explanation": {
                    "value": 0.71918744,
                    "description": "sum of:",
                    "details": [
                        {
                            "value": 0.71918744,
                            "description": "product of:",
                            "details": [
                                {
                                    "value": 1.4383749,
                                    "description": "sum of:",
                                    "details": [
                                        {
                                            "value": 0.71918744,
                                            "description": "weight(家庭住址:大街 in 0) [PerFieldSimilarity], result of:",
                                            "details": [
                                                {
                                                    "value": 0.71918744,
                                                    "description": "score(doc=0,freq=1.0), product of:",
                                                    "details": [
                                                        {
                                                            "value": 0.35959372,
                                                            "description": "queryWeight, product of:",
                                                            "details": [
                                                                {
                                                                    "value": 1,
                                                                    "description": "idf(docFreq=1, maxDocs=2)",
                                                                    "details": []
                                                                },
                                                                {
                                                                    "value": 0.35959372,
                                                                    "description": "queryNorm",
                                                                    "details": []
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "value": 2,
                                                            "description": "fieldWeight in 0, product of:",
                                                            "details": [
                                                                {
                                                                    "value": 1,
                                                                    "description": "tf(freq=1.0), with freq of:",
                                                                    "details": [
                                                                        {
                                                                            "value": 1,
                                                                            "description": "termFreq=1.0",
                                                                            "details": []
                                                                        }
                                                                    ]
                                                                },
                                                                {
                                                                    "value": 1,
                                                                    "description": "idf(docFreq=1, maxDocs=2)",
                                                                    "details": []
                                                                },
                                                                {
                                                                    "value": 2,
                                                                    "description": "fieldNorm(doc=0)",
                                                                    "details": []
                                                                }
                                                            ]
                                                        }
                                                    ]
                                                }
                                            ]
                                        },
                                        {
                                            "value": 0.71918744,
                                            "description": "weight(家庭住址:街 in 0) [PerFieldSimilarity], result of:",
                                            "details": [
                                                {
                                                    "value": 0.71918744,
                                                    "description": "score(doc=0,freq=1.0), product of:",
                                                    "details": [
                                                        {
                                                            "value": 0.35959372,
                                                            "description": "queryWeight, product of:",
                                                            "details": [
                                                                {
                                                                    "value": 1,
                                                                    "description": "idf(docFreq=1, maxDocs=2)",
                                                                    "details": []
                                                                },
                                                                {
                                                                    "value": 0.35959372,
                                                                    "description": "queryNorm",
                                                                    "details": []
                                                                }
                                                            ]
                                                        },
                                                        {
                                                            "value": 2,
                                                            "description": "fieldWeight in 0, product of:",
                                                            "details": [
                                                                {
                                                                    "value": 1,
                                                                    "description": "tf(freq=1.0), with freq of:",
                                                                    "details": [
                                                                        {
                                                                            "value": 1,
                                                                            "description": "termFreq=1.0",
                                                                            "details": []
                                                                        }
                                                                    ]
                                                                },
                                                                {
                                                                    "value": 1,
                                                                    "description": "idf(docFreq=1, maxDocs=2)",
                                                                    "details": []
                                                                },
                                                                {
                                                                    "value": 2,
                                                                    "description": "fieldNorm(doc=0)",
                                                                    "details": []
                                                                }
                                                            ]
                                                        }
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                },
                                {
                                    "value": 0.5,
                                    "description": "coord(2/4)",
                                    "details": []
                                }
                            ]
                        },
                        {
                            "value": 0,
                            "description": "match on required clause, product of:",
                            "details": [
                                {
                                    "value": 0,
                                    "description": "# clause",
                                    "details": []
                                },
                                {
                                    "value": 0.35959372,
                                    "description": "_type:user, product of:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "boost",
                                            "details": []
                                        },
                                        {
                                            "value": 0.35959372,
                                            "description": "queryNorm",
                                            "details": []
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            },
            ......
        ]
    }
}

你可以看到,不仅是“魔都大街”的记录被查询出来了,只要有“大街”的记录也被查出来了哦。同时,也告诉了你,为什么”u002″是最靠前的。

还有一种用法,就是让ES告诉你,查询语句哪里错了:

curl -XPOST http://127.0.0.1:9200/myindex/user/_validate/query?explain -d'
{
   "query"   : { "matchA" : { "家庭住址" : "魔都大街" }}
}'

{
    "valid": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "explanations": [
        {
            "index": "myindex",
            "valid": false,
            "error": "org.elasticsearch.index.query.QueryParsingException: No query registered for [matchA]"
        }
    ]
}

ES会告诉你matchA这里错了哦。

ElasticSearch2基本操作(04关于分词)

恩,有些初步的感觉了没?那回过头来我们看下最基础的东西:

ES中,常见数据类型如下:

类型名称 数据类型
字符串 string
整数 byte, short, integer, long
浮点数 float, double
布尔 boolean
日期 date
对象 object
嵌套结构 nested
地理位置(经纬度) geo_point

常用字段分析类型如下:

分析类型 含义
analyzed 首先分析这个字符串,然后索引。换言之,以全文形式索引此字段。
not_analyzed 索引这个字段,使之可以被搜索,但是索引内容和指定值一样。不分析此字段。
no 不索引这个字段。这个字段不能被搜索到。

然后,我们测试一下分词器

1、首先测试一下用标准分词进行分词

curl -XPOST http://localhost:9200/_analyze?analyzer=standard&text=小明同学大吃一惊

{
    "tokens": [
        {
            "token": "小",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "明",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "同",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "学",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "大",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "吃",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "一",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "惊",
            "start_offset": 7,
            "end_offset": 8,
            "type": "<IDEOGRAPHIC>",
            "position": 7
        }
    ]
}

2、然后对比一下用IK分词进行分词

curl -XGET http://localhost:9200/_analyze?analyzer=ik&text=小明同学大吃一惊

{
    "tokens": [
        {
            "token": "小明",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "同学",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "大吃一惊",
            "start_offset": 4,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "大吃",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "吃",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "一惊",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "一",
            "start_offset": 6,
            "end_offset": 7,
            "type": "TYPE_CNUM",
            "position": 6
        },
        {
            "token": "惊",
            "start_offset": 7,
            "end_offset": 8,
            "type": "CN_CHAR",
            "position": 7
        }
    ]
}

3、测试一下按”家庭住址”字段进行分词

curl -XGET http://localhost:9200/myindex/_analyze?field=家庭住址&text=我爱北京天安门

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "爱",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "北京",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "京",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "天安门",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "天安",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "门",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 6
        }
    ]
}

4、测试一下按”性别”字段进行分词

curl -XGET http://localhost:9200/myindex/_analyze?field=性别&text=我爱北京天安门

{
    "tokens": [
        {
            "token": "我爱北京天安门",
            "start_offset": 0,
            "end_offset": 7,
            "type": "word",
            "position": 0
        }
    ]
}

大家可以看到,不同的分词器,使用场景、针对语言是不一样的,所以要选择合适的分词器。
此外,对于不同的字段,要选择不同的分析方式及适用的分词器,会让你事半功倍。