谈谈ChatGPT

春节期间,试用了ChatGPT,让我被惊艳到了。
大模型的涌现效果,能达到ChatGPT3.5的程度,着实让我有些吃惊,真是大力出奇迹啊。

在此之前,我一直认为本次AI技术革命已经接近尾声了:
1、在影像和视频方面,AI已经可以实现商业化:医疗影像AI诊断、自动驾驶、人脸识别、图像搜索、P图滤镜等;
2、在语音方面,语音识别和语音合成已经很成熟;
3、在NLP方面,简单重复任务可以较好完成,比如机器翻译、文本搜索等。但在复杂任务上,还处于有多少人工就有多少智能的尴尬阶段,距离商业化有较长的路需要走;
而且,无论是哪个领域,大家可以发现,AI还是只是一种能力、一个工具,也就是处于“业务X+AI”的模式。
就算AI是生产力,但想象空间也就那么大,因为领域已经被限制死了。

但ChatGPT改变了这个局面,聊天这个场景,可以让ChatGPT成为一个各种能力的插座。
也就是说,一个类似于ChatGPT的大模型,如果能快速整合各种外部能力,从“业务X+AI”,变成“AI+业务X、业务Y、业务Z”的模式,很可能会成为下一代互联网的入口,并从各种维度给人类带来全新体验。

钢铁侠的贾维斯(J.A.R.V.I.S.)还是保守了,我们有更广阔的空间,十分期待这个时代能尽快到来。

最近看了一些ChatGPT资料,整理了一些相关示例:
https://github.com/neohope/NeoDemosChatGPT

多数例子,来源于极客时间徐文浩老师的课程《AI大模型之美》:
https://time.geekbang.org/column/intro/100541001?tab=catalog

TensorFlow入门02:Tensor

Tensorflow中所有的数据都称为Tensor,可以是一个变量、数组或者多维数组。Tensor 有几个重要的属性:
Rank:纬数,比如scalar rank=0, vector rank=1, matrix rank=2
Shape:形状,比如vector shape=[D0], matrix shape=[D0, D1]
类型:数据类型,比如tf.float32, tc.uint8等

Rank与Shape关系如下表所示

Rank Shape Dimension number Example
0 [] 0-D A 0-D tensor. A scalar.
1 [D0] 1-D A 1-D tensor with shape [5].
2 [D0, D1] 2-D A 2-D tensor with shape [3, 4].
3 [D0, D1, D2] 3-D A 3-D tensor with shape [1, 4, 3].
n [D0, D1, … Dn-1] n-D A tensor with shape [D0, D1, … Dn-1].

TensorFlow入门01:环境搭建

1、CPU版本安装

1.1 安装tensorflow

pip3 install --upgrade tensorflow

1.2 Python验证,看到版本信息就可以了

python3
>>> import tensorflow as tf
>>> print('Tensorflow version ', tf.__version__)

Tensorflow version  1.12.0

2、GPU版本安装(需要NVIDIA显卡)

2.1 检查驱动信息

nvidia-smi

Fri Nov 16 21:22:13 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77                 Driver Version: 390.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P2    27W /  N/A |   5938MiB /  6078MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7953      G   /usr/lib/xorg/Xorg                           126MiB |
|    0      8215      G   /usr/bin/gnome-shell                         109MiB |
|    0     13578    C+G   python3                                     5689MiB |
+-----------------------------------------------------------------------------+

2.2 安装CUDA

# 查看网站 https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1704&target_type=runfilelocal
# 选择下载这个版本 Linux x86_64 Ubuntu 17.04 runfile
# 安装,但注意不要更新驱动
sudo chmod +x cuda_9.0.176_384.81_linux.run
./cuda_9.0.176_384.81_linux.run --override

2.3 安装CUDNN

# 查看网站 https://developer.nvidia.com/rdp/cudnn-download
# 选择下载这个版本 9.0 cuDNN Library for Linux
# 解压
tar -zxvf cudnn-9.0-linux-x64-v7.tgz
# 手工安装
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-9.0/lib64/
sudo cp  cuda/include/cudnn.h /usr/local/cuda-9.0/include/
# 调整权限
sudo chmod a+r /usr/local/cuda-9.0/include/cudnn.h /usr/local/cuda-9.0/lib64/libcudnn*

2.3 安装libcupti-dev

sudo apt-get install libcupti-dev

2.4 修改.bashrc

# 增加下面两行
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

2.5 安装tensorflow-gpu

pip3 install --upgrade tensorflow-gpu

2.6 Python验证,看到GPU就可以啦

Python3
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
...
incarnation: 2559160109308400478
physical_device_desc: "device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1"
]

3、Docker方式安装
3.1 CPU版

# 运行tensorflow
docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow

3.2 GPU版

# 安装nvidia-docker
wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i nvidia-docker*.deb

# 测试nvidia-docker,执行nvidia-smi命令
nvidia-docker run --rm nvidia/cuda nvidia-smi

# 运行tensorflow
nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:latest-gpu

4、编译CUDA Demo(非必须)

# 咱们选用的版本只支持到gcc6
apt-get install gcc-6 g++-6
ln -s /bin/gcc /bin/gcc-6

# 安装libmpich-dev
sudo apt-get install libmpich-dev


# 切换路径
cd PATH_TO_DEMO

# 编译
make

知识图谱03:JENA

1、下载apache-jena-fuseki和apache-jena
https://jena.apache.org/download/index.cgi

2、将上一篇教程的nt文件转换为tdb格式

cd apache-jena-3.9.0\bat
tdbloader.bat --loc="PATH_TO_TDB\tdb" "PATH_TO_NT\movies_mapping.nt"

3、切换到apache-jena-fuseki-3.9.0目录,启动一次服务,然后退出

4、将教程1里面的Movies.owl,拷贝到apache-jena-fuseki-3.9.0\run\databases路径下面,并重命名为Movies.ttl

5、创建配置文件apache-jena-fuseki-3.9.0\run\configuration\fuseki_conf.ttl

@prefix fuseki: <http://jena.apache.org/fuseki#> . 
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> . 
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . 
@prefix : <http://base/#> . 

<#service> rdf:type fuseki:Service ; 
    fuseki:name "movies" ;
    fuseki:serviceQuery "sparql" ;
    fuseki:dataset <#dataset> ; 
    fuseki:serviceReadGraphStore      "get" ;
    fuseki:serviceReadWriteGraphStore "data" ;
    fuseki:serviceUpdate              "update" ;
    fuseki:serviceUpload              "upload"
    . 

<#dataset> rdf:type ja:RDFDataset ;
	ja:defaultGraph <#modelInf> ;
	.

<#modelInf> 
    rdf:type ja:InfModel ;
    #ja:reasoner [ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLFBRuleReasoner>]  
    ja:reasoner [ 
        ja:reasonerURL <http://jena.hpl.hp.com/2003/GenericRuleReasoner> ; 
        ja:rulesFrom <file:///D:/ProjectsMy/KG/apache-jena-fuseki-3.9.0/run/databases/Rules.ttl> ] ; 
    ja:baseModel <#baseModel> ; 
    . 

<#baseModel> rdf:type tdb:GraphTDB ; 
    tdb:location "D:/ProjectsMy/KG/workspace/data/tdb" ; 
    tdb:unionDefaultGraph true ; 
    .

6、创建规则文件apache-jena-fuseki-3.9.0\run\databases\Movies.ttl
这个规则规定了,演过喜剧的演员,叫做喜剧演员(Comedian)

@prefix xsd: <XML Schema> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.neohope.com/hansen/ontologies/2018/movies#> .

[ruleComedian: (?aPerson :hasActedIn ?aMovie) (?aMovie :hasGenre ?aGenre) (?aGenre :genreName '喜剧') -> (?aPerson rdf:type :Comedian)]
[ruleInverse: (?aPerson :hasActedIn ?aMove) -> (?aMovie :hasActor ?aPerson)]

7、启动apache-jena-fuseki-3.9.0

8、访问http://localhost:3030/

9、进行查询,上一篇的例子也都可以用
http://localhost:3030/dataset.html?tab=query&ds=/movies

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix : <http://www.neohope.com/hansen/ontologies/2018/movies#>

SELECT ?name WHERE {
?aComedian rdf:type :Comedian.
?aComedian :personName ?name.
}
LIMIT 10

10、通过python访问
https://github.com/neohope/kg-demo-for-movie/tree/master/src/query-jena.py

参考链接:
https://zhuanlan.zhihu.com/knowledgegraph
https://github.com/SimmerChan/KG-demo-for-movie

PS:
参考教程中,原作者通过结巴分词+正则匹配+Jena,实现了一个简单的问答系统,感兴趣的话,大家可以看下。

知识图谱02:RDF

1、安装MySQL5,并新建movies库

2、导入数据
https://github.com/neohope/kg-demo-for-movie/tree/master/data/movies.sql

3、下载d2rq,并配置好JDK环境变量
http://d2rq.org/

4、利用d2rq生成mapping

generate-mapping -u movie -p password -o movies_mapping.ttl jdbc:mysql:///movies

5、手工编辑ttl,任务如下
设置正确的域名
修正类名与属性名
删除一些不需要的字段
修改前后的数据可以在这里找到
https://github.com/neohope/kg-demo-for-movie/tree/master/data/movies_mapping.ttl
https://github.com/neohope/kg-demo-for-movie/tree/master/data/movies_mapping_ok.ttl

6、输出RDF文件,用于后续的教程

dump-rdf.bat -o movies_mapping.nt movies_mapping_ok.ttl

7、启动d2r服务

d2r-server.bat movies_mapping_ok.ttl

8、访问及浏览数据
http://localhost:2020/

9、查询
http://localhost:2020/snorql/

#周星驰演过的电影
SELECT ?title WHERE {
  ?aPerson rdf:type :Person.
  ?aPerson :personName '周星驰'.
  ?aPerson :hasActedIn ?aMovie.
  ?aMovie :movieTitle ?title
}
LIMIT 10


#英雄的演员
SELECT ?actor WHERE {
  ?aMovie rdf:type :Movie.
  ?aMovie :movieTitle '英雄'.
  ?aPerson :hasActedIn ?aMovie.
  ?aPerson :personName ?actor
}
LIMIT 10


#巩俐参演的,评分高于7的电影
SELECT ?title WHERE {
  ?aPerson rdf:type :Person.
  ?aPerson  :personName '巩俐'.
  ?aPerson  :hasActedIn ?aMovie.
  ?aMovie :movieTitle ?title.
  ?aMovie :movieRating ?rating.
  FILTER (?rating>=7)
}
LIMIT 10

10、通过python访问
https://github.com/neohope/kg-demo-for-movie/tree/master/src/query-d2rq.py

参考链接:
https://zhuanlan.zhihu.com/knowledgegraph
https://github.com/SimmerChan/KG-demo-for-movie

知识图谱01:本体建模

1、下载Protege工具

https://protege.stanford.edu/

2、安装JDK,并在配置好JDK环境变量

3、打开Protege

4、在Active Ontology页面,填写两个IRI,我分别填写了下面的数值

#Ontology IRI
http://www.neohope.com/hansen/ontologies/2018/movies
#Ontology Version IRI
http://www.neohope.com/hansen/ontologies/2018/movies/1.0.0

5、在Entities页面,切换到Classes,新建三个Class

Genre
Movie
Person

6、Entities页面,切换到Data properties,新建以下属性

genereId{Domain=Genre,Ranges=xsd:string}
genereName{Domain=Genre,Ranges=xsd:string}
movieId{Domain=Movie,Ranges=xsd:string}
movieIntroduction{Domain=Movie,Ranges=xsd:string}
movieRating{Domain=Movie,Ranges=xsd:string}
movieReleaseDate{Domain=Movie,Ranges=xsd:string}
movieTitile{Domain=Movie,Ranges=xsd:string}
personAppellation{Domain=Person,Ranges=xsd:string}
->personEnglishName{Domain=Person,Ranges=xsd:string}
->personName{Domain=Person,Ranges=xsd:string}
personBiography{Domain=Person,Ranges=xsd:string}
personbirthDay{Domain=Person,Ranges=xsd:string}
personBirthPlace{Domain=Person,Ranges=xsd:string}
personDeathDay{Domain=Person,Ranges=xsd:string}
personId{Domain=Person,Ranges=xsd:string}

7、Entities页面,切换到Object Properties,新建以下属性

hasActedIn{Domain=Person,Range=Movie,InverseOf=hasActor}
hasActor{Domain=Movie,Range=Person}
hasGenre{Domain=Person,Range=Genre}

8、保存为Movies.owl,这个文件可以在后面jena的例子中用到

9、建模后的结果,可以在这里获取:
https://github.com/neohope/kg-demo-for-movie/tree/master/protege

参考链接:
https://zhuanlan.zhihu.com/knowledgegraph
https://github.com/SimmerChan/KG-demo-for-movie

利用Speech Recognition进行语音识别

首先说明一下,我用的是MacOS,多数命令Linux可用,但部分相关命令要进行调整。不建议用Windows。

1、安装环境

#安装pyaudio
brew install portaudio
pip install pyaudio

#安装Sphinx
pip install PocketSphinx

#安装tensorflow
pip install tensorflow

#安装SpeechRecognition
pip install SpeechRecognition

2、测试

#提示你进行语音输入
#默认用Google Speech Recognition
#所以需要梯子,才能用这个命令哦
python -m speech_recognition

3、建议大家看一下自带例子
https://github.com/Uberi/speech_recognition/tree/master/examples
这个库,可以支持多种语音引擎,只有PocketSphinx是离线的,其余都是在线的。
而PocketSphinx的识别率,实在是有些差,需要一些辅助才能达到较好的效果。

PS:
tensorflow需要2018年(3.8.1以后,不包括3.8.1)以后的版本才支持。

利用Face Recognition进行人脸识别

首先说明一下,我用的是MacOS,多数命令Linux可用,但部分相关命令要进行调整。不建议用Windows。

1、安装环境

#安装cmake
brew install cmake
#安装boost
brew install boost-python
#设置boost环境
export CMAKE_PREFIX_PATH="/usr/local:/usr/local/Cellar/boost/1.61.0:$PATH"
#安装pip
sudo easy_install pip
#安装fc
pip install face_recognition
#安装opencv
pip install opencv-python

2、测试

#将已知单人照片,放到iknow文件夹中,而且照片名称就是人名,比如zhangsan.jpg等
#将需要识别的照片,放到unknown文件夹中,照片可以是多人的
#tolerance是精确度,感觉默认的库是以欧美人标准进行训练的,不太适用于亚洲人,建议将tolerance设置到0.5-0.6之间
#命令行会通过iknow的照片,识别unknown中的图片,并输入每幅图片中有哪些人
face_recognition --tolerance 0.56 ./iknow/ ./unknown/

3、建议大家去看一下项目自带的example
https://github.com/ageitgey/face_recognition/tree/master/examples

4、建议大家去看一下作者写的文章,真的写的很通俗易懂
https://medium.com/@ageitgey

可能遇到的问题:
1、如果安装环境时,遇到下面的问题

OSError: [Errno 1] Operation not permitted: '/PATH/XXX-info'

解决步骤如下:
1.1、重启电脑,按command+R进入恢复模式
1.2、点击菜单【实用工具】,打开【终端】,输入csrutil disable,关闭System Integrity Protection
1.3、重启电脑,正常开机
1.4、打开【终端】输入csrutil status,验证配置已生效

Windows下编译word2vec

首先要声明,如果条件允许,不要在windows下做类似的事情,我这里是在折腾。

如果只需要下载代码,相应的代码,我已经上传了github,可以在这里下载到:
word2vec_win32

编译工具为:VS2013

具体的做法为:

1、到google code下载代码https://code.google.com/p/word2vec/

2、根据makefile,创建VS2013工程

3、进行调整,保证编译成功
3.1、所有c文件,添加下面的宏定义

#define _CRT_SECURE_NO_WARNINGS

3.2、将部分const修改为define,比如

    #define MAX_STRING 100

3.3、用_aligned_malloc函数,替换posix_memalign函数

    #define posix_memalign(p, a, s) (((*(p)) = _aligned_malloc((s), (a))), *(p) ?0 :errno)

3.4、下载windows下的pthread库,pthreads-win32,并修改include及link配置

3.5、编译成功

4、可执行文件说明
word2vec:词转向量,或者进行聚类
word2phrase:词转词组,用于预处理,可重复使用(运行一遍则生成2个词的短语,运行两遍则形成4个词的短语)
compute-accuracy:校验模型精度
distance:输入一个词A,返回最相近的词(A=》?)
word-analogy:输入三个词A,B,C,返回(如果A=》B,C=》?)

5、进行测试
5.1下载测试资料
http://mattmahoney.net/dc/text8.zip

5.2建立模型

>word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000005  Progress: 100.10%  Words/thread/sec: 13.74k

5.3校验模型精度

>compute-accuracy vectors.bin 30000 < questions-word
s.txt
capital-common-countries:
ACCURACY TOP1: 80.83 %  (409 / 506)
Total accuracy: 80.83 %   Semantic accuracy: 80.83 %   Syntactic accuracy: -1.#J
 %
capital-world:
ACCURACY TOP1: 62.65 %  (884 / 1411)
Total accuracy: 67.45 %   Semantic accuracy: 67.45 %   Syntactic accuracy: -1.#J
 %
currency:
ACCURACY TOP1: 23.13 %  (62 / 268)
Total accuracy: 62.01 %   Semantic accuracy: 62.01 %   Syntactic accuracy: -1.#J
 %
city-in-state:
ACCURACY TOP1: 46.85 %  (736 / 1571)
Total accuracy: 55.67 %   Semantic accuracy: 55.67 %   Syntactic accuracy: -1.#J
 %
family:
ACCURACY TOP1: 77.45 %  (237 / 306)
Total accuracy: 57.31 %   Semantic accuracy: 57.31 %   Syntactic accuracy: -1.#J
 %
gram1-adjective-to-adverb:
ACCURACY TOP1: 19.44 %  (147 / 756)
Total accuracy: 51.37 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 19.44
 %
gram2-opposite:
ACCURACY TOP1: 24.18 %  (74 / 306)
Total accuracy: 49.75 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 20.81
 %
gram3-comparative:
ACCURACY TOP1: 64.92 %  (818 / 1260)
Total accuracy: 52.74 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 44.75
 %
gram4-superlative:
ACCURACY TOP1: 39.53 %  (200 / 506)
Total accuracy: 51.77 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 43.81
 %
gram5-present-participle:
ACCURACY TOP1: 40.32 %  (400 / 992)
Total accuracy: 50.33 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 42.91
 %
gram6-nationality-adjective:
ACCURACY TOP1: 84.46 %  (1158 / 1371)
Total accuracy: 55.39 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 53.88
 %
gram7-past-tense:
ACCURACY TOP1: 39.79 %  (530 / 1332)
Total accuracy: 53.42 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 51.00
 %
gram8-plural:
ACCURACY TOP1: 61.39 %  (609 / 992)
Total accuracy: 54.11 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 52.38
 %
gram9-plural-verbs:
ACCURACY TOP1: 33.38 %  (217 / 650)
Total accuracy: 53.01 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 50.86
 %
Questions seen / total: 12227 19544   62.56 %

5.4查找关系最近的单词

>distance vectors.bin
Enter word or sentence (EXIT to break): china

Word: china  Position in vocabulary: 486

                                              Word       Cosine distance
------------------------------------------------------------------------
                                            taiwan              0.649276
                                             japan              0.624836
                                            hainan              0.567946
                                          kalmykia              0.562871
                                             tibet              0.562600
                                               prc              0.553833
                                              tuva              0.553255
                                             korea              0.552685
                                           chinese              0.545661
                                            xiamen              0.542703
                                              liao              0.542607
                                             jiang              0.540888
                                         manchuria              0.540783
                                             wuhan              0.537735
                                            yunnan              0.535809
                                             hunan              0.535770
                                          hangzhou              0.524340
                                              yong              0.523802
                                           sichuan              0.517254
                                         guangdong              0.514874
                                             liang              0.511881
                                               jin              0.511389
                                             india              0.508853
                                          xinjiang              0.505971
                                         taiwanese              0.503072
                                              qing              0.502909
                                          shanghai              0.502771
                                          shandong              0.499169
                                           jiangxi              0.495940
                                           nanjing              0.492893
                                         guangzhou              0.492788
                                              zhao              0.490396
                                          shenzhen              0.489658
                                         singapore              0.489428
                                             hubei              0.488228
                                            harbin              0.488112
                                          liaoning              0.484283
                                          zhejiang              0.484192
                                            joseon              0.483718
                                          mongolia              0.481411
Enter word or sentence (EXIT to break):

5.5根据A=>B,得到C=>?

>word-analogy vectors.bin
Enter three words (EXIT to break): china beijing canada

Word: china  Position in vocabulary: 486

Word: beijing  Position in vocabulary: 3880

Word: canada  Position in vocabulary: 474

                                              Word              Distance
------------------------------------------------------------------------
                                           toronto              0.624131
                                          montreal              0.559667
                                            mcgill              0.519338
                                           calgary              0.518366
                                           ryerson              0.515524
                                            ottawa              0.515316
                                           alberta              0.509334
                                          edmonton              0.498436
                                           moncton              0.488861
                                            quebec              0.487712
                                          canadian              0.475655
                                      saskatchewan              0.460744
                                       fredericton              0.460354
                                           ontario              0.458213
                                       montrealers              0.435611
                                         vancouver              0.429893
                                         saskatoon              0.416954
                                            dieppe              0.404408
                                           iqaluit              0.401143
                                         canadians              0.398137
                                          winnipeg              0.397547
                                            labatt              0.393893
                                              city              0.386245
                                      bilingualism              0.386245
                                          columbia              0.384754
                                        provincial              0.383439
                                             banff              0.382603
                                             metro              0.382367
                                            molson              0.379343
                                           nunavut              0.375992
                                             montr              0.373883
                                      francophones              0.373512
                                         brunswick              0.364261
                                          manitoba              0.360447
                                               bec              0.359977
                                       francophone              0.358556
                                             leafs              0.353035
                                        ellensburg              0.352787
                                           curling              0.351973
                                               cdn              0.347580
Enter three words (EXIT to break):

5.6进行聚类,输出结果(classes为0时,就是向量输出了)

>word2vec -train text8 -output classes.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -iter 15 -classes 500
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000005  Progress: 100.10%  Words/thread/sec: 14.72k

5.7原来程序中,还有三个测试脚本,是处理词组的,由于要用到linux命令sed,awk等,大家还是到Cygwin或MinGW下运行吧

分词及词性标注总结

近期,尝试了各类的分词及词性标注工具,包括如下软件:

工具 中英文支持 其他说明
中科院的ICTCLAS 中英 CPP,多语言接口
清华大学的THULANC 中,英较差 多语言支持
哈工大的LTP CPP,多语言接口
复旦的FudanDNN Java
东北大学的NiuParser 中,英较差 CPP
斯坦福的Stanford 中英 Java
Ansj Java
Jieba Python
Word Java
HanLP Java
LingPipe 英,中较差 Java
OpenNLP Java
NLTK Python
Gate Java,GUI,但不太符合程序员思维逻辑
lucene-analyzers-smartcn Java,只分词,不标词性

此外,还有几个工具,由于时间关系,没有进行测试,有兴趣的话可以看一下:
mmseg4j
paoding
jcseg
IK-Analyzer

总结如下:
1、无论是英文还是中文,其分词及标注词性的技术已经相对比较成熟;
2、英文和中文完全是两个体系,中文还是国内做的好一些
3、算法是公开的,因此很多时候,模型库比算法要一些
4、模型库够用就好,不是越大越好。尤其是特定语境下的模型库,自己训练的会更好用
5、英文的模型库比国内好太多了,看着好羡慕啊
6、希望国内的科研可以更有套路、更有组织、更专业化一些