首先要声明,如果条件允许,不要在windows下做类似的事情,我这里是在折腾。
如果只需要下载代码,相应的代码,我已经上传了github,可以在这里下载到:
word2vec_win32
编译工具为:VS2013
具体的做法为:
1、到google code下载代码https://code.google.com/p/word2vec/
2、根据makefile,创建VS2013工程
3、进行调整,保证编译成功
3.1、所有c文件,添加下面的宏定义
#define _CRT_SECURE_NO_WARNINGS
3.2、将部分const修改为define,比如
#define MAX_STRING 100
3.3、用_aligned_malloc函数,替换posix_memalign函数
#define posix_memalign(p, a, s) (((*(p)) = _aligned_malloc((s), (a))), *(p) ?0 :errno)
3.4、下载windows下的pthread库,pthreads-win32,并修改include及link配置
3.5、编译成功
4、可执行文件说明
word2vec:词转向量,或者进行聚类
word2phrase:词转词组,用于预处理,可重复使用(运行一遍则生成2个词的短语,运行两遍则形成4个词的短语)
compute-accuracy:校验模型精度
distance:输入一个词A,返回最相近的词(A=》?)
word-analogy:输入三个词A,B,C,返回(如果A=》B,C=》?)
5、进行测试
5.1下载测试资料
http://mattmahoney.net/dc/text8.zip
5.2建立模型
>word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000005 Progress: 100.10% Words/thread/sec: 13.74k
5.3校验模型精度
>compute-accuracy vectors.bin 30000 < questions-word
s.txt
capital-common-countries:
ACCURACY TOP1: 80.83 % (409 / 506)
Total accuracy: 80.83 % Semantic accuracy: 80.83 % Syntactic accuracy: -1.#J
%
capital-world:
ACCURACY TOP1: 62.65 % (884 / 1411)
Total accuracy: 67.45 % Semantic accuracy: 67.45 % Syntactic accuracy: -1.#J
%
currency:
ACCURACY TOP1: 23.13 % (62 / 268)
Total accuracy: 62.01 % Semantic accuracy: 62.01 % Syntactic accuracy: -1.#J
%
city-in-state:
ACCURACY TOP1: 46.85 % (736 / 1571)
Total accuracy: 55.67 % Semantic accuracy: 55.67 % Syntactic accuracy: -1.#J
%
family:
ACCURACY TOP1: 77.45 % (237 / 306)
Total accuracy: 57.31 % Semantic accuracy: 57.31 % Syntactic accuracy: -1.#J
%
gram1-adjective-to-adverb:
ACCURACY TOP1: 19.44 % (147 / 756)
Total accuracy: 51.37 % Semantic accuracy: 57.31 % Syntactic accuracy: 19.44
%
gram2-opposite:
ACCURACY TOP1: 24.18 % (74 / 306)
Total accuracy: 49.75 % Semantic accuracy: 57.31 % Syntactic accuracy: 20.81
%
gram3-comparative:
ACCURACY TOP1: 64.92 % (818 / 1260)
Total accuracy: 52.74 % Semantic accuracy: 57.31 % Syntactic accuracy: 44.75
%
gram4-superlative:
ACCURACY TOP1: 39.53 % (200 / 506)
Total accuracy: 51.77 % Semantic accuracy: 57.31 % Syntactic accuracy: 43.81
%
gram5-present-participle:
ACCURACY TOP1: 40.32 % (400 / 992)
Total accuracy: 50.33 % Semantic accuracy: 57.31 % Syntactic accuracy: 42.91
%
gram6-nationality-adjective:
ACCURACY TOP1: 84.46 % (1158 / 1371)
Total accuracy: 55.39 % Semantic accuracy: 57.31 % Syntactic accuracy: 53.88
%
gram7-past-tense:
ACCURACY TOP1: 39.79 % (530 / 1332)
Total accuracy: 53.42 % Semantic accuracy: 57.31 % Syntactic accuracy: 51.00
%
gram8-plural:
ACCURACY TOP1: 61.39 % (609 / 992)
Total accuracy: 54.11 % Semantic accuracy: 57.31 % Syntactic accuracy: 52.38
%
gram9-plural-verbs:
ACCURACY TOP1: 33.38 % (217 / 650)
Total accuracy: 53.01 % Semantic accuracy: 57.31 % Syntactic accuracy: 50.86
%
Questions seen / total: 12227 19544 62.56 %
5.4查找关系最近的单词
>distance vectors.bin
Enter word or sentence (EXIT to break): china
Word: china Position in vocabulary: 486
Word Cosine distance
------------------------------------------------------------------------
taiwan 0.649276
japan 0.624836
hainan 0.567946
kalmykia 0.562871
tibet 0.562600
prc 0.553833
tuva 0.553255
korea 0.552685
chinese 0.545661
xiamen 0.542703
liao 0.542607
jiang 0.540888
manchuria 0.540783
wuhan 0.537735
yunnan 0.535809
hunan 0.535770
hangzhou 0.524340
yong 0.523802
sichuan 0.517254
guangdong 0.514874
liang 0.511881
jin 0.511389
india 0.508853
xinjiang 0.505971
taiwanese 0.503072
qing 0.502909
shanghai 0.502771
shandong 0.499169
jiangxi 0.495940
nanjing 0.492893
guangzhou 0.492788
zhao 0.490396
shenzhen 0.489658
singapore 0.489428
hubei 0.488228
harbin 0.488112
liaoning 0.484283
zhejiang 0.484192
joseon 0.483718
mongolia 0.481411
Enter word or sentence (EXIT to break):
5.5根据A=>B,得到C=>?
>word-analogy vectors.bin
Enter three words (EXIT to break): china beijing canada
Word: china Position in vocabulary: 486
Word: beijing Position in vocabulary: 3880
Word: canada Position in vocabulary: 474
Word Distance
------------------------------------------------------------------------
toronto 0.624131
montreal 0.559667
mcgill 0.519338
calgary 0.518366
ryerson 0.515524
ottawa 0.515316
alberta 0.509334
edmonton 0.498436
moncton 0.488861
quebec 0.487712
canadian 0.475655
saskatchewan 0.460744
fredericton 0.460354
ontario 0.458213
montrealers 0.435611
vancouver 0.429893
saskatoon 0.416954
dieppe 0.404408
iqaluit 0.401143
canadians 0.398137
winnipeg 0.397547
labatt 0.393893
city 0.386245
bilingualism 0.386245
columbia 0.384754
provincial 0.383439
banff 0.382603
metro 0.382367
molson 0.379343
nunavut 0.375992
montr 0.373883
francophones 0.373512
brunswick 0.364261
manitoba 0.360447
bec 0.359977
francophone 0.358556
leafs 0.353035
ellensburg 0.352787
curling 0.351973
cdn 0.347580
Enter three words (EXIT to break):
5.6进行聚类,输出结果(classes为0时,就是向量输出了)
>word2vec -train text8 -output classes.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -iter 15 -classes 500
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000005 Progress: 100.10% Words/thread/sec: 14.72k
5.7原来程序中,还有三个测试脚本,是处理词组的,由于要用到linux命令sed,awk等,大家还是到Cygwin或MinGW下运行吧