About neohope

一直在努力,还没想过要放弃...

ElasticSearch2基本操作(01增删改查REST)

首先,大家要调整一下概念,对应于普通的关系型数据库,你可以暂时这样考虑

Relational DB Elasticsearch
Databases Indexes
Tables Types
Rows Documents
Columns Fields

1、创建索引myindex

curl -XPUT http://localhost:9200/myindex

2、创建类型user

curl -XPOST http://localhost:9200/myindex/user/_mapping -d'
{
    "user": {
        "_all": {
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word",
            "term_vector": "no",
            "store": "false"
        },
        "properties": {
            "用户ID": {
                "type": "string",
                "store": "no",
                "analyzer": "keyword",
                "search_analyzer": "keyword",
                "include_in_all": "true",
                "boost": 8
            },
            "姓名": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word",
                "include_in_all": "true",
                "boost": 8
            },
            "性别": {
                "type": "string",
                "store": "no",
                "analyzer": "keyword",
                "search_analyzer": "keyword",
                "include_in_all": "true",
                "boost": 8
            },
            "年龄": {
                "type": "integer",
                "store": "no",
                "index": "not_analyzed",
                "include_in_all": "true",
                "boost": 8
            },
            "家庭住址": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word",
                "include_in_all": "true",
                "boost": 8
            },
            "注册时间": {
                "type": "date",
                "format": "yyy-MM-dd HH:mm:ss",
                "store": "no",
                "index": "not_analyzed",
                "include_in_all": "true",
                "boost": 8
            }
        }
    }
}'

在这里类型user中,有几种索引类型,

key 类型 分词方式
用户ID string keyword
姓名 string ik_max_word
性别 string keyword
年龄 integer not_analyzed
家庭住址 string ik_max_word
注册时间 date not_analyzed

其中,
ik_max_word,指的是用ik分词,然后将分词结果作为term,需要分词检索的字段,需要这样处理
keyword,指的是,不要分词,而是把整个词作为term,ID及字典很适合这样做
not_analyzed,是不做分词处理,如数字、时间,没有必要

3、上传文档

curl -XPUT http://localhost:9200/myindex/user/u001 -d'
{
"用户ID": "u001",
"姓名":"张三",
"性别":"男",
"年龄":"25",
"家庭住址":"北京市崇文区天朝大街001号",
"注册时间":"2015-01-01 08:30:00"
}'

curl -XPUT http://localhost:9200/myindex/user/u002 -d'
{
"用户ID": "u002",
"姓名":"李四",
"性别":"男",
"年龄":"25",
"家庭住址":"上海市闸北区魔都大街007号",
"注册时间":"2015-02-01 08:30:00"
}'

curl -XPUT http://localhost:9200/myindex/user/u003 -d'
{
"用户ID": "u003",
"姓名":"王五",
"性别":"男",
"年龄":"26",
"家庭住址":"广州市花都区花城大街010号",
"注册时间":"2015-03-01 08:30:00"
}'

4、文档是否存在

#判断id为u003的文档是否存在
curl -XHEAD http://localhost:9200/myindex/user/u003

5、获取文档

#获取id为u003的文档
curl -XGET http://localhost:9200/myindex/user/u003

#获取id为u003的文档的姓名及性别字段
http://localhost:9200/myindex/user/u003?_source=姓名,性别

6、查询文档

#查询文档,默认返回前10个
curl -XGET http://localhost:9200/myindex/user/_search

#用参数进行查询
#年龄等于25的记录
curl -XGET http://localhost:9200/myindex/user/_search?q=年龄:25
#姓名等于王五的记录
curl -XGET http://localhost:9200/myindex/user/_search?q=姓名:王五
#姓名等于王五及年龄等于25的记录
curl -XGET http://localhost:9200/myindex/user/_search?q=+姓名:王五+年龄:26

#查询年龄等于25的用户
curl -XGET http://localhost:9200/myindex/user/_search -d'
{
    "query" : {
        "match" : {
            "年龄" : "25"
        }
    }
}'

#查询年龄大于25,男性用户
curl -XGET http://localhost:9200/myindex/user/_search -d'
{
    "query": {
        "filtered": {
            "filter": {
                "range": {
                    "年龄": {
                        "gt": 25
                    }
                }
            },
            "query": {
                "match": {
                    "性别": "男"
                }
            }
        }
    }
}'

#查询家庭住址中,包含北京或上海的用户
curl -XGET http://localhost:9200/myindex/user/_search -d'
{
    "query" : {
        "match" : {
            "家庭住址" : "北京 上海"
        }
    }
}'

#查询词组
curl -XGET http://localhost:9200/myindex/user/_search -d'
{
    "query" : {
        "match_phrase" : {
            "家庭住址" : "北京 崇文"
        }
    }
}

#按年龄分组聚合,并count
curl -XGET http://localhost:9200/myindex/user/_search -d'
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "年龄" }
    }
  }
}

#男性患者,按年龄分组聚合,并count
curl -XGET http://localhost:9200/myindex/user/_search -d'
{
  "query": {
    "match": {
      "性别": "男"
    }
  },
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "年龄"
      }
    }
  }
}

R语言做聚类分析

1、分层聚类

#载入R自带的测试数据
#data(iris)
#attach(iris)
inData = iris[,1:4]

#计算距离矩阵,并绘图
inData.dist = dist(inData)
#inData.dist = dist(inData,method='euclidean')
heatmap(as.matrix(inData.dist), labRow = F, labCol = F)

#进行分层聚类
#并绘图
#inData.hc <- hclust(inData.dist)
#inData.hc <- hclust(inData.dist,method='ward')
plot(inData.hc, labels = FALSE, hang = -1)

#标识聚类结果,结果设为3类
rect.hclust(inData.hc, k = 3)

#将Tree进行分组
inData.groups <- cutree(inData.hc, 3)
#输出结果表格
table(inData.groups, Species)

#进行降维处理
#绘图对比结果
#形状是正确的数据
#颜色为聚类后的数据
mds=cmdscale(inData.dist,k=2,eig=T)
x = mds$points[,1]
y = mds$points[,2]
library(ggplot2)
p=ggplot(data.frame(x,y),aes(x,y))
p+geom_point(size=3,alpha=0.8,aes(colour=factor(inData.groups),shape=iris$Species))

ClusterAnalysisH.png

2、K值聚类

#载入R自带的测试数据
#data(iris)
#attach(iris)
inData = iris[,1:4]

#计算距离矩阵,并绘图
inData.dist = dist(inData)
#inData.dist = dist(inData,method='euclidean')
heatmap(as.matrix(inData.dist), labRow = F, labCol = F)

#进行K值聚类
inData.kc <- kmeans(inData.dist,centers=3)

#绘图对比结果
#形状是正确的数据
#颜色为聚类后的数据
library(ggplot2)
x=inData[c("Sepal.Length")]
y=inData[c("Sepal.Width")]
p=ggplot(data.frame(x,y),aes(x,y))
p+geom_point(size=3,alpha=0.8,aes(colour=factor(inData.kc$cluster),shape=iris$Species))

ClusterAnalysisK.png

ElasticSearch2常用插件

1、在线安装常用插件

#head
bin\plugin install mobz/elasticsearch-head

#gui
bin\plugin install jettro/elasticsearch-gui

#bigdesk
#bin\plugin install lukas-vlcek/bigdesk
bin\plugin install hlstudio/bigdesk

#kopf
bin\plugin install lmenezes/elasticsearch-kopf

#carrot2
bin\plugin install org.carrot2/elasticsearch-carrot2/2.2.1

#inquisitor
bin\plugin install polyfractal/elasticsearch-inquisitor

2、离线安装常用插件

#上面的插件,都可手工下载后,通过命令行进行离线安装
bin\plugin install file:///PATH_TO_PLUGIN/PLUGIN.zip

3、手工安装分词插件

#到下面的地址下载release版本,解压,放到ES的plugins目录下,然后重启即可
https://github.com/medcl/elasticsearch-analysis-ik
https://github.com/medcl/elasticsearch-analysis-pinyin
https://github.com/medcl/elasticsearch-analysis-mmseg

编译Wkhtmltopdf

1、首先下载源码

git clone https://github.com/wkhtmltopdf/wkhtmltopdf.git  D:\GitHub\wkhtmltopdf
git clone https://github.com/wkhtmltopdf/qt.git  D:\GitHub\wkhtmltopdf\qt

2、安装下面几个软件
VS2013
Python2.7
ActivePerl
NSIS

3、初始化环境变量

set GIT_HOME=C:\Program Files\Git
set NSIS_HOME=C:\NeoLanguages\NSIS
set PYTHON_HOME=C:\NeoLanguages\Python27_x86
set PERL_HOME=C:\NeoLanguages\Perl
set PATH=%GIT_HOME%\bin;%PYTHON_HOME%;%PERL_HOME%\bin;%NSIS_HOME%;%PATH%
@call C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat x86

4、编译

python scripts\build.py msvc2013-win32

PS:
如果编译失败,一般是由于依赖包的网站无法方位导致的,你懂的。
这个时候,打开build.py文件,找到DEPENDENT_LIBS,自备梯子,下载后直接放到static-build路径下就好了。

Git04设置代理

1、设置http代理

git config --global https.proxy "http://127.0.0.1:1080"
git config --global https.proxy "https://127.0.0.1:1080"

2、设置socket代理

git config --global http.proxy "socks5://127.0.0.1:9527"
git config --global https.proxy "socks5://127.0.0.1:9527"

3、取消代理

git config --global --unset http.proxy
git config --global --unset https.proxy

Wkhtmltopdf添加页码

为wkhtmltopdf(wkhtmltox)添加页面有两种方式。

第一种为,在下面六个参数中,传递[page]/[topage]即可。

--header-center
--header-left
--header-right
--footer-center
--footer-left
--footer-right

第二种为,为header或footer设置–header-html或–footer-html参数,从而生成页码。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/>
  <title>footer</title>
  <script>
  function subst() {
    var vars={};
    var x=window.location.search.substring(1).split('&');
    for (var i in x) {var z=x[i].split('=',2);vars[z[0]] = unescape(z[1]);}
    var x=['frompage','topage','page','webpage','section','subsection','subsubsection'];
    for (var i in x) {
      var y = document.getElementsByClassName(x[i]);
      for (var j=0; j<y.length; ++j) y[j].textContent = vars[x[i]];
    }
  }
  </script>
</head>
<body style="border:0; margin: 0;" onload="subst()">
  <table style="border-bottom: 1px solid black; width: 100%">
    <tr>
      <td style="text-align:right">
        第 <span class="page"></span> 页,共 <span class="topage"></span> 页
      </td>
    </tr>
  </table>
</body>
</html>

Windows下编译word2vec

首先要声明,如果条件允许,不要在windows下做类似的事情,我这里是在折腾。

如果只需要下载代码,相应的代码,我已经上传了github,可以在这里下载到:
word2vec_win32

编译工具为:VS2013

具体的做法为:

1、到google code下载代码https://code.google.com/p/word2vec/

2、根据makefile,创建VS2013工程

3、进行调整,保证编译成功
3.1、所有c文件,添加下面的宏定义

#define _CRT_SECURE_NO_WARNINGS

3.2、将部分const修改为define,比如

    #define MAX_STRING 100

3.3、用_aligned_malloc函数,替换posix_memalign函数

    #define posix_memalign(p, a, s) (((*(p)) = _aligned_malloc((s), (a))), *(p) ?0 :errno)

3.4、下载windows下的pthread库,pthreads-win32,并修改include及link配置

3.5、编译成功

4、可执行文件说明
word2vec:词转向量,或者进行聚类
word2phrase:词转词组,用于预处理,可重复使用(运行一遍则生成2个词的短语,运行两遍则形成4个词的短语)
compute-accuracy:校验模型精度
distance:输入一个词A,返回最相近的词(A=》?)
word-analogy:输入三个词A,B,C,返回(如果A=》B,C=》?)

5、进行测试
5.1下载测试资料
http://mattmahoney.net/dc/text8.zip

5.2建立模型

>word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000005  Progress: 100.10%  Words/thread/sec: 13.74k

5.3校验模型精度

>compute-accuracy vectors.bin 30000 < questions-word
s.txt
capital-common-countries:
ACCURACY TOP1: 80.83 %  (409 / 506)
Total accuracy: 80.83 %   Semantic accuracy: 80.83 %   Syntactic accuracy: -1.#J
 %
capital-world:
ACCURACY TOP1: 62.65 %  (884 / 1411)
Total accuracy: 67.45 %   Semantic accuracy: 67.45 %   Syntactic accuracy: -1.#J
 %
currency:
ACCURACY TOP1: 23.13 %  (62 / 268)
Total accuracy: 62.01 %   Semantic accuracy: 62.01 %   Syntactic accuracy: -1.#J
 %
city-in-state:
ACCURACY TOP1: 46.85 %  (736 / 1571)
Total accuracy: 55.67 %   Semantic accuracy: 55.67 %   Syntactic accuracy: -1.#J
 %
family:
ACCURACY TOP1: 77.45 %  (237 / 306)
Total accuracy: 57.31 %   Semantic accuracy: 57.31 %   Syntactic accuracy: -1.#J
 %
gram1-adjective-to-adverb:
ACCURACY TOP1: 19.44 %  (147 / 756)
Total accuracy: 51.37 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 19.44
 %
gram2-opposite:
ACCURACY TOP1: 24.18 %  (74 / 306)
Total accuracy: 49.75 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 20.81
 %
gram3-comparative:
ACCURACY TOP1: 64.92 %  (818 / 1260)
Total accuracy: 52.74 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 44.75
 %
gram4-superlative:
ACCURACY TOP1: 39.53 %  (200 / 506)
Total accuracy: 51.77 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 43.81
 %
gram5-present-participle:
ACCURACY TOP1: 40.32 %  (400 / 992)
Total accuracy: 50.33 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 42.91
 %
gram6-nationality-adjective:
ACCURACY TOP1: 84.46 %  (1158 / 1371)
Total accuracy: 55.39 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 53.88
 %
gram7-past-tense:
ACCURACY TOP1: 39.79 %  (530 / 1332)
Total accuracy: 53.42 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 51.00
 %
gram8-plural:
ACCURACY TOP1: 61.39 %  (609 / 992)
Total accuracy: 54.11 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 52.38
 %
gram9-plural-verbs:
ACCURACY TOP1: 33.38 %  (217 / 650)
Total accuracy: 53.01 %   Semantic accuracy: 57.31 %   Syntactic accuracy: 50.86
 %
Questions seen / total: 12227 19544   62.56 %

5.4查找关系最近的单词

>distance vectors.bin
Enter word or sentence (EXIT to break): china

Word: china  Position in vocabulary: 486

                                              Word       Cosine distance
------------------------------------------------------------------------
                                            taiwan              0.649276
                                             japan              0.624836
                                            hainan              0.567946
                                          kalmykia              0.562871
                                             tibet              0.562600
                                               prc              0.553833
                                              tuva              0.553255
                                             korea              0.552685
                                           chinese              0.545661
                                            xiamen              0.542703
                                              liao              0.542607
                                             jiang              0.540888
                                         manchuria              0.540783
                                             wuhan              0.537735
                                            yunnan              0.535809
                                             hunan              0.535770
                                          hangzhou              0.524340
                                              yong              0.523802
                                           sichuan              0.517254
                                         guangdong              0.514874
                                             liang              0.511881
                                               jin              0.511389
                                             india              0.508853
                                          xinjiang              0.505971
                                         taiwanese              0.503072
                                              qing              0.502909
                                          shanghai              0.502771
                                          shandong              0.499169
                                           jiangxi              0.495940
                                           nanjing              0.492893
                                         guangzhou              0.492788
                                              zhao              0.490396
                                          shenzhen              0.489658
                                         singapore              0.489428
                                             hubei              0.488228
                                            harbin              0.488112
                                          liaoning              0.484283
                                          zhejiang              0.484192
                                            joseon              0.483718
                                          mongolia              0.481411
Enter word or sentence (EXIT to break):

5.5根据A=>B,得到C=>?

>word-analogy vectors.bin
Enter three words (EXIT to break): china beijing canada

Word: china  Position in vocabulary: 486

Word: beijing  Position in vocabulary: 3880

Word: canada  Position in vocabulary: 474

                                              Word              Distance
------------------------------------------------------------------------
                                           toronto              0.624131
                                          montreal              0.559667
                                            mcgill              0.519338
                                           calgary              0.518366
                                           ryerson              0.515524
                                            ottawa              0.515316
                                           alberta              0.509334
                                          edmonton              0.498436
                                           moncton              0.488861
                                            quebec              0.487712
                                          canadian              0.475655
                                      saskatchewan              0.460744
                                       fredericton              0.460354
                                           ontario              0.458213
                                       montrealers              0.435611
                                         vancouver              0.429893
                                         saskatoon              0.416954
                                            dieppe              0.404408
                                           iqaluit              0.401143
                                         canadians              0.398137
                                          winnipeg              0.397547
                                            labatt              0.393893
                                              city              0.386245
                                      bilingualism              0.386245
                                          columbia              0.384754
                                        provincial              0.383439
                                             banff              0.382603
                                             metro              0.382367
                                            molson              0.379343
                                           nunavut              0.375992
                                             montr              0.373883
                                      francophones              0.373512
                                         brunswick              0.364261
                                          manitoba              0.360447
                                               bec              0.359977
                                       francophone              0.358556
                                             leafs              0.353035
                                        ellensburg              0.352787
                                           curling              0.351973
                                               cdn              0.347580
Enter three words (EXIT to break):

5.6进行聚类,输出结果(classes为0时,就是向量输出了)

>word2vec -train text8 -output classes.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -iter 15 -classes 500
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000005  Progress: 100.10%  Words/thread/sec: 14.72k

5.7原来程序中,还有三个测试脚本,是处理词组的,由于要用到linux命令sed,awk等,大家还是到Cygwin或MinGW下运行吧

posix_memalign函数在Windows下的实现

posix_memalign函数主要用于申请内存时,做内存对齐使用,Windows下对应的函数为_aligned_malloc,但两者的参数有一定区别:

int posix_memalign(void **memptr, size_t alignment, size_t size);
void * _aligned_malloc(size_t size, size_t alignment);

从stackoverflow上,找到了两种实现方式,对于第一种,我只能说,佩服佩服。

1、最简练的实现方式

#define posix_memalign(p, a, s) (((*(p)) = _aligned_malloc((s), (a))), *(p) ?0 :errno)

2、比较稳妥的实现方式

#ifdef _WIN32
static int check_align(size_t align)
{
    for (size_t i = sizeof(void *); i != 0; i *= 2)
    if (align == i)
        return 0;
    return EINVAL;
}

int posix_memalign(void **ptr, size_t align, size_t size)
{
    if (check_align(align))
        return EINVAL;

    int saved_errno = errno;
    void *p = _aligned_malloc(size, align);
    if (p == NULL)
    {
        errno = saved_errno;
        return ENOMEM;
    }

    *ptr = p;
    return 0;
}
#endif

分词及词性标注总结

近期,尝试了各类的分词及词性标注工具,包括如下软件:

工具 中英文支持 其他说明
中科院的ICTCLAS 中英 CPP,多语言接口
清华大学的THULANC 中,英较差 多语言支持
哈工大的LTP CPP,多语言接口
复旦的FudanDNN Java
东北大学的NiuParser 中,英较差 CPP
斯坦福的Stanford 中英 Java
Ansj Java
Jieba Python
Word Java
HanLP Java
LingPipe 英,中较差 Java
OpenNLP Java
NLTK Python
Gate Java,GUI,但不太符合程序员思维逻辑
lucene-analyzers-smartcn Java,只分词,不标词性

此外,还有几个工具,由于时间关系,没有进行测试,有兴趣的话可以看一下:
mmseg4j
paoding
jcseg
IK-Analyzer

总结如下:
1、无论是英文还是中文,其分词及标注词性的技术已经相对比较成熟;
2、英文和中文完全是两个体系,中文还是国内做的好一些
3、算法是公开的,因此很多时候,模型库比算法要一些
4、模型库够用就好,不是越大越好。尤其是特定语境下的模型库,自己训练的会更好用
5、英文的模型库比国内好太多了,看着好羡慕啊
6、希望国内的科研可以更有套路、更有组织、更专业化一些

使用Word进行分词及词性标注

1、下载Jar或源码
https://github.com/ysc/word/

2、写一些代码

public static void tag(String sentence) throws Exception {
        List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.BidirectionalMaximumMatching);
        PartOfSpeechTagging.process(words);
        System.out.println(words);
    	/*
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.BidirectionalMaximumMinimumMatching);
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.BidirectionalMinimumMatching);
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.FullSegmentation);
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.MaximumMatching);
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.MaxNgramScore);
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.MinimalWordCount);
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.MinimumMatching);
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.PureEnglish);
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.ReverseMaximumMatching);
    	List<Word> words = WordSegmenter.segWithStopWords(sentence, SegmentationAlgorithm.ReverseMinimumMatching);
    	*/
    }

4、输入
zh.txt

别让别人告诉你你成不了才,即使是我也不行。
如果你有梦想的话,就要去捍卫它。
那些一事无成的人想告诉你你也成不了大器。
如果你有理想的话,就要去努力实现。
就这样。

5、输出
zhout.txt

[别让/i, 别人/r, 告诉/v, 你/r, 你/r, 成不了/l, 才/d, 即使/c, 是/v, 我/r, 也/d, 不行/v, 如果/c, 你/r, 有/v, 梦想/n, 的话/u, 就要/d, 去/v, 捍卫/v, 它/r, 那些/r, 一事无成/l, 的/uj, 人/n, 想/v, 告诉/v, 你/r, 你/r, 也/d, 成不了/l, 大器/n, 如果/c, 你/r, 有理想/i, 的话/u, 就要/d, 去/v, 努力实现/nr, 就这样/i]