Spark环境搭建01

1、下载scala-2.11.1,并解压到/usr/scala/scala-2.11.1

2、下载spark-2.0.0-bin-hadoop2.4,并解压到/home/hadoop/Deploy/spark-2.0.0
(*如果要看后续文章,建议使用hadoop-2.5.2 hbase-1.1.2 hive-1.2.1 spark-2.0.0)

3、复制spark-env.sh.template为spark-env.sh,并添加下面几行

export JAVA_HOME=/usr/java/jdk1.7.0_79
export SCALA_HOME=/usr/scala/scala-2.11.1/
export SPARK_MASTER_IP=hiup01
export SPARK_WORKER_MEMORY=1g
export HADOOP_CONF_DIR=/home/hadoop/Deploy/hadoop-2.5.2/etc/hadoop

4、复制slaves.template为slaves,并添加下面几行

hiup01
hiup02
hiup03

5、将scala-2.11.1及spark-2.0.0复制到hiup02及hiup03

6、环境搭建完毕。

Hive与HBase数据互导

1、Hive到HBase
1.1、创建hive表

create table inpatient_hv(
PATIENT_NO String COMMENT '住院号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
STORED AS TEXTFILE;

1.2、hive表导入数据

load data inpath '/usr/hadoop/inpatient.txt' into table inpatient_hv

1.3、创建hbase表

create table inpatient_hb(
PATIENT_NO String COMMENT '住院号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST") 
TBLPROPERTIES ("hbase.table.name" = "inpatient_hb");

1.4、数据从hive导入hbase

INSERT OVERWRITE TABLE inpatient_hb SELECT * FROM inpatient_hv;

2、hbase到hive
2.1、创建hbase表

create 'inpatient_hb','pinfo'

2.2、hbase表导入数据

./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COS inpatient_hb /usr/hadoop/inpatient.txt

2.3、创建hive表

#创建hbase external表
create external table inpatient_hb(
PATIENT_NO String COMMENT '住院号',
INPATIENT_NO String  COMMENT '住院流水号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST") 
TBLPROPERTIES ("hbase.table.name" = "inpatient_hb");

#创建hive表
create table inpatient_hv(
PATIENT_NO String COMMENT '住院号',
INPATIENT_NO String  COMMENT '住院流水号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
STORED AS TEXTFILE;

2.4、数据从hbase导入hive

INSERT OVERWRITE TABLE inpatient_hv SELECT * FROM inpatient_hb;

Hive环境搭建04

1、建表

create table inpatient(
PATIENT_NO String COMMENT '住院号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST") 
TBLPROPERTIES ("hbase.table.name" = "inpatient");

2、Hbase导入数据
2.1、Hbase直接导入

./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST inpatient /usr/hadoop/inpatient.txt
......
2016-12-22 10:33:36,985 INFO  [main] client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.172.13:8032
2016-12-22 10:33:37,340 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-12-22 10:33:43,450 INFO  [main] input.FileInputFormat: Total input paths to process : 1
2016-12-22 10:33:44,640 INFO  [main] mapreduce.JobSubmitter: number of splits:1
2016-12-22 10:33:44,952 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-12-22 10:33:47,173 INFO  [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1482371551462_0002
2016-12-22 10:33:50,830 INFO  [main] impl.YarnClientImpl: Submitted application application_1482371551462_0002
2016-12-22 10:33:51,337 INFO  [main] mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1482371551462_0002/
2016-12-22 10:33:51,338 INFO  [main] mapreduce.Job: Running job: job_1482371551462_0002
2016-12-22 10:34:39,499 INFO  [main] mapreduce.Job: Job job_1482371551462_0002 running in uber mode : false
2016-12-22 10:34:39,572 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2016-12-22 10:35:48,228 INFO  [main] mapreduce.Job:  map 1% reduce 0%
2016-12-22 10:36:06,876 INFO  [main] mapreduce.Job:  map 3% reduce 0%
2016-12-22 10:36:09,981 INFO  [main] mapreduce.Job:  map 5% reduce 0%
2016-12-22 10:36:13,739 INFO  [main] mapreduce.Job:  map 7% reduce 0%
2016-12-22 10:36:17,592 INFO  [main] mapreduce.Job:  map 10% reduce 0%
2016-12-22 10:36:22,891 INFO  [main] mapreduce.Job:  map 12% reduce 0%
2016-12-22 10:36:45,217 INFO  [main] mapreduce.Job:  map 17% reduce 0%
2016-12-22 10:37:14,914 INFO  [main] mapreduce.Job:  map 20% reduce 0%
2016-12-22 10:37:35,739 INFO  [main] mapreduce.Job:  map 25% reduce 0%
2016-12-22 10:37:39,013 INFO  [main] mapreduce.Job:  map 34% reduce 0%
2016-12-22 10:38:24,289 INFO  [main] mapreduce.Job:  map 42% reduce 0%
2016-12-22 10:38:36,644 INFO  [main] mapreduce.Job:  map 49% reduce 0%
2016-12-22 10:38:57,618 INFO  [main] mapreduce.Job:  map 54% reduce 0%
2016-12-22 10:39:00,808 INFO  [main] mapreduce.Job:  map 56% reduce 0%
2016-12-22 10:39:07,879 INFO  [main] mapreduce.Job:  map 58% reduce 0%
2016-12-22 10:39:11,489 INFO  [main] mapreduce.Job:  map 60% reduce 0%
2016-12-22 10:39:24,708 INFO  [main] mapreduce.Job:  map 62% reduce 0%
2016-12-22 10:39:29,188 INFO  [main] mapreduce.Job:  map 63% reduce 0%
2016-12-22 10:39:34,165 INFO  [main] mapreduce.Job:  map 65% reduce 0%
2016-12-22 10:40:12,473 INFO  [main] mapreduce.Job:  map 66% reduce 0%
2016-12-22 10:40:39,471 INFO  [main] mapreduce.Job:  map 73% reduce 0%
2016-12-22 10:40:40,910 INFO  [main] mapreduce.Job:  map 74% reduce 0%
2016-12-22 10:40:42,936 INFO  [main] mapreduce.Job:  map 75% reduce 0%
2016-12-22 10:40:46,471 INFO  [main] mapreduce.Job:  map 77% reduce 0%
2016-12-22 10:40:50,495 INFO  [main] mapreduce.Job:  map 79% reduce 0%
2016-12-22 10:40:53,267 INFO  [main] mapreduce.Job:  map 81% reduce 0%
2016-12-22 10:41:06,843 INFO  [main] mapreduce.Job:  map 83% reduce 0%
2016-12-22 10:41:13,140 INFO  [main] mapreduce.Job:  map 92% reduce 0%
2016-12-22 10:41:22,305 INFO  [main] mapreduce.Job:  map 93% reduce 0%
2016-12-22 10:41:27,671 INFO  [main] mapreduce.Job:  map 96% reduce 0%
2016-12-22 10:41:48,688 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2016-12-22 10:43:20,552 INFO  [main] mapreduce.Job: Job job_1482371551462_0002 completed successfully
2016-12-22 10:43:28,574 INFO  [main] mapreduce.Job: Counters: 31
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=127746
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=43306042
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=2
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=460404
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=460404
                Total vcore-seconds taken by all map tasks=460404
                Total megabyte-seconds taken by all map tasks=471453696
        Map-Reduce Framework
                Map input records=115411
                Map output records=115152
                Input split bytes=115
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=26590
                CPU time spent (ms)=234550
                Physical memory (bytes) snapshot=83329024
                Virtual memory (bytes) snapshot=544129024
                Total committed heap usage (bytes)=29036544
        ImportTsv
                Bad Lines=259
        File Input Format Counters
                Bytes Read=43305927
        File Output Format Counters
                Bytes Written=0

2.2、completebulkload导入

#/etc/profile中添加下面一行
#export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$HBASE_HOME/lib/*"
./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.bulk.output=/usr/hadoop/inpatient.tmp -Dimporttsv.columns=HBASE_ROW_KEY,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST inpatient /usr/hadoop/inpatient.txt
......
2016-12-22 12:26:04,496 INFO  [main] client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.172.13:8032
2016-12-22 12:26:12,411 INFO  [main] input.FileInputFormat: Total input paths to process : 1
2016-12-22 12:26:12,563 INFO  [main] mapreduce.JobSubmitter: number of splits:1
2016-12-22 12:26:12,577 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-12-22 12:26:13,220 INFO  [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1482371551462_0005
2016-12-22 12:26:13,764 INFO  [main] impl.YarnClientImpl: Submitted application application_1482371551462_0005
2016-12-22 12:26:13,832 INFO  [main] mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1482371551462_0005/
2016-12-22 12:26:13,833 INFO  [main] mapreduce.Job: Running job: job_1482371551462_0005
2016-12-22 12:26:35,952 INFO  [main] mapreduce.Job: Job job_1482371551462_0005 running in uber mode : false
2016-12-22 12:26:36,156 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2016-12-22 12:27:15,839 INFO  [main] mapreduce.Job:  map 3% reduce 0%
2016-12-22 12:27:18,868 INFO  [main] mapreduce.Job:  map 53% reduce 0%
2016-12-22 12:27:21,981 INFO  [main] mapreduce.Job:  map 58% reduce 0%
2016-12-22 12:27:29,195 INFO  [main] mapreduce.Job:  map 67% reduce 0%
2016-12-22 12:27:41,582 INFO  [main] mapreduce.Job:  map 83% reduce 0%
2016-12-22 12:27:52,819 INFO  [main] mapreduce.Job:  map 85% reduce 0%
2016-12-22 12:27:59,189 INFO  [main] mapreduce.Job:  map 93% reduce 0%
2016-12-22 12:28:07,498 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2016-12-22 12:29:11,199 INFO  [main] mapreduce.Job:  map 100% reduce 67%
2016-12-22 12:29:24,353 INFO  [main] mapreduce.Job:  map 100% reduce 70%
2016-12-22 12:29:32,324 INFO  [main] mapreduce.Job:  map 100% reduce 74%
2016-12-22 12:29:37,001 INFO  [main] mapreduce.Job:  map 100% reduce 79%
2016-12-22 12:29:38,011 INFO  [main] mapreduce.Job:  map 100% reduce 82%
2016-12-22 12:29:41,038 INFO  [main] mapreduce.Job:  map 100% reduce 84%
2016-12-22 12:29:45,082 INFO  [main] mapreduce.Job:  map 100% reduce 88%
2016-12-22 12:29:48,115 INFO  [main] mapreduce.Job:  map 100% reduce 90%
2016-12-22 12:29:51,154 INFO  [main] mapreduce.Job:  map 100% reduce 92%
2016-12-22 12:29:54,186 INFO  [main] mapreduce.Job:  map 100% reduce 94%
2016-12-22 12:29:57,205 INFO  [main] mapreduce.Job:  map 100% reduce 97%
2016-12-22 12:30:00,236 INFO  [main] mapreduce.Job:  map 100% reduce 100%
2016-12-22 12:30:06,388 INFO  [main] mapreduce.Job: Job job_1482371551462_0005 completed successfully
2016-12-22 12:30:09,203 INFO  [main] mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=237707880
                FILE: Number of bytes written=357751428
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=43306042
                HDFS: Number of bytes written=195749237
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=99691
                Total time spent by all reduces in occupied slots (ms)=83330
                Total time spent by all map tasks (ms)=99691
                Total time spent by all reduce tasks (ms)=83330
                Total vcore-seconds taken by all map tasks=99691
                Total vcore-seconds taken by all reduce tasks=83330
                Total megabyte-seconds taken by all map tasks=102083584
                Total megabyte-seconds taken by all reduce tasks=85329920
        Map-Reduce Framework
                Map input records=115411
                Map output records=115152
                Map output bytes=118397787
                Map output materialized bytes=118853937
                Input split bytes=115
                Combine input records=115152
                Combine output records=115077
                Reduce input groups=115077
                Reduce shuffle bytes=118853937
                Reduce input records=115077
                Reduce output records=3337137
                Spilled Records=345231
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=2017
                CPU time spent (ms)=38130
                Physical memory (bytes) snapshot=383750144
                Virtual memory (bytes) snapshot=1184014336
                Total committed heap usage (bytes)=231235584
        ImportTsv
                Bad Lines=259
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=43305927
        File Output Format Counters
                Bytes Written=195749237

3、在hive中进行查询

hive> select * from inpatient limit 1;
OK
......
Time taken: 12.419 seconds, Fetched: 1 row(s)

hive> select count(*) from inpatient;
Query ID = hadoop_20161222114304_b247c745-a6ec-4e52-b76d-daefb657ac20
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1482371551462_0004, Tracking URL = http://hadoop-master:8088/proxy/application_1482371551462_0004/
Kill Command = /home/hadoop/Deploy/hadoop-2.5.2/bin/hadoop job  -kill job_1482371551462_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-12-22 11:44:22,634 Stage-1 map = 0%,  reduce = 0%
2016-12-22 11:45:08,704 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.74 sec
2016-12-22 11:45:50,754 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.19 sec
MapReduce Total cumulative CPU time: 8 seconds 190 msec
Ended Job = job_1482371551462_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.19 sec   HDFS Read: 13353 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 190 msec
OK
115077
Time taken: 170.801 seconds, Fetched: 1 row(s)

./hadoop jar /home/hadoop/Deploy/hbase-1.1.2/lib/hbase-server-1.1.2.jar completebulkload /usr/hadoop/inpatient.tmp inpatient
......
16/12/22 12:42:04 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x4df040780x0, quorum=localhost:2181, baseZNode=/hbase
16/12/22 12:42:04 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
16/12/22 12:42:04 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
16/12/22 12:42:04 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x5924755d380005, negotiated timeout = 90000
16/12/22 12:42:06 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x7979cd9c connecting to ZooKeeper ensemble=localhost:2181
16/12/22 12:42:06 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x7979cd9c0x0, quorum=localhost:2181, baseZNode=/hbase
16/12/22 12:42:06 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
16/12/22 12:42:06 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
16/12/22 12:42:07 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x5924755d380006, negotiated timeout = 90000
16/12/22 12:42:07 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://hadoop-master:9000/usr/hadoop/inpatient.tmp/_SUCCESS
16/12/22 12:42:08 INFO hfile.CacheConfig: CacheConfig:disabled
16/12/22 12:42:08 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://hadoop-master:9000/usr/hadoop/inpatient.tmp/pinfo/7ee330c0f66c4d36b5d614a337d3929f first=" last="B301150360"
16/12/22 12:42:08 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
16/12/22 12:42:08 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x5924755d380006
16/12/22 12:42:08 INFO zookeeper.ZooKeeper: Session: 0x5924755d380006 closed
16/12/22 12:42:08 INFO zookeeper.ClientCnxn: EventThread shut down

Hive环境搭建03

上面说到了Hive如何与Hadoop整合,下面就说一下Hive如何与HBase整合。

Hive与HBase整合简单的超出你的想象:

1、设置环境变量

export HADOOP_HOME=/home/hadoop/Deploy/hadoop-2.5.2
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/mynative"
export HBASE_HOME=/home/hadoop/Deploy/hbase-1.1.2
export HIVE_HOME=/home/hadoop/Deploy/hive-1.2.1

2、启动Hadoop

3、启动HBase

4、启动metastore

./hive --service metastore &

5、启动hive

#输出DEBUG日志
#./hive hive.root.logger=DEBUG,console
#单节点接入
./hive -hiveconf hbase.master=hadoop-master:6000
#多节点接入
./hive -hiveconf hbase.zookeeper.quorum=hadoop-master:2181,hadoop-slave01:2181,hadoop-slave02:2181 
#为了方便,也可以在hive-site.xml增加对应的property来达到相同的效果

6、整合完毕了?好吧,你太天真了。。。

7、去github上下载hive-1.2.1的源码,需要重新编译一下hive-hbase-handler-1.2.1这个jar包(解决版本不兼容的问题)
7.1、方法1,修改pom.xml中的,用mvn在linux下编译(用到了bash),要自己解决一些兼容性问题

    <hbase.hadoop1.version>0.98.9-hadoop1</hbase.hadoop1.version>
    <hbase.hadoop2.version>1.2.1-hadoop2</hbase.hadoop2.version>

7.2、拷贝hbase-handler的源码,把hadoop、hbase、hive下的jar包找出来作为依赖包(删除hive-hbase-handler-1.2.1.jar),然后打jar包,简单粗暴

8、用自己的hive-hbase-handler-1.2.1.jar包,替换hive下的jar包

9、这样就搞定啦。

Hive环境搭建02

1、启动Hadoop

2、启动metastore

./hive --service metastore &

3、启动hive

./hive

4、建表

show tables
create table inpatient(
PATIENT_NO String COMMENT '住院号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用'
)
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
STORED AS TEXTFILE

5、查看表

show tables;
describe inpatient;

6、加载数据

#从hdfs加载
./hadoop fs -put ~/Deploy/testdata/inpatient.txt /usr/hadoop/
load data inpath '/usr/hadoop/inpatient.txt' into table inpatient
#从本地加载
load data local inpath '/home/hadoop/Deploy/testdata/inpatient.txt' into table inpatient

7、那先来做一个查询,看下男女各多少人

hive> select count(*),sex_code from inpatient group by sex_code;
Query ID = hadoop_20161221104618_d530ab5e-d1c9-48c9-a5da-dff47f81786d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1482286286190_0005, Tracking URL = http://hadoop-master:8088/proxy/application_1482286286190_0005/
Kill Command = /home/hadoop/Deploy/hadoop-2.5.2/bin/hadoop job  -kill job_1482286286190_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-12-21 10:46:39,640 Stage-1 map = 0%,  reduce = 0%
2016-12-21 10:47:08,810 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.39 sec
2016-12-21 10:47:33,013 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.5 sec
MapReduce Total cumulative CPU time: 5 seconds 500 msec
Ended Job = job_1482286286190_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.5 sec   HDFS Read: 43317170 HDFS Write: 26 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 500 msec
OK
4       "0"
60479   "F"
54928   "M"
Time taken: 77.198 seconds, Fetched: 3 row(s)

8.删除表

drop table inpatient;

9.执行dfs命令

dfs -ls /

Hive环境搭建01

1、安装MySQL

2、建立Hive的Meta库

CREATE DATABASE hive CHARACTER SET latin1;
CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hive';
GRANT ALL PRIVILEGES ON hive.* to 'hive'@'localhost';

如果上面的库字符集不小心弄成了utf8,就和我一样,就会遇到下面的错误:

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes.

修正方法是:

alter database hive character set latin1;

3、下载hive-1.2.1,并解压
(*如果要看后续文章,建议使用hadoop-2.5.2 hbase-1.1.2 hive-1.2.1 spark-2.0.0)

4、下载MySQL的JDBC驱动mysql-connector-java-xxx.jar,放到hive的lib目录下

5、用hive中lib目录下jline-2.12.jar文件,替换掉hadoop下面share/hadoop/yarn/lib/jline-xxx.jar

6、复制hive-default.xml.template为hive-site.xml

7、修改hive-site.xml配置,分为下面几部分
7.1、mysql相关配置

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
    <description>password to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseInfoNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <!--value>org.apache.derby.jdbc.EmbeddedDriver</value-->
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>Username to use against metastore database</description>
  </property

7.2、hadoop相关配置(HDFS上要有这些文件)

  <property>
    <name>hive.exec.scratchdir</name>
    <value>hdfs://hadoop-master:9000/hive/tmp</value>
    <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/&lt;username&gt; is created, with ${hive.scratch.dir.permission}.</description>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>hdfs://hadoop-master:9000/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
  </property>
  <property>
    <name>hive.user.install.directory</name>
    <value>hdfs://hadoop-master:9000/hive/usr</value>
    <description>
      If hive (in tez mode only) cannot find a usable hive jar in "hive.jar.directory", 
      it will upload the hive jar to "hive.user.install.directory/user.name"
      and use it to run queries.
    </description>
  </property>

7.3、hive相关配置

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop-master:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>

7.4、hive本地路径相关配置
把下面的value,按部署实际情况,设置成实际存在的文件夹即可

  <property>
    <name>hive.exec.local.scratchdir</name>
    <value>${system:java.io.tmpdir}/${system:user.name}</value>
    <description>Local scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.downloaded.resources.dir</name>
    <value>${system:java.io.tmpdir}/${hive.session.id}_resources</value>
    <description>Temporary local directory for added resources in the remote file system.</description>
  </property>
  <property>
    <name>hive.querylog.location</name>
    <value>${system:java.io.tmpdir}/${system:user.name}</value>
    <description>Location of Hive run time structured log file</description>
  </property>
  <property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>${system:java.io.tmpdir}/${system:user.name}/operation_logs</value>
    <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
  </property>

8.配置完成:)

Cassandra3基本操作01

1、首先看一下各节点状态

neohope@debian8-node01:~/Deploy/apache-cassandra-3.0.1$ bin/nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  172.16.172.23  230.25 KB  256          ?       760879a3-e06b-43bb-97bf-b045a655da9c  rack1
UN  172.16.172.24  237.17 KB  256          ?       5bf3e77d-ccf4-4f81-9465-de0babe1e0cc  rack1
UN  172.16.172.25  241.55 KB  256          ?       5a7ed53f-c8e0-4658-af9a-41880dada5ef  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

2、然后用cqlsh连接cassandra

neohope@debian8-node01:~/Deploy/apache-cassandra-3.0.1$ bin/cqlsh debian8-node01
Connected to NeoCluster at debian8-node01:9042.
[cqlsh 5.0.1 | Cassandra 3.0.1 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.

cqlsh:neokeyspace> help

Documented shell commands:
===========================
CAPTURE  CLS          COPY  DESCRIBE  EXPAND  LOGIN   SERIAL  SOURCE   UNICODE
CLEAR    CONSISTENCY  DESC  EXIT      HELP    PAGING  SHOW    TRACING

CQL help topics:
================
AGGREGATES               CREATE_KEYSPACE           DROP_TRIGGER      TEXT     
ALTER_KEYSPACE           CREATE_MATERIALIZED_VIEW  DROP_TYPE         TIME     
ALTER_MATERIALIZED_VIEW  CREATE_ROLE               DROP_USER         TIMESTAMP
ALTER_TABLE              CREATE_TABLE              FUNCTIONS         TRUNCATE 
ALTER_TYPE               CREATE_TRIGGER            GRANT             TYPES    
ALTER_USER               CREATE_TYPE               INSERT            UPDATE   
APPLY                    CREATE_USER               INSERT_JSON       USE      
ASCII                    DATE                      INT               UUID     
BATCH                    DELETE                    JSON            
BEGIN                    DROP_AGGREGATE            KEYWORDS        
BLOB                     DROP_COLUMNFAMILY         LIST_PERMISSIONS
BOOLEAN                  DROP_FUNCTION             LIST_ROLES      
COUNTER                  DROP_INDEX                LIST_USERS      
CREATE_AGGREGATE         DROP_KEYSPACE             PERMISSIONS     
CREATE_COLUMNFAMILY      DROP_MATERIALIZED_VIEW    REVOKE          
CREATE_FUNCTION          DROP_ROLE                 SELECT          
CREATE_INDEX             DROP_TABLE                SELECT_JSON     

Continue reading Cassandra3基本操作01

HBase简单通讯代码

首先,就要说一下配置问题了。HBase客户端的配置有两种方式,一种是通过配置文件,另一种是通过代码设置。

1、配置文件方式
配置文件名称为hbase-site.xml,该文件必须放置到CLASS_PATH下面才会有效,文件示例如下:
hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<property>
		<name>hbase.rootdir</name>
		<value>hdfs://hadoop-master:9000/hbase</value>
	</property>
	<property>
		<name>hbase.cluster.distributed</name>
		<value>true</value>
	</property>
	<property>
		<name>hbase.master</name>
		<value>hdfs://hadoop-master:60000</value>
	</property>
	<property>
		<name>hbase.zookeeper.quorum</name>
		<value>hadoop-master,hadoop-slave01,hadoop-slave02</value>
	</property>
</configuration>

2、通过代码配置方式

        Configuration hbaseConfig = HBaseConfiguration.create();
        hbaseConfig.setInt("timeout", 120000);
        hbaseConfig.set("hbase.master", "hdfs://hadoop-master:60000");
        hbaseConfig.set("hbase.zookeeper.quorum", "hadoop-master,hadoop-slave01,hadoop-slave02");
        hbaseConfig.setInt("hbase.zookeeper.property.clientPort", 2181);
        hbaseConfig.setInt("hbase.client.retries.number", 1);

Continue reading HBase简单通讯代码

HBase基本操作04

最后,我们说一些不那么常用的操作吧:

新增列族:

hbase(main):036:0> alter 'patientvisit', NAME=>'cftest'
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 3.5200 seconds

修改列族:

hbase(main):039:0> alter 'patientvisit', NAME=>'cftest', VERSIONS=>3
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 3.0070 seconds

删除列族:

hbase(main):040:0> alter 'patientvisit', NAME=>'cftest', METHOD=>'delete'
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 2.4890 seconds

Continue reading HBase基本操作04