1、第一次运行
docker run -d -p 3000:3000 --name metabase metabase/metabase
2、以后运行
docker start metabase
3、访问http://ip:3000就好了
Learn and share.
1、第一次运行
docker run -d -p 3000:3000 --name metabase metabase/metabase
2、以后运行
docker start metabase
3、访问http://ip:3000就好了
1、安装依赖包
sudo apt-get install build-essential libssl-dev libffi-dev python-dev python-pip libsasl2-dev libldap2-dev pip install virtualenv
2、使用虚拟环境
#新建沙盒 virtualenv supersetenv #进入沙盒 source bin/activate
3、安装
#升级安装工具,安装服务 pip install --upgrade setuptools pip pip install superset #新增管理员用户 fabmanager create-admin --app superset #重置管理员密码 #fabmanager reset-password admin --app superset #升级数据库 superset db upgrade #加载测试数据 superset load_examples #初始化 superset init #运行Server superset runserver
4、此时,只需要访问http://ip:8088就可以登陆了
5、安装驱动
#mysql apt-get install libmysqlclient-dev pip install mysqlclient #oracle #pip install cx_Oracle #mssql #pip install pymssql
6、关闭
#Ctrl+C关闭Server #退出沙盒 deactivate #删除沙盒 #rmvirtualenv supersetenv
7、升级
#进入沙盒 source bin/activate #升级Server pip install superset --upgrade #升级DB superset db upgrade #初始化 superset init #退出沙盒 deactivate
这里主要说一下Spark的SQL操作,如何从mysql导入数据。
#拷贝了mysql的驱动到jar文件下面
#建立DF
val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3307").option("dbtable", "hive.TBLS").option("user", "hive").option("password", "hive").load()
#schema
jdbcDF.schema
#count
jdbcDF.count()
#show
jdbcDF.show()
这里主要说一下Spark的SQL操作。
1、dataframe操作数据
#加载json数据
val df = spark.read.json("/usr/hadoop/person.json")
#加载CSV数据
#val df = spark.read.csv("/usr/hadoop/person.csv")
#查询前20行
df.show()
#查看结构
df.printSchema()
#选择一列
df.select("NAME").show()
#按条件过滤行
df.filter($"BALANCE_COST" < 10 && $"BALANCE_COST" > 1).show()
#分组统计
df.groupBy("SEX_CODE").count().show()
2、sql操作数据
#创建视图
df.createOrReplaceTempView("person")
#查看数据
spark.sql("SELECT * FROM person").show()
#统计数据
spark.sql("SELECT * FROM person").count()
#带条件选择
spark.sql("SELECT * FROM person WHERE BALANCE_COST<10 and BALANCE_COST>1 order by BALANCE_COST").show()
3、转为DS
#转为Dataset
case class PERSONC(PATIENT_NO : String,NAME : String,SEX_CODE : String,BIRTHDATE : String,BALANCE_CODE : String)
var personDS = spark.read.json("/usr/hadoop/person.json").as[PERSONC]
4、sql与map reduce混用
personDS.select("BALANCE_COST").map(row=>if(row(0)==null) 0.0 else (row(0)+"").toDouble).reduce((a,b)=>if(a>b) a else b)
spark.sql("select BALANCE_COST from person").map(row=>if(row(0)==null) 0.0 else (row(0)+"").toDouble).reduce((a,b)=>if(a>b) a else b)
5、数据映射为Class对象
val personRDD = spark.sparkContext.textFile("/usr/hadoop/person.txt")
val persons = personRDD.map(_.split(",")).map(attributes => Person(attributes(0), attributes(1), attributes(2), attributes(3), attributes(4).toDouble))
6、自定义schema
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
#加载数据
val personRDD = spark.sparkContext.textFile("/usr/hadoop/person.txt")
#转为org.apache.spark.sql.Row
val rowRDD = personRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1), attributes(2), attributes(3), attributes(4).replace("\"","").toDouble))
#定义新的Schema
val personSchema = StructType(List(StructField("PatientNum",StringType,nullable = true), StructField("Name",StringType,nullable = true), StructField("SexCode",StringType,nullable = true), StructField("BirthDate",StringType,nullable = true), StructField("BalanceCode",DoubleType,nullable = true)))
#建立新的DF
val personDF = spark.createDataFrame(rowRDD, personSchema)
#使用DF
personDF.select("PatientNum").show()
上面说到了Spark如何与Hadoop整合,下面就说一下Spark如何与HBase整合。
1、获取hbase的classpath
#要把netty和jetty的包去掉,否则会有jar包冲突 HBASE_PATH=`/home/hadoop/Deploy/hbase-1.1.2/bin/hbase classpath`
2、启动spark
bin/spark-shell --driver-class-path $HBASE_PATH
3、进行简单的操作
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "inpatient_hb")
val admin = new HBaseAdmin(conf)
admin.isTableAvailable("inpatient_hb")
res1: Boolean = true
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
2017-01-03 20:46:29,854 INFO [main] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Job 0 finished: count at <console>:36, took 23.170739 s
res2: Long = 115077
1、启动spark
sbin/start-all.sh
可以在http://hiup:8080/看到spark运行情况。
2、启动shell
bin/spark-shell
3、测试
$ ./run-example SparkPi 10 Pi is roughly 3.140408 $ ./run-example SparkPi 100 Pi is roughly 3.1412528 $ ./run-example SparkPi 1000 Pi is roughly 3.14159016
4、基本操作
#HDFS加载数据
scala> var textFile=sc.textFile("/usr/hadoop/inpatient.txt")
#第一行
scala> textFile.first()
res1: String = "第一行内容"
#第一行,用逗号分割后,第一列(住院号)
textFile.first().split(",")(0)
res2: String = "0000718165"
#第一行,用逗号分割后,第5列(费用)
textFile.first().split(",")(5)
res3: String = "100.01"
#行数
scala> textFile.count()
res4: Long = 115411
#包含ICU的行数
textFile.filter(line=>line.contains("ICU")).count()
res5: Long = 912
#获取每一行的长度
var lineLengths = textFile.map(s=>s.length)
#获取总长度
var totalLenght = lineLengths.reduce((a,b)=>a+b)
totalLenght: Int = 32859905
#获取最大费用
textFile.map(line=>if(line.split(",").size==30) line.split(",")(23).replace("\"","") else "0").reduce((a,b)=>if(a.toDouble>b.toDouble) a else b)
res6: String = 300
#创建一个类
@SerialVersionUID(100L)
class PATIENT(var PATIENT_NO : String,var NAME : String,var SEX_CODE : String,var BIRTHDATE : String,var BALANCE_COST : String) extends Serializable
#新建一个对象
var p=new PATIENT("PATIENT_NO","NAME","SEX_CODE","BIRTHDATE","BALANCE_COST")
#新建一个map函数
def mapFunc(line:String) : PATIENT = {
var cols=line.split(",")
return new PATIENT(cols(0),cols(1),cols(2),cols(3),cols(4))
}
#最大费用
textFile.filter(line=>line.split(",").size==30).map(mapFunc).reduce((a,b)=>if(a.BALANCE_COST.replace("\"","").toDouble>b.BALANCE_COST.replace("\"","").toDouble) a else b).BALANCE_COST
#男性最大费用
textFile.filter(line=>line.split(",").size==30).map(mapFunc).filter(p=>p.SEX_CODE=="\"M\"").reduce((a,b)=>if(a.BALANCE_COST.replace("\"","").toDouble>b.BALANCE_COST.replace("\"","").toDouble) a else b).BALANCE_COST
#女性最大费用
textFile.filter(line=>line.split(",").size==30).map(mapFunc).filter(p=>p.SEX_CODE=="\"F\"").reduce((a,b)=>if(a.BALANCE_COST.replace("\"","").toDouble>b.BALANCE_COST.replace("\"","").toDouble) a else b).BALANCE_COST
#退出
scala> exit
1、下载scala-2.11.1,并解压到/usr/scala/scala-2.11.1
2、下载spark-2.0.0-bin-hadoop2.4,并解压到/home/hadoop/Deploy/spark-2.0.0
(*如果要看后续文章,建议使用hadoop-2.5.2 hbase-1.1.2 hive-1.2.1 spark-2.0.0)
3、复制spark-env.sh.template为spark-env.sh,并添加下面几行
export JAVA_HOME=/usr/java/jdk1.7.0_79 export SCALA_HOME=/usr/scala/scala-2.11.1/ export SPARK_MASTER_IP=hiup01 export SPARK_WORKER_MEMORY=1g export HADOOP_CONF_DIR=/home/hadoop/Deploy/hadoop-2.5.2/etc/hadoop
4、复制slaves.template为slaves,并添加下面几行
hiup01 hiup02 hiup03
5、将scala-2.11.1及spark-2.0.0复制到hiup02及hiup03
6、环境搭建完毕。
1、Hive到HBase
1.1、创建hive表
create table inpatient_hv( PATIENT_NO String COMMENT '住院号', NAME String COMMENT '姓名', SEX_CODE String COMMENT '性别', BIRTHDATE TIMESTAMP COMMENT '生日', BALANCE_COST String COMMENT '总费用') COMMENT '住院患者基本信息' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' STORED AS TEXTFILE;
1.2、hive表导入数据
load data inpath '/usr/hadoop/inpatient.txt' into table inpatient_hv
1.3、创建hbase表
create table inpatient_hb(
PATIENT_NO String COMMENT '住院号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST")
TBLPROPERTIES ("hbase.table.name" = "inpatient_hb");
1.4、数据从hive导入hbase
INSERT OVERWRITE TABLE inpatient_hb SELECT * FROM inpatient_hv;
2、hbase到hive
2.1、创建hbase表
create 'inpatient_hb','pinfo'
2.2、hbase表导入数据
./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COS inpatient_hb /usr/hadoop/inpatient.txt
2.3、创建hive表
#创建hbase external表
create external table inpatient_hb(
PATIENT_NO String COMMENT '住院号',
INPATIENT_NO String COMMENT '住院流水号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST")
TBLPROPERTIES ("hbase.table.name" = "inpatient_hb");
#创建hive表
create table inpatient_hv(
PATIENT_NO String COMMENT '住院号',
INPATIENT_NO String COMMENT '住院流水号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
STORED AS TEXTFILE;
2.4、数据从hbase导入hive
INSERT OVERWRITE TABLE inpatient_hv SELECT * FROM inpatient_hb;
1、建表
create table inpatient(
PATIENT_NO String COMMENT '住院号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST")
TBLPROPERTIES ("hbase.table.name" = "inpatient");
2、Hbase导入数据
2.1、Hbase直接导入
./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST inpatient /usr/hadoop/inpatient.txt
......
2016-12-22 10:33:36,985 INFO [main] client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.172.13:8032
2016-12-22 10:33:37,340 INFO [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-12-22 10:33:43,450 INFO [main] input.FileInputFormat: Total input paths to process : 1
2016-12-22 10:33:44,640 INFO [main] mapreduce.JobSubmitter: number of splits:1
2016-12-22 10:33:44,952 INFO [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-12-22 10:33:47,173 INFO [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1482371551462_0002
2016-12-22 10:33:50,830 INFO [main] impl.YarnClientImpl: Submitted application application_1482371551462_0002
2016-12-22 10:33:51,337 INFO [main] mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1482371551462_0002/
2016-12-22 10:33:51,338 INFO [main] mapreduce.Job: Running job: job_1482371551462_0002
2016-12-22 10:34:39,499 INFO [main] mapreduce.Job: Job job_1482371551462_0002 running in uber mode : false
2016-12-22 10:34:39,572 INFO [main] mapreduce.Job: map 0% reduce 0%
2016-12-22 10:35:48,228 INFO [main] mapreduce.Job: map 1% reduce 0%
2016-12-22 10:36:06,876 INFO [main] mapreduce.Job: map 3% reduce 0%
2016-12-22 10:36:09,981 INFO [main] mapreduce.Job: map 5% reduce 0%
2016-12-22 10:36:13,739 INFO [main] mapreduce.Job: map 7% reduce 0%
2016-12-22 10:36:17,592 INFO [main] mapreduce.Job: map 10% reduce 0%
2016-12-22 10:36:22,891 INFO [main] mapreduce.Job: map 12% reduce 0%
2016-12-22 10:36:45,217 INFO [main] mapreduce.Job: map 17% reduce 0%
2016-12-22 10:37:14,914 INFO [main] mapreduce.Job: map 20% reduce 0%
2016-12-22 10:37:35,739 INFO [main] mapreduce.Job: map 25% reduce 0%
2016-12-22 10:37:39,013 INFO [main] mapreduce.Job: map 34% reduce 0%
2016-12-22 10:38:24,289 INFO [main] mapreduce.Job: map 42% reduce 0%
2016-12-22 10:38:36,644 INFO [main] mapreduce.Job: map 49% reduce 0%
2016-12-22 10:38:57,618 INFO [main] mapreduce.Job: map 54% reduce 0%
2016-12-22 10:39:00,808 INFO [main] mapreduce.Job: map 56% reduce 0%
2016-12-22 10:39:07,879 INFO [main] mapreduce.Job: map 58% reduce 0%
2016-12-22 10:39:11,489 INFO [main] mapreduce.Job: map 60% reduce 0%
2016-12-22 10:39:24,708 INFO [main] mapreduce.Job: map 62% reduce 0%
2016-12-22 10:39:29,188 INFO [main] mapreduce.Job: map 63% reduce 0%
2016-12-22 10:39:34,165 INFO [main] mapreduce.Job: map 65% reduce 0%
2016-12-22 10:40:12,473 INFO [main] mapreduce.Job: map 66% reduce 0%
2016-12-22 10:40:39,471 INFO [main] mapreduce.Job: map 73% reduce 0%
2016-12-22 10:40:40,910 INFO [main] mapreduce.Job: map 74% reduce 0%
2016-12-22 10:40:42,936 INFO [main] mapreduce.Job: map 75% reduce 0%
2016-12-22 10:40:46,471 INFO [main] mapreduce.Job: map 77% reduce 0%
2016-12-22 10:40:50,495 INFO [main] mapreduce.Job: map 79% reduce 0%
2016-12-22 10:40:53,267 INFO [main] mapreduce.Job: map 81% reduce 0%
2016-12-22 10:41:06,843 INFO [main] mapreduce.Job: map 83% reduce 0%
2016-12-22 10:41:13,140 INFO [main] mapreduce.Job: map 92% reduce 0%
2016-12-22 10:41:22,305 INFO [main] mapreduce.Job: map 93% reduce 0%
2016-12-22 10:41:27,671 INFO [main] mapreduce.Job: map 96% reduce 0%
2016-12-22 10:41:48,688 INFO [main] mapreduce.Job: map 100% reduce 0%
2016-12-22 10:43:20,552 INFO [main] mapreduce.Job: Job job_1482371551462_0002 completed successfully
2016-12-22 10:43:28,574 INFO [main] mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=127746
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=43306042
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=460404
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=460404
Total vcore-seconds taken by all map tasks=460404
Total megabyte-seconds taken by all map tasks=471453696
Map-Reduce Framework
Map input records=115411
Map output records=115152
Input split bytes=115
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=26590
CPU time spent (ms)=234550
Physical memory (bytes) snapshot=83329024
Virtual memory (bytes) snapshot=544129024
Total committed heap usage (bytes)=29036544
ImportTsv
Bad Lines=259
File Input Format Counters
Bytes Read=43305927
File Output Format Counters
Bytes Written=0
2.2、completebulkload导入
#/etc/profile中添加下面一行
#export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$HBASE_HOME/lib/*"
./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.bulk.output=/usr/hadoop/inpatient.tmp -Dimporttsv.columns=HBASE_ROW_KEY,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST inpatient /usr/hadoop/inpatient.txt
......
2016-12-22 12:26:04,496 INFO [main] client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.172.13:8032
2016-12-22 12:26:12,411 INFO [main] input.FileInputFormat: Total input paths to process : 1
2016-12-22 12:26:12,563 INFO [main] mapreduce.JobSubmitter: number of splits:1
2016-12-22 12:26:12,577 INFO [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-12-22 12:26:13,220 INFO [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1482371551462_0005
2016-12-22 12:26:13,764 INFO [main] impl.YarnClientImpl: Submitted application application_1482371551462_0005
2016-12-22 12:26:13,832 INFO [main] mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1482371551462_0005/
2016-12-22 12:26:13,833 INFO [main] mapreduce.Job: Running job: job_1482371551462_0005
2016-12-22 12:26:35,952 INFO [main] mapreduce.Job: Job job_1482371551462_0005 running in uber mode : false
2016-12-22 12:26:36,156 INFO [main] mapreduce.Job: map 0% reduce 0%
2016-12-22 12:27:15,839 INFO [main] mapreduce.Job: map 3% reduce 0%
2016-12-22 12:27:18,868 INFO [main] mapreduce.Job: map 53% reduce 0%
2016-12-22 12:27:21,981 INFO [main] mapreduce.Job: map 58% reduce 0%
2016-12-22 12:27:29,195 INFO [main] mapreduce.Job: map 67% reduce 0%
2016-12-22 12:27:41,582 INFO [main] mapreduce.Job: map 83% reduce 0%
2016-12-22 12:27:52,819 INFO [main] mapreduce.Job: map 85% reduce 0%
2016-12-22 12:27:59,189 INFO [main] mapreduce.Job: map 93% reduce 0%
2016-12-22 12:28:07,498 INFO [main] mapreduce.Job: map 100% reduce 0%
2016-12-22 12:29:11,199 INFO [main] mapreduce.Job: map 100% reduce 67%
2016-12-22 12:29:24,353 INFO [main] mapreduce.Job: map 100% reduce 70%
2016-12-22 12:29:32,324 INFO [main] mapreduce.Job: map 100% reduce 74%
2016-12-22 12:29:37,001 INFO [main] mapreduce.Job: map 100% reduce 79%
2016-12-22 12:29:38,011 INFO [main] mapreduce.Job: map 100% reduce 82%
2016-12-22 12:29:41,038 INFO [main] mapreduce.Job: map 100% reduce 84%
2016-12-22 12:29:45,082 INFO [main] mapreduce.Job: map 100% reduce 88%
2016-12-22 12:29:48,115 INFO [main] mapreduce.Job: map 100% reduce 90%
2016-12-22 12:29:51,154 INFO [main] mapreduce.Job: map 100% reduce 92%
2016-12-22 12:29:54,186 INFO [main] mapreduce.Job: map 100% reduce 94%
2016-12-22 12:29:57,205 INFO [main] mapreduce.Job: map 100% reduce 97%
2016-12-22 12:30:00,236 INFO [main] mapreduce.Job: map 100% reduce 100%
2016-12-22 12:30:06,388 INFO [main] mapreduce.Job: Job job_1482371551462_0005 completed successfully
2016-12-22 12:30:09,203 INFO [main] mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=237707880
FILE: Number of bytes written=357751428
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=43306042
HDFS: Number of bytes written=195749237
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=99691
Total time spent by all reduces in occupied slots (ms)=83330
Total time spent by all map tasks (ms)=99691
Total time spent by all reduce tasks (ms)=83330
Total vcore-seconds taken by all map tasks=99691
Total vcore-seconds taken by all reduce tasks=83330
Total megabyte-seconds taken by all map tasks=102083584
Total megabyte-seconds taken by all reduce tasks=85329920
Map-Reduce Framework
Map input records=115411
Map output records=115152
Map output bytes=118397787
Map output materialized bytes=118853937
Input split bytes=115
Combine input records=115152
Combine output records=115077
Reduce input groups=115077
Reduce shuffle bytes=118853937
Reduce input records=115077
Reduce output records=3337137
Spilled Records=345231
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=2017
CPU time spent (ms)=38130
Physical memory (bytes) snapshot=383750144
Virtual memory (bytes) snapshot=1184014336
Total committed heap usage (bytes)=231235584
ImportTsv
Bad Lines=259
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=43305927
File Output Format Counters
Bytes Written=195749237
3、在hive中进行查询
hive> select * from inpatient limit 1; OK ...... Time taken: 12.419 seconds, Fetched: 1 row(s) hive> select count(*) from inpatient; Query ID = hadoop_20161222114304_b247c745-a6ec-4e52-b76d-daefb657ac20 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1482371551462_0004, Tracking URL = http://hadoop-master:8088/proxy/application_1482371551462_0004/ Kill Command = /home/hadoop/Deploy/hadoop-2.5.2/bin/hadoop job -kill job_1482371551462_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-12-22 11:44:22,634 Stage-1 map = 0%, reduce = 0% 2016-12-22 11:45:08,704 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.74 sec 2016-12-22 11:45:50,754 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.19 sec MapReduce Total cumulative CPU time: 8 seconds 190 msec Ended Job = job_1482371551462_0004 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.19 sec HDFS Read: 13353 HDFS Write: 7 SUCCESS Total MapReduce CPU Time Spent: 8 seconds 190 msec OK 115077 Time taken: 170.801 seconds, Fetched: 1 row(s) ./hadoop jar /home/hadoop/Deploy/hbase-1.1.2/lib/hbase-server-1.1.2.jar completebulkload /usr/hadoop/inpatient.tmp inpatient ...... 16/12/22 12:42:04 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x4df040780x0, quorum=localhost:2181, baseZNode=/hbase 16/12/22 12:42:04 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 16/12/22 12:42:04 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 16/12/22 12:42:04 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x5924755d380005, negotiated timeout = 90000 16/12/22 12:42:06 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x7979cd9c connecting to ZooKeeper ensemble=localhost:2181 16/12/22 12:42:06 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x7979cd9c0x0, quorum=localhost:2181, baseZNode=/hbase 16/12/22 12:42:06 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 16/12/22 12:42:06 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 16/12/22 12:42:07 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x5924755d380006, negotiated timeout = 90000 16/12/22 12:42:07 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://hadoop-master:9000/usr/hadoop/inpatient.tmp/_SUCCESS 16/12/22 12:42:08 INFO hfile.CacheConfig: CacheConfig:disabled 16/12/22 12:42:08 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://hadoop-master:9000/usr/hadoop/inpatient.tmp/pinfo/7ee330c0f66c4d36b5d614a337d3929f first=" last="B301150360" 16/12/22 12:42:08 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService 16/12/22 12:42:08 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x5924755d380006 16/12/22 12:42:08 INFO zookeeper.ZooKeeper: Session: 0x5924755d380006 closed 16/12/22 12:42:08 INFO zookeeper.ClientCnxn: EventThread shut down
上面说到了Hive如何与Hadoop整合,下面就说一下Hive如何与HBase整合。
Hive与HBase整合简单的超出你的想象:
1、设置环境变量
export HADOOP_HOME=/home/hadoop/Deploy/hadoop-2.5.2 export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/mynative" export HBASE_HOME=/home/hadoop/Deploy/hbase-1.1.2 export HIVE_HOME=/home/hadoop/Deploy/hive-1.2.1
2、启动Hadoop
3、启动HBase
4、启动metastore
./hive --service metastore &
5、启动hive
#输出DEBUG日志 #./hive hive.root.logger=DEBUG,console #单节点接入 ./hive -hiveconf hbase.master=hadoop-master:6000 #多节点接入 ./hive -hiveconf hbase.zookeeper.quorum=hadoop-master:2181,hadoop-slave01:2181,hadoop-slave02:2181 #为了方便,也可以在hive-site.xml增加对应的property来达到相同的效果
6、整合完毕了?好吧,你太天真了。。。
7、去github上下载hive-1.2.1的源码,需要重新编译一下hive-hbase-handler-1.2.1这个jar包(解决版本不兼容的问题)
7.1、方法1,修改pom.xml中的,用mvn在linux下编译(用到了bash),要自己解决一些兼容性问题
<hbase.hadoop1.version>0.98.9-hadoop1</hbase.hadoop1.version>
<hbase.hadoop2.version>1.2.1-hadoop2</hbase.hadoop2.version>
7.2、拷贝hbase-handler的源码,把hadoop、hbase、hive下的jar包找出来作为依赖包(删除hive-hbase-handler-1.2.1.jar),然后打jar包,简单粗暴
8、用自己的hive-hbase-handler-1.2.1.jar包,替换hive下的jar包
9、这样就搞定啦。