版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、1 部署环境1.1 CDH环境DataNode有5个节点,Hive、Spark、Impala都有3个节点,每个节点128G内存。1.2 创建表1、 原始数据表:taglibs_text,以STORED AS TEXTFILE方式存储表。2、 RCFILE格式表:taglibs_rc,以STORED AS RCFILE方式存储表。3、 PARQUET格式表:taglibs_parquet,以STORED AS PARQUET方式存储表。4、 关联查询表:taglibs_busy,只有一个telno字段,用于跟数据表做关联查询。1.3 制作测试数据原始数据是实际数据,517列,966929行,以&
2、作为分隔符1.3.1 原始数据行扩展roothadoop-9 gaoyang# for(i=10087;i HIVE_201606050001_201606_1.txt ;done;原始数据3G左右,行扩展一次替换不要太大30G左右即可,否则会慢,跟系统内存有关1.3.2 原始数据列扩展roothadoop-9 gaoyang# cat HIVE_201606050001_201606_1.txt |awk BEGINOFS=&print $0$0 HIVE1.txt行扩展完成后,进行列扩展,517扩大一倍变成1034列roothadoop-9 gaoyang# cat HIVE_201606
3、050001_201606_1.txt |awk BEGINOFS=&print $0$0$0$0 HIVE1.txt行扩展完成后,进行列扩展,517扩大四倍变成2068列1.4 导入数据1、 导入原始text格式数据:load data local inpath /root/gaoyang/HIVE_201606050001_201606.txt into table taglibs_text;2、 将原始数据导入到RCFILE格式表:insert into taglibs_rc select * from taglibs_text;3、 将原始数据导入到PARQUET格式表:insert
4、into taglibs_parquet select * from taglibs_text;4、 将业务数据导入到busy表:load data local inpath /root/gaoyang/busy.txt into table taglibs_busy;1.5 占用空间1.5.1 taglibs_rc517占用空间hive desc formatted taglibs_rc517;# Detailed Table Information Database: default Owner: root CreateTime: Wed Nov 16 11:12:55 CST 2016
5、LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs:/hadoop-9:8020/user/hive/warehouse/taglibs_rc517 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1113 numRows 99818156 rawDataSize 245389068794 totalSize 265033234899 transient_lastDdlTime 14792868
6、03 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: Sort Columns: Storage Desc Params: fie
7、ld.delim & serialization.format & Time taken: 0.32 seconds, Fetched: 548 row(s)1.5.2 taglibs_rc1034占用空间hive desc formatted taglibs_rc1034;# Detailed Table Information Database: default Owner: root CreateTime: Mon Nov 21 10:41:09 CST 2016 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Locati
8、on: hdfs:/hadoop-9:8020/user/hive/warehouse/taglibs_rc1034 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATEtrue last_modified_by root last_modified_time 1479780464 numFiles 1109 numRows 50280308 rawDataSize 221080904200 totalSize 238925225228 transient_lastDdlTime1479782247 # Storag
9、e Information SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: Sort Columns: Storage Desc Params: field.delim &
10、serialization.format& Time taken: 0.399 seconds, Fetched: 1067 row(s)1.5.3 taglibs_parquet517占用空间hive desc formatted taglibs_parquet;# Detailed Table Information Database: default Owner: root CreateTime: Thu Nov 17 16:52:18 CST 2016 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: h
11、dfs:/hadoop-9:8020/user/hive/warehouse/taglibs_parquet517 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 791 numRows 100785085 rawDataSize 52105888945 totalSize 122656264171 transient_lastDdlTime 1479453914 # Storage Information SerDe Library: org.apache.hadoop.hive.
12、ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Compressed: No Num Buckets: 0 Bucket Columns: Sort Columns: Storage Desc Params: field.delim & serialization.fo
13、rmat & Time taken: 0.249 seconds, Fetched: 548 row(s)1.5.4 taglibs_parquet517占用空间hive desc formatted taglibs_parquet1034;# Detailed Table Information Database: default Owner: root CreateTime: Tue Nov 22 13:46:26 CST 2016 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs:/hadoop-
14、9:8020/user/hive/warehouse/taglibs_parquet1034 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATEtrue numFiles 762 numRows 50280308 rawDataSize 51989838472 totalSize 92736041837 transient_lastDdlTime1479796596 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.parquet.s
15、erde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: Sort Columns: Storage Desc Params: field.delim & serialization.format& Time tak
16、en: 0.333 seconds, Fetched: 1065 row(s)2 Hadoop+hive+spark测试2.1 测试Hive-RCFILE-517列2.1.1 条件查询2.1.1.1 查询总数hive select count(*) from taglibs_rc;结果:99818156Time taken: 119.816 seconds, Fetched: 1 row(s)2.1.1.2 10个条件count(*)hive select count(*) from taglibs_rc where age50 and net=4 and times12 and month_
17、fee200 and data_fee10 and downtime=0 and term_times12 and call_times100 and called_times100 and province_roam50;结果:50808Time taken: 69.47 seconds, Fetched: 1 row(s)2.1.1.3 10个条件11个目标取100条hive select telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from
18、 taglibs_rc where age50 and net=4 and times12 and month_fee200 and data_fee10 and downtime=0 and term_times12 and call_times100 and called_times100 and province_roam50 limit 100;结果:Time taken: 55.384 seconds, Fetched: 100 row(s)2.1.1.4 多表联合查询hive select taglibs_rc.telno,age,net,times,month_fee,data_
19、fee,downtime,term_times,call_times,called_times,province_roam from taglibs_rc join taglibs_busy on taglibs_rc.telno=taglibs_busy.telno;结果:Time taken: 72.727 seconds, Fetched: 9999 row(s)2.1.2 查询结论以517列,99818156行数据为例,hive on spark查询性能如下:序号查询功能查询性能1查询总数119秒210个条件count(*)69秒310个条件11个目标取100条55秒4多表联合查询72
20、秒2.2 测试Hive-RCFILE-1034列查询结果2.2.1 条件查询2.2.1.1 查询总数hive select count(*) from taglibs_rc1034;结果:49313379Time taken: 76.645 seconds, Fetched: 1 row(s)2.2.1.2 10个条件count(*)hive select count(*) from taglibs_rc1034 where age50 and net=4 and times12 and month_fee200 and data_fee10 and downtime=0 and term_t
21、imes12 and call_times100 and called_times100 and province_roam50;结果:25143Time taken: 56.702 seconds, Fetched: 1 row(s)2.2.1.3 10个条件11个目标取100条hive select telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from taglibs_rc1034 where age50 and net=4 and time
22、s12 and month_fee200 and data_fee10 and downtime=0 and term_times12 and call_times100 and called_times100 and province_roam50 limit 100;结果:Time taken: 59.5 seconds, Fetched: 25143 row(s)2.2.1.4 多表联合查询hive select taglibs_rc1034.telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,cal
23、led_times,province_roam from taglibs_rc1034 join taglibs_busy on taglibs_rc1034.telno=taglibs_busy.telno;结果:Time taken: 59.85 seconds, Fetched: 9999 row(s)2.2.2 查询结论以1034列,49313379行数据为例,hive on spark查询性能如下:序号查询功能查询性能1查询总数76秒210个条件count(*)56秒310个条件11个目标取100条59秒4多表联合查询59秒2.3 测试Hive-RCFILE-2068列查询结果2.3
24、.1 条件查询2.3.1.1 查询总数hive select count(*) from taglibs_rc2068;结果:49313379Time taken: 258.158 seconds, Fetched: 1 row(s)2.3.1.2 10个条件count(*)hive select count(*) from taglibs_rc2068 where age50 and net=4 and times12 and month_fee200 and data_fee10 and downtime=0 and term_times12 and call_times100 and c
25、alled_times100 and province_roam50;结果:25143Time taken: 267.088 seconds, Fetched: 1 row(s)2.3.1.3 10个条件11个目标取100条hive select telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from taglibs_rc2068 where age50 and net=4 and times12 and month_fee200 and data
26、_fee10 and downtime=0 and term_times12 and call_times100 and called_times100 and province_roam50 limit 100;结果:Time taken: 256.999 seconds, Fetched: 25143 row(s)2.3.1.4 多表联合查询hive select taglibs_rc2068.telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam fr
27、om taglibs_rc2068 join taglibs_busy on taglibs_rc2068.telno=taglibs_busy.telno;结果:Time taken: 252.211 seconds, Fetched: 9999 row(s)2.3.2 查询结论以2068列,49313379行数据为例,hive on spark查询性能如下:序号查询功能查询性能1查询总数258秒210个条件count(*)267秒310个条件11个目标取100条256秒4多表联合查询252秒2.4 测试Hive-PARQUET-1034列2.4.1 条件查询2.4.1.1 查询总数hive
28、 select count(*) from taglibs_parquet1034;结果:49313379Time taken: 44.62 seconds, Fetched: 1 row(s)2.4.1.2 10个条件count(*)hive select count(*) from taglibs_parquet1034 where age50 and net=4 and times12 and month_fee200 and data_fee10 and downtime=0 and term_times12 and call_times100 and called_times100
29、and province_roam50;结果:19720Time taken: 25.573 seconds, Fetched: 1 row(s)2.4.1.3 10个条件11个目标取100条hive select telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from taglibs_parquet1034 where age50 and net=4 and times12 and month_fee200 and data_fee10 and
30、downtime=0 and term_times12 and call_times100 and called_times100 and province_roam50 limit 100;结果:Time taken: 23.349 seconds, Fetched: 19720 row(s)2.4.1.4 多表联合查询hive select taglibs_parquet1034.telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from tagl
31、ibs_parquet1034 join taglibs_busy on taglibs_parquet1034.telno=taglibs_busy.telno;结果:Time taken: 25.67 seconds, Fetched: 9999 row(s)2.4.2 查询结论以517列,99818156行数据为例,hive查询性能如下:序号查询功能查询性能1查询总数44秒210个条件count(*)25秒310个条件11个目标取100条23秒4多表联合查询25秒2.5 测试Hive-PARQUET-2068列2.5.1 条件查询2.5.1.1 查询总数hive select count
32、(*) from taglibs_parquet2068;结果:49313379Time taken: 39.584 seconds, Fetched: 1 row(s)2.5.1.2 10个条件count(*)hive select count(*) from taglibs_parquet2068 where age50 and net=4 and times12 and month_fee200 and data_fee10 and downtime=0 and term_times12 and call_times100 and called_times100 and province
33、_roam50;结果:25143Time taken: 44.613 seconds, Fetched: 1 row(s)2.5.1.3 10个条件11个目标取100条hive select telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from taglibs_parquet2068 where age50 and net=4 and times12 and month_fee200 and data_fee10 and downtime=0 a
34、nd term_times12 and call_times100 and called_times100 and province_roam50 limit 100;结果:Time taken: 43.795 seconds, Fetched: 25143 row(s)2.5.1.4 多表联合查询hive select taglibs_parquet2068.telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from taglibs_parquet2
35、068 join taglibs_busy on taglibs_parquet2068.telno=taglibs_busy.telno;结果:Time taken: 32.994 seconds, Fetched: 9999 row(s)2.5.2 查询结论以2068列,49313379行数据为例,hive查询性能如下:序号查询功能查询性能1查询总数39秒210个条件count(*)44秒310个条件11个目标取100条43秒4多表联合查询32秒3 Hadoop+Hive+Impala测试3.1 测试Impala-RCFILE-517列3.1.1 条件查询3.1.1.1 查询总数hadoo
36、p-3:21000 select count(*) from taglibs_rc;结果:+-+| count(*) |+-+| 99818156 |+-+Fetched 1 row(s) in 697.82s3.1.1.2 10个条件count(*)hadoop-3:21000 select count(*) from taglibs_rc where age50 and net=4 and times12 and month_fee200 and data_fee10 and downtime=0 and term_times12 and call_times100 and called_
37、times100 and province_roam50;结果:+-+| count(*) |+-+| 50808 |+-+Fetched 1 row(s) in 460.03s3.1.1.3 10个条件11个目标取100条hadoop-3:21000 select telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from taglibs_rc where age50 and net=4 and times12 and month_fee200 an
38、d data_fee10 and downtime=0 and term_times12 and call_times100 and called_times100 and province_roam50 limit 100;结果:立即返回结果3.1.1.4 多表联合查询hadoop-3:21000 select taglibs_rc.telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from taglibs_rc join taglibs_busy
39、on taglibs_rc.telno=taglibs_busy.telno;结果:Fetched 9999 row(s) in 341.30s3.1.2 查询结论以517列,99818156行数据为例,impala查询性能如下:序号查询功能查询性能1查询总数697秒210个条件count(*)460秒310个条件11个目标取100条立即返回4多表联合查询341秒3.2 测试Impala-PARQUET-517列3.2.1 条件查询3.2.1.1 查询总数hadoop-3:21000 select count(*) from taglibs_parquet;结果:+-+| count(*) |
40、+-+| 100785085 |+-+Fetched 1 row(s) in 3.55s3.2.1.2 10个条件count(*)hadoop-3:21000 select count(*) from taglibs_parquet where age50 and net=4 and times12 and month_fee200 and data_fee10 and downtime=0 and term_times12 and call_times100 and called_times100 and province_roam50;结果:+-+| count(*) |+-+| 5130
41、1 |+-+Fetched 1 row(s) in 69.30s3.2.1.3 10个条件11个目标取100条hadoop-3:21000 select telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from taglibs_parquet where age50 and net=4 and times12 and month_fee200 and data_fee10 and downtime=0 and term_times12 and cal
42、l_times100 and called_times100 and province_roam50 limit 100;结果:立即返回结果3.2.1.4 多表联合查询hadoop-3:21000 select taglibs_parquet.telno,age,net,times,month_fee,data_fee,downtime,term_times,call_times,called_times,province_roam from taglibs_parquet join taglibs_busy on taglibs_parquet.telno=taglibs_busy.teln
43、o;结果:Fetched 19998 row(s) in 55.85s3.2.2 查询结论以517列,99818156行数据为例,impala查询性能如下:序号查询功能查询性能1查询总数3秒210个条件count(*)69秒310个条件11个目标取100条立即返回4多表联合查询37秒3.3 测试Impala-PARQUET-1034列以1034列,49313379行数据为例,impala查询性能如下:序号查询功能查询性能1查询总数3秒210个条件count(*)40秒310个条件11个目标取100条立即返回4多表联合查询37秒3.4 测试Impala-PARQUET-2068列3.4.1 条件查询3.4.1.1 查询总数hadoop-3:21000 select count(*) from taglibs_parquet2068;结果:+-+| count(*) |+-+| 49313379 |+-+Fetched 1 row(s) in 8.44s3.4.1.2 10个条件count(*)hadoop-3:21000 select count(*) from taglibs_parquet2068 where age50 and net=4 and times12 and mo
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 护理多媒体课件制作的交互设计
- 汞中毒患者的眼部并发症护理
- 环境职业健康与安全管理方案1
- 2026年保险公司运营成本优化与合理利润水平控制规范
- 2025年前台服务规范模拟试卷
- 2026年流域上下游横向生态补偿机制建立方案
- 2026年公共数据授权运营三种模式:整体授权 分领域授权 依场景授权适用指南
- 老年病人皮肤黏膜护理常规
- 2026年腾讯WorkBuddy打通企业微信QQ飞书钉钉自动化办公
- 2026年矫形器配置服务流程与质量控制规范
- 人教版数学六年级下册数第四单元《比例》集体备课教案
- 美丽的夏牧场同声合唱谱
- 新进人员院感培训
- 山西职业技术学院单招《语文》考试复习题库(含答案)
- 新版《技规》工务普速课件
- 浙江华峰新材料股份有限公司年产32万吨聚氨酯原液和32万吨聚氨酯中间体技改项目环境影响报告书
- 护理学腮腺炎的护理课件
- 机械设备技术参数登记表
- 特种水处理工艺运行与管理-含铁含锰水给水处理
- 地大水文地质学基础-课件
- 四年级数学智算365(课后拓展题)
评论
0/150
提交评论