




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、数据透视 - 商场(如沃尔玛)选址应用本文章来自于阿里云云栖社区摘要:标签 PostgreSQL , 数据透视 , cube , grouping sets , rollup 背景 人群透视是商业与数据结合的案例之一,比如大型商场的选址,可与分析的数据包括车流、人流量等等。标签PostgreSQL , 数据透视 , cube , grouping sets , rollup背景人群透视是商业与数据结合的案例之一,比如大型商场的选址,可与分析的数据包括车流、人流量等等。结合数据可以更深入的分析人群的组成结构,消费能力等等,给大型商场的选址带来更多的参考价值。那么如何使用数据库透视人群数据呢?场景
2、构建1. 人群属性表记载了每个人的各个属性段落,比如收入、车龄、固定资产等等。如下create table people( id serial8 primary key, - 用户ID c1 int2, - 年龄分段, 假设分5个档, 使用0,1,2,3,4表示 c2 int2, - 个人收入分段, 假设分3个档, 使用0,1,2表示 c3 int2, - 车龄分段, 假设分5个档, 使用0,1,2,3,4表示 c4 int2, - 家庭收入分段, 假设分3个档, 使用0,1,2表示 c5 int2, - 固定资产分段, 假设分3个档, 使用0,1,2表示 c6 int2 - 存款分段, 假设
3、分3个档, 使用0,1,2表示 ); 2. 人群动态轨迹记录的是人群的活动位置或轨迹使用PostgreSQL PostGIS插件,可以很方便的记录轨迹数据,并且支持GIST索引,可以快速的根据某个区域或范围搜索对应的人群。create table people_loc( id int8, - 用户ID - loc geometry, - 位置 crt_time timestamp - 时间 ); 生成测试数据1. 生成1000万人群的测试数据, 其中车龄为4, 年龄段为4的不插入,制造一些空洞。insert into people (c1,c2,c3,c4,c5,c6) select mod(
4、random()*10):int,4), mod(random()*10):int,3), mod(random()*10):int,4), mod(random()*10):int,3), mod(random()*10):int,3), mod(random()*10):int,3) from generate_series(1,); postgres=# select * from people limit 10; id | c1 | c2 | c3 | c4 | c5 | c6 -+-+-+-+-+-+- 1 | 2 | 1 | 3 | 0 | 1 | 2 2 | 0 | 0 | 1
5、| 0 | 1 | 0 3 | 2 | 1 | 0 | 2 | 0 | 2 4 | 1 | 0 | 0 | 0 | 1 | 2 5 | 3 | 2 | 2 | 1 | 2 | 1 6 | 1 | 2 | 0 | 0 | 1 | 1 7 | 2 | 1 | 0 | 1 | 0 | 0 8 | 1 | 1 | 0 | 1 | 0 | 2 9 | 3 | 0 | 3 | 1 | 2 | 1 10 | 3 | 2 | 2 | 0 | 2 | 1 (10 rows) 2. 生成1000万人群轨迹数据insert into people_loc (id, crt_time) select random()
6、*, now()+format(%L, (-random()*):interval from generate_series(1,); postgres=# select * from people_loc limit 10; id | crt_time -+- | 2017-03-05 16:35:13. | 2017-03-07 09:08:26. | 2017-03-04 18:47:49. | 2017-03-11 08:46:31. | 2017-03-11 14:48:55. | 2017-03-04 08:17:28. | 2017-03-01 15:37:11.57363 |
7、2017-03-11 17:51:46. | 2017-03-05 08:07:45. | 2017-03-09 14:10:42. (10 rows) 数据透视1. 选择人群以某个点为中心、或者根据某个闭环区域,圈一部分人群,(采用PostGIS)这里不举例GIS(跟兴趣的童鞋可以使用PostGIS测试一下,性能杠杠的),我直接以时间为度量直接圈人。select id from people_loc where crt_time between 2017-03-06:date and 2017-03-08:date; 有人可能要问,如果这个时间段,同一个人出现了多条轨迹,怎么处理呢?这里使用
8、了IN,PostgreSQL 的优化器很强大,JOIN时数据库会自动聚合,不必在这里GROUP BY,原理可参考如下文章。聊一下PostgreSQL优化器 - in里面有重复值时PostgreSQL如何处理?(原文链接:/digoal/blog/blob/master/_01.md?spm=5176.blogcont71875.16.5QXS0X&file=_01.md)2. 数据透视PostgreSQL的SQL兼容性非常强大,对于数据透视,可以使用grouping sets, cube, rollup等语法。GROUPING SETS, CUBE and R
9、OLLUP(原文链接:/digoal/blog/blob/master/_02.md?spm=5176.blogcont71875.17.p1ibfO&file=_02.md)select c1,c2,c3,c4,c5,c6,count(*) cnt from people where id in ( select id from people_loc where crt_time between 2017-03-06:date and 2017-03-08:date ) GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,()
10、; c1 | c2 | c3 | c4 | c5 | c6 | cnt -+-+-+-+-+-+- | 0 | | | | | | 1 | | | | | | 2 | | | | | | | | | | | | | | 0 | | | | | | 1 | | | | | | 2 | | | | | | | | 0 | | | | | | 1 | | | | | | 2 | | | | | 0 | | | | | | 1 | | | | | | 2 | | 0 | | | | | | 1 | | | | | | 2 | | | | | | 3 | | | | | | | | 0 | | | |
11、| | 1 | | | | | | 2 | | | | | | 3 | | | | (21 rows) 更多透视用法参考cube, rollup, grouping sets用法。目前PostgreSQL, HybridDB, Greenplum都支持以上语法。3. 结果转换使用WITH语法,将以上结果进行转换with tmp as ( select c1,c2,c3,c4,c5,c6,count(*) cnt from people where id in ( select id from people_loc where crt_time between 2017-03-06:date a
12、nd 2017-03-08:date ) GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,() ) select case when c1 is not null then c1_|c1 when c2 is not null then c2_|c2 when c3 is not null then c3_|c3 when c4 is not null then c4_|c4 when c5 is not null then c5_|c5 when c6 is not null then c6_|c6 else cnt end AS col, t
13、as private, t as all, t:numeric/t as ratio from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2 ; col | private | all | ratio -+-+-+- c2_0 | | | 0. c2_1 | | | 0. c2_2 | | | 0. cnt | |
14、| 1. c4_0 | | | 0. c4_1 | | | 0. c4_2 | | | 0. c6_0 | | | 0. c6_1 | | | 0. c6_2 | | | 0. c5_0 | | | 0. c5_1 | | | 0. c5_2 | | | 0. c1_0 | | | 0. c1_1 | | | 0. c1_2 | | | 0. c1_3 | | | 0. c3_0 | | | 0. c3_1 | | | 0. c3_2 | | | 0. c3_3 | | | 0. (21 rows) Time: 8466.507 ms perf report# Events: 8K cycle
15、s # # Overhead Command Shared Object Symbol # . . . . # 6.29% postgres postgres . comparetup_heap | - comparetup_heap | |-41.84%- (nil) | |-33.36%- 0x1 | |-8.44%- 0x23e8e | |-8.43%- 0x2 | -7.93%- 0x3 5.16% postgres postgres . slot_deform_tuple.lto_priv.1138 | - slot_deform_tuple.lto_priv.1138 3.82%
16、postgres postgres . mergeprereadone | - mergeprereadone 3.79% postgres postgres . qsort_ssup | - qsort_ssup 3.51% postgres postgres . tuplesort_gettuple_common.lto_priv.1348 | - tuplesort_gettuple_common.lto_priv.1348 | |-32.14%- 0x1 | |-22.28%- 0x2 | |-18.95%- (nil) | |-11.41%- 0x10 | |-5.72%- 0x3
17、| |-1.91%- 0x3d84d9 | |-1.91%- 0xef259 | |-1.91%- get_select_query_def.lto_priv.1324 | |-1.91%- 0x95c9af | -1.88%- 0x3a0e54 4. left join 补缺(可选)对于空洞值,如果你要补齐的话,使用left join即可select * from (values (c1_0),(c1_1),(c1_2),(c1_3),(c1_4),(c2_0),(c2_1),(c2_2),(c3_0),(c3_1),(c3_2),(c3_3),(c3_4),(c4_0),(c4_1),(c
18、4_2),(c5_0),(c5_1),(c5_2),(c6_0),(c6_1),(c6_2) t (col); col - c1_0 c1_1 c1_2 c1_3 c1_4 c2_0 c2_1 c2_2 c3_0 c3_1 c3_2 c3_3 c3_4 c4_0 c4_1 c4_2 c5_0 c5_1 c5_2 c6_0 c6_1 c6_2 (22 rows) 补缺如下with tmp as ( select c1,c2,c3,c4,c5,c6,count(*) cnt from people where id in ( select id from people_loc where crt_
19、time between 2017-03-06:date and 2017-03-08:date ) GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,() ), tmp2 as ( select case when c1 is not null then c1_|c1 when c2 is not null then c2_|c2 when c3 is not null then c3_|c3 when c4 is not null then c4_|c4 when c5 is not null then c5_|c5 when c6 is not null
20、 then c6_|c6 else cnt end AS col, t as private, t as all, t:numeric/t as ratio from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2 ) select t1.col,coalesce(t2.ratio,0) ratio from
21、 (values (c1_0),(c1_1),(c1_2),(c1_3),(c1_4),(c2_0),(c2_1),(c2_2),(c3_0),(c3_1),(c3_2),(c3_3),(c3_4),(c4_0),(c4_1),(c4_2),(c5_0),(c5_1),(c5_2),(c6_0),(c6_1),(c6_2) t1 (col) left join tmp2 t2 on (t1.col=t2.col) order by t1.col; col | ratio -+- c1_0 | 0. c1_1 | 0. c1_2 | 0. c1_3 | 0. c1_4 | 0 c2_0 | 0.
22、 c2_1 | 0. c2_2 | 0. c3_0 | 0. c3_1 | 0. c3_2 | 0. c3_3 | 0. c3_4 | 0 c4_0 | 0. c4_1 | 0. c4_2 | 0. c5_0 | 0. c5_1 | 0. c5_2 | 0. c6_0 | 0. c6_1 | 0. c6_2 | 0. (22 rows) 5. 行列变换(可选)如果要将以上数据,多行转换为单行,可以使用tablefunc插件,PostgreSQL玩法巨多哦。/docs/9.6/static/tablefunc.htmlcreate extensi
23、on tablefunc; select * from crosstab($ with tmp as ( select c1,c2,c3,c4,c5,c6,count(*) cnt from people where id in ( select id from people_loc where crt_time between 2017-03-06:date and 2017-03-08:date ) GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,() ), tmp2 as ( select case when c1 is not null then c
24、1_|c1 when c2 is not null then c2_|c2 when c3 is not null then c3_|c3 when c4 is not null then c4_|c4 when c5 is not null then c5_|c5 when c6 is not null then c6_|c6 else cnt end AS col, t as private, t as all, t:numeric/t as ratio from tmp t1, (select cnt from tmp where tmp.c1 i
25、s null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2 ) select row:text , t1.col,coalesce(t2.ratio,0) ratio from (values (c1_0),(c1_1),(c1_2),(c1_3),(c1_4),(c2_0),(c2_1),(c2_2),(c3_0),(c3_1),(c3_2),(c3_3),(c3_4),(c4_0),(c4_1),(c4_2),(c5_0),(c5_1),(
26、c5_2),(c6_0),(c6_1),(c6_2) t1 (col) left join tmp2 t2 on (t1.col=t2.col) order by t1.col $ ) as ( row text, c1_0 numeric, c1_1 numeric, c1_2 numeric, c1_3 numeric, c1_4 numeric, c2_0 numeric, c2_1 numeric, c2_2 numeric, c3_0 numeric, c3_1 numeric, c3_2 numeric, c3_3 numeric, c3_4 numeric, c4_0 numer
27、ic, c4_1 numeric, c4_2 numeric, c5_0 numeric, c5_1 numeric, c5_2 numeric, c6_0 numeric, c6_1 numeric, c6_2 numeric ); row | c1_0 | c1_1 | c1_2 | c1_3 | c1_4 | c2_0 | c2_1 | c2_2 | c3_0 | c3_1 | c3_2 | c3_3 | c3_4 | c4_0 | c4_1 | c4_2 | c5_0 | c5_1 | c5_2 | c6_0 | c6_1 | c6_2 -+-+-+-+-+-+-+-+-+-+- +-
28、+-+-+-+-+-+-+-+-+-+- -+- row | 0. | 0. | 0. | 0. | 0 | 0. | 0. | 0. | 0. | 0. | 0. | 0. | 0 | 0. | 0. | 0. | 0. | 0. | 0. | 0. | 0.3 | 0. (1 row) 透视优化1. 关于索引(BRIN, GIST, BTREE_GIST)通常我们会限定两个维度,筛选人群,1时间范围,2地理位置范围。由于轨迹数据通常是时间和堆的线性相关性很好的,所以,在索引方面,可以使用BRIN索引。brin索引详见PostgreSQL 聚集存储 与 BRIN索引 - 高并发行为、轨迹类大
29、吞吐数据查询场景解说(原文链接:/digoal/blog/blob/master/_01.md?spm=5176.blogcont71875.20.ssOHpy&file=_01.md)而对于地理位置,如果要进行快速筛选的话,可以建立GIST索引如果要建立两者的复合索引,可以使用btree_gist插件,那么时间和地理位置就能放在一个GIST索引中了。create extension btree_gist; 2. 递归优化如果轨迹点很多,但是大多数为重复人群,可使用递归优化IN查询参考用PostgreSQL找回618秒逝去的青春 - 递归收敛优化(原文链接:h
30、ttps://digoal/blog/blob/master/_01.md?spm=5176.blogcont71875.21.SFhL9D&file=_01.md)distinct xx和count(distinct xx)的变态递归优化方法 - 索引收敛(skip scan)扫描(原文链接:/digoal/blog/blob/master/_02.md?spm=5176.blogcont71875.22.n8C9Q4&file=_02.md)时序数据合并场景加速分析和实现 - 复合索引,窗口分组查询加速,变态递归加速(原文链接:http
31、s://digoal/blog/blob/master/_01.md?spm=5176.blogcont71875.23.oaTnr7&file=_01.md)3. case when 优化,在使用本例的cube,grouping sets,rollup前,或者其他不支持数据透视语法的数据库中,可以使用case when的方法来聚合,但是每条数据都要经过case when的计算,耗费很大的CPU。select sum(case when c1=0 then 1 else 0 end)/(count(*):numeric as c1_0, sum(case when c1=1
32、 then 1 else 0 end)/(count(*):numeric as c1_1, sum(case when c1=2 then 1 else 0 end)/(count(*):numeric as c1_2, sum(case when c1=3 then 1 else 0 end)/(count(*):numeric as c1_3, sum(case when c1=4 then 1 else 0 end)/(count(*):numeric as c1_4, sum(case when c2=0 then 1 else 0 end)/(count(*):numeric as
33、 c2_0, sum(case when c2=1 then 1 else 0 end)/(count(*):numeric as c2_1, sum(case when c2=2 then 1 else 0 end)/(count(*):numeric as c2_2, sum(case when c3=0 then 1 else 0 end)/(count(*):numeric as c3_0, sum(case when c3=1 then 1 else 0 end)/(count(*):numeric as c3_1, sum(case when c3=2 then 1 else 0
34、end)/(count(*):numeric as c3_2, sum(case when c3=3 then 1 else 0 end)/(count(*):numeric as c3_3, sum(case when c3=4 then 1 else 0 end)/(count(*):numeric as c3_4, sum(case when c4=0 then 1 else 0 end)/(count(*):numeric as c4_0, sum(case when c4=1 then 1 else 0 end)/(count(*):numeric as c4_1, sum(case
35、 when c4=2 then 1 else 0 end)/(count(*):numeric as c4_2, sum(case when c5=0 then 1 else 0 end)/(count(*):numeric as c5_0, sum(case when c5=1 then 1 else 0 end)/(count(*):numeric as c5_1, sum(case when c5=2 then 1 else 0 end)/(count(*):numeric as c5_2, sum(case when c6=0 then 1 else 0 end)/(count(*):
36、numeric as c6_0, sum(case when c6=1 then 1 else 0 end)/(count(*):numeric as c6_1, sum(case when c6=2 then 1 else 0 end)/(count(*):numeric as c6_2 from people where id in ( select id from people_loc where crt_time between 2017-03-06:date and 2017-03-08:date ); c1_0 | c1_1 | c1_2 | c1_3 | c1_4 | c2_0 | c2_1 | c2_2 | c3_0 | c3_1 | c3_2 | c3_3 | c3_4 | c4_0 | c4_1 | c4_2 | c5_0 | c5_1 | c5_2 | c6_0 | c6_1 | c6_2 -+-+-+-+-+-+-+-+-+- -+-+-+-+-+-+-+-+-+- -+-+-+- 0. | 0. | 0. | 0. | 0.00 | 0. | 0. | 0. | 0. |
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年叔丁基苯酚项目发展计划
- 2025年试验机械相关检测仪器项目合作计划书
- 健康饮食产业园项目实施方案
- 2025年DCA-1皮革固色剂项目建议书
- xx河流排水防涝设施建设项目建议书(参考范文)
- 2025年糖、加工糖及制糖副产品项目发展计划
- 2025年锆合金管材项目建议书
- 2025年稀有金属及稀土金属材料合作协议书
- 2025年城市污水处理厂智能化升级改造对智能化控制系统的影响报告
- 工业互联网平台数据备份与恢复策略在2025年智能证券分析领域的应用研究报告
- 2025年中国邮政集团有限公司辽宁省分公司校园招聘笔试备考试题及完整答案详解1套
- 多灾种耦合应对-洞察及研究
- 朗读协会工作报告
- T/CERDS 1-2021企业高质量发展评价指标
- 2025农发银行笔试题库及答案
- 湖北省黄冈市黄梅实验中学2025届数学八下期末统考试题含解析
- 2025届上海市闵行区七下数学期末学业水平测试模拟试题含解析
- GB/T 18487.4-2025电动汽车传导充放电系统第4部分:车辆对外放电要求
- 《成人慢性肾脏病食养指南(2024年版)》解读
- 离婚不离家协议书
- 社区干事考试试题及答案
评论
0/150
提交评论