Greenplum数据库基础培训

上传人：大*** IP属地：天津上传时间：2022-06-29 格式：PPT 页数：99 大小：7.43MB 积分：20 举报 版权申诉

已阅读5页，还剩94页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Wei.LiSenior Data ArchitectureAlpine Solution2010/05l 海量并行处理 (Massively Parallel Processing) DBMSl 基于 PostgreSQL 8.2 相同的客户端功能增加支持并行处理的技术增加支持数据仓库和BI的特性外部表(external tables)/并行加载(parallel loading) 资源管理查询优化器增强(query optimizer enhancements)l psqll pgAdmin IIIl ODBC/Datadirectl JDBCl Perl DBIl Pythonl

2、 libpqlOLE DBl 访问系统的入口l 数据库侦听进程 (postgres)l 处理所有用户连接l 建立查询计划l 协调工作处理过程l 管理工具l 系统目录表和元数据（数据字典）l 不存放任何用户数据l 每段（Segment）存放一部分用户数据l 一个系统可以有多段l 用户不能直接存取访问l 所有对段的访问都经过Masterl 数据库监听进程(postgres)监听来自Master的连接l Greenplum数据库之间的连接层l 进程间协调和管理l 基于千兆以太网架构l 属于系统内部私网配置l 支持两种协议：TCP or UDPl Standby 节点用于当 Master 节点损坏时提

3、供 Master 服务l Standby 实时与 Master 节点的 Catalog 和事务日志保持同步l 每个Segment的数据冗余存放在另一个Segment上，数据实时同步l 当Primary Segment失败时，Mirror Segment将自动提供服务l Primary Segment恢复正常后，使用gprecoverseg F 同步数据。l Hash分布 CREATE TABLE DISTRIBUTED BY (column ,) 同样数值的内容被分配到同一个Segment上l 循环分布 CREATE TABLE DISTRIBUTED RANDOMLY 具有同样数值的行内容并

4、不一定在同一个Segment上表分布的策略表分布的策略-并行计算的基础并行计算的基础查询命令的执行查询命令的执行l 举例说明：按卡号、客户号、机构的分布方式优劣点查询命令的执行查询命令的执行SELECT customer, amount FROM sales JOIN customer USING (cust_id) WHERE date=04302008;l 将一张大表逻辑性地分成多个部分，如按照分区条件进行查询，将减少数据的扫描范围，提高系统性能。l 提高对于特定类型数据的查询速度和性能l 也可以更方便数据库的维护和更新l 两种类型: Range分区 (日期范围或数字范围)/如日期、价格等

5、 List 分区,例如地区、产品等l Greenplum中的表分区在使用中具有总表的继承性，并通过Check参数指定相应的子表l 分区的子表依然根据分布策略分布在各segment上l 分区是一种非常有用的优化措施，例如一年的交易按交易日期分区后，查询一天的交易性能将提高365倍！Segment 1ASegment 1BSegment 1CSegment 1DSegment 2ASegment 2BSegment 2CSegment 2DSegment 3ASegment 3BSegment 3CSegment 3DJan 2005Feb 2005Mar 2005Apr 2005May 2005

6、Jun 2005Jul 2005Aug 2005Sep 2005Oct 2005Nov 2005Dec 2005每个分区表的数据平均分布到各个节点表分区可减少数据的搜索范围，提高查询性能Segment 1ASegment 1BSegment 1CSegment 1DSegment 2ASegment 2BSegment 2CSegment 2DSegment 3ASegment 3BSegment 3CSegment 3DSegment 1ASegment 1BSegment 1CSegment 1DSegment 2ASegment 2BSegment 2CSegment 2DSegment

7、 3ASegment 3BSegment 3CSegment 3DSegment 1ASegment 1BSegment 1CSegment 1DSegment 2ASegment 2BSegment 2CSegment 2DSegment 3ASegment 3BSegment 3CSegment 3DSELECT COUNT(*) FROM orders WHERE order_date= Oct 20 2005 AND order_date Oct 27 2005VSHash DistributionHash Distribution+ Table PartitioningSegment

8、 1ASegment 1BSegment 1CSegment 1DSegment 2ASegment 2BSegment 2CSegment 2DSegment 3ASegment 3BSegment 3CSegment 3Dl 压缩存储支持ZLIB和QUICKLZ方式的压缩,压缩比可到10:1 压缩表只能是Append Only方式压缩数据不一定会带来性能的下降，压缩表将消耗CPU资源，而减少I/O资源占用l 语法CREATE TABLE foo (a int, b text) WITH (appendonly=true, compresstype=zlib, compresslevel

9、=5); l Greenplum支持行或列存储模式列模式目前只支持Append Only 如果常用的查询只取表中少量字段，则列模式效率更高，如查询需要取表中的大量字段，行模式效率更高语法： CREATE TABLE sales2 (LIKE sales) WITH (appendonly=true, orientation=column); 效率比较测试：测试1：需要去表中所有字段，此时行存储更快。select * from dw_ods.s1_sema_scmcaccp_row where crdacct_nbr= 4033930000166380411;41秒select * from

10、 dw_ods.s1_sema_scmcaccp_col where crdacct_nbr= 4033930000166380411;116秒测试2：只取表中少量字段，列存储更快select crdacct_status from dw_ods.s1_sema_scmcaccp_row where crdacct_nbr= 4033930000166380411;35秒select crdacct_status from dw_ods.s1_sema_scmcaccp_col where crdacct_nbr= 4033930000166380411;3秒l 外部表的特征 Read-onl

11、y 数据存放在数据库外可执行SELECT, JOIN, SORT等命令，类似正规表的操作l 外部表的优点并行方式加载 ETL的灵活性格式错误行的容错处理支持多种数据源l 两种方式 External Tables: 基于文件 Web Tables: 基于URL或指令l 利用并行数据流引擎，Greenplum可以直接用SQL操作外部表l 数据加载完全并行Master主机Segment主机内部互联网千兆以太网交换机gpfdistgpfdistSegment主机Segment主机Segment主机外部表文件外部表文件ETL服务器内部网络l 并行数据加载提供最好的性能l 能够处理远程存储的文件l

12、采用HTTP协议l 200 MB/s data distribution rate per gpfdistl gpfdist文件分发守护进程启动:gpfdist -d /var/load_files/expenses -p 8080 -l /home/gpadmin/log &l 外部表定义:CREATE EXTERNAL TABLE ext_expenses ( name text, date date, amount float4, description text )LOCATION (gpfdist/etlhost:8081/*,gpfdist/etlhost:8082/*)

13、FORMAT TEXT (DELIMITER |)ENCODING UTF-8LOG ERRORS INTO ext_expenses_loaderrors SEGMENT REJECT LIMIT 10000 ROWS ;lLoad good rows and catch poorly formatted rows, such as: rows with missing or extra attributes rows with attributes of the wrong data type rows with invalid client encoding sequenceslDoes

14、 not apply to constraint errors: PRIMARY KEY, NOT NULL, CHECK or UNIQUE constraintslOptional error handling clause for external tables:LOG ERRORS INTO error_table SEGMENT REJECT LIMIT count ROWS | PERCENT( PERCENT based on gp_reject_percent_threshold parameter )lExampleCREATE EXTERNAL TABLE ext_cust

15、omer (id int, name text, sponsor text) LOCATION ( gpfdist:/filehost:8081/*.txt ) FORMAT TEXT ( DELIMITER | NULL ) LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5 ROWS; l Data resides outside the databasel No database statistics for external table datal Not meant for frequent or ad-hoc accessl Ca

16、n manually set rough statistics in pg_class: UPDATE pg_class SET reltuples=400000, relpages=400 WHERE relname=myexttable;l PostgreSQL commandSupport loading and unloadingOptimized for loading a large number of rowsLoads all rows in one command (not parallel)Loads data from a file or from standard in

17、putSupports error handling as does external tablesl EXAMPLECOPY mytable FROM /data/myfile.csv WITH CSV HEADER; （文件生成在Master）COPY mytable FROM /data/myfile.csv WITH CSV HEADER;（文件生成在本地）COPY country FROM /data/gpdb/country_data WITH DELIMITER | LOG ERRORS INTO err_country SEGMENT REJECT LIMIT 10 ROWS;

18、l Drop indexes and recreate after load Increase maintenance_work_mem parameter to speed up CREATE INDEX operationsl Run ANALYZE after loadl Run VACUUM after load errors、delete、upate。l Do not use ODBC INSERT to load large volumes of datal 限制查询队列的激活数l 防止系统过载(CPU, disk I/O, memory)l Resource Queue Limi

19、ts ACTIVE THRESHOLDEXAMPLE: CREATE RESOURCE QUEUE adhoc ACTIVE THRESHOLD 10 IGNORE THRESHOLD 1000.0; COST THRESHOLDEXAMPLES: CREATE RESOURCE QUEUE batch1 COST THRESHOLD 1000000.0 NOOVERCOMMIT;CREATE RESOURCE QUEUE batch1 COST THRESHOLD 1e+6;l Highly interactive web-based performance monitoringl Real

20、-time and historic views of:Resource utilizationQueries and query internalsDashboardSystem MetricsQuery MonitorBackups and Restoresl Parallel backups (gp_dump)l Parallel restores (gp_restore)l Automating dump/restores (gpcrondump, gpdbrestore)l Non-parallel backups and restores(pg_dump/pg_restore/ps

21、ql)l 用于在同构环境间迁移数据结构、数据、function备份恢复l 用于在异构环境间迁移数据结构、数据、functionl Each active segment is dumped in parallell Dump files created in segment data directory by defaultl Supports compression (gzip)l Ensure sufficient disk space for dump filesl A backup set is identified by a timestamp keylExample:Back up

22、 a database: gp_dump gpdbBack up a database, and create dump files in a centralized location on all hosts: gp_dump -gp-d=/home/gpadmin/backups gpdbBack up a particular schema only: gp_dump -n myschema mydatabaseBack up a single segment instance only (by noting the dbid of the segment instance): gp_d

23、ump -gp-s=i5 gpdblOn the master host gp_catalog_1_ gp_cdatabase_1_ gp_dump_1_ gp_dump_status_1_ lOn the segment hosts gp_dump_0_gp_dump_status_0_l Use gp_restore commandl Need timestamp key from gp_dumpl Make sure dump files are placed on correct segment hostsl Make sure database exists before resto

24、ringl Database-level server configuration settings are not restoredl Examples Restore an Greenplum database using backup files created by gp_dump:gp_restore -gp-k=2005103112453 -d gpdb Restore a single segment instance only (by noting the dbid of the segment instance):gp_restore -gp-k=2005103112453

25、-d gpdb -gp-s=i5l gpcrondumpCalls to gp_dumpCan be called directly or can schedule using CRONSend email notificationsFlexible dump optionsCan copy configuration filesCan dump system catalogsCan dump global objectsCan include a post-dump scriptl gpdbrestoreRestores from gpcrondump filesCan restore fr

26、om an archive host no need to pre-place dump files on segmentsl Green plum also supports pg_dump and pg_restore l Useful for migrating data to/from other DBMSl pg_dump creates a single dump fileCan be slow on very large databasesRun at low-usage timesSupports compressionCan dump data as INSERT or CO

27、PY commandsgp-syntax option includes DISTRIBUTED BY statements in DDLl Dump a database called mydb into a SQL-script file:pg_dump mydb db.sqll To reload such a script into a (freshly created) database named newdb:psql -d newdb -f db.sqll Dump a Greenplum database in tar file format and include distr

28、ibution policy information:pg_dump -Ft -gp-syntax mydb db.tarl To dump a database into a custom-format archive file:pg_dump -Fc mydb db.dumpl To reload an archive file into a (freshly created) database named newdb:pg_restore -d newdb db.dumpl To dump a single table named mytab:pg_dump -t mytab mydb

29、db.sqll To specify an upper-case or mixed-case name in -t and related switches, you need to double-quote the name; else it will be folded to lower case. But double quotes are special to the shell, so in turn they must be quoted. Thus, to dump a single table with a mixed-case name, you need something

30、 like:pg_dump -t MixedCaseName mydb mytab.sqll pgAdmin3 图形化管理和SQL执行/分析/监控工具l psql 行命令操作和管理工具l pgAdmin3 is the leading graphical Open Source management, development and administration tool for PostgreSQL Greenplum has contributed extensive GPDB-specific enhancements With GPDB 3.3, Greenplum ships and

31、 supports this tooll 监控活动session ,同SQL：select * from pg_stat_activity;l 监控锁，从pg_lock中获取信息l 可以停止正在运行的SQLl Connect through the masterl Connection information database name (-d | PGDATABASE) master host name (-h | PGHOST) master port (-p | PGPORT) user name (-U | PGUSER)l First time connections templat

32、e1 database default superuser account (gpadmin)l Interactive modepsql mydatabasemydatabase=# SELECT * FROM foo; l Non-interactive mode (single command) psql mydatabase ac “SELECT * FROM foo;”l Non-interactive mode (multiple commands) psql mydatabase af /home/lab1/sql/createdb.sqll (Use semi-colon (;

33、) to denote end of a statement)l ? (help on psql meta-commands) l h (help on SQL command syntax)l dt (show tables) l dtS (show system tables) l dg or du (show roles) l l (show databases)l c db_name (connect to this database) l q (quit psql) l ! (Enter into shell mode)l df (show function)l dn(show sc

34、hema)l Set search_path=l timingpostgresql.confLocalLocalLocalLocall 参数参考Adminguidel 重要参数:max_connection,share_buff,work_meml Local 变量的修改，如max_stack_depth需要修改所有segment上的valuel Located in master or segment instances data directoryl Used to set configuration parameters on the system levell Parameters t

35、hat are using the default setting are commented out (#)l Requires a restart (or reload using gpstop -u) for changes to take effectl View a specific parameter setting Example: SHOW search_path;l View all parameter settings Example: SHOW ALL;l Set parameter Example: set search_path=public set client_e

36、ncoding=gb18030 l 客户端授权是否允许从某个客户端的连接用户是否可以连接到所请求的数据库l pg_hba.conf file 基于host address, database, and/or DB user account控制权限位于master和segment实例的数据目录中系统初始化时进行default配置l Local connections allowed for Greenplum superuserl Remote connections not allowedl EXAMPLE# TYPE DATABASE USER CIDR-ADDRESS METHOD

37、 local all gpadmin ident sameuser local all all ident sameuser host all gpadmin /0 trust l EXAMPLE# TYPE DATABASE USER CIDR-ADDRESS METHOD localallall trust hostall all/32 trust hostall all:1/128 trust hostall all06/32 trust hostcarddwetl2/32md5 hostgpadminall21

38、.104.138.0/24md5l gpstop -u 可与在不重启数据库方式下，让设置生效l 具体参考Gp sql language：l 注意事项： DELETE ,UPDATE在两表关联时，两个表的distribution必须一致。如：delete from tablea using tableb where tablea.id=tableb.idUPDATE table a as a SET desc= b.desc FROM tableb as b WHERE a.id=b.id 以上操作 table a,table b 必须使用相同的分布，必要时可以使用 alter table s

39、et distribution 进行分布修改。l 常用数据类型 CHAR,VARCHAR,TEXT Smallint ,integer,bigint Numeric, real,double precision Timestamp,date,time Boolean Array 类型。如 integerl 其它数据类型请参考lAll system catalogs in pg_catalog schemalStandard PostgreSQL system catalogs (pg_*)lGreenplum-specific catalogs: gp_configuration gp_dis

40、tribution_policy gp_id gp_version_at_initdb pg_resqueue pg_exttable pg_tables pg_class pg_stat_activitylTo list all system catalog tables in psql: dtSlTo list all system views in psql: dvSl其它 catalog 参考 l 日期函数Extract（day|month|year。 From date);Select date + 1 day:interval,date+ 1 month:intervalSELEC

41、T date_part(day, TIMESTAMP 2001-02-16 20:38:40); Result: 16SELECT date_trunc(hour, TIMESTAMP 2001-02-16 20:38:40); Result: 2001-02-16 20:00:00 pg_sleep(seconds); 系统日期变量Current_dateCurrent_timeCurrent_timestampNow()Timeofday() 在事务中发生变化，以上函数在事务中不变l 字符串处理函数Substr/length/lower/upper/trim/replace/positi

42、onrPad/lpadTo_char, | (字符串连接)substring(string from pattern/ * ，like，simillar to (模式匹配)l 其它杂类Case 。When/Coalescenullifgenerate_seriesIn/not in/exists/any/all各类函数参考：l Greenplum支持SQL/PYTHON/PERL/C语言构建函数，以下着重介绍SQL 存储过程。一个存储过程就是一个事务，包括对子过程的调用都在一个事务内存储过程结构：CREATE FUNCTION somefunc() RETURNS integer AS $

43、DECLARE quantity integer := 30;BEGIN RETURN .;END;$ LANGUAGE plpgsql; 赋值给一个变量或行/记录赋值用下面方法：identifier := expression例子：user_id := 20; 执行一个没有结果的查询: PERFORM query;一个例子：PERFORM create_mv(cs_session_page_requests_mv, my_query);存储过程请参考：存储过程请参考：l动态SQLEXECUTE command-string INTO STRICT target;lSELECT INTOExa

44、mple：SELECT ID INTO VAR_ID FROM TABLEAl获取结果状态GET DIAGNOSTICS variable = item , .;一个例子：GET DIAGNOSTICS integer_var = ROW_COUNT;lSQL返回变量SQLERRM， SQLSTATEl控制结构IF . THEN . ELSEIF . THEN . ELSELOOP, EXIT, CONTINUE, WHILE, FOR l从函数返回有两个命令可以用来从函数中返回数据：RETURN 和 RETURN NEXT 。Syntax：RETURN expression;l设置回调EXE

45、C SQL WHENEVER condition action; condition 可以是下列之一： SQLERROR，SQLWARNING，NOT FOUNDl异常处理EXCEPTION WHEN unique_violation THEN - do nothing END;忽略错误：EXCEPTION WHEN OTHERS THEN RAISE NOTICE an EXCEPTION is about to be raised; RAISE EXCEPTION NUM:%, DETAILS:%, SQLSTATE, SQLERRM;END;l错误和消息RAISE level forma

46、t , expression , .;Level:Info:信息输入Notice:信息提示Exception:产生一个例外，将退出存储过程Example: RAISE NOTICE Calling cs_create_job(%), v_job_id;Performance Tuningl Introduction to performance tuningl Common performance problemsl Tracking down a performance probleml Query profiling (EXPLAIN, EXPLAIN ANALYZE)l Query tu

47、ningl 我的经验：lSet performance expectationsacceptable response times, queries per minute, etc.BenchmarkslKnow your baseline hardware performancethroughput / capacitylKnow your workloadheavy usage timesresource contentiondata contentionlFocus your optimizationslHardware Issues / Failed SegmentslResource

48、 AllocationlContention Between Concurrent WorkloadslInaccurate Database StatisticslUneven Data DistributionlSQL FormulationlDatabase Design lDisk failureslHost failureslNetwork failureslOS not tuned for GreenplumlDisk Capacity 70% maximum recommended VACUUM after updates, deletes and loadslVACUUM co

49、nfiguration parameters max_fsm_relations = tables + indexes + system tables max_fsm_pages = 16 * max_fsm_relations l Greenplum resource queueslimit active queries in the systemlimit the size of a query a particular user can runlPerform admin tasks at low usage timesData loading, ETLVACUUM, ANALYZEBa

50、ckupslDesign applications to prevent lock conflictsConcurrent sessions not updating the same data at the same timelResource-related Configuration Parameterswork_mem = 32MBmaintenance_work_mem = 64MBshared_buffers = 125MBlDatabase statistics used by the query plannerlRun ANALYZE afterData loadsRestor

51、es from backupsChanges to schema (adding indexes, etc.)Inserts, Updates, or DeleteslCan configure statistics collection default_statistics_target = 25 gp_analyze_relative_error = .25 on specific table columnsALTER TABLE name ALTER column SET STATISTICS # lTable Distribution Key Considerations Even d

52、ata distributionLocal vs. distributed operationsEven query processing lChecking for data skewgpskew t schema.tableUnix system utilities (gpssh)：gpssh f seg_host -sar 1 100lRebalancing a Table CREATE TABLE sales_temp (LIKE sales) DISTRIBUTED BY (date, total, customer); INSERT INTO sales_temp SELECT *

53、 FROM sales; DROP sales; ALTER TABLE sales_temp RENAME TO sales; l Data Type Selectionsmallest size possible to fit data INTEGER, not NUMERIC(11,2) for integer fieldssame data type across tables for join columnsconsider hash of wide join keys, using BYTEA instead of CHAR(100) varchar or text for cha

54、racter data lDenormalization (star schema)lTable Partitioning lUse sparingly in Greenplum DatabaselTry workload without indexes firstlCan improve performance of OLTP type workloadslOther index considerations:Avoid on frequently updated columnsAvoid overlapping indexesUse bitmap indexes where applica

55、ble instead of B-treeDrop indexes for loadsConsider a clustered indexlConfiguring Index Usage:enable_indexscan = on | offlGeneral ConsiderationsKnow your dataMinimize returned rowsAvoid unnecessary columns/tables in result setAvoid large sorts if possibleMatch data types in joinslGreenplum-specific

56、ConsiderationsJoin on common distribution key columns when possibleConsider data distribution policy and query predicateslSystem Catalog Tables and Viewspg_stat_activitypg_locks / pg_classlDatabase LogsLocated in master (and segment) data directorieslUNIX system utilities (gpssh)lExamine query plans

57、 to identify tuning opportunitieslWhat to look for?Plan operations that are taking exceptionally longAre the planners estimates close to reality? (EXPLAIN ANALYZE)Is the planner applying selective predicates early?Is the planner choosing the best join order?Is the planner selectively scanning partit

58、ioned tables?Is the planner choosing hash aggregate and hash join operations where applicable? Is there sufficient work memory?lTo see the plan for a queryEXPLAIN EXPLAIN ANALYZE lQuery plans are read from bottom to topMotions (Gather, Redistribute, Broadcast)Joins, sorts, aggregationsTable scanslTh

59、e following metrics are given for each operationcost (units of disk page fetches)rows (rows output by this node)width (bytes of the rows produced by this node)EXPLAIN SELECT * FROM names WHERE name = Joelle; - QUERY PLAN -Gather Motion 2:1 (slice1) (cost=0.00.20.88 rows=1 width=13) - Seq Scan on nam

60、es (cost=0.00.20.88 rows=1 width=13) Filter: name:text Joelle:text EXPLAIN ANALYZE SELECT * FROM names WHERE name = Joelle;- QUERY PLAN -Gather Motion 2:1 (slice1) (cost=0.00.20.88 rows=1 width=13) recv: Total 1 rows with 0.305 ms to first row, 0.537 ms to end. - Seq Scan on names (cost=0.00.20.88 rows=

人人文库> 全部分类> 行业资料 > 管理策划

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

Greenplum数据库基础培训

文档简介

温馨提示

最新文档

评论

Greenplum数据库基础培训

文档简介

温馨提示

最新文档

评论

相关文档