kc第13讲-数据散列技术.ppt_第1页
kc第13讲-数据散列技术.ppt_第2页
kc第13讲-数据散列技术.ppt_第3页
kc第13讲-数据散列技术.ppt_第4页
kc第13讲-数据散列技术.ppt_第5页
已阅读5页,还剩24页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、第13讲: (第12章) 数据散列技术 重庆大学计算机学院,课程名称: 数据库系统 -,第13讲:数据散列技术,项目驱动目标: 如何实现基于散列的文件组织和索引: 一、静态散列 二、动态散列 三、SQL中的索引定义 主要讨论问题: 什么是静态散列 好的散列函数该是什么样 什么是动态散列 如何在动态散列中插入记录 如何删除动态散列中的记录 与静态散列相比,动态散列有特点 顺序索引与散列的区别 SQL中如何定义索引,Exercise 13,Static Hashing(静态散列),A bucket(桶) is a unit of storage containing one or more rec

2、ords (a bucket is typically a disk block). In a hash file organization(散列文件组织) we obtain the bucket of a record directly from its search-key value using a hash function(散列函数). Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B. Hash function is

3、 used to locate records for access, insertion as well as deletion. Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.,一 静态散列,1-1 什么是静态散列?,问题1答案,1.1 基本概念,Example of Hash File Organization,There are 10 bucke

4、ts, The binary representation of the ith character is assumed to be the integer i. The hash function returns the sum of the binary representations of the characters modulo 10. E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(B r igh t o n)= 3 h(Mia nu s) = 7,Hash file organization of account file, using d

5、ept_name as key,1.1 基本概念,Hash Functions(散列函数),Worst hash function maps all search-key values to the same bucket; this makes access time proportional to the number of search-key values in the file. An ideal hash function is uniform(均匀的), i.e., each bucket is assigned the same number of search-key val

6、ues from the set of all possible values. (桶的尺寸一样大!) An ideal hash function is random(随机的), so each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file. (桶中放的记录一样多!) Typical hash functions perform computation on the inter

7、nal binary representation of the search-key. For example, for a string search-key (如account的部门名), the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned. .,1-2 好的散列函数该是什么样?,1.1 基本概念,问题2答案,Problems of Bucket Overflows(溢

8、出问题),Bucket overflow can occur because of Insufficient buckets(桶不足) 若记录总数是nr,fr一个桶能存放的记录数(分布的空间大小),则桶数目nB应满足: nB nr / fr 但遗憾的是: 在选择散列函数时, nr不是已知的 Skew(偏斜) in distribution of records. This can occur due to two reasons: multiple records have same search-key value chosen hash function produces non-unif

9、orm distribution of key values Although the probability of bucket overflow can be reduced(减少), it cannot be eliminated(消除); it is handled by using overflow buckets(溢出桶).,1.2 桶溢出问题,Handling of Bucket Overflows (溢出处理),解决方法:Overflow chaining(溢出链) The overflow buckets of a given bucket are chained toget

10、her in a linked list. Above scheme is called closed hashing(闭散列). 注: An alternative, called open hashing(开散列) which does not use overflow buckets, is not suitable for database applications.,1.2 桶溢出问题,Hash Indices(散列索引),Hashing can be used not only for file organization, but also for index-structure

11、creation. A hash index organizes the search keys, with their associated record pointers, into a hash file structure. Strictly speaking, hash indices are always secondary indices if the file itself is organized using hashing, a separate primary hash index on it using the same search-key is unnecessar

12、y. However, we use the term hash index to refer to both secondary index structures and hash organized files. 例子:散列索引,1.3 散列索引,Example of Hash Index,1.3 散列索引,Deficiencies of Static Hashing,In static hashing, function h maps search-key values to a fixed set of B of bucket addresses. Databases(指记录数) gr

13、ow升 or shrink降 with time. If initial number of buckets is too small, and file grows, performance will degrade due to too much overflows. If space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets will be underfull). If database shrinks, again sp

14、ace will be wasted. One solution: periodic re-organization of the file with a new hash function Expensive, disrupts干扰 normal operations Better solution: allow the number of buckets to be modified dynamically. (详见下页),1.4 静态散列的弱点,Dynamic Hashing(动态散列),Good for database that grows and shrinks in size A

15、llows the hash function to be modified dynamically Extendable hashing(可扩充散列) one form of dynamic hashing: Hash function generates values over a large range typically b-bit integers, with b = 32. (典型值) (但并不为每个散列值创建1个桶) At any time use only a prefix of the hash function to index into a table of bucket

16、 addresses. (这前i个位,用于计算桶地址偏移量) Let the length of the prefix be i bits, 0 i 32. Bucket address table size = 2i. Initially i = 0 Value of i grows and shrinks as the size of the database grows and shrinks. Multiple多个 entries in the bucket address table may point to a bucket (why?) Thus, actual number o

17、f buckets is 2i The number of buckets also changes dynamically due to coalescing 整合 and splitting of buckets. (因重组时一次仅作用1个桶, 索引更新的维护开销低) 一般结构:可扩展散列,1-1 什么是动态散列?,二 动态散列,问题3答案,2.1 动态散列的基本知识,General Extendable Hash Structure -一般结构,In this structure, i2 = i3 = i (=2), whereas i1 = i 1 (=1) 而 i=2 (see ne

18、xt slide for details),一般地iki,2.1 动态散列的基本知识,Use of Extendable Hash Structure-如何该使用索引,如何知道指向同一桶: Each bucket j stores a valueij 正整数 All the entries that point to the same bucket have the same values on the first ij bits. 如何定位属于哪一桶:To locate the bucket containing search-key Kj: 1.Compute h(Kj) = X 即计算该

19、搜索码相应的二进数表示 Use the first i high order bits of X as a displacement偏转/位移 into bucket address table, and follow跟随 the pointer to appropriate bucket 桶地址表中指向桶j的表项编号(计算公式)为: 2(i-ij) 如何插入记录:To insert a record with search-key value Kj follow same procedure as look-up and locate the bucket, say j. (桶j) If t

20、here is room in the bucket j, insert record in the bucket. Else the bucket must be split and insertion re-attempted 重试 (next slide) Overflow buckets used instead in some cases (will see in next slide),2.1 动态散列的基本知识,1-2 如何在动态散列中插入记录?,问题4答案,Insertion in Extendable Hash Structure (桶分裂),:If i ij (more t

21、han one pointer to bucket j) 可在该桶后增加桶 allocate a new bucket z, and set ij = iz = (ij + 1) Update the second half of the bucket address table entries originally pointing to j, to point to z remove each record in bucket j and reinsert (in j or z) recompute new bucket for Kj and insert record in the bu

22、cket (further splitting is required if the bucket is still full) If i = ij (only one pointer to bucket j) 不能在该桶后增加桶 If i reaches some limit b, or too many splits have happened in this insertion, create an overflow bucket 只能扩展溢出桶 Else (做下三步工作) 还可成倍增加桶指针 increment i (to i+1) and double the size of the

23、 bucket address table. 表项指针翻倍! replace each entry in the table by two entries that point to the same bucket. (因指针空间加倍,用指向同一桶的两个指针去替代表中每个指针) recompute new bucket address table entry for Kj (当前待插记录) (Now i ij so use the first case above.) 桶加倍和桶内记录处理!,To split a bucket j when inserting record with sear

24、ch-key value Kj: (满时),返回例子中,2.2 散列桶动态分裂,Deletion in Extendable Hash Structure,locate it in its bucket and remove it. The bucket itself can be removed if it becomes empty (with appropriate updates to the bucket address table). Coalescing整合 of buckets can be done (can coalesce only with a “buddy” buck

25、et having same value of ij and same ij 1 prefix, if it is present) Decreasing bucket address table size is also possible Note: decreasing bucket address table size is an expensive operation and should be done only if number of buckets becomes much smaller than the size of the table,2.2 散列桶动态分裂,1-3 如

26、何删除动态散列中的记录?,问题5答案,To delete a key value,Use Example of Extendable Hash Structure,1)Initial Hash structure:假设bucket size = 2 (记录,桶大小),b=2,Branch_name的32位散列值: (即将它们按照前面指出的单词到整数的映射,并将该数字采用二进制表示),2.3 散列桶动态分裂过程实例,Example (续1),2)Hash structure after insertion of one Brighton and two Downtown records -在依次

27、插入Brighton记录和第一个Downtown记录时,i=0,仅一个初始桶,不需桶定位,都可直接装入; -在插入第二个Downtown记录时,桶已装满,需要分裂 (参分裂方法) i=i0且i2 ,属于“需要成倍增加桶指针”情形,分裂如下: (i=1,桶地址表加倍,桶数量加倍!由前i=1位为0、1两表项指针分别确定) (并且,原桶中记录重定位时,根据前i=1位是0或1分别装入到第1、2桶中),2.3 散列桶动态分裂过程实例,Example (续2),3)Hash structure after insertion of Mianus record 过程说明: Downtown的散列码首位也为1

28、 ,Mianus的散列码首位也为1; 但第二个桶已装满,故桶指针数量又增加一倍,到4个; 这时本来仍只有两个桶,但Mianus定位到第二个桶时发现已满; 故在该桶后又增加一个桶,并将Mianus记录插入。,2.3 散列桶动态分裂过程实例,Example (续3),4)Hash structure after insertion of three Perryridge records 说明:在插入第3个Perryridge记录时,因与已有两个Perryridge的散列值相同,无法增加桶来存放(因无法区分这三个记录)。这是只能采用增加溢出桶来解决!,Example (续4),5) Hash str

29、ucture after insertion of Redwood and Round Hill records 说明:因桶中有空间,在确定桶位置后(散列值的前2位分别是00和11),直接插入既可,2.3 散列桶动态分裂过程实例,Extendable Hashing vs. Other Schemes(与静态散列比),Benefits of extendable hashing: Hash performance does not degrade with growth of file Minimal space overhead负载 (桶地址表中仅为每个当前前缀长度的散列值存放一个指针) D

30、isadvantages of extendable hashing Extra level of indirection间接层 to find desired record 增加访问桶地址表开销 Bucket address table may itself become very big (larger than memory) Cannot allocate very large contiguous areas on disk either Solution: B+-tree structure to locate desired record in bucket address ta

31、ble Changing size of bucket address table is an expensive operation 自学:* Linear hashing(线性散列-另一种动态散列) is an alternative mechanism Allows incremental growth of its directory 避免使用额外的间接层 (equivalent to bucket address table) At the cost of more bucket overflows,2.4 动态散列与静态散列比较,2-4 与静态散列相比,动态散列有特点?,问题6答案

32、,Comparison of Ordered Indexing and Hashing,是否使用索引和使用什么类型的索引应考虑的因素: Cost of periodic re-organization Relative frequency of insertions and deletions Is it desirable to optimize average access time at the expense of worst-case access time? 什么时候用什么索引会更好 Expected type of queries: Hashing is generally be

33、tter at retrieving records having a specified value of the key. If range queries are common, ordered indices are to be preferred 哪些DBMS支持哪些类型索引: In practice: PostgreSQL supports hash indices, but discourages use due to poor performance Oracle supports static hash organization, but not hash indices S

34、QLServer supports only B+-trees,2.5 有序索引与散列的比较,2-5 何时使用顺序索引和散列?,问题7答案,Index Definition in SQL,Create an index create index on () E.g. create index b-index on branch(branch_name) Use create unique index to indirectly specify and enforce the condition that the search key is a candidate key. E.g. creat

35、e unique index branch-index on branch(branch_name) Not really required if SQL unique integrity constraint is supported To drop an index drop index Most database systems allow specification of type of index(比如:B+树或散列) Some database systems allow specification of clustering index.,三 SQL中的索引定义,3-1 SQL中

36、如何定义索引?,问题8答案,3-1 SQL索引如何被使用?,系统自动分析选择使用,用于无法显示指定! 应在SQL查询语句中的WHERE条件中使用到相应字段!,项目驱动目标: 数据库系统如何实现查询快速处理: 一 、查询处理概述 二 、基本查询操作的处理 三 、连接运算的处理 主要讨论问题: 你知道查询处理的基本过程 查询代价大致该如何估算 选择操作代价如何 如何实现外排序归并 外排序归并的复杂度如何 嵌套循环连接代价如何 索引嵌套循环连接代价如何 归并连接算法是什么样,练习 13:,P.350 12.18 12.19,Thank you !,End !,Bitmap Indices(位图索引)

37、,Bitmap indices are a special type of index designed for efficient querying on multiple keys Records in a relation are assumed to be numbered sequentially from, say, 0 Given a number n it must be easy to retrieve record n Particularly easy if records are of fixed size Applicable on attributes that t

38、ake on a relatively small number of distinct values E.g. gender, country, state, E.g. income-level (income broken up into a small number of levels such as 0-9999, 10000-19999, 20000-50000, 50000- infinity) A bitmap is simply an array of bits,*四 位图索引,4-1 什么是位图索引?,问题9答案,Bitmap Indices (Cont.),In its s

39、implest form a bitmap index on an attribute has a bitmap for each value of the attribute Bitmap has as many bits as records In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise,二 位图索引,Bitmap Indices (Cont.),Bitmap indices are useful f

40、or queries on multiple attributes not particularly useful for single attribute queries Queries are answered using bitmap operations Intersection (and) Union (or) Complementation (not) Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap E.g. 100110 AND 110011 = 100010 100110 OR 110011 = 110111 NOT 100110 = 011001 Males with income level L1: 10010 AND 10100 = 10000 Can then retrieve required tuples. Counting number of matching tuples is even faster,二 位图索引,Bitmap Indices (Cont.),Bitmap ind

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论