谷歌经典论文BigTable翻译.docx_第1页
谷歌经典论文BigTable翻译.docx_第2页
谷歌经典论文BigTable翻译.docx_第3页
谷歌经典论文BigTable翻译.docx_第4页
谷歌经典论文BigTable翻译.docx_第5页
免费预览已结束,剩余1页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

前记几年前在读Google的BigTable论文的时候,当时并没有理解论文里面表达的思想,因而囫囵吞枣,并没有注意到SSTable的概念。再后来开始关注HBase的设计和源码后,开始对BigTable传递的思想慢慢的清晰起来,但是因为事情太多,没有安排出时间重读BigTable的论文。在项目里,我因为自己在学HBase,开始主推HBase,而另一个同事则因为对Cassandra比较感冒,因而他主要关注Cassandra的设计,不过我们两个人偶尔都会讨论一下技术、设计的各种观点和心得,然后他偶然的说了一句:Cassandra和HBase都采用SSTable格式存储,然后我本能的问了一句:什么是SSTable?他并没有回答,可能也不是那么几句能说清楚的,或者他自己也没有尝试的去问过自己这个问题。然而这个问题本身却一直困扰着我,因而趁着现在有一些时间深入学习HBase和Cassandra相关设计的时候先把这个问题弄清楚了。SSTable的定义要解释这个术语的真正含义,最好的方法就是从它的出处找答案,所以重新翻开BigTable的论文。在这篇论文中,最初对SSTable是这么描述的(第三页末和第四页初):SSTableThe GoogleSSTablefile format is used internally to store Bigtable data. AnSSTableprovides apersistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.Operations are provided to look up the value associated with a specifiedkey, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.简单的非直译:SSTable是Bigtable内部用于数据的文件格式,它的格式为文件本身就是一个排序的、不可变的、持久的Key/Value对Map,其中Key和value都可以是任意的byte字符串。使用Key来查找Value,或通过给定Key范围遍历所有的Key/Value对。每个SSTable包含一系列的Block(一般Block大小为64KB,但是它是可配置的),在SSTable的末尾是Block索引,用于定位Block,这些索引在SSTable打开时被加载到内存中,在查找时首先从内存中的索引二分查找找到Block,然后一次磁盘寻道即可读取到相应的Block。还有一种方案是将这个SSTable加载到内存中,从而在查找和扫描中不需要读取磁盘。这个貌似就是HFile第一个版本的格式么,贴张图感受一下:在HBase使用过程中,对这个版本的HFile遇到以下一些问题(参考这里):1. 解析时内存使用量比较高。2. Bloom Filter和Block索引会变的很大,而影响启动性能。具体的,Bloom Filter可以增长到100MB每个HFile,而Block索引可以增长到300MB,如果一个HRegionServer中有20个HRegion,则他们分别能增长到2GB和6GB的大小。HRegion需要在打开时,需要加载所有的Block索引到内存中,因而影响启动性能;而在第一次Request时,需要将整个Bloom Filter加载到内存中,再开始查找,因而Bloom Filter太大会影响第一次请求的延迟。而HFile在版本2中对这些问题做了一些优化,具体会在HFile解析时详细说明。SSTable作为存储使用继续BigTable的论文往下走,在5.3 Tablet Serving小节中这样写道(第6页):Tablet ServingUpdates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called amemtable; the older updates are stored in a sequence of SSTables. To recover a tablet, a tablet serverreads its metadata from theMETADATAtable. This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet. The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.When a write operation arrives at a tablet server, the server checks that it is well-formed, and that the sender is authorized to perform the mutation. Authorization is performed by reading the list of permitted writers from a Chubby file (which is almost always a hit in the Chubby client cache). A valid mutation is written to the commit log. Group commit is used to improve the throughput of lots of small mutations 13, 16. After the write has been committed, its contents are inserted into the memtable.When a read operation arrives at a tablet server, it is similarly checked for well-formedness and proper authorization. A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently.Incoming read and write operations can continue while tablets are split and merged.第一段和第三段简单描述,非翻译:在新数据写入时,这个操作首先提交到日志中作为redo纪录,最近的数据存储在内存的排序缓存memtable中;旧的数据存储在一系列的SSTable 中。在recover中,tablet server从METADATA表中读取metadata,metadata包含了组成Tablet的所有SSTable(纪录了这些SSTable的元 数据信息,如SSTable的位置、StartKey、EndKey等)以及一系列日志中的redo点。Tablet Server读取SSTable的索引到内存,并replay这些redo点之后的更新来重构memtable。在读时,完成格式、授权等检查后,读会同时读取SSTable、memtable(HBase中还包含了BlockCache中的数据)并合并他们的结果,由于SSTable和memtable都是字典序排列,因而合并操作可以很高效完成。SSTable在Compaction过程中的使用在BigTable论文5.4 Compaction小节中是这样说的:CompactionAs write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. Thisminor compactionprocess has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.Every minor compaction creates a new SSTable. If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables. Instead, we bound the number of such files by periodically executing amerging compactionin the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.A merging compaction that rewrites all SSTables into exactly one SSTable is called amajor compaction. SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in older SSTables that are still live. A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major compactions to them. These major compactions allow Bigtable to reclaim resources used by deleted data, and also allow it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data.随着memtable大小增加到一个阀值,这个memtable会被冻住而创建一个新的memtable以供使用,而旧的memtable会转换成一个SSTable而写道GFS中,这个过程叫做minor compaction。这个minor compaction可以减少内存使用量,并可以减少日志大小,因为持久化后的数据可以从日志中删除。在minor compaction过程中,可以继续处理读写请求。每次minor compaction会生成新的SSTable文件,如果SSTable文件数量增加,则会影响读的性能,因而每次读都需要读取所有SSTable文件,然后合并结果,因而对SSTable文件个数需要有上限,并且时不时的需要在后台做merging compaction,这个merging compaction读取一些SSTable文件和memtable的内容,并将他们合并写入一个新的SSTable中。当这个过程完成后,这些源SSTable和memtable就可以被删除了。如果一个merging compaction是合并所有SSTable到一个SSTable,则这个过程称做major compaction。一次major compaction会将mark成删除的信息、数据删除,而其他两次compaction则会保留这些信息、数据(mark的形式)。Bigtable会时不时的扫描所有的Tablet,并对它们做major compaction。这个major compaction可以将需要删除的数据真正的删除从而节省空间,并保持系统一致性。SSTable的locality和In Memory在Bigtable中,它的本地性是由Locality group来定义的,即多个column family可以组合到一个locality group中,在同一个Tablet中,使用单独的SSTable存储这些在同一个locality group的column family。HBase把这个模型简化了,即每个column family在每个HRegion都使用单独的HFile存储,HFile没有locality group的概念,或者一个column family就是一个locality group。在Bigtable中,还可以支持在locality group级别设置是否将所有这个locality group的数据加载到内存中,在HBase中通过column family定义时设置。这个内存加载采用延时加载,主要应用于一些小的column family,并且经常被用到的,从而提升读的性能,因而这样就不需要再从磁盘中读取了。SSTable压缩Bigtable的压缩是基于locality group级别:CompressionClients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block (whose size is controllable via a locality group specific tuning parameter). Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroys scheme 6, which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data. Both compression passes are very fastthey encode at 100200 MB/s, and decode at 4001000 MB/s on modern machines.Bigtable的压缩以SSTable中的一个Block为单位,虽然每个Block为压缩单位损失一些空间,但是采用这种方式,我们可以以Block为单位读取、解压、分析,而不是每次以一个“大”的SSTable为单位读取、解压、分析。SSTable的读缓存为了提升读的性能,Bigtable采用两层缓存机制:Caching for read performanceTo improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The Block Cache is useful for applications that tend to read data that is close to the data they recently read (e.g., sequential reads, or random reads of different columns in the same locality group within a hot row).两层缓存分别是:1. High Level,缓存从SSTable读取的Key/Value对。提升那些倾向重复的读取相同的数据的操作(引用局部性原理)。2. Low Level,BlockCache,缓存SSTable中的Block。提升那些倾向于读取相近数据的操作。Bloom Filter前文有提到Bigtable采用合并读,即需要读取每个SSTable中的相关数据,并合并成一个结果返回,然而每次读都需要读取所有SSTable,自然会耗费性能,因而引入了Bloom Filter,它可以很快速的找到一个RowKey不在某个SSTable中的事实(注:反过来则不成立)。Bloom FilterAs described in Section 5.3, a read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom fil- ters 7 should be created for SSTables in a particu- lar locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a spec- ified row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks re- quired for read operations. Our use of Bloom filters also implies that most lookups for non-existent rows or columns do not need to touch disk.SSTable设计成Immutable的好处在SSTable定义中就有提到SSTable是一个Immutable的order map,这个Immutable的设计可以让系统简单很多:Exploiting ImmutabilityBesides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that allof the SSTables that we generate are immutable. For example, we do not need any synchronization of accesses to the file system when reading from SSTables. As a result, concurrency control over rows can

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论