Lustre详解.doc

上传人：c*** IP属地：河南上传时间：2020-02-08 格式：DOC 页数：8 大小：149.80KB 积分：20 举报 版权申诉

已阅读5页，还剩3页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Lustre详解Lustre是一个大规模并行分布式文件系统，一般用于大规模集群计算。 The name Lustre is a portmanteau word derived from L inux and cl uster . 1 Available under the GNU GPL , the project provides a high performance file system for clusters of tens of thousands of nodes with petabytes of storage capacity.名称Lustre是一个混成词来自Linux和CLUSTER 。 1可根据GNU GPL的，该项目提供了一个高性能的文件系统数万集群节点与PB级的存储容量。 Lustre file systems are used in computer clusters ranging from small workgroup clusters to large-scale, multi-site clusters. Lustre文件系统的使用范围从小型工作组集群，以大规模，多站点集群计算机集群。 Fifteen of the top 30 supercomputers in the world use Lustre file systems, including the worlds fastest TOP500 supercomputer, K computer . 2 排名前30位的超级计算机在世界上使用Lustre文件系统，包括世界上最快的的15个， K电脑Top500超级计算机。 2 Lustre file systems can support tens of thousands of client systems, tens of petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Lustre文件系统，可以支持数万客户端系统，几十PB的存储和数以百计的每秒千兆字节（GB / S）的I / O吞吐量（ PBS）。 Due to Lustres high scalability , businesses such as Internet service providers, financial institutions, and the oil and gas industry deploy Lustre file systems in their data centers. 3 由于Lustre的高可扩展性，如互联网服务提供商，金融机构，以及石油和天然气行业部署Lustre文件系统在其数据中心的企业。3 历史 The Lustre file system architecture was developed as a research project in 1999 by Peter Braam, who was a Senior Systems Scientist at Carnegie Mellon University at the time. Lustre文件系统架构的开发作为一个研究项目于1999年由Peter Braam，是一个高级系统科学家在卡耐基梅隆大学。 Braam went on to found his own company Cluster File Systems , which released Lustre 1.0 in 2003. Braam去发现自己的公司集群文件系统，在2003年发布的Lustre 1.0。 In 2007, Sun Microsystems acquired Cluster File Systems Inc. 4 5 Sun included Lustre with its HPC hardware offerings, with the intent to bring the benefits of Lustre technologies to Suns ZFS file system and the Solaris operating system . 2007年， Sun微系统收购集群文件系统公司4 5包括与Sun的意图带来的Lustre技术的好处，其高性能的硬件产品的Lustre ， ZFS文件系统和Solaris操作系统。 In November 2008, Braam left Sun Microsystems to work on another filesystem, leaving Eric Barton and Andreas Dilger in charge of Lustre architecture and development. 2008年11月，Braam离开Sun微系统的工作在另一个文件系统，离开Lustre的架构和开发的主管埃里克巴顿和安德烈亚斯狄杰。 In 2010 Oracle Corporation , by way of its 2010 acquisition of Sun, began to manage and release Lustre. 甲骨文公司在2010年，其2010年收购Sun的方式，开始管理和发布的Lustre。 In April 2010 Oracle announced it would limit paid support for new Lustre 2.0 deployment to Oracle hardware, or hardware provided by approved third party vendors. 2010年4月，甲骨文公司宣布，它会限制支付新的Lustre 2.0部署到Oracle硬件，或经批准的第三方供应商提供的硬件支持。 Lustre remained available under the GPL license to all users, and existing Lustre 1.8 customers would continue to receive support from Oracle. 6 6Lustre保持在GPL许可下提供给所有用户，和现有的Lustre 1.8的客户将继续从甲骨文的支持。 In December 2010, Oracle announced cessation of Lustre development.在2010年12月，甲骨文公司宣布停止Lustre的发展。 Lustre 1.8 release was placed into maintenance-only support 7 creating uncertainty around the future development of the file system. Lustre的1.8版本，放到维护只支持7围绕创建文件系统的未来发展的不确定性。 Following this announcement, new Lustre support and development was provided by a community, including Whamcloud , 8 Xyratex , 9 OpenSFS , EUROPEAN Open Filesystems (OFS) SCE and others.本公布之后，新的Lustre支持和发展提供了一个社区，包括Whamcloud ， 8 Xyratex公司， 9 OpenSFS ，SCE和其他欧洲公开赛的文件系统（OFS）。 In the same year, Eric Barton and Andreas Dilger left Oracle for the Lustre-centric startup Whamcloud , 10 where they continue to work on Lustre.同年，埃里克巴顿和安德烈亚斯狄杰离开了Lustre为中心的启动甲骨文Whamcloud ， 10 ，他们继续对Lustre的工作。 edit Release history 编辑发行历史 The Lustre file system was first installed for production use in March 2003 on the MCR Linux Cluster at LLNL, 11 one of the largest supercomputers at the time. 12 Lustre文件系统首次安装在LLNL的MCR Linux集群在2003年3月用于生产， 11当时最大的超级计算机之一。 12 Lustre 1.2.0, released in March 2004, provided Linux kernel 2.6 support, a size glimpse feature to avoid lock revocation on files undergoing write, and client side data write-back cache accounting (grant). 1.2.0，于2004年3月发布的Lustre，提供Linux 2.6内核的支持下，“大小惊鸿一瞥”的功能，以避免在经历写的文件，客户端数据的回写缓存会计（授予）锁撤销。 Lustre 1.4.0, released in November 2004, provided protocol compatibility between versions, InfiniBand network support, and support for extents/mballoc in the ldiskfs on-disk filesystem. 1.4.0，于2004年11月发布的Lustre，提供协议版本之间的兼容性，支持InfiniBand网络，并在程度/ mballoc 支持 ldiskfs磁盘上的文件系统。 Lustre 1.6.0, released in April 2007, supported mount configuration (“mountconf”) allowing servers to be configured with mkfs and mount, supported dynamic addition of object storage targets (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on symmetric multiprocessing (SMP) servers, and supported free space management for object allocations. 1.6.0，于2007年4月发布的Lustre，支持安装配置（“mountconf”）允许“的mkfs”和“安装”配置服务器，支持动态添加对象存储目标（OSTS），启用Lustre的分布式锁管理器（LDLM ）的可扩展性对称多处理（SMP）服务器，并支持对象分配的空闲空间的管理。 Lustre 1.8.0, released in May 2009, provided OSS Read Cache, improves recovery in the face of multiple failures, adds basic heterogeneous storage management via OST Pools, adaptive network timeouts, and version-based recovery. 1.8.0，于2009年5月发布的Lustre，提供OSS读缓存，提高面对多次失败的复苏，增加了基本的异构存储管理，通过原声池，自适应网络超时，版型恢复。 It also serves as a transition release, being interoperable with both Lustre 1.6 and Lustre 2.0. 13 13它也可作为一个过渡版本，与双方的Lustre 1.6和Lustre 2.0互操作。 Lustre 2.0.0, released in August 2010, provided a rewritten metadata server stack to provide a basis for Clustered Metadata (CMD) to allow distribution of the Lustre metadata across multiple metadata servers, a new Client IO stack (CLIO) for portability to other client operating systems such as Mac OS , Windows , and Solaris , and an abstracted Object Storage Device (OSD) back-end for portability to other filesystems such as ZFS . 2.0.0，于2010年8月发布的Lustre，提供了一个重写的元数据服务器栈提供集群元数据（CMD）的基础上，允许可移植性了Lustre元数据分布在多个元数据服务器上，一个新的客户端IO栈（CLIO）其他如客户端操作系统的Mac OS ， Windows中，和 Solaris ，和抽象移植到其他文件，如ZFS对象存储设备（OSD）的后端。 The Lustre file system and associated open source software has been adopted by many partners. Lustre文件系统和相关的开放源码软件已通过众多的合作伙伴。 Both Red Hat and SUSE ( Novell ) offer Linux kernels that work without patches on the client for easy deployment.无论是红帽和 SUSE （Novell公司）提供易于部署的客户端的补丁，没有工作的Linux内核。 edit Architecture 编辑架构 A Lustre file system has three major functional units:一个Lustre文件系统主要有三大功能单元： A single metadata server (MDS) that has a single metadata target (MDT) per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout.一个单一的元数据服务器（MDS ）的Lustre文件系统的目标，每一个单一的元数据（MDT）的存储命名空间元数据，如文件名，目录，访问权限和文件布局。 The MDT data is stored in a single local disk filesystem. MDT的数据存储在一个本地磁盘的文件系统。 One or more object storage servers (OSSes) that store file data on one or more object storage targets (OSTs) .一个或多个对象存储服务器（OSSes）的存储文件数据对象存储在一个或多个目标（OSTS）。 Depending on the servers hardware, an OSS typically serves between two and eight OSTs, with each OST managing a single local disk filesystem.根据服务器的硬件，一个开放源码软件通常提供2至8 OSTS，与每一个本地磁盘的文件系统管理的OST。 The capacity of a Lustre file system is the sum of the capacities provided by the OSTs. Lustre文件系统的容量是OSTS提供的能力的总和。 Client(s) that access and use the data. 客户端（S），访问和使用数据。 Lustre presents all clients with a unified namespace for all of the files and data in the filesystem, using standard POSIX semantics, and allow concurrent and coherent read and write access to the files in the filesystem. Lustre的所有客户端提出了一个统一的命名空间，所有的文件和文件系统中的数据，使用标准的 POSIX语义，并允许并发和连贯的读取和写入访问文件系统中的文件。 The MDT, OST, and client can be on the same node, but in typical installations these functions are on separate nodes communicating over a network. MDT的OST，和客户端可以在同一节点上，但这些功能在典型安装单独的节点，通过网络进行通信。 The Lustre Network (LNET) layer supports several network interconnects, including native Infiniband verbs, TCP/IP on Ethernet and other networks, Myrinet , Quadrics , and other proprietary network technologies. Lustre的网络层（LNET）支持多个网络互连，包括本地 InfiniBand动词， TCP / IP 以太网上的其他网络， Myrinet的，二次型，以及其他专有的网络技术。 Lustre will take advantage of remote direct memory access ( RDMA ) transfers, when available, to improve throughput and reduce CPU usage.Lustre将利用远程直接内存访问（ RDMA ）时，转移，提高吞吐量，降低CPU使用率。 The storage used for the MDT and OST backing filesystems is partitioned, optionally organized with logical volume management (LVM) and/or RAID , and normally formatted as ext4 file systems. MDT和OST的支持文件系统使用的存储分区，可以选择与组织逻辑卷管理（LVM）和/或RAID ，并通常以格式化的ext4文件系统。 The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems. Lustre的OSS和MDS服务器的读，写和修改这些文件系统施加的数据格式。 An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. OST是一个专用的文件系统的出口读/写操作接口的对象的字节范围。 An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDT是一个专用的文件系统，控制文件访问，并告诉客户对象（S）文件。 MDTs and OSTs currently use an enhanced version of ext4 called ldiskfs to store data.多学科小组和OSTS目前使用的一个增强版本的ext4所谓ldiskfs来存储数据。 Work started in 2008 at Sun to port Lustre to Suns ZFS /DMU for back-end data storage 14 and continues as an open source project. 15 Sun公司在2008年的工作开始端口的Lustre Sun 公司的 ZFS /后端数据存储的DMU 14 ，并继续作为一个开源项目。 15 When a client accesses a file, it completes a filename lookup on the MDS.当一个客户端访问一个文件，它完成对MDS文件名查找。 As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client.因此，创建一个文件上的现有文件的布局是返回到客户端的客户端或代表。 For read or write operations, the client then interprets the layout in the logical object volume (LOV) layer, which maps the offset and size to one or more objects, each residing on a separate OST.读或写操作，然后客户端解释的逻辑对象卷（LOV）层的布局，偏移和大小映射到一个或多个对象，每个居住在一个单独的OST。 The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs.然后，客户端锁定文件的范围内经营，并执行一个或多个并行读或写操作直接到OSTS。 With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.通过这种方法，客户端 - 原声通信的瓶颈被淘汰，因此，总带宽为客户提供读取和写入数据在文件系统中的OSTS尺度几乎呈线性。 Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes.客户端不直接修改的OST文件系统对象，但是，相反，这项任务委托到OSSes。 This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability.这种方法可以确保大规模的集群超级计算机，以及改进的安全性和可靠性的可扩展性。 In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and increase the risk of filesystem corruption from misbehaving/defective clients.相比之下，共享的基于块的文件系统，如全球文件系统和 OCFS必须允许直接访问底层存储在文件系统中的所有客户端和行为不端的有缺陷的客户端/文件系统损坏的风险增加。 edit Implementation 编辑实施 In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem.在一个典型的Lustre Linux客户机上安装一个Lustre的文件系统驱动程序模块加载到内核和文件系统安装像其他任何本地或网络文件系统。 Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.客户端应用程序看到一个单一的，统一的文件系统，即使它可能是成千上万的个人服务器和MDT / OST文件系统组成。 On some massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client.在一些大规模并行处理器（MPP）的装置，计算处理器可以访问其I / O请求重定向作为Lustre的客户端配置一个专用的I / O节点的Lustre文件系统。 This approach is used in the Blue Gene installation 16 at LLNL .这种方法是在蓝色基因 16在LLNL的安装使用。 Another approach used in the past is the liblustre library, which provided userspace applications with direct filesystem access.在过去使用的另一种方法是liblustre库，它提供了直接的文件系统访问用户空间应用程序。 Liblustre was a user-level library that allows computational processors to mount and use the Lustre file system as a client. Liblustre是一个用户级库，允许安装和使用Lustre文件系统作为客户端的计算处理器。 Using liblustre, the computational processors could access a Lustre file system even if the service node on which the job was launched is not a Lustre client.计算处理器的使用liblustre，可以访问一个Lustre文件系统，即使该作业不是一个服务节点的Lustre客户端。 Liblustre allowed data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from computational processors to the Lustre file system directly. Liblustre允许，无需干预数据通过内核的副本之间直接的应用空间和Lustre OSSS的数据移动，从而提供了低延迟，Lustre文件系统直接计算处理器的高带宽接入。 edit Data objects and file striping 编辑数据对象和文件分拆 In a traditional UNIX disk file system, an inode data structure contains basic information about each file, such as where the data contained in the file is stored.在传统的UNIX磁盘文件系统，一个inode数据结构，包含每个文件的基本信息，如在该文件中包含的数据存储，。 The Lustre file system also uses inodes, but inodes on MDTs point to one or more OST objects associated with the file rather than to data blocks. Lustre文件系统也使用上的多学科小组点的inode，但inode的一个或多个OST文件，而不是数据块相关联的对象。 These objects are implemented as files on the OSTs.这些对象实施的OSTS上的文件。 When a client opens a file, the file open operation transfers a set of object pointers and their layout from the MDS to the client, so that the client can directly interact with the OSS node where the object is stored, allowing the client to perform I/O on the file without further communication with the MDS.当客户端打开一个文件，文件的打开操作转移的对象指针和其布局从MDS到客户端的设置，使客户端可以直接与OSS的对象存储节点，允许客户端执行I没有进一步的沟通与MDS / O上的文件。 If only one OST object is associated with an MDT inode, that object contains all the data in the Lustre file.如果只有一个OST的对象是与联合化疗的inode关联，该对象包含Lustre文件中的所有数据。 When more than one object is associated with a file, data in the file is “striped” across the objects similar to RAID 0 .当一个以上的对象是与文件关联，文件中的数据是“条纹”跨越与RAID 0类似的对象。 Striping a file over multiple objects provides significant performance benefits.分拆多个对象的文件提供了显着的性能优势。 When striping is used, the maximum file size is not limited by the size of a single target.当使用分割，最大文件大小是不局限于单一目标的大小。 Capacity and aggregate I/O bandwidth scale with the number of OSTs a file is striped over.文件的OSTS容量和总的I / O带宽规模条纹以上。 Also, since the locking of each object is managed independently for each OST, adding more stripes (OSTs) scales the file IO locking capability of the filesystem proportionately.此外，由于每个对象锁定是独立管理每个OST，增加更多的条纹（OSTS）尺度文件IO能力的文件系统锁定比例。 Each file in the filesystem can have a different striping layout, so that performance and capacity can be tuned optimally for each file.在文件系统中的每个文件可以有不同的条带布局，这样，可以调整为每个文件最佳性能和容量。 edit Locking 编辑锁定 Lustre has a distributed lock manager in the style of the VMS style to protect the integrity of each files data and metadata. Lustre的分布式锁管理器，在VMS的风格样式，以保障每个文件的数据和元数据的完整性。 Access and modification of a Lustre file is completely cache coherent among all of the clients.一个Lustre文件的访问和修改完全是在所有的客户端的缓存一致。 Metadata locks are managed by the MDT that stores the inode for the file, using the 128-bit Lustre File Identifier (FID, composed of the Sequence number and Object ID) as the resource name.元数据锁管理MDT的，存储在inode的文件，资源名称中使用128位的Lustre文件标识符（FID检测器，组成的序列号和对象ID ）。 The metadata locks are split into multiple bits that protect the lookup of the file (file owner and group, permission and mode, and access control list (ACL), the state of the inode (directory size, directory contents, link count, timestamps), and layout (file striping).元数据锁被分成多个位，保护查找的文件（文件所有者和组，权限和模式，访问控制列表（ACL），inode的状态（目录大小，目录的内容，链接计数，时间戳）和布局（文件分段）。 A client can fetch multiple metadata lock bits for a single inode with a single RPC request, but currently they are only ever granted a read lock for the inode.客户端可以获取多个元数据与一个单一的RPC请求的单一的inode锁位，但目前他们只授予一个inode的读锁。 The MDS manages all modifications to the inode in order to avoid lock resource contention and is currently the only node that gets write locks on inodes. MDS管理所有的inode的修改，以避免锁定资源争，是目前国内唯一的节点，获取写的inode锁。 File data locks are managed by the OST on which each object of file is striped, using byte-range extent locks.每个文件对象是条纹的OST文件数据锁管理，使用字节范围程度上的锁。 Clients can be granted both overlapping read extent locks for part or all of the file, allowing multiple concurrent readers of the same file, and/or non-overlapping write extent locks for regions of the file.客户可以得到两个重叠的读部分或全部文件的程度锁，允许同一个文件的多个并发读者，和/或不重叠的程度锁写的文件的地区。 This allows many Lustre clients to access a single file concurrently for both read and write, avoiding bottlenecks during file IO.这让很多的Lustre客户同时访问一个单一的文件读取和写入，避免在文件IO瓶颈。 In practice, because Linux clients manage their data cache in units of pages , the clients will request locks that are always an integer multiple of the page size (4096 bytes on most clients).在实践中，因为Linux的客户管理他们的数据单位的缓存页面，客户端将请求锁，总是一个页面大小（4096字节大多数客户）的整数倍。 When a client is requesting an extent lock the OST may grant a lock for a larger extent than requested, in order to reduce the number of lock requests that the client makes.当客户端要求在一定程度上锁定了OST可以给予比要求更大程度上的锁，以减少客户端所做的锁请求的数量。 The actual size of the granted lock depends on several factors, including the number of currently-granted locks, whether there are conflicting write locks, and the number of outstanding lock requests.授予锁的实际大小取决于几个因素，包括当前授予的锁的数量，是否有冲突的写锁，和优秀的锁请求的数量。 The granted lock is never smaller than the originally-requested extent.授予锁是永远不会比原先要求的程度要小得多。 OST extent locks use the Lustre FID as the resource name for the lock.原声程度锁锁的资源名称中使用了Lustre的FID。 Since the number of extent lock servers scales with the number of OSTs in the filesystem, this also scales the aggregate locking performance of the filesystem, and of a single file if it is striped over multiple OSTs.由于程度锁定服务器尺度与文件系统中的OSTS，这也扩展文件系统的总锁性能，数量和单个文件，如果是多个OSTS条纹。 edit Networking 编辑网络 In a cluster with a Lustre file system, the system network connecting the servers

人人文库> 全部分类> 应用文书 > 技术指导

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

Lustre详解.doc

文档简介

温馨提示

最新文档

评论

Lustre详解.doc

文档简介

温馨提示

最新文档

评论

相关文档