Lustre (file system)

From Wikipedia, the free encyclopedia

Jump to: navigation, search
Lustre
Developed by Sun Microsystems
Latest release 1.6.7 / 2009-02-20; 55 days ago
Preview release 2.0 Alpha / 2009-04-15; 1 day ago
Operating system Linux
Type Distributed file system
License GPL
Website http://www.lustre.org, http://sun.com/lustre/

Lustre is an object-based, distributed file system, generally used for large scale cluster computing. The name Lustre is a blend of the words Linux and cluster. The project aims to provide a file system for clusters of tens of thousands of nodes with petabytes of storage capacity, without compromising speed or security. Lustre is available under the GNU GPL.

Lustre is designed, developed and maintained by Sun Microsystems, Inc. with input from many other individuals and companies. Sun completed its acquisition [1] [2] of Cluster File Systems, Inc., including the Lustre file system, on October 2, 2007, with the intention of bringing the benefits of Lustre technologies to Sun's ZFS file system and the Solaris operating system.

Lustre file systems are used in computer clusters ranging from small workgroup clusters to large-scale, multi-site clusters. Fifteen of the top 30 supercomputers in the world use Lustre file systems, including the world's second fastest supercomputer, the Blue Gene/L at Lawrence Livermore National Laboratory (LLNL) [3]. Other supercomputers that use the Lustre file system include systems at Oak Ridge National Laboratory, Pacific Northwest National Laboratory, and NASA[4] in North America, the largest system in Asia at Tokyo Institute of Technology[5], and one of the largest systems in Europe at CEA [6].

Lustre file systems can support up to tens of thousands of client systems, petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Businesses ranging from Internet service providers to large financial institutions deploy Lustre file systems in their data centers. Due to the high scalability of Lustre file systems, Lustre deployments are popular in the oil and gas, manufacturing, rich media and finance sectors.[7]

Contents

[edit] History

The Lustre file system architecture was developed as a research project in 1999 by Peter Braam, who was a Senior Systems Scientist at Carnegie Mellon University at the time. Braam went on to found his own company Cluster File Systems, which released Lustre 1.0 in 2003. Cluster File Systems was acquired by Sun Microsystems in October 2007.

The Lustre file system was first installed for production use in March 2003 on the MCR Linux Cluster at LLNL [8] , one of the largest supercomputers at the time [9].

Lustre 1.2.0, released in March 2004, provided Linux kernel 2.6 support, a "size glimpse" feature to avoid lock revocation on files undergoing write, and client side write-back cache accounting.

Lustre 1.4.0, released in November 2004, provided protocol compatibility between versions, Infiniband network support, and support for extents/mballoc in the ldiskfs on-disk filesystem.

Lustre 1.6.0, released in April 2007, supported mount configuration (“mountconf”) allowing servers to be configured with "mkfs" and "mount", supported dynamic addition of object storage targets (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on symmetric multiprocessing (SMP) servers, and supported free space management for object allocation.

The Lustre file system and associated open source software has been adopted by many partners. Both Red Hat and SUSE (Novell) offer Linux kernels that work without patches on the client for easy deployment.

[edit] Architecture

A Lustre file system has three major functional units:

  • A single metadata target (MDT) per filesystem that stores metadata, such as filenames, directories, permissions, and file layout, on the metadata server (MDS)
  • One or more object storage targets (OSTs) that store file data on one or more object storage servers (OSSes). Depending on the server’s hardware, an OSS typically serves between two and eight targets, each target a local disk filesystem up to 8 terabytes (TBs) in size. The capacity of a Lustre file system is the sum of the capacities provided by the targets
  • Client(s) that access and use the data. Lustre presents all clients with standard POSIX semantics and concurrent read and write access to the files in the filesystem.

The MDT, OST, and client can be on the same node or on different nodes, but in typical installations, these functions are on separate nodes with two to four OSTs per OSS node communicating over a network. Lustre supports several network types, including Infiniband, TCP/IP on Ethernet, Myrinet, Quadrics, and other proprietary technologies. Lustre can take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.

The storage attached to the servers is partitioned, optionally organized with logical volume management (LVM) and/or RAID, and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.

An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use a modified version of ext3 to store data. In the future, Sun's ZFS/DMU will also be used to store data.

When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or writer operations, the client then passes the layout to a logical object volume (LOV), which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.

Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and risk filesystem corruption from misbehaving/defective clients.

[edit] Implementation

In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.

On some massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client. This approach was used in the LLNL Blue Gene installation[10].

Another approach uses the liblustre library to provide userspace applications with direct filesystem access. Liblustre is a user-level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors can access a Lustre file system even if the service node on which the job was launched is not a Lustre client. Liblustre allows data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from computational processors to the Lustre file system directly. Good performance characteristics and scalability make this approach the most suitable for using Lustre with MPP systems. Liblustre is the most significant design difference between Lustre implementations on MPPs such as Cray XT3 and Lustre implementations on conventional clustered computational systems.

Lustre 1.8, which is expected to be released in Q4 2008 will focus on improving recovery in the face of multiple failures, and storage management via OST Pools [11].

Lustre technology has been integrated in the HP StorageWorks Scalable File Share product [12].

[edit] Data objects and file striping

In a traditional UNIX disk file system, an inode data structure contains basic information about each file, such as where the data contained in the file is stored. The Lustre file system also uses inodes, but inodes on MDSs point to one or more objects associated with the file rather than to data blocks. These objects are implemented as files on the OSSs. When a client opens a file, the file open operation transfers a set of object pointers and their layout from the MDS to the client, so that the client can directly interact with the OSS node where the object is stored, allowing the client to perform I/O on the file without further communication with the MDS.

If only one object is associated with an MDS inode, that object contains all the data in the Lustre file. When more than one object is associated with a file, data in the file is “striped” across the objects similar to RAID 0. Striping provides significant performance benefits. When striping is used, the maximum file size is not limited by the size of a single target. Also, capacity and aggregate I/O bandwidth scale with the number of OSTs.

[edit] Networking

In a cluster with a Lustre file system, the system network connecting the servers and the clients is implemented using Lustre Networking (LNET), which provides the communication infrastructure required by the Lustre file system. Disk storage is connected to the Lustre file system MDSs and OSSs using traditional storage area network (SAN) technologies.

LNET supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote Direct Memory Access (RDMA) is permitted when supported by underlying networks such as Quadrics Elan, Myrinet, and InfiniBand. High availability and recovery features enable transparent recovery in conjunction with failover servers.

LNET provides end-to-end throughput over Gigabit Ethernet (GigE) networks in excess of 100 MB/s[13], throughput up to 1.5 GB/s over InfiniBand double data rate (DDR) links, and throughput over 1 GB/s across 10GigE interfaces.

[edit] High availability

Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, merely experiencing a delay.

Lustre MDSs are configured as an active/passive pair, while OSSs are typically deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system, so no nodes are idle in the cluster.

[edit] ZFS integration

Lustre 3.0 will allow users to choose between ZFS and ldiskfs as back-end storage.[14]

[edit] References

  1. ^ "Sun Assimilates Lustre Filesystem". Linux Magazine. 2007-09-13. http://www.linux-magazine.com/online/news/sun_assimilates_lustre_filesystem?category=13402. 
  2. ^ "Sun Welcomes Cluster File Systems Customers and Partners". Sun Microsystems, Inc.. 2007-10-02. http://www.sun.com/software/clusterfs/index.xml. 
  3. ^ "TOP500 List - June 2008". TOP500.Org. 2006-06-18. http://top500.org/list/2008/06/100. 
  4. ^ "Pleiades Supercomputer". www.nas.nasa.gov. 2008-08-18. http://www.nas.nasa.gov/Resources/Systems/pleiades.html. 
  5. ^ "TOP500 List - November 2006". TOP500.Org. http://www.top500.org/system/8216. 
  6. ^ "TOP500 List - June 2006". TOP500.Org. http://www.top500.org/system/8237. 
  7. ^ "Lustre File System presentation". Google Video. http://video.google.ca/videoplay?docid=965817030102279091. Retrieved on 2008-01-28. 
  8. ^ url=http://goliath.ecnext.com/coms2/gi_0199-2958660/LINUX-NETWORX-CLUSTER-BECOMES-3RD.html
  9. ^ url=http://www.top500.org/system/6085
  10. ^ "DataDirect SELECTED AS STORAGE TECH POWERING BLUEGENE/L". HPC Wire, October 15, 2004: Vol. 13, No. 41.. http://www.hpcwire.com/hpcwire/hpcwireWWW/04/1015/108577.html. 
  11. ^ "Lustre Roadmap and Future Plans". Sun Microsystems. http://blogs.sun.com/HPC/resource/ISC08-Lustre_Update.pdf. Retrieved on 2008-08-21. 
  12. ^ "Global File systems (SFS) collaboration". Hewlett Packard. http://www.hp.com/techservers/hpccn/linux_gfs/. 
  13. ^ Lafoucrière, Jacques-Charles. "Lustre Experience at CEA/DIF". HEPiX Forum, April 2007. http://hepix.caspur.it/afs/hepix.org/project/strack/hep_pdf/2007/Spring/Lustre-CEA-hepix2007.pdf. 
  14. ^ "Architecture ZFS for Lustre". Sun Microsystems. http://arch.lustre.org/index.php?title=Architecture_ZFS_for_Lustre. Retrieved on 2008-02-18. 

[edit] See also

[edit] External links

Personal tools