Jeff Bonwick, the leader of the team at Sun Microsystems that developed ZFS, called it “…the last word in filesystems.” It is indeed worthy of the praise considering its advanced yet easily maintainable features. ZFS, a pseudo-acronym for what was earlier called Zettabyte Filesystem, is a 128-bit filesystem, as opposed to the presently available 64-bits filesystems like ext4 and others.
Some of its excellent features include:
- Simplified administration: ZFS has a well-planned hierarchical structure with the uberblock (parent of all blocks) and disk label at the top, followed by pool-wide metadata, the filesystem’s metadata, directories and files. The uberblock checksum is used as the digital signature for the entire filesystem. Besides property inheritance (utilising the hierarchical structure), ZFS provides auto management of mounting, sharing, compressions, ACLs, quotas and reservations, etc, making administration easier and more effective. The filesystems in ZFS can be compared to directories in ordinary filesystems like ext3, and most administration tasks are done using just two commands—
- Pooled storage: ZFS has revolutionised the filesystem implementation and its management with the introduction of storage pools. Concepts like datasets (a generic term for volumes, filesystems, snapshots and clones) and pools (a large storage area available for the datasets) make filesystem handling easier for the administrator. Like the virtual memory model for a process, the filesystem can grow its usage space as required without any pre-determined space limits unless provided as ‘quotas’ within the pool model. ‘Quotas’ can be set, changed or removed at will. Also, a minimum ‘reservation’ space for each filesystem can be specified. One important aspect of the storage pool is the removal of volume management architecture, thus reducing a lot of complexity for the administrator.
- Transactional paradigm: ZFS being a transactional filesystem is guaranteed to be consistent according to its developers. Data management in ZFS uses copy on write semantics, which ensure that data is never overwritten, always maintaining an old reference to the data. A sequence of filesystem operations is either committed or ignored as a whole, thereby preventing any corruption to the filesystem due to power shortage or some other outage. This, in effect, removes the need for the
fscktool, the traditional filesystem check and repair tool.
- Scrubbing and self-healing: Since data and even metadata is checksummed, data scrubbing (an operation that checks data integrity within a filesystem or, in other words, data validation) is performed easily within ZFS. Checksum algorithms can be any user-selected algorithm from SHA-256 to fletcher2, producing 256-bit long checksums. Besides checking for data integrity and preventing silent corruption, ZFS also provides mechanisms for self healing, mainly through RAID-Z and mirroring. Two RAID-Z variations, single and double-parity, are in fact slight variations of RAID-5 and RAID-6, respectively. The variations mainly aim to eliminate the write hole, solidifying data integrity. Besides, techniques like resilvering or resyncing help in replacing a corrupted or faulty device with a new one.
- Scalability: The team behind ZFS made the decision to go for a 128-bit filesystem, even though 64-bit filesystems like ext4 have come up only recently. Its data limit is an enormous 256 quadrillion zettabytes of storage which, is almost an impossible limit to reach in the near future since fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans, as Bonwick pointed out. Directories can have up to 248 (256 trillion) entries. No limit exists on the number of filesystems or number of files that can be contained within a filesystem.
- Snapshots and clones: Snapshot is a read-only copy of a filesystem or volume at any particular point of time. Its design is such that space is consumed only when data is changed, preventing any freeing of data from the filesystem unless explicitly asked, giving further options for maintaining data integrity. Clone is a writable filesystem generated from a snapshot. The creation of snapshots and clones in ZFS is very simple and is always pointed out as one of its big advantages.
ZFS and Linux
ZFS is the standard filesystem for Solaris/OpenSolaris OS whose source code is published under CDDL (Common Development and Distribution License). However, from the beginning (and hopefully forever) the Linux kernel has remained licensed under the GPLv2, which prevents any other code to be linked with the GPL’d Linux kernel unless that code’s licence is GPL v2 compatible. So the open sourced code of ZFS cannot be added/linked to the kernel code like any other filesystem, either as a part of the kernel or as kernel modules. As a workaround, some solutions pointed out by the open source community are:
- A ‘court ruling’ (either in the US or EU, where ZFS is mainly used) stating that GPL and CDDL are compatible.
- Either of the parties (Linux and Solaris) need to change the licence of their code to a mutually compatible one.
- A GPL’d ZFS reimplementation from scratch, which should be free from all the 56 patents that Sun has taken on ZFS code.
- A method by which we would be able to implement ZFS to be usable for Linux, which is only possible through dynamic linking between the codes—this is allowed.
The possibility of Options 1 and 2 are remote, compelling us to choose between Options 3 or 4. As a solution like that suggested in Option 3, a project named BTRFS, guided by Chris Mason at Oracle is under development, having been merged to a “rc” pre-release of the current Linux kernel (2.6.30), and is under testing. Definitely, this is going to take a long time as ZFS itself was under development for five years. Solution 4, which is through a utility called FUSE and seems the most stable option as of now, is what I am going to discuss as we go on.
Filesystem in USErspace, or FUSE, helps implement a fully-functional filesystem in a userspace program rather than directly in the kernel. It is implemented in OSs like Linux, FreeBSD, etc. Its components (as of version 2.7.4) consist of a FUSE kernel module, a FUSE library containing libfuse and libulockmgr, and a special file descriptor like a device file in Linux named
/dev/fuse, used for communication between the kernel module and the userspace library. For user convenience, a program named ‘fusermount’ is provided along with the FUSE package as an easy usermode tool to link up between the user-defined filesystem and the FUSE module.
ZFS on FUSE
ZFS on FUSE is a project under development by Ricardo Manuel da Silva Correia, a computer engineering student, and is sponsored by Google as part of Google Summer of Code 2006. So after completion of this project, ZFS will have a port on the FUSE framework, which effectively will mean operating systems like Linux can use ZFS. A rough performance comparison of ZFS on FUSE with NTFS-3G, XFS and JFS can be found at www.csamuel.org/2007/04/25/comparing-ntfs-3g-to-zfs-fuse-for-fuse-performance.
How it works
The zfs-fuse daemon acts like a server, managing ZFS on the system through the FUSE framework. Every filesystem operation on the mounted ZFS devices from any application will be through the standard C library system calls. This results in calling the kernel’s appropriate function from the virtual filesystem (VFS) interface, which will then be hooked to the FUSE module and, in turn, acts like a filesystem module through a special purpose device named
/dev/fuse. This device acts as a bridge between the ZFS implementation and fuse module. The fuse module communicates with the ZFS filesystem implementation (which in this case is zfs-fuse), through the FUSE library libfuse which has functions similar to that of VFS’s interface. The user program returns results for the filesystem request in the required format through the FUSE framework to the application.
The latest source code of ZFS on FUSE can be downloaded from the project site. It is available in two forms, as a release version packed as a bzip file or directly in source form from the Mercurial repository. Installing from the source requires that we use
scons instead of
make, though the command and options are almost the same for both. It’s better you read the README and INSTALL files in the source directory before proceeding. Besides, for certain distributions like Gentoo, Debian, Fedora, Ubuntu, etc, zfs-fuse is available via the regular package management system making the installation much easier. Please use your package manager and search for “zfs”.
Installation on Fedora 10
As I was using Fedora 10 while testing ZFS, my commands and configuration files are more specific to Fedora, though with minor variations the same should apply to most distros.
First install the zfs-fuse package using the command [all commands from here on should be executed as the root user, unless otherwise mentioned]:
yum install zfs-fuse
This installed zfs-fuse version 0.5 on my system that has Fedora 10.
Setting up ZFS
Before executing any commands, it should be verified that zfs-fuse daemon is running.
If it’s not, issue the following code:
service zfs-fuse start
…or directly run the script file as follows: