nexenta: Bringing Enterprise Storage to the Open Source World

Storing all that!

This article discusses some of the challenges and technologies related to data storage, and introduces the Nexenta Core Platform, which is an OS based on OpenSolaris.

Storage is arguably the most overlooked component in IT systems’ design. Long considered the realm of those long-haired denizens who stalk the feared corridors in the infrastructure department, storage has been treated almost like hardware, to be requisitioned and justified in terms of capacity and nothing else.

However, as computers become faster, they also become hungrier for data, and the storage system starts playing an increasingly important role. Moore’s Law tells us that the number of transistors you can pack onto a chip doubles roughly every 18 months. In layman’s English, this means that computers 18 months from now will be able to perform twice as many calculations in the same time as they do today. However, a CPU’s ability to perform calculations is nullified if the bottleneck is I/O. Calculations can only be carried out on data that is provided by the I/O subsystem, which in turn is limited by the speed of the storage system.

In times when all parts of a system were homogeneous, it was possible to predict the I/O requirements, and design methods into the software to limit the effects of storage latency in processing, such as data serialisation. As IT systems become more and more diversified in their constituents, their I/O patterns become less predictable.

What this means for systems architects is that now, in addition to thinking about the number of bytes, they also need to think about another quantity: IOPS.

IOPS
Input/Output Per Second (IOPS) is a measure of how fast a storage medium can process I/O requests. IOPS depends on:

  1. I/O size
  2. The type of I/O (read or write)
  3. Whether the I/O is sequential or random

I/O size, or block size, is the amount of data involved in one input or output transaction. This is typically measured in bytes. Read I/O is typically faster, since no modification of existing data is required. Write I/O is slower, since existing data must be changed.

Sequential I/O occurs when the data to be read or written is laid out in consecutive address spaces on the storage medium. When the access pattern is not sequential, I/O is said to be random. For example, in a mechanical disk drive, sequential I/O does not require the head to move, but random I/O does. One of the reasons that SSDs (Solid-state drives) provide more IOPS than magnetic disks is because the random access and serial access times are roughly the same for SSDs, since there are no mechanical parts to move.

In addition to the above, when dealing with RAID arrays, IOPS also depends on the number of disks in the array, as well as the RAID type used. For example, mirrored RAID configurations are typically slower than striped RAID configurations, particularly for write operations. This is because in a mirrored RAID configuration, each write operation must be carried out twice. Striping helps increase performance, since the I/O is distributed between multiple disks.

In most cases, more disks in the array means more IOPS. What this boils down to is that even though each of your individual disks may have a high capacity, you may still need more disks than anticipated to provide the IOPS required by your application.

While estimating the Total Cost of Ownership of a system, the IOPS calculation tends to throw a wrench in the works. The catch here is that while the cost per Gigabyte, taken by itself, may not be much, the cost of the same capacity at a greater number of IOPS is typically much higher. As a result, many architects now distinguish between storage classes.

Primary storage is the fastest and most expensive storage available, while secondary and tertiary storage classes are cheaper and slower. This makes it easier to regulate the storage component of the TCO by correctly sizing different classes of storage.

Most enterprise storage vendors are now rating their devices based on both capacity and IOPS. Additionally, storage vendors are focusing on helping customers get the most out of their storage by developing tools and methods such as in-line data de-duplication and caching.

De-duplication
Data de-duplication is a technique used to improve storage utilisation by storing only one copy of duplicate chunks of data. These chunks are identified by the byte pattern during analysis, and when a duplicate of one of the chunks analysed so far is found, it is replaced with a reference to the original, which is much smaller than the original data. This substantially reduces the storage used.

De-duplication is of two types, in-line and post process.

  • In-line de-duplication: In this process, the de-duplication of data occurs on the fly as the data enters the device. This results in only one copy of all data blocks being ever stored. This means that storage never has to be over-provisioned. The drawback is that de-duplication being a CPU-intensive process, slows down the incoming I/O substantially.
  • Post-process de-duplication: In this process, data is first written to the storage medium, and the de-duplication process is run afterwards. While this is more efficient in terms of I/O speed, it takes away from the value of the de-duplication, since the storage must still be provisioned for the entire data set, not just the de-duplicated size.

Caching

Caching is the process of storing part of a large data set stored in a comparatively slower medium in a fast storage medium. For example, primary storage can be set up to act as a cache for secondary storage. The cache is essentially an interim place to hold data that is quicker for the I/O system to reach than the actual storage system. The transfer time between the cache and the processing unit is typically less than the transfer time between the processor and the storage system.

Since the cache is smaller than the actual storage, periodically, some part of the cache must be cleared to make room for new data that the system has requested. The selection of the data to be evicted is made through one of several algorithms:

  • LRU (Least Recently Used): This algorithm evicts the data that has not been requested from the cache for the longest period.
  • LFU (Least Frequently Used): This algorithm evicts the data that has been requested the fewest times
    from the cache.
  • n-way set associative caches: This algorithm calculates a set of "n" locations based on the incoming data's address, where the data can go in the cache. The least recently used location is evicted.
  • Direct mapped cache: This is a special case of the n-way Set Associative algorithm, where each block of data can occupy exactly one location in the cache. Whatever is already in the cache at that location is evicted.

Major vendors in this space are EMC, NetApp, HP and IBM. However, most of the solutions provided by these companies are out of reach for SMEs, on account of their cost. However, this may change with the entry of Nexenta.

A little history

To fully understand this new entrant to the storage playing field, we must travel a little back in time and look at the history of ZFS.

Back in September 2004, Sun Microsystems announced that it had developed "the last word in file systems." According to Jeff Bonwick, the leader of the team that developed it, ZFS doesn't stand for anything in particular. The team just wanted something that "vaguely suggests a way to store files that gets you a lot of points in Scrabble," so they called it ZFS.

It must be noted though, that the expansion "Zetta File System" has caught on in many circles. According to Sun's press release at the time, ZFS was designed to provide simple administration, provable data integrity, unlimited scalability and blazing performance.

ZFS departed from the traditional volume manager approach in that, instead of first organising disks into partitions and volumes using a volume manager like LVM, and laying a file system on top of the volumes, ZFS introduced a new layer of abstraction called the storage pool. Storage pools aggregate all the disks available into a single logical unit, and then designate file systems from these pools. This means that administrators could now simply add disks to a storage pool, and file systems would automatically consume space as needed.

As you can imagine, this drastically simplifies the work of a storage administrator. Expanding or re-organising storage becomes pretty simple when volumes and partitions do not have to be edited every time the underlying disk configuration changes.

Additionally, ZFS provides a superb performance boost with a few more clever features. For one, it dynamically stripes I/O across all devices in a storage pool. It also makes much of the I/O sequential by following a copy-on-write approach, uses I/O recognition techniques to aggregate and sort I/O, and alters the block size of the I/O operation depending on the workload.

What really sets ZFS apart in the world of storage, though, is its ability to de-duplicate data in-line (added in October 2009) and its use of the Adaptive Replacement Cache (ARC).

ARC

The ARC, or adaptive replacement cache, is a high-performance caching algorithm. ARC is different from the traditional LRU/LFU cache eviction algorithms in that instead of keeping track of only either "Least Recently Used" or "Least Frequently Used" blocks, it keeps track of both, as well as a history of recently evicted blocks.

ARC was originally developed by IBM, and is deployed in its DS6000 and DS8000 storage controllers. In ZFS, ARC has been modified to allow for locked data that cannot be evicted.

When a fast storage medium such as SSD or even RAM is used as the ARC cache device, it allows the requesting application to access data instantaneously, instead of going back to the storage to retrieve it. With the adaptive replacement algorithm, the cache hit rate stays consistently high even for varying I/O patterns.

In ZFS, ARC is used to improve read performance. For the write side, ZFS' copy-on-write transactional I/O model comes to the rescue, because in ZFS, no block containing active data is overwritten in-place. A new block is allocated, and the metadata blocks referencing the original are changed to point to the new one. As a result, most I/O in ZFS is sequential, and therefore faster.

Moreover, ZFS uses an intent log to keep track of write operations. In the event of a failure, the intent log is used to recreate the transactions affected by the failure. For data sets where the I/O speed is more important than the data integrity, the intent log can be turned off.

Nexenta

And so we come back to Nexenta. NexentaOS, or the Nexenta Core Platform as it is officially known, is an OS based on OpenSolaris. Its development was initiated by Nexenta Systems Inc and the first release came out in February 2008. It incorporates a kernel that allows proprietary closed-source drivers to be included in an open source operating system.

Nexenta Systems' tag-line is "Enterprise Class Storage for Everyone", and it is not an unfounded claim. Nexenta.org, the home of the Nexenta community, brings together developers as only OSS can -- to provide a free, community edition of the NexentaStor proprietary OS, which is built on top of Nexenta Core.

NexentaStor adds an Ubuntu user-space environment on top of the Solaris core that provides the ZFS filesystem. It is also optimised for virtualised environments, one of the most challenging I/O environments on account of a very random I/O pattern.

Using Nexenta is remarkably simple. The NexentaStor OS is available for download as a CD or as a Virtual Machine image. Once the system is up and running, an administrator simply needs to add disks and then use the very well designed Web UI to configure the storage.

Nexenta allows multiple disks to be aggregated into storage pools, also called z-pools. While aggregating, admins can also specify which version of RAID to use. Nexenta's RAID options are RAID-Z1, Z2 and Z3, for single parity, double parity and triple parity. Nexenta uses checksums and hashes of data to ensure data integrity. Moreover, RAID-Z1, while quite similar to RAID-5, has one advantage over it, in that it converts all writes to full-stripe writes, thereby serialising and speeding up the I/O.

Once z-pools are created, administrators can create as many ZFS file systems (or folders, as they are called in Nexenta), on top of these pools. These are, in turn, exposed to the client system using NAS, iSCSI or Fibre Channel.

NexentaStor also allows administrators to hot-add (add hardware without a power cycle operation) disks to increase capacity as needed. This means that you can get up and running with far fewer disks than you will ultimately need, allowing architects to spread out the cost of storage, rather than expecting customers to commit to a high number of disks, up front.

Applications

Nexenta truly does bring the power of enterprise storage to the general public. It has quickly found its way into the labs of solutions architects everywhere, particularly those who design large complex systems with lots of I/O. The space savings from in-line de-duplication, and the IOPS gain from the caching and copy-on-write mechanisms, make Nexenta a great addition to the toolbox.

Nexenta could be used in cluster computing, as well as by small enterprises and for cloud computing. However, the biggest use of Nexenta is likely to be in virtualised environments, where several VMs may have much of their data in common, and which have a high IOPS requirement.

In these environments, Nexenta could be installed as a guest alongside the other VMs on the same host, and the storage could be connected to the server through the Nexenta appliance. Given enough RAM on the server, and CPU allocation to the Nexenta VM, this design could help speed up even local disks to server-level performances.

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.