Long week for me! Met many many customers, and tossed many ideas around with people. I was going to blog next on the SPC-2 and share some of my observations that. But now, I'm inspired by Open Systems Storage Guy's comment , and more recently, the storageanarchist's challenge on what a usable storage performance benchmark should look like. So I'm writing a series of posts to stimulate some discussion around this topic. Here is the first.
I know this can lead to a loaded political discussion, and I hope to stay clear of that. I'd rather focus on what can and cannot be done from a technology perspective. My hope is to develop, through discussion and contribution, a framework on what such a benchmark should measure, and how one might construct it, and how results should be normalized and interpreted.
I have no illusions about knowing the answers, or even knowing if there is an answer! But its about time we had an open discussion on this. And I don't just mean between vendors, but end-users too. I'll throw some open-ended ideas out, and lets see what develops.
What is a storage system?
I'll start with a truism - modern storage systems are not a bunch of disks. Many modern storage arrays are processing power, memory and algorithmic sophistication than rival or exceed the hosts that are connected to them. They support not one, but sometimes thousands of hosts with a mixture of almost every operating system known.
These array resources have enabled a host of software functions in the arrays themselves - around local and remote replication, workload optimization, data movement, fault tolerance and the like.
For the purposes of this discussion, I'd like to propose that we focus on storage arrays, devices that satisfy the following properties:
- The device is accessible by the IO channels of the hosts attached to it
- The data in the device is accessible online for random access and sequential IO by the host requesting it
- The media storing the data is non-volatile and not "removable"
- The device has the ability to provide hardware redundancy for the media
- The device has the ability to attach multiple operating system images to it.
- The device has the ability to abstract the physical medium and present a logical storage device to the operating system, which appears as a physical device to the host.
So Tape is out, as are Zip drives and CD/DVD/Optical type media. Dual ported-SSA or SCSI drives are out too, as they can't do the abstraction thing. Virtualization engines like the IBM SVC and EMC Invista qualify. When FSSD becomes reality, that would fit too. Network attached storage is very interesting, but has very different characteristics for attachment to a host, so for this discussion, I'd like to stick to block protocols.
I think this should cover most controller-based arrays today - but I am open to any suggestions to tighten this up. The basic architecture for these arrays is usually a front-end, cache/processing units, and back-end. The front-end has host connectivity (SCSI, ESCON, Fiber Channel, FICON etc). The cache/processing unit attempts to ensure that as many IO requests as possible are satisfied from cache. The back-end has direct connectivity to the media, usually SCSI or Fiber Channel disk drives (but possibly other technologies).
Anatomy of IO requests
What should a benchmark tell us? More importantly, is one benchmark enough?
Naive answer - the standard metrics: IOPS, MB/s, response time for a given configuration. Too naive - needs qualification.
Lets look at host IO requests, starting with a single threaded IO Driver. Workloads are characterized as reads or writes, random or sequential, with different IO sizes. The response of a storage system is dependent on the nature of the IO request.
In most storage arrays, writes are written to and acknowledged by the cache memory in the array. There are exceptions - if the array gets overwhelmed, it may pause to actually do the backend write. This manifests itself as delayed acknowledgment of the IO to the host. It may also bypass cache (in some systems) if single component failure puts data integrity at risk, and commit the IO directly to the media. When the array is overwhelmed, the backends ability to drain writes starts to become the limiting factor in throughput.
When an array gets overwhelmed, and how it reacts to it is obviously of much interest and often defines upper bounds to some of these metrics. Its failure modes are also of consequence in choosing a system. So writes are usually "hits" in cache - at memory speeds.
Reads may or may not need a backend IO, depending on whether the IO requested is already in cache or not.
The read IO request may be in cache, in which case, the response is at memory speeds. Many arrays rely on a couple of factors for enhancing performance. First, they sense sequential IO request streams early and preemptively prefetch subsequent data and place it in cache. Secondly, they often rely on the "principle of locality" - that is, if a piece of data is accessed, chances are it will be used again soon, or data "near" it will be used (Think database transactions - insert a row, update it a few times, commit the transaction. The same block is used again and again). So large cache systems will retain recently used data blocks in memory for as long as possible in cache, maximizing the probability that a read request will be a cache hit or a "read hit".
If the read is not in cache, backend IO is needed to bring it into the cache. This is the most "expensive" operation for an array as the backend is the slowest, usually mechanical, part of the configuration. This is referred to as a "read miss". Much work has been done is minimizing the time required to retrieve an IO request from media. Striping, aggressive prefetching, making use of multiple copies of data, smaller form factor drives, faster rotation of platters and the like have been used extensively to improve this.
Workloads where the read hits dominate are referred to as "cache friendly" and those where read misses dominate are called "cache-hostile". In the absence of any other component limiting throughput (like cache or processing power), cache hostile workloads are limited by backend media throughput.
What should we ask for
The request from host is a random or sequential read/write of a specific size.
The response of a storage array comes in many forms. Write hits, write misses, read hits and read misses. For sequential and random access patterns. For different IO sizes.
Each of these responses is dependent on the architecture, electronic components and intellectual property of the specific vendors.
What makes realistic benchmarks hard to construct is that real workloads have
- Threads with dynamic composites of these IO components
- Many concurrent threads from each host
- Many hosts sharing resources, including spindles, in the same array
- Complex array based functions in operation (such as local and remote replication) while servicing IO requests
- Different relative priorities of workloads from different hosts
A good set of benchmarks must be able to distinguish different vendor configurations with respect to their architecture, components and intellectual property investment. The ability to normalize different vendor results must be available for such a benchmark - i.e to be useful, physical configurations must be comparable for results to be comparable between vendors. Standard IO drivers and workload simulators much be constructed and agreed upon. The drivers should be able to generate enough workload to run the arrays to saturation.
For example, if my system can sustain a certain response time under a specific workload, what if I was also doing a backup to disk on the same array? What is the performance degradation? What is the hit on throughput with synchronous replication active? How well does the array handle spikes in workload? What is the performance hit if I take a point-in-time copy of a large system? These are questions my customers ask me about performance. I answer them on a case by case basis for the DMX. Good benchmarks should be able to do this across vendors.
My criticism of the SPC benchmark, as it stand today, is that the posture it has adopted is that of a static cache-hostile single host composite which by construction tests primarily backend components and is virtually agnostic to the architecture and intellectual property contributions of the vendors. For modern systems, it can only weakly distinguish between high-end and mid-range architectures. In both cases, it effectively counts the number of spindles in the tested systems. This has made SPC benchmarking a marketing circus for the vendors who participate.
Granted, construction of a multi-host benchmark with other array functions active is difficult before a single host benchmark is acceptable. I have ideas on how to improve this, but before I go there, I would love to hear any ideas on this.