Hi folks,
Long week for me! Met many many customers, and tossed many ideas around with people. I was going to blog next on the SPC-2 and share some of my observations that. But now, I'm inspired by Open Systems Storage Guy's comment , and more recently, the storageanarchist's challenge on what a usable storage performance benchmark should look like. So I'm writing a series of posts to stimulate some discussion around this topic. Here is the first.
Ground rules:
I know this can lead to a loaded political discussion, and I hope to stay clear of that. I'd rather focus on what can and cannot be done from a technology perspective. My hope is to develop, through discussion and contribution, a framework on what such a benchmark should measure, and how one might construct it, and how results should be normalized and interpreted.
I have no illusions about knowing the answers, or even knowing if there is an answer! But its about time we had an open discussion on this. And I don't just mean between vendors, but end-users too. I'll throw some open-ended ideas out, and lets see what develops.
What is a storage system?
I'll start with a truism - modern storage systems are not a bunch of disks. Many modern storage arrays are processing power, memory and algorithmic sophistication than rival or exceed the hosts that are connected to them. They support not one, but sometimes thousands of hosts with a mixture of almost every operating system known.
These array resources have enabled a host of software functions in the arrays themselves - around local and remote replication, workload optimization, data movement, fault tolerance and the like.
For the purposes of this discussion, I'd like to propose that we focus on storage arrays, devices that satisfy the following properties:
- The device is accessible by the IO channels of the hosts attached to it
- The data in the device is accessible online for random access and sequential IO by the host requesting it
- The media storing the data is non-volatile and not "removable"
- The device has the ability to provide hardware redundancy for the media
- The device has the ability to attach multiple operating system images to it.
- The device has the ability to abstract the physical medium and present a logical storage device to the operating system, which appears as a physical device to the host.
So Tape is out, as are Zip drives and CD/DVD/Optical type media. Dual ported-SSA or SCSI drives are out too, as they can't do the abstraction thing. Virtualization engines like the IBM SVC and EMC Invista qualify. When FSSD becomes reality, that would fit too. Network attached storage is very interesting, but has very different characteristics for attachment to a host, so for this discussion, I'd like to stick to block protocols.
I think this should cover most controller-based arrays today - but I am open to any suggestions to tighten this up. The basic architecture for these arrays is usually a front-end, cache/processing units, and back-end. The front-end has host connectivity (SCSI, ESCON, Fiber Channel, FICON etc). The cache/processing unit attempts to ensure that as many IO requests as possible are satisfied from cache. The back-end has direct connectivity to the media, usually SCSI or Fiber Channel disk drives (but possibly other technologies).
Anatomy of IO requests
What should a benchmark tell us? More importantly, is one benchmark enough?
Naive answer - the standard metrics: IOPS, MB/s, response time for a given configuration. Too naive - needs qualification.
Lets look at host IO requests, starting with a single threaded IO Driver. Workloads are characterized as reads or writes, random or sequential, with different IO sizes. The response of a storage system is dependent on the nature of the IO request.
Writes
In most storage arrays, writes are written to and acknowledged by the cache memory in the array. There are exceptions - if the array gets overwhelmed, it may pause to actually do the backend write. This manifests itself as delayed acknowledgment of the IO to the host. It may also bypass cache (in some systems) if single component failure puts data integrity at risk, and commit the IO directly to the media. When the array is overwhelmed, the backends ability to drain writes starts to become the limiting factor in throughput.
When an array gets overwhelmed, and how it reacts to it is obviously of much interest and often defines upper bounds to some of these metrics. Its failure modes are also of consequence in choosing a system. So writes are usually "hits" in cache - at memory speeds.
Reads
Reads may or may not need a backend IO, depending on whether the IO requested is already in cache or not.
The read IO request may be in cache, in which case, the response is at memory speeds. Many arrays rely on a couple of factors for enhancing performance. First, they sense sequential IO request streams early and preemptively prefetch subsequent data and place it in cache. Secondly, they often rely on the "principle of locality" - that is, if a piece of data is accessed, chances are it will be used again soon, or data "near" it will be used (Think database transactions - insert a row, update it a few times, commit the transaction. The same block is used again and again). So large cache systems will retain recently used data blocks in memory for as long as possible in cache, maximizing the probability that a read request will be a cache hit or a "read hit".
If the read is not in cache, backend IO is needed to bring it into the cache. This is the most "expensive" operation for an array as the backend is the slowest, usually mechanical, part of the configuration. This is referred to as a "read miss". Much work has been done is minimizing the time required to retrieve an IO request from media. Striping, aggressive prefetching, making use of multiple copies of data, smaller form factor drives, faster rotation of platters and the like have been used extensively to improve this.
Workloads where the read hits dominate are referred to as "cache friendly" and those where read misses dominate are called "cache-hostile". In the absence of any other component limiting throughput (like cache or processing power), cache hostile workloads are limited by backend media throughput.
What should we ask for
The request from host is a random or sequential read/write of a specific size.
The response of a storage array comes in many forms. Write hits, write misses, read hits and read misses. For sequential and random access patterns. For different IO sizes.
Each of these responses is dependent on the architecture, electronic components and intellectual property of the specific vendors.
What makes realistic benchmarks hard to construct is that real workloads have
- Threads with dynamic composites of these IO components
- Many concurrent threads from each host
- Many hosts sharing resources, including spindles, in the same array
- Complex array based functions in operation (such as local and remote replication) while servicing IO requests
- Different relative priorities of workloads from different hosts
A good set of benchmarks must be able to distinguish different vendor configurations with respect to their architecture, components and intellectual property investment. The ability to normalize different vendor results must be available for such a benchmark - i.e to be useful, physical configurations must be comparable for results to be comparable between vendors. Standard IO drivers and workload simulators much be constructed and agreed upon. The drivers should be able to generate enough workload to run the arrays to saturation.
For example, if my system can sustain a certain response time under a specific workload, what if I was also doing a backup to disk on the same array? What is the performance degradation? What is the hit on throughput with synchronous replication active? How well does the array handle spikes in workload? What is the performance hit if I take a point-in-time copy of a large system? These are questions my customers ask me about performance. I answer them on a case by case basis for the DMX. Good benchmarks should be able to do this across vendors.
My criticism of the SPC benchmark, as it stand today, is that the posture it has adopted is that of a static cache-hostile single host composite which by construction tests primarily backend components and is virtually agnostic to the architecture and intellectual property contributions of the vendors. For modern systems, it can only weakly distinguish between high-end and mid-range architectures. In both cases, it effectively counts the number of spindles in the tested systems. This has made SPC benchmarking a marketing circus for the vendors who participate.
Granted, construction of a multi-host benchmark with other array functions active is difficult before a single host benchmark is acceptable. I have ideas on how to improve this, but before I go there, I would love to hear any ideas on this.
Cheers,
.Connector.
A couple of things you should probably consider from the start.
First is the notion of the workload test "size." Given that there are rather significant differences in the number of ports, hosts, LUNs, GB of RAM, and drives that various arrays can support, I think you'll have to plan on having more than one "test size." May two or three standard class sizes (small, medium, large), with a specific MAXIMUM number of ports/hosts/RAM, etc. And maybe a Ginormous, no-holds-barred max configuration class.
Second, regarding your restricting this to block protocols only. Clearly, that would include iSCSI (and other block-over-network protocols). But you should also consider that NFS/CIFS is a characteristically unique protocol that should be included in the test domain. That is, a Celerra can front-end a Symmetrix, and NetApp supports a similar configuration. Running standard database or file applications over CFS/NFS undoubtedly puts a different strain/stress on a storage array than does running those same applications directly over block FC. Thus, when you get around to discussing the workloads you want to model, I think you should include NFS/CFS-based front-end servers as one of the host types.
Looking forward to the discussion...
Posted by: the storage anarchist | November 04, 2007 at 06:12 AM
Connector,
Thanks for taking the initiative to write up your thoughts on this. I see your proposed effort as having real potential and the notion of extending the discourse to customers is fundamental to that. For our part the wikibon community is committed to working with folks like you to advance the metrics and standards by which customers can obtain useful decision-making tools. Our users are asking for this and we'll commit to recruit them in to the process, host meetups/telebriefings on the topic, moderate discussions, write about this effort, make the press aware of it so more users will be informed and lend a helping hand wherever it makes sense.
Posted by: David Vellante | November 04, 2007 at 06:52 AM
Hey Barry,
Both valid points.
Having multiple "scales" for tests makes a lot of sense. This helps the smaller guys from being crushed by the highly scalable large systems, so having the McDonalds model of small, medium, large and Biggie should help delineate the tested systems.
On the NFS/CIFS attach - I struggled with whether or not to include it. For exactly the reasons you stated. The network attach means that two distinct pieces have to be specified - the access to the storage, and the access to the network, which would increase the complexity of such a benchmark. My inclination was to have a separate set of metrics for that, as there is significant intertwining with some other standards like SPECnfs, and variants like network topology and switching gear.
But, I agree, NFS/CIFS have a definite place alongside iSCSI in a comprehensive benchmark. We should discuss if a common framework works for both.
Cheers,
dotConnector.
Posted by: dotConnector | November 04, 2007 at 08:33 AM
Hi Dave,
Thank you very much for your support and offer to assist. Like yourself, BarryB also pointed out in his recent post (http://storageararchist.typepad.com) the real question is, should such testing and certification belong in the end-user domain, as opposed to industry consortiums? We, as vendors, should be doing what the customer feels they need.
Building out a grass-roots, user-driven externalized framework for performance measurement, with the Web2.0 collaborative technologies like blogs and wikis is, IMHO, the lasting approach. And even though some of it may succumb to the tyranny of the majority, it also offers a distinct voice to everyone, and hopefully the convergence will be rapid and meaningful.
In my day job, I have the privilege of working with hundreds of customers at many scales, and I can collect a lot of good ideas by word of mouth. I will pass the word around and get them to participate. I'd rather have this be a customer-driven vendor-facilitated effort than a vendor-driven effort.
I find efforts like the Wikibon gratifying for that reason, and will definitely collaborate in making that successful as well.
Thanks,
Cheers, dotConnector.
Posted by: dotConnector | November 04, 2007 at 09:02 AM
I'd certainly be happy to get involved. The undertaking is not small, but the end result could be a wonderful tool for end users and vendors alike.
The inclusion of specific tests for todays feature rich products, like snapshot, replication, data migration and their impacts on the 'background' processing. Running nightly backup / restore processing etc.
I could imagine something that has a base 'background' workload processing and then a series of 'delta' like benchmarks to run on-top.
I like the idea.
Posted by: Barry Whyte | November 05, 2007 at 07:47 AM
BarryW,
Wonderful! Lets put aside our competitive postures for a while, and I believe something very useful for our customers could emerge.
Thanks for the support!
Cheers, K.
Posted by: dotConnector | November 05, 2007 at 08:27 AM
The complexities of benchmarking mean that the best approach is probably a suite of benchmarks broken into several classes. The classes should be tackled from easiest ones first--for example why worry about different front-end host types if you can't recreate a valid back-end test.
The first would be storage vendor centric (i.e. a test that is cache hostile like SPC would be one, cache friendly would be another, etc).
The second class would be usage model driven. Ideally we would recruit the application vendors to provide some of these components.
The third would cover special features such as snap-shots or different front end-hosts types.
Out friends in the consumer PC space have been using suites for years-admittedly their problems are simpler.
Example of hard drive suite: http://www.tomshardware.com/2007/11/05/the_terabyte_battle/
From this simple test series, I see what different products are optimized for (bandwidth, access time, or power consumption), and how those trade-offs play in different scenarios (database simulation, windows startup). Benchmark users can gain useful information about what to expect and what the trade-offs are, which is hard to do with SPC.
Posted by: TechEnki | November 05, 2007 at 08:53 AM
TechEnki,
Excellent post! I like the idea of layered benchmarks (akin to the baseline+delta idea from BarryW).
Between your suggestions and the others (BarryB and BarryW), I believe there is a seed for a substantial, solid approach to develop.
Let me summarize your contributions, and put forth a consolidated approach in my next blog post, and we can build on that.
Thanks for your support!
Cheers, dotConnector.
Posted by: dotConnector | November 05, 2007 at 09:11 AM
This is great :) I hope that by putting competitive doublespeak aside for long enough to get an agreed upon benchmark, a real service will have been done for the companies who use storage products.
I have a few things I'd like to start with. First, and most importantly, I believe that for this benchmark to have any real meaning, any result must be agreed upon by at least two parties, at least one of which must be neutral or a competitor.
Second, I think the definition of a storage array should simply state that it either must be able to allow access to its storage using SCSI disk LUNs, or must be compatible with most standard email and database packages. These requirements would simplify the requirement and avoid the whole question of NFS and iSCSI.
Thirdly, why don't we start a wikipedia stile wiki? It's the easiest way to allow collaboration, and it also has tools allowing us to police for vandals and neutral points of view.
I'll volunteer to get the thing off the ground, but if anyone's interested in helping, please email me at opensystemsguy@gmail.com. I can provide for the hosting and bandwidth, however I could use a hand with the wiki software- never used it before.
Posted by: open systems storage guy | November 05, 2007 at 09:28 AM
Good discussion.
I agree with the folks who want different "sizes" of tests. Based on experience, the scaling factors are relevant.
I also like the idea where a skilled practitioner could run the test themselves. Results would not be as comparable, but I believe would ultimately generate more value for the industry and customers.
The challenge will be in coming up with a "standardized" multi-threaded workload.
I think one of the central arguments here is that "storage arrays" have to be able to handle multiple flavors of workload simultaneously, and cope with rapid changes.
Similarly, there would ideally be a notion of "standardized" failure scenarios, e.g. drive bad, link fails, cache card fails, etc.
Perhaps we could combine the thoughts as follows:
"Small" workloads do not have much variability in either performance characteristics, or envisioned failure characteristics.
"Medium" has a bit more. And "large" has a very wide range of dynamic behavior in terms of workload and envisioned failure modes.
Man, I hope this goes somewhere. Having to respond to the umpteenth person as to why SPCs are irrelevant (and actually harmful) is getting tiresome.
Posted by: Chuck Hollis | November 05, 2007 at 10:10 AM
Chuck, OSSG,
Thanks for the input. Scale is obviously something we have to consider in this effort.
The idea of workload generation is thorny - and I agree will be very challenging. But I have some ideas that I can throw out in my next post.
A wiki for this would be very useful - in fact Dave Vellante has his wikibon community and has already indicated that he would try to help as mush as he can. Dave, don't want to put you on the spot, but is hosting a wiki under wikibon viable?
Cheers, Kartik.
Posted by: dotConnector | November 05, 2007 at 10:28 AM
OK, so Mr Burke has commented nobody wants to join in - where are you then Dr... I'm ready, willing and happy to join the merry band, but its going to take more than just us - maybe you guys need to join back with the SPC and together we can create an SPC-4 (EMC happy) benchmark...
Posted by: Barry Whyte | November 11, 2008 at 06:52 PM