November 13, 2008

Should end-users do the benchmarking?

OSSG shared some interesting ideas about who does the benchmarking, and how to manage these results. The core question here is: are vendors the only ones who can benchmark? Why not allow customers who own the equipment to benchmark and publish their own results?

Perhaps the right approach is to create a support structure for this. A Web 2.0 collaborative portal with a wiki such as the one from Wikibon can give guidance on what constitutes a benchmark, how to run it, and what the pitfalls are, as well as what the utility of the result is in making a storage platform decision. The the results can be commented on and sliced and diced by the world at large, and in true internet anarchy style, converge on the “truth” and “vaidity” of each result.

I have misgivings about this, although the approach is very intriguing. First, most customers I know don't benchmark storage, they benchmark an entire stack. For example, a customer of mine recently re-platformed (Unix variant change) one of their critical OLTP applications running on a Progress database - their benchmark was a particular piece of batch code wall clock timings, which we used as a basis for tuning storage performance. Can this be used as a general purpose result? Likely not. 

Another fundamental problem here is constucting a benchmark for storage that can bring an array to it's knees, without the IO driver infrastructure saturating. Most customers don't have the capability of using say 12 Linux systems with a distributed IO driving piece of code to really test such large arrays, and run the risk of saturating the driving server before the array chokes.

But, there is something appealing about anarchy...... and regardless of the whether customer benchmarks should be collated, interpreted and governed, a central body of guidelines and definitions would be very helpful.

Thoughts and comments?

Sidebar

Marc Farley asked why the SPC couldn't be modified to satisfy EMC's objections, and whether EMC's objections were religious or technical.

Disclaimer: These are my opinions only, and may not reflect the opinions of EMC as a company.

Well, I don't believe EMC will (re)join the SPC because SPC-1 set a direction  that has proven to be unproductive. The reasons are technical, and well described in earlier posts in this blog. In a nutshell, the SPC-1's cache hostile workload profile reduces its utility to counting the number of spindles in an array, and therefore a) provides no new useable information and b) deliberately "dumbs" down storage arrays and the intelligence inherent in them. It has no discriminating power between offerings from different vendors. EMC left the SPC for this reason, and I suspect (now, this is only my personal opinion) EMC wont join again. So by all means we can discuss modifying the SPC - I'm not holding my breath that it will be the standard benchmark from EMCs point of view. This whole effort is to discuss the creation of a new set of gudelines on how to characterize performance outside the SPC.

Cheers,

.Connector.

November 12, 2008

The Quest continues....

Hullo there Barry W.!

Thanks for the comment  on my old post - woke me up out of a year-long hibernation. You succeded where others have tried and failed....;^)

Yeah.. lets look at realistic benchmarks again - and I agree, its going to take more than the two of us. And don't hold your breath on EMC joining the SPC anytime soon.....

Here's one way of looking at it

Lets me grant that the intent behind the SPC was noble - to have a benchmark that customers could look to as a guideline for their performance needs from storage. There are then two major aspects that I find objectionable about it today, namely:

1. The benchmark itself is very narrow - cache hostile, basically counts spindles, not representative of real life workloads.

2. Governance - as the specific HW configurations are unconstrained, many of the tested systems are highly optimized, short-stroked, and not representative of what customers would buy. Apples-to-apples comparisons are very difficult. Even though the cost of the configuration is a weak measure of how efficiently assets were used, lets face it, no one pays attention to that. Every press release focuses on IOPS, nopt the cost, and that is where customer attention is drawn to.

So instead of designing the uber-benchmark from first principles, perhaps addressing these deficiencies for the SPC is one way of converging quickly. So for example, include more measurements for a range of workloads - ones that will let the underlying array architecture show its mettle. And demand high asset utilization (say 70%+) for ports, spindles, capacity etc. to discourage jury-rigging configurations for the test.

This way, the work done by SPC can be leveraged, and the meaningfullness of the results can be enhanced. A more complete benchmark with mixed concurrent workloads, backend processes like array replication etc are desirable, but will take a lot longer to craft.

Thoughts?

Cheers,

.Connector

November 14, 2007

Back again!

Sorry folks! My day job and family effectively chewed up all my spare time for the past week- hence the delay in responding to the excellent suggestions in the comments to the last one.

Inch observed correctly that Open Source workload generators may end up writing to buffer cache instead. True.  Reminds me of an old story...

Many years ago, when I was still doing REAL work, a customer called me and said they were running Oracle RAC on an Symmetrix 8830. We had told them they should get about 240 MB/s large block throughput from a well partitioned workload across the two node cluster.

The customer measured about 160 MB/s and was understandably concerned. So I went over and looked at the configuration - 2-node Solaris 8 VCS DBED/AC with Veritas' cluster file system. Enough HBA's, no architectural bottlenecks on the SAN. So my interest was piqued. I tuned up max_phys in /etc/system to 1 MB and that improved stuff a bit - 200 MB/s. Still short.

I used one of our internal workload generators, and measured the throughput for the vxfs filesystems. Sure enough, I got 200 MB/s too. So I went to raw devices, and viola, 240 MB/s. Turned out that going through vxfs and CFS, even with Direct IO turned on, I still saw a lot of overhead from operating system meta-structures like file systems.

Moral: Be very careful how the operating system is configured. Many unintended effects may rear their head if not done carefully. Raw devices are the best to get a true measure of storage performance. (I still maintain that the DS8300 would have posted better results on SPC-1 if only the queue_depth had been increased - one mans opinion!)

So if one is careful, maybe open source workload generators could be used. By stipulating only raw devices as targets. Its the multi-threading and building correlated IO threads that is daunting to me - any ideas? And then moving to multi-host benchmarks after that.....

Constraints!

Overwhelmingly, the sentiment echoed by people seemed to be - if someone wants to short-stroke, let 'em!

OSSG and TechEnki said just disclose it clearly, for all you know, for a fixed price, that may be the best configuration to achieve a certain metric. The open used community will decide if it is relavant or useful. The Anarchist suggested a minimum limit on utilization - spindles, ports, etc.

Here's my suggestion. I like Anarchist's idea of setting a limit, but tend to agree with what OSSG and TechEnki suggest as well. So why not have a default benchmark report with a utilization limit (such as 70% for component utilization), which I hope will be the majority, and have the tester explicitly call out any result which had been optimized for a specific metric like response time or back-end throughput?

The issue is one of constraints, IMHO. One can optimize a configuration for cost, throughput, power consumption (as Anarchist suggested as well) or other metrics. This should be highlighted in the results. So I may have Copan giving me results that suck from an IOPS perspective, but which rock from a power consumption standpoint because of the MAID technology it incorporates. Inquiring minds should want to know!

So should we have a loose framework of constraints, and have classes of results optimizing for the maximum bang for that constraint?  Or have the constraints (spindles, total capacity, cache, $$, power, etc.) for a given configuration declared prominently?

This way, if someone wants to short-stroke, or have a a much greater number of spindles that needed for the storage, so be it. But its optimized for response time and throughput, and we know ot explicitly because it's posted in that class. BarryW echoed this sentiment as well - constraining # spindles may not make sense as spindle sizes grow.

BarryW also called for requirements on replication "overlay" tests. I agree - I'm only wondering if we should dive into this now, or hash out the pure performance benchmark first, and then move to overlays.

So my modified Postulate #2 could be:

Postulate #2: Dont overconfigure! (unless you have to)

and

Postulate #3: If you DO overconfigure, publish all constraints.

On a tangent...

I was tickled by the financial analyst reaction to Oracle VM - suddenly a bunch of them were going gaga over how this was going to compete with VMWare! Huh? Didn't anyone see that it was Xen based? Xen is nice, but its still got a ways to go.... even funnier is how both VMWare and Citrix tanked! Priceless!

TTFN!

Cheers, .Connector

November 07, 2007

Moving right along....

First,

Kudos to OSSG! Check out his  Storage Benchmark Wiki on Wikibon - awesome work to get us started! Thanks! And to answer your question, absolutely! Any of the material I post is fair game to be cut and pasted into it to get things started.

Folks, this is your golden chance to get a neutral vote it! End users especially - but any interested party is welcome to participate.

The couple of people who commented (inch, OSSG,..) seemed to lean towards an open source effort for benchmark workload generators.  I found this post by Jacob Gsoeld on SearchStorage.com that describes some generators like IOMeter, IOZone and NetBench - should we be checking these out? Are more people interested in the open source approach?

You saw my Postulate #1 last time. Without any further ado....

Postulate #2: No over configuring!

Don't claim to benchmark 25 TB while there is 100TB in the array. Less is more. Minimalism rules! Brownie points for getting nearly as good results with less HW.  Our customers are struggling to work through 60% growth per year with a flat budget - efficiency is key here.

A very nice side effect of this that benchmark comparison may actually make sense now. Apples to apples is the only way to go. Now, constraints like cost, power consumption, floorspace can be used as optimizers for the right platform. Tricks like shortstroking and increasing spindle count artificially go away.

So with Postulate #1, we get different views for the same HW configuration - cache-friendly, -hostile, random sequential, small block, large block - and with #2 we level the playing field.

We still got major hurdles to cross to get to multi-host, dynamic, composite workloads - but I believe that if we start with some well defined simple workloads, a workbench approach where these can be combined could be possible. Inch, any thoughts?

Or anyone else?

Cheers, .Connector

November 05, 2007

A trip down memory lane...

Hi folks,

Very nice ideas from a lot of people. I'll summarize them in this post, and put forth some of my own. But first, from my dim and distant past...

Everything I need to know I learned in Particle Physics

My soul is that of a physicist. In my younger days, I studied high-energy collisions between different elementary particles to understand their internal structure and interactions between them.

We know of four kinds of interactions: Strong, Weak, Electromagnetic and Gravitational. (Strictly speaking, Electromagnetic and Weak are now unified into the Electroweak interaction). Elementary particles came in three kinds - quarks, leptons (electrons, for example) and the carriers of the interactions, the gauge bosons (the photon is the best known).

Charting the properties of an unknown entity or structure meant using a probe to see the effects of a particular interaction. For example, if I wanted to probe electromagnetic structure, I'd use an electromagnetic probe, like the electron. If I wanted to probe weak structure, I would use a weak probe, like the neutrino. Now quark structure was tricky, as it could participate in strong, weak, electromagnetic and gravitational interactions.

So the trick to getting a solid composite view of the quark was to use different probes to build out a picture of all the aspects of this beast - not unlike using the input of 6 blind men to figure out the structure of an elephant. Any single probe would give incomplete information, and the results of all the probes had to be triangulated to get the correct interpretation.

I believe that any approach to storage benchmarking is no different, and has to take into account that the probe can only reveal substructure from its dominant interaction. In the storage world, particles are storage arrays; probes are workloads. So my first draft of the first of the Storage Benchmarking Postulates...

Postulate #1: A usable benchmark must subject the storage array to multiple workloads, testing different aspects of response while keeping the physical configuration static.

For example, I think the same physical array configuration should be subject to workloads that stress back end (cache hostile), cache (throughput) and front-end (cache friendly) (and others) application workloads. Then vendor doctoring or optimizing for a specific measurement should die out. Something optimized for OLTP may not do well for data warehouse workloads.

The figures-of-merit for this would be a composite of these results for a fixed configuration. The issue with the SPC-1 wasn't so much that thought didn't go into it, but rather that it ended up testing just one of these aspects, leading to skewed interpretations. What should a minimal set of probes be to fully characterize an array?

I have other postulates in my head, but let me solicit some feedback here first, and see if this is an avenue people think is worth pursuing.

The great ideas from you

As I mentioned earlier, many readers had great suggestions.

The The Anarchist and Chuck Hollis commented that size or scale is a critical factor. Storage arrays came in many sizes, and for effective normalization of the results, one should really have small, medium and large (and maybe "Ginormous no-holds-barred") configurations that one should plan to test for.

BarryB also suggested that NAS be included in this (seconded by OSSG) so even though NFS and CIFS introduced some unique challenges, they do belong in the realm of devices to be tested. OSSG proposed a way to be inclusive - "to allow access to its storage using SCSI disk LUNs" and skirt the issue of host connection protocol. I like that - and would like to modify my definitions for an array to reflect that.

The germ for Postulate #1 was contained in TechEnki's and BarryW's comments as well. They suggested layered tests, one with a baseline of array properties, before going to higher levels of functionality. So one would test the array envelope performance for components, layer on application workloads, and then move on to advanced functionality tests running concurrently with the workload. Chuck also wanted the addition of standard failure scenarios and the systems response to that, like drive rebuilds or cache disabling.

Chuck pointed out that coming up with "standardized" workloads would be tough (in the spirit of Postulate #1, that would be multiple standardized workloads!), and that making this a suite of tests an end user could run would probably make it even more usable. OSSG also pointed out that "for this benchmark to have any real meaning, any result must be agreed upon by at least two parties, at least one of which must be neutral or a competitor".

These are tough governance issues. Should there be an external body (outside of the vendor community) that should govern such a benchmark? Should this be an open source set of workload drivers with configuration guidelines, with a results database from end-user testers. My inclination is the latter - let our customers tell us how well they think the arrays do.

David Vellante and OSSG offered to support such an effort, perhaps in the form of end-user driven wikis or other collaboration tools. Perhaps vendors can help to create a support structure in the move to return power to the customers.

Please ask others you know to contribute their ideas. I am a big fan of collaborative think. More the merrier!

Ta-ta for now!

Cheers, dotConnector.

November 03, 2007

The quest for a better benchmark

Hi folks,

Long week for me! Met many many customers, and tossed many ideas around with people. I was going to blog next on the SPC-2 and share some of my observations that. But now, I'm inspired by Open Systems Storage Guy's comment , and more recently, the storageanarchist's challenge on what a usable storage performance benchmark should look like. So I'm writing a series of posts to stimulate some discussion around this topic. Here is the first.

Ground rules:

I know this can lead to a loaded political discussion, and I hope to stay clear of that. I'd rather focus on what can and cannot be done from a technology perspective. My hope is to develop, through discussion and contribution, a framework on what such a benchmark should measure, and how one might construct it, and how results should be normalized and interpreted.

I have no illusions about knowing the answers, or even knowing if there is an answer! But its about time we had an open discussion on this. And I don't just mean between vendors, but end-users too. I'll throw some open-ended ideas out, and lets see what develops.

What is a storage system?

I'll start with a truism - modern storage systems are not a bunch of disks. Many modern storage arrays are processing power, memory and algorithmic sophistication than rival or exceed the hosts that are connected to them.  They support not one, but sometimes thousands of hosts with a mixture of almost every operating system known.

These array resources have enabled a host of software functions in the arrays themselves - around local and remote replication, workload optimization, data movement, fault tolerance and the like.

For the purposes of this discussion, I'd like to propose that we focus on storage arrays, devices that satisfy the following properties:

  • The device is accessible by the IO channels of the hosts attached to it
  • The data in the device is accessible online for random access and sequential IO by the host requesting it
  • The media storing the data is non-volatile and not "removable"
  • The device has the ability to provide hardware redundancy for the media
  • The device has the ability to attach multiple operating system images to it.
  • The device has the ability to abstract the physical medium and present a logical storage device to the operating system, which appears as a physical device to the host.

So Tape is out, as are Zip drives and CD/DVD/Optical type media. Dual ported-SSA or SCSI drives are out too, as they can't do the abstraction thing. Virtualization engines like the IBM SVC and EMC Invista qualify. When FSSD becomes reality, that would fit too.  Network attached storage is very interesting, but has very different characteristics for attachment to a host, so for this discussion, I'd like to stick to block protocols.

I think this should cover most controller-based arrays today - but I am open to any suggestions to tighten this up.  The basic architecture for these arrays is usually a front-end, cache/processing units, and back-end. The front-end has host connectivity (SCSI, ESCON, Fiber Channel, FICON etc). The cache/processing unit attempts to ensure that as many IO requests as possible are satisfied from cache. The back-end has direct connectivity to the media, usually SCSI or Fiber Channel disk drives (but possibly other technologies).

Anatomy of IO requests

What should a benchmark tell us? More importantly, is one benchmark enough?

Naive answer - the standard metrics: IOPS, MB/s, response time for a given configuration. Too naive - needs qualification.

Lets look at host IO requests, starting with a single threaded IO Driver. Workloads are characterized as reads or writes, random or sequential, with different IO sizes. The response of a storage system is dependent on the nature of the IO request.

Writes

In most storage arrays, writes are written to and acknowledged by the cache memory in the array. There are exceptions - if the array gets overwhelmed,  it may pause to actually do the backend write. This manifests itself as delayed acknowledgment of the IO to the host. It may also bypass cache (in some systems) if single component failure puts data integrity at risk, and commit the IO directly to the media. When the array is overwhelmed, the backends ability to drain writes starts to become the limiting factor in throughput.

When an array gets overwhelmed, and how it reacts to it is obviously of much interest and often defines upper bounds to some of these metrics. Its failure modes are also of consequence in choosing a system. So writes are usually "hits" in cache - at memory speeds.

Reads

Reads may or may not need a backend IO, depending on whether the IO requested is already in cache or not.

The read IO request may be in cache, in which case, the response is at memory speeds. Many arrays rely on a couple of factors for enhancing performance. First, they sense sequential IO request streams early and preemptively prefetch subsequent data and place it in cache. Secondly, they often rely on the "principle of locality" - that is, if a piece of data is accessed, chances are it will be used again soon, or data "near" it will be used (Think database transactions - insert a row, update it a few times, commit the transaction. The same block is used again and again). So large cache systems will retain recently used data blocks in memory for as long as possible in cache, maximizing the probability that a read request will be a cache hit or a "read hit".

If the read is not in cache, backend IO is needed to bring it into the cache. This is the most "expensive" operation for an array as the backend is the slowest, usually mechanical, part of the configuration. This is referred to as a "read miss". Much work has been done is minimizing the time required to retrieve an IO request from media. Striping, aggressive prefetching, making use of multiple copies of data, smaller form factor drives, faster rotation of platters and the like have been used extensively to improve this.

Workloads where the read hits dominate are referred to as "cache friendly" and those where read misses dominate are called "cache-hostile". In the absence of any other component limiting throughput (like cache or processing power), cache hostile workloads are limited by backend media throughput.

What should we ask for

The request from host is a random or sequential read/write of a specific size.

The response of a storage array comes in many forms. Write hits, write misses, read hits and read misses. For sequential and random access patterns. For different IO sizes.

Each of these responses is dependent on the architecture, electronic components and intellectual property of the specific vendors.

What makes realistic benchmarks hard to construct is that real workloads have

  • Threads with dynamic composites of these IO components
  • Many concurrent threads from each host
  • Many hosts sharing resources, including spindles, in the same array
  • Complex array based functions in operation (such as local and remote replication) while servicing IO requests
  • Different relative priorities of workloads from different hosts

A good set of benchmarks must be able to distinguish different vendor configurations with respect to their architecture, components and intellectual property investment. The ability to normalize different vendor results must be available for such a benchmark - i.e to be useful, physical configurations must be comparable for results to be comparable between vendors. Standard IO drivers and workload simulators much be constructed and agreed upon. The drivers should be able to generate enough workload to run the arrays to saturation.

For example, if my system can sustain a certain response time under a specific workload, what if I was also doing a backup to disk on the same array? What is the performance degradation? What is the hit on throughput with synchronous replication active? How well does the array handle spikes in workload? What is the performance hit if I take a point-in-time copy of a large system? These are questions my customers ask me about performance. I answer them on a case by case basis for the DMX. Good benchmarks should be able to do this across vendors.

My criticism of the SPC benchmark, as it stand today, is that the posture it has adopted is that of a static cache-hostile single host composite which by construction tests primarily backend components and is virtually agnostic to the architecture and intellectual property contributions of the vendors. For modern systems, it can only weakly distinguish between high-end and mid-range architectures. In both cases, it effectively counts the number of spindles in the tested systems. This has made SPC benchmarking a marketing circus for the vendors who participate.

Granted, construction of a multi-host benchmark with other array functions active is difficult before a single host benchmark is acceptable. I have ideas on how to improve this, but before I go there, I would love to hear any ideas on this.

Cheers,

.Connector.

October 30, 2007

SPC-1: What have you done for me - lately!

Hi folks!

Lots of good stuff in response to my last post! Good comments from Open Systems Storage Guy; Barry Burke's tounge-firmly-in-cheek DMX-4 SPC-1 IOPS calculation was very entertaining; and an interesting  Barry Whyte post in response to the observations I made.

Out with the old...

My interest in what the SPC-1 benchmark has been used for - lately! Like in the last 3 years or so. Apparently BarryW has a significant amount of academic interest in a LOT of older results, from an era when cache was at a cost premium, and modular 2-controller systems were the new rage, and the benchmark results highlighted that. Many of the systems have been EOL'd, or superseded.

If anyone wants to harp on how the SPC-1 IOPS results can help an end user to decide whether to buy Fujitsu Eternus 3000 Model 300 or the Dell PERQ/3 QC today - have a blast! I just don't see the relevance.

And in with the new...

Welcome to 2007! My customers are likely to be using much more modern stuff, and thats MY interest. So I restricted my analysis (and stated so in my post that I am excluding older results) to roughly submssion dates post October 2004.

I missed a few SUN/STK systems in the cutoff (my bad! Add them in...won't change a thing), and I did include the SVC 1.1.1 from June 2004 ( I was curious about that!). Others are from another era - not relavant for the points I am making.

My observation is simple: for the results in the last 3 years, all the results reflect direct proportionality between the SPC-1 IOPS benchmark and spindle count. This is a significantly stronger statement than stating that there is a monotonic increase of SPC-1 IOPS with spindle count (which is intutive). The data speaks for itself.

The curious tale of two brothers: DS8300 and The DS8300 Turbo

However, thanks, BarryW, for reminding me to discuss the DS8300 family SPC-1 IOPS. During my data collection phase for this series of posts, I too noticed this very enigmatic puzzle.

Here were two systems, identical in HW configuration (same number of disks, same testing configuration, same amount of cache, processors, 32 channels) driven by exactly the same server, a P5 p595 Model 9119 32-way processor with 32 channels running AIX 5.3.

Yes, the adressable storage was a bit different (6.6TB for the DS8300 and 8.9 TB for the Turbo), but in true form like other vendors, the data only occupied about 32-36% of the total storage in the system, ostensibly to increase spindle count and thereby inflate the SPC-1 IOPS result (just curious: so what is the end customer supposed to do with the rest of the storage?). So that couldn't be the reason for the mystery below....

So it was intriguing to see the Turbo post 123,033 SPC-1 IOPS and the DS8300 post 101,102 SPC-1 IOPS! 22% better!

Could it be... that the microcode on the DS8300 Turbo was better, and the SPC-1 actually caught that? Wow! I mean, Gee Whiz! Holy Cow! Or...

Could it be... Satan?

The devil in the details

Let me ask this question:

Why is the driving host configuration different for the Turbo benchmark compared to the vanilla 8300?

Specifically, why were the queue_depth and max_transfer parameters  changed from their defaults (20,256K) to higher values (64, 1024K) for the DS8300 Turbo benchmark? This is buried in the Full Disclosure report, in the link above.

The queue_depth parameter increase lets more IOs queue up at the host - a good thing for large arrays, where many disks make appear aggregated as one logical volume. This makes sure that the disks are not twiddling their thumbs, while the operating is working under the wrong assumption that the volume will be overwhelmed if more IOs are pushed. Set it too low, and the disk array seems to underperform, as not enough work is going its way. I have seen many instances when increasing operating system queue_depth gives significant gains in IO throughput, especially with internal striping in the array (like the RAID-10 the DS8300 uses).

The max_transfer parameter, makes sure that large IO's are broken up into bite sized chunks. Set it too small, and the operating system does a lot of work for nothing.

Could it be.... that the DS8300 test was actually throttled by the inability of the host to drive enough IOs? So actually the DS8300 and the Turbo might have posted the same result - except, the driving P595 was not queuing enough IOs for the older array.

Or was it really the SPC-1 suddenly becoming sensitive to something other than spindle count, just for the DS8300 family? Somehow, I don't think so...

I could be wrong... but on the other hand....

I know I am speculating - but right now, the SPC-1 IOPS FUll Disclosure report for both is no help in clarifying what reality is. Short of a new measurement with the new parameters for the vanilla DS8300, I don't see how one can argue that the storage performance under SPC-1 actually improved for the DS8300 Turbo compared to the DS8300. I would submit that this could equally be an artifact of a restricted IO driver. If two things changed, which one caused the difference in the measurement?

But, oh, didn't I hear somewhere that the DS8300 just got EOL'd?

So it is an interesting and anomalous discrepancy, but not one that can be resolved conclusively with the data at hand. I'm not buying it, BarryW.

Is the SVC really high-end?

Hmmm... the fact that the USP V and the SVC seem to perform the same means one of two things:

a) The SVC is truly on par with a USP V from an array capability point of view or

b) the SPC-1 benchmark has done great disservice to the USP V, and forced it down to the least common denominator - spindle count.

The USP V SPC-1 result did show one thing - that a full configuration shows no other component saturation effects. It is purely spindle bound - with no discernable choke points outside of that. My personal belief (not EMCs!) is that HDS sold their technology short by participating in this benchmark. Like the DMX, their microcode engineers have spent many hundreds if not thousands of man-years optimizing their systems for MF, multiple-host workloads, etc. The benchmark lets none of that shine through. I think they can do a lot better with a real life workload than the SVC.

But, as I said, thats one mans opinion..... the same man who is also positive that the DMX will do better that both the SVC and USP V with real customer multi-host multi-function workloads, with a boatload more functionality to boot.

Thanks for the welcome to blogosphere, BarryW! I am sure we will agree and disagree on many other topics over the coming years- and I hope to continue to learn from that as we go on.

Cheers,

.Connector.

October 29, 2007

Mathematics of the SPC-1 Benchmark

Hi folks,

When the Storage Performance Council SPC-1 benchmark was introduced several years ago, and the first few results hit the streets, I noticed something astounding: the SPC-1 IOPS scaled linearly with the number of drives in the tested storage array, independent of vendor, array, drive size or drive speed!

It was not astounding as a technical result - I would expect a cache hostile benchmark like the SPC-1 to be loosely dependent on spindle count. What was astounding was that no one ever called any attention to it!

Vendor Olympics:

I saw vendor after vendor claiming technical superiority based on the magnitude of the SPC-1 IOPS measurement they had made. I saw challenges to EMC and Netapp to participate in this.  So I decided to look at the most recent results to see if the benchmark or vendor arrays had changes so significantly in the past few years.

I took "high-end" systems - with a lot of scalability, and plotted their SPC-1 IOPS against the number of drives they had. The results:

Spchighend_3

The equation is a simple straight line fit done with Excel. The R^2 value is the quality of the fit, 1 is considered the best, the 0.996 is pretty darned good.

What did this mean?

Well... I could have probably saved HDS a lot of money and testing! Just knowing the previous results with the DS8300 Turbo, SVC 3.1 and SVC 4.2, I could have predicted the SPC-1 IOPS number for the 1024 drive HDS USP V almost exactly! Pretty good for arithmetic!

The HDS USP V has a tested configuration (ASU Storage) of 26 TB, but had 150+ of raw storage in it. In fact, even discounting for the RAID-1 protection for the storage, the benchmark only used 34% of the storage in the array. But spread out over all 1024 spindles. The USP V had 146 GB 15K Drives. The press release says none of this! You have to read the Full Disclosure report to find this out. What gets the headlines is just the fact that they have 200,000+ SPC-1 IOPS for 26 TB!

Even more surprising, the SVC matches the performance almost exactly as well - a completely different architecture! And it had sub-arrays that used 73 GB 15K drives.

So all HDS has to do to beat the standing SVC record is to load up the USP V with 1536 drives.

Ooops - can't do that.

Or for the DS 8300 to match the USP V record - go from 512 to 1024 drives.

Regardless, the SPC-1 IOPS has no discriminatory power for any of these 4 systems. The benchmark results are completely determined by one thing: the number of physical spindles in the configuration.

Lets look at it all!

What if I included not just highly-scalable systems, but mid-sized systems too? Well... you be the judge.

Spcall

Here we see two classes of storage - the scalable arrays, and the mid-range ones.The lines have the same slope, and the correlation is pretty linear in both the blue and red bands.  For almost every system tested, the results can be predicted to within a few percent with just knowledge of the number of spindles. [Note: I have excluded some older results at the low end (red band) in the interest of time (mine!). This is an exercise I encourage everyone to do at home - the data is public at the SPC Home Page]

The real useful metric from the SPC-1 is $/SPC-1 IOPS - one that has unfortunately faded from prominence. The over configuration of the USP V makes the SVC give the best bang for the buck.

So I'd love to see what these platforms can do without the spindle count advantage. How about a test with 26 TB where there is 52 TB in a RAID-1 configuration in the array? I can predict the result: 1/3 of the current SPC-1 IOPS for the USP V with 1024 drives.

Why I am upset:

This is no magic folks - I cannot imagine that this is not a well known fact for vendors and the SPC - that the benchmark dumbs down ALL vendors to the level close to JBOD.  If SPC-1 performance is predetermined by drive count, customer decisions on storage investments are purely an exercise in pricing - all vendors are the same from a performance standpoint.

If this was known - not drawing customer attention to it is, mildly put, disingenuous.

If this wasn't known - this is not rocket science, people. Thats hard to believe.

Lets work instead to see if there is a real way to benchmark these systems - which is actually useful to our customers. This is exactly why EMC pulled out of the SPC years ago.

I know this is not easy, but am willing to work on it.

Cheers,

.Connector

My mug shot!

In case anyone is interested.... my true mad scientist spirit captured by my daughter!

Dcfn0002.Connector

Hi folks!

Hello World!

My first venture into blogosphere... guess I should introduce myself. I have been described, through my years, as a particle physicist, CTO, architect and most recently an EMC Distinguished Engineer, and consider myself an honorary southerner (Yes - I did live in and love Louisiana in my past, and still miss the food). I like to think of myself as a curious soul who is ever in awe of what I don't know and what no one knows.

I grew up with real-time computing and loosely-coupled compute grids in data acquisition environments in high-energy accelerator experiments at Fermilab, DESY and CERN during the birth of HTTP, my baptism by fire in UNIX, which I grew to love. Now I am comfortable with most operating systems, and a smattering languages, databases, ERPs and networks, and of course, storage technologies.

Fair warning: I am an an employee of EMC Corporation and proud of it, however, any and all opinions expressed by me in this blog are mine and mine alone, and may not reflect EMC's official stance on anything. EMC is not responsible for any of this content, and no one controls my posts.

I find technology and data of any kind fascinating, and people even more so, and will be blogging quite a bit on both.

'Nuff for now, I do have some interesting observations to share on certain benchmarks in the storage industry, but more on that in my next few posts!

Cheers,

.Connector.