« October 2007 | Main

November 2007

November 14, 2007

Back again!

Sorry folks! My day job and family effectively chewed up all my spare time for the past week- hence the delay in responding to the excellent suggestions in the comments to the last one.

Inch observed correctly that Open Source workload generators may end up writing to buffer cache instead. True.  Reminds me of an old story...

Many years ago, when I was still doing REAL work, a customer called me and said they were running Oracle RAC on an Symmetrix 8830. We had told them they should get about 240 MB/s large block throughput from a well partitioned workload across the two node cluster.

The customer measured about 160 MB/s and was understandably concerned. So I went over and looked at the configuration - 2-node Solaris 8 VCS DBED/AC with Veritas' cluster file system. Enough HBA's, no architectural bottlenecks on the SAN. So my interest was piqued. I tuned up max_phys in /etc/system to 1 MB and that improved stuff a bit - 200 MB/s. Still short.

I used one of our internal workload generators, and measured the throughput for the vxfs filesystems. Sure enough, I got 200 MB/s too. So I went to raw devices, and viola, 240 MB/s. Turned out that going through vxfs and CFS, even with Direct IO turned on, I still saw a lot of overhead from operating system meta-structures like file systems.

Moral: Be very careful how the operating system is configured. Many unintended effects may rear their head if not done carefully. Raw devices are the best to get a true measure of storage performance. (I still maintain that the DS8300 would have posted better results on SPC-1 if only the queue_depth had been increased - one mans opinion!)

So if one is careful, maybe open source workload generators could be used. By stipulating only raw devices as targets. Its the multi-threading and building correlated IO threads that is daunting to me - any ideas? And then moving to multi-host benchmarks after that.....

Constraints!

Overwhelmingly, the sentiment echoed by people seemed to be - if someone wants to short-stroke, let 'em!

OSSG and TechEnki said just disclose it clearly, for all you know, for a fixed price, that may be the best configuration to achieve a certain metric. The open used community will decide if it is relavant or useful. The Anarchist suggested a minimum limit on utilization - spindles, ports, etc.

Here's my suggestion. I like Anarchist's idea of setting a limit, but tend to agree with what OSSG and TechEnki suggest as well. So why not have a default benchmark report with a utilization limit (such as 70% for component utilization), which I hope will be the majority, and have the tester explicitly call out any result which had been optimized for a specific metric like response time or back-end throughput?

The issue is one of constraints, IMHO. One can optimize a configuration for cost, throughput, power consumption (as Anarchist suggested as well) or other metrics. This should be highlighted in the results. So I may have Copan giving me results that suck from an IOPS perspective, but which rock from a power consumption standpoint because of the MAID technology it incorporates. Inquiring minds should want to know!

So should we have a loose framework of constraints, and have classes of results optimizing for the maximum bang for that constraint?  Or have the constraints (spindles, total capacity, cache, $$, power, etc.) for a given configuration declared prominently?

This way, if someone wants to short-stroke, or have a a much greater number of spindles that needed for the storage, so be it. But its optimized for response time and throughput, and we know ot explicitly because it's posted in that class. BarryW echoed this sentiment as well - constraining # spindles may not make sense as spindle sizes grow.

BarryW also called for requirements on replication "overlay" tests. I agree - I'm only wondering if we should dive into this now, or hash out the pure performance benchmark first, and then move to overlays.

So my modified Postulate #2 could be:

Postulate #2: Dont overconfigure! (unless you have to)

and

Postulate #3: If you DO overconfigure, publish all constraints.

On a tangent...

I was tickled by the financial analyst reaction to Oracle VM - suddenly a bunch of them were going gaga over how this was going to compete with VMWare! Huh? Didn't anyone see that it was Xen based? Xen is nice, but its still got a ways to go.... even funnier is how both VMWare and Citrix tanked! Priceless!

TTFN!

Cheers, .Connector

November 07, 2007

Moving right along....

First,

Kudos to OSSG! Check out his  Storage Benchmark Wiki on Wikibon - awesome work to get us started! Thanks! And to answer your question, absolutely! Any of the material I post is fair game to be cut and pasted into it to get things started.

Folks, this is your golden chance to get a neutral vote it! End users especially - but any interested party is welcome to participate.

The couple of people who commented (inch, OSSG,..) seemed to lean towards an open source effort for benchmark workload generators.  I found this post by Jacob Gsoeld on SearchStorage.com that describes some generators like IOMeter, IOZone and NetBench - should we be checking these out? Are more people interested in the open source approach?

You saw my Postulate #1 last time. Without any further ado....

Postulate #2: No over configuring!

Don't claim to benchmark 25 TB while there is 100TB in the array. Less is more. Minimalism rules! Brownie points for getting nearly as good results with less HW.  Our customers are struggling to work through 60% growth per year with a flat budget - efficiency is key here.

A very nice side effect of this that benchmark comparison may actually make sense now. Apples to apples is the only way to go. Now, constraints like cost, power consumption, floorspace can be used as optimizers for the right platform. Tricks like shortstroking and increasing spindle count artificially go away.

So with Postulate #1, we get different views for the same HW configuration - cache-friendly, -hostile, random sequential, small block, large block - and with #2 we level the playing field.

We still got major hurdles to cross to get to multi-host, dynamic, composite workloads - but I believe that if we start with some well defined simple workloads, a workbench approach where these can be combined could be possible. Inch, any thoughts?

Or anyone else?

Cheers, .Connector

November 05, 2007

A trip down memory lane...

Hi folks,

Very nice ideas from a lot of people. I'll summarize them in this post, and put forth some of my own. But first, from my dim and distant past...

Everything I need to know I learned in Particle Physics

My soul is that of a physicist. In my younger days, I studied high-energy collisions between different elementary particles to understand their internal structure and interactions between them.

We know of four kinds of interactions: Strong, Weak, Electromagnetic and Gravitational. (Strictly speaking, Electromagnetic and Weak are now unified into the Electroweak interaction). Elementary particles came in three kinds - quarks, leptons (electrons, for example) and the carriers of the interactions, the gauge bosons (the photon is the best known).

Charting the properties of an unknown entity or structure meant using a probe to see the effects of a particular interaction. For example, if I wanted to probe electromagnetic structure, I'd use an electromagnetic probe, like the electron. If I wanted to probe weak structure, I would use a weak probe, like the neutrino. Now quark structure was tricky, as it could participate in strong, weak, electromagnetic and gravitational interactions.

So the trick to getting a solid composite view of the quark was to use different probes to build out a picture of all the aspects of this beast - not unlike using the input of 6 blind men to figure out the structure of an elephant. Any single probe would give incomplete information, and the results of all the probes had to be triangulated to get the correct interpretation.

I believe that any approach to storage benchmarking is no different, and has to take into account that the probe can only reveal substructure from its dominant interaction. In the storage world, particles are storage arrays; probes are workloads. So my first draft of the first of the Storage Benchmarking Postulates...

Postulate #1: A usable benchmark must subject the storage array to multiple workloads, testing different aspects of response while keeping the physical configuration static.

For example, I think the same physical array configuration should be subject to workloads that stress back end (cache hostile), cache (throughput) and front-end (cache friendly) (and others) application workloads. Then vendor doctoring or optimizing for a specific measurement should die out. Something optimized for OLTP may not do well for data warehouse workloads.

The figures-of-merit for this would be a composite of these results for a fixed configuration. The issue with the SPC-1 wasn't so much that thought didn't go into it, but rather that it ended up testing just one of these aspects, leading to skewed interpretations. What should a minimal set of probes be to fully characterize an array?

I have other postulates in my head, but let me solicit some feedback here first, and see if this is an avenue people think is worth pursuing.

The great ideas from you

As I mentioned earlier, many readers had great suggestions.

The The Anarchist and Chuck Hollis commented that size or scale is a critical factor. Storage arrays came in many sizes, and for effective normalization of the results, one should really have small, medium and large (and maybe "Ginormous no-holds-barred") configurations that one should plan to test for.

BarryB also suggested that NAS be included in this (seconded by OSSG) so even though NFS and CIFS introduced some unique challenges, they do belong in the realm of devices to be tested. OSSG proposed a way to be inclusive - "to allow access to its storage using SCSI disk LUNs" and skirt the issue of host connection protocol. I like that - and would like to modify my definitions for an array to reflect that.

The germ for Postulate #1 was contained in TechEnki's and BarryW's comments as well. They suggested layered tests, one with a baseline of array properties, before going to higher levels of functionality. So one would test the array envelope performance for components, layer on application workloads, and then move on to advanced functionality tests running concurrently with the workload. Chuck also wanted the addition of standard failure scenarios and the systems response to that, like drive rebuilds or cache disabling.

Chuck pointed out that coming up with "standardized" workloads would be tough (in the spirit of Postulate #1, that would be multiple standardized workloads!), and that making this a suite of tests an end user could run would probably make it even more usable. OSSG also pointed out that "for this benchmark to have any real meaning, any result must be agreed upon by at least two parties, at least one of which must be neutral or a competitor".

These are tough governance issues. Should there be an external body (outside of the vendor community) that should govern such a benchmark? Should this be an open source set of workload drivers with configuration guidelines, with a results database from end-user testers. My inclination is the latter - let our customers tell us how well they think the arrays do.

David Vellante and OSSG offered to support such an effort, perhaps in the form of end-user driven wikis or other collaboration tools. Perhaps vendors can help to create a support structure in the move to return power to the customers.

Please ask others you know to contribute their ideas. I am a big fan of collaborative think. More the merrier!

Ta-ta for now!

Cheers, dotConnector.

November 03, 2007

The quest for a better benchmark

Hi folks,

Long week for me! Met many many customers, and tossed many ideas around with people. I was going to blog next on the SPC-2 and share some of my observations that. But now, I'm inspired by Open Systems Storage Guy's comment , and more recently, the storageanarchist's challenge on what a usable storage performance benchmark should look like. So I'm writing a series of posts to stimulate some discussion around this topic. Here is the first.

Ground rules:

I know this can lead to a loaded political discussion, and I hope to stay clear of that. I'd rather focus on what can and cannot be done from a technology perspective. My hope is to develop, through discussion and contribution, a framework on what such a benchmark should measure, and how one might construct it, and how results should be normalized and interpreted.

I have no illusions about knowing the answers, or even knowing if there is an answer! But its about time we had an open discussion on this. And I don't just mean between vendors, but end-users too. I'll throw some open-ended ideas out, and lets see what develops.

What is a storage system?

I'll start with a truism - modern storage systems are not a bunch of disks. Many modern storage arrays are processing power, memory and algorithmic sophistication than rival or exceed the hosts that are connected to them.  They support not one, but sometimes thousands of hosts with a mixture of almost every operating system known.

These array resources have enabled a host of software functions in the arrays themselves - around local and remote replication, workload optimization, data movement, fault tolerance and the like.

For the purposes of this discussion, I'd like to propose that we focus on storage arrays, devices that satisfy the following properties:

  • The device is accessible by the IO channels of the hosts attached to it
  • The data in the device is accessible online for random access and sequential IO by the host requesting it
  • The media storing the data is non-volatile and not "removable"
  • The device has the ability to provide hardware redundancy for the media
  • The device has the ability to attach multiple operating system images to it.
  • The device has the ability to abstract the physical medium and present a logical storage device to the operating system, which appears as a physical device to the host.

So Tape is out, as are Zip drives and CD/DVD/Optical type media. Dual ported-SSA or SCSI drives are out too, as they can't do the abstraction thing. Virtualization engines like the IBM SVC and EMC Invista qualify. When FSSD becomes reality, that would fit too.  Network attached storage is very interesting, but has very different characteristics for attachment to a host, so for this discussion, I'd like to stick to block protocols.

I think this should cover most controller-based arrays today - but I am open to any suggestions to tighten this up.  The basic architecture for these arrays is usually a front-end, cache/processing units, and back-end. The front-end has host connectivity (SCSI, ESCON, Fiber Channel, FICON etc). The cache/processing unit attempts to ensure that as many IO requests as possible are satisfied from cache. The back-end has direct connectivity to the media, usually SCSI or Fiber Channel disk drives (but possibly other technologies).

Anatomy of IO requests

What should a benchmark tell us? More importantly, is one benchmark enough?

Naive answer - the standard metrics: IOPS, MB/s, response time for a given configuration. Too naive - needs qualification.

Lets look at host IO requests, starting with a single threaded IO Driver. Workloads are characterized as reads or writes, random or sequential, with different IO sizes. The response of a storage system is dependent on the nature of the IO request.

Writes

In most storage arrays, writes are written to and acknowledged by the cache memory in the array. There are exceptions - if the array gets overwhelmed,  it may pause to actually do the backend write. This manifests itself as delayed acknowledgment of the IO to the host. It may also bypass cache (in some systems) if single component failure puts data integrity at risk, and commit the IO directly to the media. When the array is overwhelmed, the backends ability to drain writes starts to become the limiting factor in throughput.

When an array gets overwhelmed, and how it reacts to it is obviously of much interest and often defines upper bounds to some of these metrics. Its failure modes are also of consequence in choosing a system. So writes are usually "hits" in cache - at memory speeds.

Reads

Reads may or may not need a backend IO, depending on whether the IO requested is already in cache or not.

The read IO request may be in cache, in which case, the response is at memory speeds. Many arrays rely on a couple of factors for enhancing performance. First, they sense sequential IO request streams early and preemptively prefetch subsequent data and place it in cache. Secondly, they often rely on the "principle of locality" - that is, if a piece of data is accessed, chances are it will be used again soon, or data "near" it will be used (Think database transactions - insert a row, update it a few times, commit the transaction. The same block is used again and again). So large cache systems will retain recently used data blocks in memory for as long as possible in cache, maximizing the probability that a read request will be a cache hit or a "read hit".

If the read is not in cache, backend IO is needed to bring it into the cache. This is the most "expensive" operation for an array as the backend is the slowest, usually mechanical, part of the configuration. This is referred to as a "read miss". Much work has been done is minimizing the time required to retrieve an IO request from media. Striping, aggressive prefetching, making use of multiple copies of data, smaller form factor drives, faster rotation of platters and the like have been used extensively to improve this.

Workloads where the read hits dominate are referred to as "cache friendly" and those where read misses dominate are called "cache-hostile". In the absence of any other component limiting throughput (like cache or processing power), cache hostile workloads are limited by backend media throughput.

What should we ask for

The request from host is a random or sequential read/write of a specific size.

The response of a storage array comes in many forms. Write hits, write misses, read hits and read misses. For sequential and random access patterns. For different IO sizes.

Each of these responses is dependent on the architecture, electronic components and intellectual property of the specific vendors.

What makes realistic benchmarks hard to construct is that real workloads have

  • Threads with dynamic composites of these IO components
  • Many concurrent threads from each host
  • Many hosts sharing resources, including spindles, in the same array
  • Complex array based functions in operation (such as local and remote replication) while servicing IO requests
  • Different relative priorities of workloads from different hosts

A good set of benchmarks must be able to distinguish different vendor configurations with respect to their architecture, components and intellectual property investment. The ability to normalize different vendor results must be available for such a benchmark - i.e to be useful, physical configurations must be comparable for results to be comparable between vendors. Standard IO drivers and workload simulators much be constructed and agreed upon. The drivers should be able to generate enough workload to run the arrays to saturation.

For example, if my system can sustain a certain response time under a specific workload, what if I was also doing a backup to disk on the same array? What is the performance degradation? What is the hit on throughput with synchronous replication active? How well does the array handle spikes in workload? What is the performance hit if I take a point-in-time copy of a large system? These are questions my customers ask me about performance. I answer them on a case by case basis for the DMX. Good benchmarks should be able to do this across vendors.

My criticism of the SPC benchmark, as it stand today, is that the posture it has adopted is that of a static cache-hostile single host composite which by construction tests primarily backend components and is virtually agnostic to the architecture and intellectual property contributions of the vendors. For modern systems, it can only weakly distinguish between high-end and mid-range architectures. In both cases, it effectively counts the number of spindles in the tested systems. This has made SPC benchmarking a marketing circus for the vendors who participate.

Granted, construction of a multi-host benchmark with other array functions active is difficult before a single host benchmark is acceptable. I have ideas on how to improve this, but before I go there, I would love to hear any ideas on this.

Cheers,

.Connector.