NEWS & NOTES
FROM THE THOUGHT LEADERS OF DATA STORAGE
— Storage Horizons Blog —
What are the specific aspects that an array can and should have, to be efficient and TCO friendly? How does ISE meet and exceed these aspects by making the “whole greater than the sum of the parts”? I will deal with these questions across my next few blogs, but first consider the design of a storage product from the ground up. A storage array is a specialized computer system. It has a clear focus on data storage, but it’s also much more than that. A storage array has a few laws it must live by:
- It must protect data from at least a single failure
- It must never lose data after a power failure
- It must withstand a failure as a result of a power failure (see number 1)
- Reads and writes should be expected and be capable of being performed, at a proper duty cycle, depending on the tier of storage (e.g., ISE is a Tier 0/1 device, meaning the duty cycle should be 100%, meaning anything at any time/all the time, and be low latency, high IOPS/throughput).
So, what makes up an array that meets these “laws” in such a way that it’s not just a small server or even a PC with a bunch of Band-Aids on top (or “perfume on a pig”!)?
Array Hardware and Its Effect on TCO
Given that a storage array typically has two controllers, aspects that make or break TCO include:
1. Are both controllers active at the same time for access to same data volumes? If they are not, then typically an active/passive system or one that is only active to some volumes, and the other to the rest of the volumes, causes availability issues and/or software reliability issues, driving cost. An active/passive system would most likely throw more hefty hardware at each controller, driving up power to make up for performance loss, in the normal case of both controllers operating. Also, in cases where active-active is not within an array, then software called multi-pathing drivers, must be put into play that add complexity, sometimes cost extra money, and drive the overall solution cost up—either way with storage companies seeking to recover development and support costs by hiding costs inside of high warranty costs.
2. Do both controllers have a communication link that has near zero latency? This makes a difference in case 1, above, when failover is to occur; but most importantly to solve issues with an application’s write workload with the lowest latency and overall cost. Mirroring of write data between controllers is the best method to ensure data integrity in the case of failure, and also for lowest latency across the widest range of host access patterns. True active-active operation with a dual controller array is possible when this communication link is fast enough. Not only does this allow for faster failover, in the event of a controller reboot or failure, but also additive performance to all volumes when both controllers are operational. In addition, servers no longer need special drivers to control multiple paths to the storage.
3. Related to case 2 is how the dynamic random access memory (DRAM) cache is used for writes and how it is protected. A good write-back cache can smooth out most application I/O “outliers” from the standpoint of overall access to the dataset for the application. A small amount of DRAM with non-volatility, as well as a very fast inter-controller communication link, allows for I/O latency to be reduced on the first order. Remember, DRAM is 1000x faster than SSD, which in turn is much faster than HDD (for random I/O). Using DRAM in the proper quantities can reduce TCO, but throwing a large amount at it without intelligence just drives up cost and power usage.
4. Good cache algorithms that can aggregate I/O, pre-fetch, do full raid strip writes, atomic writes, parity caching, etc., are all aspects of a very cost-effective usage of a small amount of DRAM that points to all the back-end storage devices, which in the end must have I/O performed to/from them, in the most efficient ways possible, for each back-end device type.
5. What kind of back-end device types should be considered? Nearline HDD (SATA or SAS), Enterprise HDD (10K or 15K), SSD in drive or plug-in card form factor? It all depends on the mission of the array. If it is price/performance and TCO, then my mind goes to how to use the 10K HDD, as well as MLC SSD, for some applications for the job. Using nearline HDD has its place in very low performance or sequential I/O environments, mainly in backup and archive use cases; because the extremely low I/O density causes the ability to utilize the capacity behind these typically high-capacity drives (to disallow efficient full capacity utilization). Remember though, low-cost, high-capacity drives have a different duty cycle than enterprise drives. For example, throwing multiple sequential workloads against high cap drives is just like a random workload and will kill these drives prematurely, resulting in more service events, slower performance during long rebuilds, potential data loss, and sub-optimal performance.
6. Does the array have the ability to utilize all the capacity with I/O to all attached capacity? This is a key metric in effective TCO vs. the old adage of $/GB. If an array can utilize ALL the capacity under load, then efficiency drives down TCO. The ability to utilize all the capacity is the function of the data layout, effective utilization of back-end devices, andt also how the caching and controller cooperation work. All of this can drive TCO way down or way up depending on how well it’s done.
7. Does the array have a warranty greater than three years? If so, then it’s either because the technology reduces service events OR it’s a sales tactic. If it’s the former, then it truly drives TCO down as more storage is purchased. If it’s not, then its “pay me now or pay me later.” Technology that provides for less service is based on a design for reliability and availability that goes far past just dealing with errors that occur in a system. It’s a system approach, similar to Six Sigma to reduce variation in the system, which reduces the chance of failure. In an array, that means how the devices are packaged, how the removable pieces are grouped together, and how the software can deal with potential faults in the system and keep the application running without loss of QoS. A system that can do this drives TCO down because of the fact that customers don’t have to design for failure, or in other words, design around the shortcomings of the array by over provisioning (as many cloud vendors do). Many cloud providers have designed for failure with mass amounts of over-provisioned storage, n-way mirroring, etc. The industry has been trained around the shortcomings of array design and error recovery, so those that build their own datacenters just go for the cheapest design with the cheapest parts because of this. In contrast, a storage system that really does provide for magnitudes-greater reliability, availability, capacity utilization, and performance across that capacity, can actually change this mindset. However, it takes belief that a design of this nature is possible . . . and it has been done with the ISE from X-IO.
8. Does the array provide real-time tiering that maintains a consistent I/O stream for multiple applications across the largest amount of capacity possible? An array that can effectively do this with the highest I/O and largest capacity, at lowest cost, wins the TCO battle. Beware of marketing fear, uncertainty, and doubt (FUD) that sound the same, but the architecture and design of the product, as well as results, are what matter.
9. Does an array add features that, under the right circumstances, reduce capacity footprint via de-dupe or compression? If so, I smell snake oil because in most tier1 applications, compression and de-dupe just drive up cost of the controller while giving dubious results. On paper it might look good for the $/GB, but other aspects like space, power, and utilization go down. And if it’s done with all SSD, in order to artificially say the cost is less, all the worse.
Why am I harping on the way that arrays are designed? It is because all of this drives the TCO up or down based on architecture and methods used to drive up performance, capacity utilization, reliability, and availability . . . or NOT!
Most arrays today are very wasteful when it comes to the:
- amount of compute power inside the array
- amount of actual usable capacity
- overall reliability (or aversion to service events)
- availability of the array to the application
Also, adding features such as those noted above, as well as many kinds of replication, make the performance of the array inconsistent, causing IT architects to over-provision their gear and “work around the SAN.” SANs got a bad name for bloated, framed architectures with big iron, big license fees for every feature on the planet, poor performance, poor reliability, poor capacity utilization, etc., etc., etc . . . A SAN was originally meant to just put storage on a private network that servers could share. Oh, how things get polluted over time when greed takes over by a vendor.
As noted before, putting the right amount of compute, against the right amount of storage, will drive costs down in power, space, and application efficiency.
Most arrays also have the mindset of “when in doubt, throw it out” when it comes to replaceable components within the system, also known as Field Replaceable Units (FRUs). This leads to more service events, higher warranty costs, as well as potential and real performance loss at the application, and even down time.
What Makes ISE Tick?
X-IO is now in its second generation of ISE, a balanced storage system that breaks all the molds of the traditional storage system. Unique aspects of ISE and its second generation are:
1. All the things ISE solves, including two to three times the I/O, per HDD, over any other array manufacturer.
2. Dual super-capacitor subsystems, in order to always be able to hold up both controllers, for up to 8 minutes, to flush the mirrored write-back cache on both controllers to a small SSD on each controller. This ENDS the issue of the batteries or UPS, to either hold up cache or hold up the entire array, to write out write-back cache to a set of log disks. It now means reliability goes up exponentially over a batter which was already good—it not only keeps the price the same, but also make the data readily available for server usage when power comes back on. (Note: Two super-caps are in each ISE but only one is necessary for hold-up. Two are provided for high availability and no single point of failure.)
3. Reliability that is increased tenfold, over the first generation ISE, for the back-end devices in datapacs that are using the new Hyper ISE 7-series (with additional groupings of HDDs). This extends the art of ISE-deferring-service, and includes the 5-year hardware warranty that X-IO extends on all its ISE systems.
4. Unique Performance Tiering in the Hyper ISE hybrid that allows for full use of the HDD capacity with a small % of SSD. The new 7-series extends this capability, with varying capacities of the Hyper ISE, as well as SSD capacity for application acceleration.
5. No features that are not necessary for application performance. ISE does NOT do de-duplication as it’s not necessary if the application does it—which most do—but moreover, since we are the only company in the world that allows for full utilization of the storage purchased, de-duplication/compression is relegated to where it should be: for data at rest NOT for tier 1 storage. Furthermore, features like thin provisioning are not necessary as the mainline OS, such as Windows and Linux, let alone VMware, allow for proper grow and shrink of volumes that ISE does support.
Read more on http://stevesicola.com.
— Storage Horizons Blog —
When it comes to storage systems, the cost to build the product and the subsequent acquisition cost are only two aspects of the overall cost of owning and operating the storage. The $/GB argument does not hold up anymore as the only important point in storage, because the enterprise and this world demand much, much more. Price/ Performance is very important but other aspects, in this day and age, play equal roles in most cases. Aspects like how the storage array is designed, how much capacity can be utilized (e.g., getting I/Os to it), and then how the software is layered on it, make all the difference in the world, to the total cost of operation (TCO) of storage. So many aspects make up a good array that provides performance, reliability, availability, and capacity utilization; and it’s important not just about a specific aspect, but how they play together. It’s all about making “the whole greater than the sum of its parts.” Environmental aspects make up a huge part of the cost of owning and managing storage in a datacenter and many times are “invisible” costs because of departmental silos.
The aspects to consider in a storage system, today, when buying and then owning the system are:
- Cost of acquisition
- Cost of warranty service
- Cost of power
- Cost of space
- Cost of features with licenses, etc.
- Cost for managing and attaching the storage to the system/application
How the array is designed—from mechanicals and electronics to the software that runs it all—plays a key role to drive TCO up or down.
When I consider building storage, I look at what gives the biggest bang for the buck when it comes to performance, at the lowest cost. I also look at reliability, availability, and the usable capacity. Stan Zaffos of Gartner coined the term, “Available Performance,” which seems to sum it up pretty well. Can you make the storage available, all the time, with a consistent amount of performance? That ties together price/performance, reliability, availability, and usable capacity.
TCO is not just about $ per GB anymore, nor has it been for some time, but many storage companies still seem to focus on it. Then there are others that now seem to focus only on $ per I/O, which is like fishing with dynamite when using all RAM or SSD! It’s also NOT about putting every feature, on the planet, inside the storage system because today, most applications provide features that obviate the need for features within the array. Focusing on the wrong things drive TCO up, not down. Our online whitepaper about common mistakes that are made in storage purchases, “How to Minimize Data Storage Costs and Avoid Expensive Mistakes,” puts this all in a business perspective. Putting all of the data management/protection features inside the storage reduces the scalability of the storage, locks customers to a vendor, and also drives down the efficiency of the storage, in terms of consistent performance and capacity utilization, let alone reliability and availability.
When considering the build of a storage array, I look at multiple factors:
- Processor Speed and Capability: If a processor can have speed, as well as RAID acceleration, without the need of having multiple additional components or custom chips, it is the winner. New x64 processors, from Intel’s Jasper Forest to the new Sandy Bridge, provide that capability. Choosing the right processor is important, because too many times, the processor that is recommended is more than what is necessary and this drives up power costs, needlessly.
- Memory Capability: Dynamic random-access memory (DRAM) is still the fastest. Its 1000x the speed of flash but of course, it’s much more expensive. Using the right amount, for the job of buffering and caching, is a key to cost containment, as well as array efficiency.
- Write-back Cache: This feature is amazingly effective, if the algorithms used, smooth out the accesses to the back-end devices, whether they are HDD or SSD.
- Non-volatility and Mirrored Cache: This feature, for most applications, is a key point when it makes a storage subsystem appear to be faster than it really is. It also provides for data integrity and availability in first- and second-order benefits.
- Back-end Storage Device Choice (Enterprise HDD, Nearline/High Cap HDD, and SSD): Each of these choices has ramifications to all aspects of the array from cost, reliability, performance, and availability.
- Storage Tiering: Tiering has been around for a long time. It was initially coined as Hierarchical Storage Management (HSM), then Life Cycle Management, etc., and now tiering. But tiering can be different, depending on what the goal is. Is it the performance or some all-in-one desire to have tier-one storage mixed with tier-x storage? Is it within the array or across arrays?
- Design for Reliability and Availability: These are subtly different and relate to things, such as how many different pieces there are to the solution, and how the intelligent parts of the array allow for availability, in the event of failures (fault tolerance). Packaging of the devices and the different components—without cables and with fewer replaceable components—are keys to driving up reliability and availability, as well as driving TCO down. In the end, design for reliability is all about reducing service events that affect the storage consumer, in one way or another, while availability is all about making sure the storage is available for access, all the time.
The world of computing is complicated enough. We do not need to see so many start-ups confusing the basics of computing and storage with statements like “SSD for the cost of HDD,” or “one tier for all,” or “automatic QoS,” or “no caching,” even to the extreme of “No more HDDs.” It’s all a game to try and sell people on how price/performance could be, not what it SHOULD be!
Basically, architecture is everything. Brute force only works so far by “hiding the cheese” and adding features that mask the overall cost with dubious claims about cost savings (by de-dupe and compression). They are like the wares of the old “snake oil” salesmen of the 19th century. Are these aspects of a storage array that really make a difference?
Read more on http://stevesicola.com.
There are titanic shifts occurring in data storage requirements today, often resulting in buyers making expensive storage purchasing mistakes. The most significant disruption to traditional storage thinking is a new problem brought about by the appeal of all-SSD systems, and that is the over-provisioning of performance (IOPS), in order to achieve the proper capacity (TBs). At the same time, we are faced with the opposite problem of legacy storage, that is, the over-provisioning of capacity (TB) in order to have the performance required. Fortunately, with today’s technology, especially systems that combine the best of SSD and HDD, it is possible to find the balance, leading to outstanding financial and operational results. As I meet with our customers and propects, I have noticed how the strong appeal of all-flash solutions, offering millions of IOPS, has caused them to settle on a very high $/GB solution that simply requires only 50,000 low-latency IOPS and could be accomplished for half the cost. There is a considerable amount of capital savings to be realized by sizing the IOPs and GBs that are required and buying the storage that matches the need. The following paper outlines a structure to think about data storage and how to avoid those expensive mistakes.
— Storage Horizons Blog —
As I began to go down the path to describe the “50 Shades of Storage” in my last blog, I noticed a blog from David Black (see http:www.blackliszt.com/2013/03/in-storage-there-is-x-io-and-then-there-are-all-the-others.html). He talks about X-IO, in a unique way, which is ripe for me to talk about as we delve into the storage products that work well—and those that don’t work well—in today’s world of Cloud and IT datacenters, in general. What I mean by “work and don’t work” speaks to fundamentals like reliability, availability, capacity utilization, and performance—all key metrics in the mission to drive overall costs down, not just the acquisition cost.
David makes the point that “In storage, there is X-IO, and there are all the others . . . ” It is true because the X-IO ISE and open storage management with RESTful web services, application integration, etc., are the perfect complements for what has transpired in the industry. It is no longer about what your array can do for your system. It is what the system can already do with the fundamentals of a good array!
The big players, as David puts it, are just turning the crank, assuming that everyone is still wanting a general purpose array that has every feature on the planet in there. As a matter of fact, most of the start-ups are doing this, as well. These products work, but in a world that has seen the maturation of Windows and Linux, as well as all the applications and virtualization software, the NEED for features, in the array, is JUST GOING AWAY!
David’s note that “ISE is a simple building block” is to the point and very much what I was after when ISE was being developed at Seagate. The goal was to make a building block of storage that reduced the Total Cost of Operation (TCO), so significantly, that service would be the exception versus the norm. Performance is effectively delivered with efficiencies and consistency, availability is almost at five 9s, and all the capacity can be used instead of being stranded like most arrays.
When ISE is compared to the others within the industry, X-IO comes up short in some people’s minds because it does NOT have the features that storage analysts look for. But when one looks at the system (like David’s blog talks about the Cloud), features do NOT matter. A normal Cloud uses mass replication (RAID is not dead, it’s just RAID-1 on steroids!). ISE can eliminate the extra copies while making the overall solution work better and have a much longer, useful life.
The models of Cloud that are either private or new Clouds that are designed to run enterprise applications, need to do so with as low of TCO as possible, in order to make money. That’s why ISE is so important, in this world, because it raises the bar in all the pertinent areas: capacity utilization, reliability, availability, and performance. For backup or content retrieval, some of these points are not as important; but for virtualization, database/business intelligence, and VDI in the Cloud or private datacenter, ALL these points are important in order to make money.
I highly recommend that my readers take a look at David Black’s blog. It is not only spot-on about the point we make at X-IO, it speaks to the lack of system awareness in the world. Without system awareness, redundant features—in operating systems/applications and storage—are everywhere, and the efficiency of datacenter operations fall well-short of goals set by CIOs and CFOs, let alone Cloud providers that want to make money by selling Cloud-based services, other than backup.
X-IO is focused on TCO, from the ground up, by working on the fundamentals of storage. This “50 Shades of Storage” blog series will clearly define those things that really do matter and those which are, well . . . just SPIN! ISE is built from all commodity parts and is just put together in a different way to make the “whole greater than the sum of the parts,” as well as taking storage to its rightful place—the trustworthy depository of customer data!
Read more on http://stevesicola.com.
By Gavin McLaughlin, Solutions Development Director at X-IO
It’s that time again in the storage industry. Every few years a disruptive technology causes VCs to get their wallets out and early adopters to over-spend while, thankfully, most sensible data centre architects sit, watch and wait to see what really happens.
The trend went this way a few years ago with ‘cheap and deep’ SATA drives while this year it’s heading the same way with all-flash arrays.
With SATA, after many years stuck in the cycle of replacing three-year old 15K RPM FC drives with some newer 15K RPM drives, lower-cost options for wide-striping more cost effective SATA drives generated some excitement.
For many organisations this led to an imbalance – low capital costs versus ridiculously high operating costs and, in some cases, business disruption. This is often the case when architects and designers look to promote the use of a trending new tool as the next big solution.
There’s no doubt SATA drives have a place in the data centre, but they are not suitable for primary use where risk and performance are critical. They’re great for backup and archive where the performance hit of RAID-6 can be tolerated and businesses can handle downtime, but other than that, it’s buyer beware. All too often we’ve seen key workloads such as OLTP apps and VDI sitting on SATA where reliability or growth issues have seen systems grind to a standstill and cause serious impact to the customers’ business.
Fast forward to today and a multitude of new vendors are shouting from the rooftops that flash is our true saviour of the storage universe. Among other things, I’ve heard outlandish claims that all flash arrays can do everything from run your OLTP apps 4,000% faster to saving $4m per annum in power costs.
Sounds impressive, but are they right or is this just another extreme example of a tool being positioned as a business solution?
Like any new media technology, flash has its place in the storage world given the appropriate use case but hard drives are far from dead in the water. This is despite some really tacky gravestone pictures doing the rounds out there. When you think about it, for how long has everyone been saying tape was dead?
The reality it that flash is a great tool to help solve storage issues but it’s not a one-size-fits-all solution in itself.
There are many positives about flash arrays but there are also associated negatives. The major one being price, unless you’re prepared to put your business at risk by using cheap, consumer grade flash-based products with short duty cycles. Few businesses are likely to do this.
The second issue is power and heat, something that may surprise those fresh to the flash debate. In once instance we’ve heard a vendor currently claim its flash modules use only five watts per TB of storage, giving the impression a 10TB array requires a miserly-but-impressive 50W power supply. After digging just a little deeper, it appears this vendor neglected to mention that the processors needed to run the unit and deal with some of the background issues – such as write endurance and garbage collection – all need power too. And so does the on-board memory. In fact, when you tot it all up, the real answer is that you need 2KW of power for 10TB. That’s 40 times more than you might originally be led to believe.
And it’s not just one flash-only array vendor pushing the power angle, it’s pretty much all of them. It’s hard to blame them. The space it super hot – EMC reportedly bought XtremIO in 2012 for close to a half billion bucks without having shipped a system.
But, from the user’s perspective, it’s essential to keep a cool head and pragmatic perspective when it comes to deploying flash.
The common myth that enterprise flash storage uses less power than hard drive arrays may have been true of some of the old-school 15K RPM Fibre drive units, but it’s not the case with new hybrid arrays. Hybrid arrays help overcome power consumption issues by melding different media technologies together, including flash, and playing to the strengths of each of them.
In a recent real-world customer test, we saw one hybrid 10TB unit sit alongside two other flash vendors. It gave the same performance – but used less than half the power and put out less than half the heat. Most crucially for anyone with an IT budget based in the real world, the hybrid solution cost approximately 70% less than the flash-only alternatives.
Even if that’s not enough to reconsider your flash strategy, the other aspect to consider is length of warranty offered. Flash-only warranties typically tend to last one year as standard. Hybrid arrays offer five-year warranties as standard, which demonstrates a far higher degree of confidence.
A five-year warranty ensures you’re buying into an architecture that can provide crucial uptime to the business. Hybrid can deliver this thanks to its non-reliance on a single media type along with, of course, some clever software stuff that self-heals hard drives.
With flash grabbing all the headlines, here are a few tips how not to lose your head. When considering flash-only arrays, ask yourself:
- Do you need 1,000,000 IOPS in 10TB of storage? Really?
- Do you need 300μS latency? Really?
- Can you justify to your boss a £300,000 outlay for something that comes with a 1 year return to base warranty? Really?
For the majority of users today, for now, it’s probably best to use flash where it’s appropriate and as a tool rather than a solution. Of all the trends we’ve witnessed in storage the one thing we probably all agree on is how to avoid ‘overspend on a storage trend’. Opt for tried and tested, rather than choose trial and error. This approach should be plenty good enough for most of us.
— Storage Horizons Blog —
HDD, Hybrid HDD/SSD, Or All SSD Storage: So Many “Shades Of Grey” That It Is No Wonder People Are Confused!
There is a significant amount of hype out there around all SSD solutions, in the marketplace today, with the likes of Violin, Pure Storage, Nimbus, etc. There is also a lot of hype about some start-ups that put SSD in as cache, along with SATA drives, to be an “all-in-one” box that includes every feature on the planet. But in the end, when building a storage array, then using it, it comes down to price performance and how people think about storage with respect to the rest of the computing application.
While hybrids offer a hedge against decreased quality of service (when heavy usage of applications occur), and all SSD solutions seem to be overkill because of the cost differences between flash and HDD capacity (except for the most demanding of applications), in this day and age, good HDD solutions still fit the bill for most applications. After all, what is the center of the universe—storage or the system/application? (Hint: It is the system/application!)
Applications such as a database, virtualization with multiple applications, VDI, and many others have “signatures” with various types of I/O that make up the “signatures.” They range from sequential read/write, localized random I/O (tight random across a relatively small range of capacity), to very un-localized random I/O (for application metadata or seldom touched pieces of data in the app). These applications have not changed in years and can be tuned to do more or less of each type of I/O, in some instances. However, to use just SATA HDD or just SSD technology to cover these applications seems folly. Anyone can improve capacity by using lots of HDDs to get enough performance to drive an application—while on the other hand, lots of performance can be thrown at an application with lower capacity and at higher costs. The optimum answer is somewhere in the middle. There is a good reason why enterprise HDDs are still used, from 7.2K, 10K, to 15K RPM drives, as well as the reason that enterprise SSD exists, in drive form or a plug-in card. The point is to use them for the right workload and in the right mix.
My conclusion is that a good hybrid storage system, mixing HDD with SSD, provides the best price/performance, as well as the lowest total cost of ownership (TCO). It also provides the most predictable & consistent I/O for applications, regardless of the application, if done right—and it will work across the entire purchased capacity.
However, if someone wants to buy a “SAN in a CAN” like most of the other hybrids that basically put consumer-grade flash in with high capacity and zero I/O drives (along with every feature on the planet), that’s an SMB play, not an enterprise play. These kinds of boxes are very much like large SAN arrays (Ethernet or Fibre Channel) and include features that the applications and operating systems already have, as well.
Then there are the all-SSD arrays. We liken them to “Fishing with Dynamite,” however in some workloads they are very necessary, just not for the broader market of today. The variation between these vendors, as well as features, causes me to pause. Once again, the actual design and implementation of these arrays range from scary to insane (adding features to validate their existence) by saying they are the same price as disk with subjective de-dupe and compression capabilities. Once again, many shades of grey and with features that already exist within operating systems and applications.
This new series of blog, starting in April 2013, will give some insight into hybrids and all-flash arrays and why ISE, with its linear scalability and matched storage management for IT and Cloud, as well as incredible performance and TCO, should be a product that all other products, on the market, are compared against first.
— Storage Horizons Blog —
What’s a cloud? To me, it is an automated datacenter that can provide compute, application, and backup services. It can be a private cloud for a company, or it can be public where the cloud provider wants to make money on the services it provides.
In both private and public clouds, the thought of efficiency is the key to a successful and profitable business. From the amount of humans that have to administer it, to the efficiency of computes, infrastructure, and storage—it is all about the money. In old terms, it’s about total cost of ownership (TCO). Somehow the fear, uncertainty, and doubt (FUD) have overshadowed this most important metric of all and it’s time it makes a comeback.
TCO is NOT about how much it costs to put something in the cloud operation. Sure it’s a component, but what about how much it can do for the cloud provider from the standpoint of work units per hour, power, space, and cooling costs? After these basics, there are the big metrics in computing like availability and reliability (how much service does it need); and for storage, it’s about capacity utilization of what was purchased. The last aspect is about how much is charged for the service contracts of each technical component of the system. It is amazing that this probably drives 20-40% of the costs for a cloud, as well as any datacenter. The three-year life cycle is artificial and drives pure money, for the vendors or forklift upgrades, when components can and do last longer. Why can’t we just make stuff that is reliable and requires little service, like ISE with its five-year hardware warranty? We put a man on the moon over 40 years ago, so why can’t we make products that just work like ISE? The answer is partly in the desire for service revenues, as well as the lack of innovation.
So much noise was made about $/GB over the years, and it seems like the entire world has been brainwashed that SATA drives are the panacea for storage. Recently, the big hype has been around flash storage in some type of card that is installed in a computer or SSD—or a big box of flash. That’s going the other way on $/GB, even though it is fast as could be.
Both SATA and flash today have TCO that is higher than people think. They have good aspects but they also have bad aspects that bring TCO down-to-earth. SATA drives may be cheap but they are unreliable and have almost no performance, and when you push them hard, they die quicker. SATA drives also are unable to access all of the capacity in any reasonable time—which means for all but backup or content. They are pretty much a bad choice for any real application because your capacity utilization is well below 50%. It is stranded capacity that is wasted and causes people to buy more storage, waste more storage, waste more space, and waste more power! So for those who just buy more drives, the power, space, and cooling then become issues, as well as drive failure rates, repair frequency, and potentials for data loss. For those in the cloud who then just say, “Buy many of them and have many copies of the data,” the capacity utilization and effective $/GB continue to dwindle while all the other cost metrics soar, as well.
For all SSD solutions today, the $/GB is prohibitive. I love flash as an option for the future, along with other up-and-coming, non-volatile storage options. However, most device makers that are currently using flash, waste a lot of space because of the lifetime and failure modes of flash. Today when you add this up, it increases the $/GB. Stay tuned for this situation because if the prices were to come down significantly, relative to enterprise HDD, things may change quickly, but all other aspects of TCO still must be addressed no matter what the storage type. SSD in ISE is already here with Hyper ISE—it is the most efficient use of SSD with HDD in the industry. When SSDs are more cost-effective, ISE will be ready for them. So the TCO is all about the cost of storage procurement, the cost of service contracts, as well as power, cooling, capacity utilization, overall reliability (need for repair events), availability to applications, and performance. Those all-ssd suppliers that use de-dupe and compression to market a lower $/GB really add power and cost to the solution with a subjective amount of capacity savings vs. 100% capacity utilization that ISE provides.
ISE density today allows for 40 2.5” devices to be housed in every 3U rack. While it is possibly not the densest packaging, ISE is all about making sure that the density does not cause reliability problems with vibration and heat. Remember, the environmental factors for all storage devices are key metrics, whether they are HDD or SSD.
By design, ISE, as compared to other enterprise arrays, experiences 100 times less service events because of its technology. My team (some of whom have been with me for 10-30 years) and I developed it while working at Seagate between the years of 2002-2007. This was proven in its first generation, first shipped in 2008, and running in most countries in the world. ISE also has unprecedented availability with straightforward test metrics from ISE architecture that allows software to be adequately tested. Architecture wins in this world, even though brute force can hide the facts for a while.
ISE performance has been shown, in industry benchmarks, to lead in efficient application performance. ISE gets three to four times the I/O out of an HDD, vs. all competition, and with Hyper-ISE, a unique fusion of HDD and SSD (NOT flash cache), the applications just scream. Recent Redknee and Temenos benchmarks, for mobile billing and banking applications that use Microsoft SQL 2012, show that ISE is 25 times more efficient than traditional enterprise storage for applications. This is an incredible savings of space and power, along with the fact that ISE allows full capacity utilization while maintaining full performance.
ISE was built with cloud in mind, first and foremost, when a cloud was basically an automated or autonomic datacenter. We have done all the work with many datacenters, and the numbers are clear about ISE efficiency for enterprise applications against traditional storage:
- ISE reduces space costs by 3X
- ISE reduces power costs in datacenters by 10X
- ISE reduces service events by 100X with proven ISE packaging and self-healing technology
In this world, where costs and human resources are always the desired point for savings by every CFO, for clouds and datacenters, ISE should be considered for all. X-IO is shown by Gartner to be the ONLY innovators in storage, and we are executing on this vision for all customers.
In previous blogs, I’ve written about the myths and truths of storage. I’ve written about how frequently drives fail, or look like they fail, and cause service events for the rest of the industry. I’ve written about performance and its need to cover all of the capacity purchased for enterprise applications. I’ve written about availability and how it relates to efficient use, 24-hours-per-day, for datacenter applications. Space and cooling also played a part in my writings, because if you can use what you buy, it takes less space and less cooling to run your datacenter.
All of these essays apply to Cloud datacenters, whether private or public. ISE is all about enterprise applications and private clouds, as well as new “application” public clouds. It is purpose-built to be efficient, easy-to-use and last five years or more. There is nothing like ISE in today’s industry, regardless of marketing and FUD by the competition. ISE is needed because it is the most efficient storage on the planet—and that’s about TCO.
Read more on http://stevesicola.com.
— Storage Horizons Blog —
Another recent myth, the third in the Five myths and Half-truths series, is that RAID is dead or that RAID-6 is the answer to everything. Where did doing the math go? What about the economics here of wastage or doing the right thing for the right reasons? From distributed file systems with copies of files in multiple places, it’s nice to see that 1980’s RAID-1 Server Replication worked.
It’s just that with this method, the work is spread to the network and in general the cheapest SATA drives are used in these environments. This is versus even replicated RAID-5 sets for the more disaster tolerant minded people would do not only more economically, but with more predicable performance and availability.
The other part is that RAID-6 is the answer. RAID-6 works to help for two drive failures, but has it been looked at for how failure/repair really is calculated to see what is necessary where before you do disaster tolerant copy to elsewhere? I believe in the right RAID Level for the right UER (uncorrectable error rate) of the drive or drives involved in the data protection scheme. This is modeled with the repair rate, which can vary and cause some to need RAID-6 in arrays based on speed of software and or hardware design. Performance suffers as protection level increases and $/IO/GB suffers, or I/O density, the key metric in my mind for application performance.
The bottom line answer is to use the right data protection technique based upon the desired UER or MTBDL (mean time between data loss). Combinations of RAID-1, RAID-5, or even RAID-6 can be employed but all have their costs which must be factored into each IT decision. ISE uses RAID-1 and RAID-5 along with ISE Mirroring to provide for proper redundancy for enterprise HDDs as well as extended data protection for those customers who require extended data loss protection for disasters, etc.
Another Half-Truth are that SANs are basically ‘Available’. Availability is different from reliability, because it has nothing to do with repair rate of the SAN. Availability is about keeping storage access to the servers maintained without disruption. The problem here is that that the answer is ‘kinda’…
Availability means ‘always on’ and accessible. Without it, server applications can’t run and the entire business stops. I call these severity ‘1’ issues. In order to avoid these things many aspects of day to day operation of storage systems must be looked at. Even though some storage systems perform ‘ok’ on ‘always on’, various aspects of ‘reduced availability’ occur based upon weaknesses in storage systems.
Most SAN arrays (IP or FC) have their weaknesses, and are vary from company to company as which weakness(es) they have. They range from:
- Back-End device weakness
- Caching Weakness
- Failover/Recovery Weakness
- Code Efficiency/Open Source weakness
- Maintenance Weakness as related to risk and loss of availability
Back-end device weaknesses are manifested in ‘slow’ drive performance, limiting access to storage for extended periods of time, causing dissatisfaction by application owners. This is typically due to lack of attention in storage software in back-end control and observation of the back-end infrastructure as well as the attached storage devices. The infrastructure today is rapidly moving to SAS (serial attached SCSI), which is very similar to a fibre channel network, albeit reduced in function. SAS is very powerful but requires knowledge of how to handle networks, even on the back-end, and even with a small number of devices. Bus resets, accessing of enclosure information, etc must be done properly in order to not get in the way of normal I/O. The other aspect of back-end weakness comes in dealing with the storage devices themselves, as getting data on and off the devices efficiently either stalls applications or makes them fly. Brute force can work with overkill of a pile of flash or DRAM, but money talks here…what price glory?
Caching weaknesses in SANS still plague users with the only indication of this being how much they have to pay to get performance. Methods for caching efficiently can get overcome by having too much cache, with searching dominating the workload. It seems amazing that caches in many SANS are huge when they don’t really need to be. Efficiency is key here once again as Caching is all about knowing ‘when to hold em’ and ‘when to fold em’ as in cache it or flush it. The ‘sensing’ of knowledge of the data is key to continuous available performance to an application. Another aspect of cache weakness is the method(s) in which cache mirroring to enable safe Write-Back caching is performed. Many SANs use an external bus in order to affect write-back caching. Most have issues with latency because network interfaces are used such as FC, IB, or SAS. In order to make write-back caching penalties from mirroring be avoided, the speed and latency of the mirror ‘bus’ should approach or be the same as the internal memory bus speed of the processor in the SAN. This weakness actually causes a ripple effect on performance but also availability when an actual failover occurs. That weakness comes from a lack of full ‘active-active’ access through all controllers in the SAN to the same volumes.
Failover and Recovery when a redundant storage system on the SAN loses part of its processing power due to faults in hardware or software are a key component in availability and weaknesses in storage systems. The desire is to have zero time failover to surviving parts of the system after such a fault. However, due to weaknesses in most storage systems, the time to perform such a ‘failover’ or ‘recovery’ (when the failing part returns), can cause down time to servers based on excessive time taken to perform failover or can cause severe performance degradation that is noticeable by the applications. The process of a failover or recovery is similar to a ‘v-motion’ activity in VMware, as the entire state of what the failing component is taken over by the survivor. The methods in which this occurs all relate to the complexity at the point of failure or recovery. The more work that has to be done to affect completion of the failover or recovery, the longer time it takes to complete the failover or recovery. Many SANs are indeterminate as to how long the process will take, based on weaknesses not only in the failover/recovery code, but all the other weaknesses in the system add up to cause issues here. Some arrays can take minutes to failover, causing downtimes periodically to applications, but also making timeouts longer and causing the fateful ‘pause’ in computing that slows businesses down.
Overall, weaknesses in SANS are due to either old software, patch-worked software, open source software, stacked software with feature upon feature, lack of design vs. ‘seat of the pants’ development, etc. This is the single most reason that storage systems do NOT get the outward efficiency that they should. Typical systems get 1/3 to ½ of the performance they could, while wasting processing power and storage to overcome this. This relates to the cost of storage, the quality of the storage from a user perspective and the cost for available storage. For many it’s one more charge after an other for propping up against these weaknesses that the customer is told is an enhancement or a new service offering. I have seen array generations get a whopping 5-10% increase in performance with new processing power that is 2-4 times as fast. There is nothing like bad software to ruin your plans. And as quality suffers so does availability. With the mix of open source, written software, glued on software, etc it’s a wonder many startups can get into an enterprise operation, let alone stay in there.
The last weakness, and this one is the most overlooked when customers buy storage, is the maintenance of the system over its life. A storage system is as good as its last upgrade in terms of stability, performance, and availability. Many customers ignore this as they have been burned in the past by failed upgrades or planned downtimes that cause data center outages that can go on forever. Many storage companies, don’t even talk about it on their websites, and even the big companies have ‘planned downtimes’ for upgrades of attached storage devices. This is crazy in this world of non-stop computing and actually causes more money to be spent to cover it up as with above to hide weaknesses in specific parts of a system. In many cases, customers who figure it out end up putting in DR solutions to even cover their maintenance operations of the SAN. What a waste! It should be just like updating a PC or a Mac where its non-intrusive, and with respect to redundant systems, performed one at a time to NOT lose any availability or cause any downtime!
ISE on the other hand, with the extreme focus on performance, reliability, AND availability is now at about 66 years between severity 1 events that cause downtime for servers. This is based on a strong focus on all aspects of the storage software stack as well as the hardware design of ISE. Our performance is 2-4 times that of any HDD device and our Hyper-ISE with HDD and SSD, we have set records in benchmarks across the world with our ability to ‘sense’ application loads and adapt accordingly for random, sequential, and mixed loads. Efficiency is the key to cost effective data centers and ISE is the building block with the most efficiency on earth.
Read more on http://stevesicola.com.
— Storage Horizons Blog —
Another myth, the second in the Five myths and Half-truths series, is that the ‘storage shelf’ or ‘drive tray’, or new ‘massively dense drive packages’ and servers with many HDDs are actually decent at preventing vibration and providing enough cooling for all the drives in the enclosure. THEY ARE NOT. Vibration across drives, hot spots in the packaging, vibration across entire packages in the rack and more are the cause of wastage in the data center from performance loss to significant increases in failure rate/drive replacement. Adding to this is the possible exposure to data loss when drives fail under these circumstances, with parts of the data set exposed without redundancy or the requirement for many hours to restore redundancy with the size of today’s drives. Failure rates in this case go well past the design specifications of 1%, up to 3-5% which is untenable for most businesses and many storage vendors try band-aids with fancier algorithms or more copies of data while increasing service costs to the user. The underlying problem should be addressed first.
Drives placed in environments where heat and vibration are excessive directly relates to real drive failures while the results of environment, software, and indirectly relate to drive replacements by causing “No Trouble Founds, or NTFs”. NTFs are the bane of the computer industry as they waste time, money, risk a customer’s data set on recovery techniques they employ in storage or elsewhere. The things that cause NTF’s are the lack of thorough error recovery in almost every storage device or host driver that communicates with storage devices. Excessive heat and vibration can and do cause extended times for read or write completion, retries, missed sectors, re-synch/reboots of drives, etc. This causes the software, which is not well designed or thought through to just mark the drive bad and go into ‘rebuild’ and assume it’s alright to slow things down for the application, pay more for a service call with higher service costs, and put the customer data in jeopardy while ‘rebuilding’ the data from the ‘failed’ drive. Service contracts for storage devices are high priced because of this lack of attention to detail and money is made whether or not drives fail and the customer always pays.
The design of ISE places drives in an extremely low vibration and heat environment and has state of the art intelligent error recovery with 4 plus years field experience with over 5,000 ISE deployed world-wide as well as 6 years of 100’s of units in the lab. This has provided XIO with the proof that good packaging and recovery algorithms (not just RAID) drastically reduce failure rate of drives, repair frequency of the entire system, and maximize performance of the drives and storage system. XIO uses patented DataPac technology that house 10-20 drives as well as patented Managed Reliability software allowing XIO to provide a no-charge 5 year HW maintenance warranty on ISE. The peace of mind of a storage system that has NO NTF’s, an extremely low failure rate, and performance across all the purchased capacity is unheard of in the storage industry. When has buying less been better than buying more? Efficiency and simplicity with little or no human intervention has always been important, and ISE solves the foundational issues that have been plaguing the industry for 20+ years. SSD’s require similar attention as they are now part of the mix in IT environments, and I’ll cover them in a bit.
Read more on http://stevesicola.com.
— Storage Horizons Blog —
Five myths and Half-truths are my second installment that I’ll continue on the Storage Horizons blog in increasing depth, because I’m convinced that efficiency is key to Cloud computing being successful for tier1 applications as well as any new IT configuration where overall cost is now a metric in the business.
The first myth is that all drives are the same. I think it’s because of the bifurcation of the PC/desktop market and the enterprise IT sector, as well as the margin that storage providers have placed upon enterprise drives. There are key differences, and relate not only to performance, but reliability, data integrity, and overall ruggedness for 24×7 utilization. This is contrasted to a drive that’s either meant to be used 8 hours a day period, or a drive that is meant to have basically continuous, less strenuous workload. The key here is that since HDDs rather look the same on the outside, a good number of people just think they ARE all the same. It’s so unfortunate, but its exacerbated by the fact that drives are used incorrectly because of this as well as the environment they are placed within (packaging, which I will cover later). All of this relates to how and why drives fail, or at least seem to fail.
Enterprise drives vs. Nearline and Desktop drives are always a subject of conversation about whether they are worth it or not in both ends of the conversation. The point is that they all have a place, and they should be used there!
It’s always interesting to see the ‘new idea’ of using the cheaper drives in place of the enterprise to achieve the same thing. When you consider the type of drive and solution to solve the data protection of data centers, considerations like the sheer numbers, the entropy difference between the solution size with different drive types can be staggering small or unfortunately, very large, driving service costs constantly.
A half-truth is that hard drives are unreliable. Well, if you believe the first one, this one leads to it directly and indirectly. Depending on different drive types, then service load, environmental conditions, and storage software in servers or arrays that interface with the drives, the pull rates of drives in the field would suggest they are just failing at a rate of almost 10%! This, when hard drives typically are specified at 1% failure rate if used in proper operating rate and environment.
To get specific (my daughter says nerdy), HDD (and SSD’s for that matter) have a specified duty cycle (amount used per day and how hard they are pushed) that the 1% failure rate specification comes from. It’s actually called the MTBF (mean time between failure) and typically manufacturers of drives strive for the 1 million hour number, after which there are diminishing returns on striving for more.
Enterprise hard drives (also called mission critical) are rated at 24×7 usage at full-speed operation. This means using it all day every day for say a database that moves the heads around constantly. This generates the most heat and the most wear, depending on the type of environment the drives is contained within, enclosure wise. Nearline hard drives (also called business critical), are rated at about 30% duty cycle per day for their 1 million hour MTBF or 1% failure rate. This means it can be used about 24 hours a day at 30% usage without undue wear and commensurate failure rate. This type of drive is not for a database, rather for backup and archive. The 30% duty cycle relates to the type of environment the drive is supposed to be used within. If only it were true, and I’ll explain later.
The last type of HDD is the desktop drive, used for PC’s. This is what people normally think of when they talk about SATA drives. This is the cheap drive you find at the electronics store and wonder why when you buy drives from your storage vendor why they cost so much more. The cost difference is not at large as you think, but most vendors of storage do rake customer over the coals for enterprise and even nearline drives all for the sake of ‘extended testing.’ Suffice it to say, it’s basically a rip-off of the customers and does explain a lot of about some of the new data centers that chose to use the cheapest drives possible and just employ mass numbers with n-way RAID-1 while dealing with the massive fall out of drive failures from over used drives in bad environmental situations. Getting back to desktop drives, these drives are meant to be used no more than 8 hours per day period. The metrics used are based on long standing degrees of design discipline within drive manufacturers. This is the way they cut costs between the three basic drive types.
To recap, enterprise drives are built for performance and high utilization, while nearline drives are meant for back-up and archive. The desktop drive is meant for the PC or external backup drive at home or in a small office.
The actual environment that drives are placed within is something of note recently. I was interviewed by Bloomberg after which an article about vibration in the data center was published in Business Week late last year. The reporter asked me many questions related to loss of performance based upon microvibration within data center racks in which the drives are housed. While this is true, and can cause up to 90% loss of performance in a drive based on ‘bad’ packaging, the key point to make is well past performance. It’s about reliability of the actual drives and the potential for early failures as well as false failures or ‘NTFs’ (no trouble found).
Half of the reasons that hard drives actually fail are because of heat and vibration. An HDD is an amazing device, and if treated as specified, with low external vibration and heat placed upon it, the HDD will last a very long time, and most likely suffer slow degradation over time versus a total failure. I’ve been in the disc and storage engineering world for 32 years and the facts have gotten buried for way too long.
Read more on http://stevesicola.com.