NEWS & NOTES
FROM THE THOUGHT LEADERS OF DATA STORAGE
03.07.12
The Truth About Storage, Part Four
— Storage Horizons Blog —
Another myth, the second in the Five myths and Half-truths series, is that the ‘storage shelf’ or ‘drive tray’, or new ‘massively dense drive packages’ and servers with many HDDs are actually decent at preventing vibration and providing enough cooling for all the drives in the enclosure. THEY ARE NOT. Vibration across drives, hot spots in the packaging, vibration across entire packages in the rack and more are the cause of wastage in the data center from performance loss to significant increases in failure rate/drive replacement. Adding to this is the possible exposure to data loss when drives fail under these circumstances, with parts of the data set exposed without redundancy or the requirement for many hours to restore redundancy with the size of today’s drives. Failure rates in this case go well past the design specifications of 1%, up to 3-5% which is untenable for most businesses and many storage vendors try band-aids with fancier algorithms or more copies of data while increasing service costs to the user. The underlying problem should be addressed first.
Drives placed in environments where heat and vibration are excessive directly relates to real drive failures while the results of environment, software, and indirectly relate to drive replacements by causing “No Trouble Founds, or NTFs”. NTFs are the bane of the computer industry as they waste time, money, risk a customer’s data set on recovery techniques they employ in storage or elsewhere. The things that cause NTF’s are the lack of thorough error recovery in almost every storage device or host driver that communicates with storage devices. Excessive heat and vibration can and do cause extended times for read or write completion, retries, missed sectors, re-synch/reboots of drives, etc. This causes the software, which is not well designed or thought through to just mark the drive bad and go into ‘rebuild’ and assume it’s alright to slow things down for the application, pay more for a service call with higher service costs, and put the customer data in jeopardy while ‘rebuilding’ the data from the ‘failed’ drive. Service contracts for storage devices are high priced because of this lack of attention to detail and money is made whether or not drives fail and the customer always pays.
The design of ISE places drives in an extremely low vibration and heat environment and has state of the art intelligent error recovery with 4 plus years field experience with over 5,000 ISE deployed world-wide as well as 6 years of 100’s of units in the lab. This has provided XIO with the proof that good packaging and recovery algorithms (not just RAID) drastically reduce failure rate of drives, repair frequency of the entire system, and maximize performance of the drives and storage system. XIO uses patented DataPac technology that house 10-20 drives as well as patented Managed Reliability software allowing XIO to provide a no-charge 5 year HW maintenance warranty on ISE. The peace of mind of a storage system that has NO NTF’s, an extremely low failure rate, and performance across all the purchased capacity is unheard of in the storage industry. When has buying less been better than buying more? Efficiency and simplicity with little or no human intervention has always been important, and ISE solves the foundational issues that have been plaguing the industry for 20+ years. SSD’s require similar attention as they are now part of the mix in IT environments, and I’ll cover them in a bit.
Read more on http://stevesicola.com.
02.24.12
The Truth About Storage, Part Three
— Storage Horizons Blog —
Five myths and Half-truths are my second installment that I’ll continue on the Storage Horizons blog in increasing depth, because I’m convinced that efficiency is key to Cloud computing being successful for tier1 applications as well as any new IT configuration where overall cost is now a metric in the business.
The first myth is that all drives are the same. I think it’s because of the bifurcation of the PC/desktop market and the enterprise IT sector, as well as the margin that storage providers have placed upon enterprise drives. There are key differences, and relate not only to performance, but reliability, data integrity, and overall ruggedness for 24×7 utilization. This is contrasted to a drive that’s either meant to be used 8 hours a day period, or a drive that is meant to have basically continuous, less strenuous workload. The key here is that since HDDs rather look the same on the outside, a good number of people just think they ARE all the same. It’s so unfortunate, but its exacerbated by the fact that drives are used incorrectly because of this as well as the environment they are placed within (packaging, which I will cover later). All of this relates to how and why drives fail, or at least seem to fail.
Enterprise drives vs. Nearline and Desktop drives are always a subject of conversation about whether they are worth it or not in both ends of the conversation. The point is that they all have a place, and they should be used there!
It’s always interesting to see the ‘new idea’ of using the cheaper drives in place of the enterprise to achieve the same thing. When you consider the type of drive and solution to solve the data protection of data centers, considerations like the sheer numbers, the entropy difference between the solution size with different drive types can be staggering small or unfortunately, very large, driving service costs constantly.
A half-truth is that hard drives are unreliable. Well, if you believe the first one, this one leads to it directly and indirectly. Depending on different drive types, then service load, environmental conditions, and storage software in servers or arrays that interface with the drives, the pull rates of drives in the field would suggest they are just failing at a rate of almost 10%! This, when hard drives typically are specified at 1% failure rate if used in proper operating rate and environment.
To get specific (my daughter says nerdy), HDD (and SSD’s for that matter) have a specified duty cycle (amount used per day and how hard they are pushed) that the 1% failure rate specification comes from. It’s actually called the MTBF (mean time between failure) and typically manufacturers of drives strive for the 1 million hour number, after which there are diminishing returns on striving for more.
Enterprise hard drives (also called mission critical) are rated at 24×7 usage at full-speed operation. This means using it all day every day for say a database that moves the heads around constantly. This generates the most heat and the most wear, depending on the type of environment the drives is contained within, enclosure wise. Nearline hard drives (also called business critical), are rated at about 30% duty cycle per day for their 1 million hour MTBF or 1% failure rate. This means it can be used about 24 hours a day at 30% usage without undue wear and commensurate failure rate. This type of drive is not for a database, rather for backup and archive. The 30% duty cycle relates to the type of environment the drive is supposed to be used within. If only it were true, and I’ll explain later.
The last type of HDD is the desktop drive, used for PC’s. This is what people normally think of when they talk about SATA drives. This is the cheap drive you find at the electronics store and wonder why when you buy drives from your storage vendor why they cost so much more. The cost difference is not at large as you think, but most vendors of storage do rake customer over the coals for enterprise and even nearline drives all for the sake of ‘extended testing.’ Suffice it to say, it’s basically a rip-off of the customers and does explain a lot of about some of the new data centers that chose to use the cheapest drives possible and just employ mass numbers with n-way RAID-1 while dealing with the massive fall out of drive failures from over used drives in bad environmental situations. Getting back to desktop drives, these drives are meant to be used no more than 8 hours per day period. The metrics used are based on long standing degrees of design discipline within drive manufacturers. This is the way they cut costs between the three basic drive types.
To recap, enterprise drives are built for performance and high utilization, while nearline drives are meant for back-up and archive. The desktop drive is meant for the PC or external backup drive at home or in a small office.
The actual environment that drives are placed within is something of note recently. I was interviewed by Bloomberg after which an article about vibration in the data center was published in Business Week late last year. The reporter asked me many questions related to loss of performance based upon microvibration within data center racks in which the drives are housed. While this is true, and can cause up to 90% loss of performance in a drive based on ‘bad’ packaging, the key point to make is well past performance. It’s about reliability of the actual drives and the potential for early failures as well as false failures or ‘NTFs’ (no trouble found).
Half of the reasons that hard drives actually fail are because of heat and vibration. An HDD is an amazing device, and if treated as specified, with low external vibration and heat placed upon it, the HDD will last a very long time, and most likely suffer slow degradation over time versus a total failure. I’ve been in the disc and storage engineering world for 32 years and the facts have gotten buried for way too long.
Read more on http://stevesicola.com.
02.13.12
The Truth About Storage, Part Two
— Storage Horizons Blog —
This blog is the second in a two-part series.
First, let’s start with what storage is about. The fundamentals of storage are about:
- Performance to the application in the way of I/O and BW;
- System and component reliability;
- System availability; and
- Data protection/integrity.
Storage is also about data management, which has come to be known as ‘thick storage controller features’ or the ‘application/OS controlled features’. These features are manifested in controllers with lots of JBOD shelves of drives or software in servers attaching to dumb or intelligent storage that provides some subset of the fundamentals of storage.
The storage industry has focused over the past 20-plus years on data management within storage controllers – essentially since the late 80s, when Windows and Linux started becoming the low-cost replacement for ‘proprietary solutions’ The problem is that those proprietary solutions from the likes of companies such as IBM, DEC, Amdahl and Unisys actually provided all the data managed within these ‘hardware and software solutions’. In other words, these solutions were basically doing what has been done at Google and Facebook with ‘no RAID’ and zillions of SATA drives!
The problem with this focus on features from the storage vendors is that all the fundamentals have been pushed aside, which means that performance is not really a focus, reliability can drive service revenue, availability can do the same, and data integrity is ‘good enough’. SSDs are now the big ‘buzz’ and seen as the panacea of all ills, which they are not. They are an evolution of the storage device, while HDDs still command the lion’s share of the shipments around the world.
XIO is all about storage fundamentals, blowing past all the myths and half-truths to get what customers want: storage that performs, is always there, and is never lost. XIO is also about the new world of ‘applications/OS’ features that will replace every single feature within array controllers. Over the next few months, I’ll be writing and debunking every myth and half truth this industry has put forth, and also what customers can do about it. I’ll also be uncovering why this is so important, given the scale of storage growth and the emergence of the cloud, so that a cloud-based solution will actually be worth something rather than just the hype it seems to be generating today.
Read more on http://stevesicola.com.
02.03.12
The Truth About Storage, Part One
— Storage Horizons Blog —
This blog is the first in a two-part series. The second part will uncover the truths, half-truths and myths about storage.
I’ve been in the storage industry for more than 32 years with my team at XIO. Digital Equipment, Compaq, and Seagate were incredible places to learn a trade that is not something taught in any college I know of…yet.
Storage is the most interdisciplinary science there is, encompassing everything about computer architecture, of course, but, perhaps more importantly, encompassing everything mechanically-, environmentally-, fault tolerance-related (and that’s just the tip of the iceberg).
I wonder how many people really know this, let alone the real truth about storage, as opposed to all the myths and half-truths most everyone has either questioned or accepted over the past 20-plus years the industry has seen more ore less ‘open’ systems and standards such as SCSI and Fibre Channel. The focus always seems to be on the next interconnect or the next ‘feature’ to make living with storage the ‘way it is’ a little better.
Some of the myths out there are that not only are all hard drives are the same but they’re also all unreliable. Furthermore, with SSDs, the mantra is now that SSDs are more reliable than hard drives. OMG, this cannot be further from the truth on both points. More myths and half-truths include ideas such as: RAID is dead, and the cloud is something magical that will solve all CIOs’ problems associated with owning aging data centers.
I’m here to tell the ‘truths’ about storage, systems, and about the industry. Having worked inside companies that make the actual devices that go into servers or arrays, as well as having designed and built 10 generations of arrays controllers with the team at XIO (who have been with me most or all of my 32 years in the storage industry), I have a unique perspective on not only the truths, but also on how the myths and half-truths have become accepted as fact.
In future blogs, I plan to examine the truth about everything from HDDs and SSDs to service costs and system partitioning. Stay tuned; we’re just getting started!
Read more on http://stevesicola.com.
08.25.11
XIO — Fast Forever
XIO means performance-driven storage. Our Hyper ISE storage system represents a breakthrough in price/performance/capacity ratios that literally changes the economics of the data center. This name change represents our new focus, revolutionary products, and our role as a category leader. Hyper ISE is the only storage system that combines SSD and HDD, and can achieve 200,000 IOPS. And you get this performance in a 3U, 14.4 TB storage system. That is a lot of power in a small package.
SSD storage systems are more expensive on a per GB basis. Traditional HDD storage systems are more expensive on a per IOPS basis. And adding SSDs into these systems isn’t really going to lower the cost or improve performance all that much.
Instead of spending hundreds of thousands—and, in some cases, millions—of dollars on big data platforms and accelerators, our customers are using Hyper ISE to improve database performance for a fraction of the cost. Customers successfully implement VDI with up to thousands of virtual desktops running on a single Hyper ISE. And we enable server virtualization 2.0 by breaking the I/O bottleneck, enabling customers to achieving up to 50-to-1 virtual to physical consolidations ratios.
We can no longer ignore the power, cooling, and floor space challenges of the data center. Hyper ISE’s design allows it to efficiently dissipate heat. And, because we use far fewer disk drives to achieve high performance, our power consumption is far less than other storage systems. Add to that the additional floor space savings you get thanks to our compact footprint.
Hyper ISE is fast forever. Performance in other storage systems degrades in over time due to wear and tear, RAID rebuilds, and mirroring operations. Hyper ISE was built to maintain high performance through its entire life cycle while other storage systems fall off a cliff. That is a dirty little secret in the storage world that is rarely talked about.
Our name means performance. XIO.
08.09.11
The Need for Speed
We are seeing a real “need for speed” and so is industry analyst Tony Asaro. In his latest article “The Need for Speed” in Storage Magazine he discusses the imbalance in the data center. Servers are getting faster but relatively speaking, storage is not. He does mention us specifically in the article: “XIO has a unique approach with its Hyper ISE product using Continuous Adaptive Data Placement (CADP) that creates a single pool of storage from SSDs and hard disk drives (HDDs). Instead of promoting and demoting data based on activity/ inactivity, XIO monitors application performance and places data on SSD or HDD based on whether there will be an actual improvement perceivable to the user. The goal is to ensure that price, performance and capacity are in optimal balance.”
The article is very interesting and makes some important points about how performance optimized storage is a new category of storage that has wide applicability. We are certainly seeing that and it is driven by this imbalance in the data center. Storage has become the bottleneck and the Hyper ISE breaks it.
06.23.11
Hyper ISE Buzz
As you know we went GA with Hyper ISE and we are having amazing market traction with our customers and partners. There has been a great deal of media buzz including two articles in TMCnet – “The Future of Faster, Smaller Storage with Hyper Solid-State and Spinning Media and “Interview with XIO“. Other articles include “XIO Unscabbards Go-faster Flash-mungous ISE Blade,” in the Register and “XIO Enhances Data Center Efficiency with Hyper ISE Storage Platform,” in Computer Technology Review. There is also a blog by David Black, Fusion IO and XIO Hyper ISE. And the most recent “XIO’s SSD Strategy: Beat Fusion-io” in TechTarget. Related to Hyper ISE is Steve Duplessie’s blog on the inevitable rise of SSD within enterprise storage.
06.13.11
CTO Insight on CloudExpo 2011
The 2011 NYC Cloud Expo was a great experience for me, XIO, and the entire crowd who attended. I was able to dispel myths and explain the realities of Storage and Storage for the Cloud in my keynote as well as the breakout session and technology executive panel.
The response I got from the crowd was encouraging, because just accepting the status quo of storage inefficiency in architecture, design, and back-end costs is NOT in the best interest of any customer. I happy with the hunger for knowledge and after the fact being able to see what has been done to solve this with Hyper ISE and the entire ISE family. It was a great opportunity for us to announce the General Availability of our new Hyper ISE storage platform designed to provide our customer with the best of both worlds — high performance and capacity in one pool of storage.
Here’s a great video interview our CEO, Alan Atkinson conducted at the Cloud Expo event in NYC talking about the rapid adoption of SSD working alongside spinning disk with the help of Continuous Adaptive Data Placement (CADP).
To the Cloud…that can handle thunderstorms of IO!
06.09.11
Fusion-io IPO – The Need for Speed
The Fusion-io IPO today is the second most exciting thing to happen in storage this week. The first of course was the Hyper ISE Launch! Having said that, we see the Fusion-io IPO as an important validation for the need for speed. The market rewarded Fusion-io with a healthy surge with a near $2B market cap. The Wall St Journal did point out the good and the bad — one point is Fusion-io gets 91% of their revenue from just 10 customers. And analyst Steve Duplessie isn’t a big fan.
One customer that selected Hyper ISE over Fusion-io told us that we gave him the high performance they required AND the capacity he needed for his application. Additionally, their high performance application was also mission-critical and our reliability and high availability was requisite. On top of that he didn’t want a point solution but something that he could use for multiple applications. When you added it all up it was a no-brainer for this customer.

04.19.12
The Truth About Storage, Part Five
Posted by Steve Sicola
— Storage Horizons Blog —
Another recent myth, the third in the Five myths and Half-truths series, is that RAID is dead or that RAID-6 is the answer to everything. Where did doing the math go? What about the economics here of wastage or doing the right thing for the right reasons? From distributed file systems with copies of files in multiple places, it’s nice to see that 1980’s RAID-1 Server Replication worked.
It’s just that with this method, the work is spread to the network and in general the cheapest SATA drives are used in these environments. This is versus even replicated RAID-5 sets for the more disaster tolerant minded people would do not only more economically, but with more predicable performance and availability.
The other part is that RAID-6 is the answer. RAID-6 works to help for two drive failures, but has it been looked at for how failure/repair really is calculated to see what is necessary where before you do disaster tolerant copy to elsewhere? I believe in the right RAID Level for the right UER (uncorrectable error rate) of the drive or drives involved in the data protection scheme. This is modeled with the repair rate, which can vary and cause some to need RAID-6 in arrays based on speed of software and or hardware design. Performance suffers as protection level increases and $/IO/GB suffers, or I/O density, the key metric in my mind for application performance.
The bottom line answer is to use the right data protection technique based upon the desired UER or MTBDL (mean time between data loss). Combinations of RAID-1, RAID-5, or even RAID-6 can be employed but all have their costs which must be factored into each IT decision. ISE uses RAID-1 and RAID-5 along with ISE Mirroring to provide for proper redundancy for enterprise HDDs as well as extended data protection for those customers who require extended data loss protection for disasters, etc.
Another Half-Truth are that SANs are basically ‘Available’. Availability is different from reliability, because it has nothing to do with repair rate of the SAN. Availability is about keeping storage access to the servers maintained without disruption. The problem here is that that the answer is ‘kinda’…
Availability means ‘always on’ and accessible. Without it, server applications can’t run and the entire business stops. I call these severity ‘1’ issues. In order to avoid these things many aspects of day to day operation of storage systems must be looked at. Even though some storage systems perform ‘ok’ on ‘always on’, various aspects of ‘reduced availability’ occur based upon weaknesses in storage systems.
Most SAN arrays (IP or FC) have their weaknesses, and are vary from company to company as which weakness(es) they have. They range from:
Back-end device weaknesses are manifested in ‘slow’ drive performance, limiting access to storage for extended periods of time, causing dissatisfaction by application owners. This is typically due to lack of attention in storage software in back-end control and observation of the back-end infrastructure as well as the attached storage devices. The infrastructure today is rapidly moving to SAS (serial attached SCSI), which is very similar to a fibre channel network, albeit reduced in function. SAS is very powerful but requires knowledge of how to handle networks, even on the back-end, and even with a small number of devices. Bus resets, accessing of enclosure information, etc must be done properly in order to not get in the way of normal I/O. The other aspect of back-end weakness comes in dealing with the storage devices themselves, as getting data on and off the devices efficiently either stalls applications or makes them fly. Brute force can work with overkill of a pile of flash or DRAM, but money talks here…what price glory?
Caching weaknesses in SANS still plague users with the only indication of this being how much they have to pay to get performance. Methods for caching efficiently can get overcome by having too much cache, with searching dominating the workload. It seems amazing that caches in many SANS are huge when they don’t really need to be. Efficiency is key here once again as Caching is all about knowing ‘when to hold em’ and ‘when to fold em’ as in cache it or flush it. The ‘sensing’ of knowledge of the data is key to continuous available performance to an application. Another aspect of cache weakness is the method(s) in which cache mirroring to enable safe Write-Back caching is performed. Many SANs use an external bus in order to affect write-back caching. Most have issues with latency because network interfaces are used such as FC, IB, or SAS. In order to make write-back caching penalties from mirroring be avoided, the speed and latency of the mirror ‘bus’ should approach or be the same as the internal memory bus speed of the processor in the SAN. This weakness actually causes a ripple effect on performance but also availability when an actual failover occurs. That weakness comes from a lack of full ‘active-active’ access through all controllers in the SAN to the same volumes.
Failover and Recovery when a redundant storage system on the SAN loses part of its processing power due to faults in hardware or software are a key component in availability and weaknesses in storage systems. The desire is to have zero time failover to surviving parts of the system after such a fault. However, due to weaknesses in most storage systems, the time to perform such a ‘failover’ or ‘recovery’ (when the failing part returns), can cause down time to servers based on excessive time taken to perform failover or can cause severe performance degradation that is noticeable by the applications. The process of a failover or recovery is similar to a ‘v-motion’ activity in VMware, as the entire state of what the failing component is taken over by the survivor. The methods in which this occurs all relate to the complexity at the point of failure or recovery. The more work that has to be done to affect completion of the failover or recovery, the longer time it takes to complete the failover or recovery. Many SANs are indeterminate as to how long the process will take, based on weaknesses not only in the failover/recovery code, but all the other weaknesses in the system add up to cause issues here. Some arrays can take minutes to failover, causing downtimes periodically to applications, but also making timeouts longer and causing the fateful ‘pause’ in computing that slows businesses down.
Overall, weaknesses in SANS are due to either old software, patch-worked software, open source software, stacked software with feature upon feature, lack of design vs. ‘seat of the pants’ development, etc. This is the single most reason that storage systems do NOT get the outward efficiency that they should. Typical systems get 1/3 to ½ of the performance they could, while wasting processing power and storage to overcome this. This relates to the cost of storage, the quality of the storage from a user perspective and the cost for available storage. For many it’s one more charge after an other for propping up against these weaknesses that the customer is told is an enhancement or a new service offering. I have seen array generations get a whopping 5-10% increase in performance with new processing power that is 2-4 times as fast. There is nothing like bad software to ruin your plans. And as quality suffers so does availability. With the mix of open source, written software, glued on software, etc it’s a wonder many startups can get into an enterprise operation, let alone stay in there.
The last weakness, and this one is the most overlooked when customers buy storage, is the maintenance of the system over its life. A storage system is as good as its last upgrade in terms of stability, performance, and availability. Many customers ignore this as they have been burned in the past by failed upgrades or planned downtimes that cause data center outages that can go on forever. Many storage companies, don’t even talk about it on their websites, and even the big companies have ‘planned downtimes’ for upgrades of attached storage devices. This is crazy in this world of non-stop computing and actually causes more money to be spent to cover it up as with above to hide weaknesses in specific parts of a system. In many cases, customers who figure it out end up putting in DR solutions to even cover their maintenance operations of the SAN. What a waste! It should be just like updating a PC or a Mac where its non-intrusive, and with respect to redundant systems, performed one at a time to NOT lose any availability or cause any downtime!
ISE on the other hand, with the extreme focus on performance, reliability, AND availability is now at about 66 years between severity 1 events that cause downtime for servers. This is based on a strong focus on all aspects of the storage software stack as well as the hardware design of ISE. Our performance is 2-4 times that of any HDD device and our Hyper-ISE with HDD and SSD, we have set records in benchmarks across the world with our ability to ‘sense’ application loads and adapt accordingly for random, sequential, and mixed loads. Efficiency is the key to cost effective data centers and ISE is the building block with the most efficiency on earth.
Read more on http://stevesicola.com.
Tags: HDD, RAID, RAID-6, SAN, Steve Sicola, storage weaknesses
Posted in Commentary, Homepage Highlight, Hyper ISE, Industry Trends, Storage Horizons • No Comments