The Truth About Storage, Part Five

— Storage Horizons Blog —

Another recent myth, the third in the Five myths and Half-truths series, is that RAID is dead or that RAID-6 is the answer to everything. Where did doing the math go? What about the economics here of wastage or doing the right thing for the right reasons? From distributed file systems with copies of files in multiple places, it’s nice to see that 1980’s RAID-1 Server Replication worked.

It’s just that with this method, the work is spread to the network and in general the cheapest SATA drives are used in these environments.  This is versus even replicated RAID-5 sets for the more disaster tolerant minded people would do not only more economically, but with more predicable performance and availability.

The other part is that RAID-6 is the answer. RAID-6 works to help for two drive failures, but has it been looked at for how failure/repair really is calculated to see what is necessary where before you do disaster tolerant copy to elsewhere?  I believe in the right RAID Level for the right UER (uncorrectable error rate) of the drive or drives involved in the data protection scheme. This is modeled with the repair rate, which can vary and cause some to need RAID-6 in arrays based on speed of software and or hardware design. Performance suffers as protection level increases and $/IO/GB suffers, or I/O density, the key metric in my mind for application performance.

The bottom line answer is to use the right data protection technique based upon the desired UER or MTBDL (mean time between data loss). Combinations of RAID-1, RAID-5, or even RAID-6 can be employed but all have their costs which must be factored into each IT decision. ISE uses RAID-1 and RAID-5 along with ISE Mirroring to provide for proper redundancy for enterprise HDDs as well as extended data protection for those customers who require extended data loss protection for disasters, etc.

Another Half-Truth are that SANs are basically ‘Available’.  Availability is different from reliability, because it has nothing to do with repair rate of the SAN. Availability is about keeping storage access to the servers maintained without disruption. The problem here is that that the answer is ‘kinda’…

Availability means ‘always on’ and accessible. Without it, server applications can’t run and the entire business stops. I call these severity ‘1’ issues. In order to avoid these things many aspects of day to day operation of  storage systems must be looked at. Even though some storage systems perform ‘ok’ on ‘always on’, various aspects of ‘reduced availability’ occur based upon weaknesses in storage systems.

Most SAN arrays (IP or FC) have their weaknesses,  and are vary from company to company as which weakness(es) they have. They range from:

  1. Back-End device weakness
  2. Caching Weakness
  3. Failover/Recovery Weakness
  4. Code Efficiency/Open Source weakness
  5. Maintenance Weakness as related to risk and loss of availability

Back-end device weaknesses are manifested in ‘slow’ drive performance, limiting access to storage for extended periods of time, causing dissatisfaction by application owners. This is typically due to lack of attention in storage software in back-end control and observation of the back-end infrastructure as well as the attached storage devices. The infrastructure today is rapidly moving to SAS (serial attached SCSI), which is very similar to a fibre channel network, albeit reduced in function. SAS is very powerful but requires knowledge of how to handle networks, even on the back-end, and even with a small number of devices. Bus resets, accessing of enclosure information, etc must be done properly in order to not get in the way of normal I/O. The other aspect of back-end weakness comes in dealing with the storage devices themselves, as getting data on and off the devices efficiently either stalls applications or makes them fly. Brute force can work with overkill of a pile of flash or DRAM, but money talks here…what price glory?

Caching weaknesses in SANS still plague users with the only indication of this being how much they have to pay to get performance. Methods for caching efficiently can get overcome by having too much cache, with searching dominating the workload. It seems amazing that caches in many SANS are huge when they don’t really need to be. Efficiency is key here once again as Caching is all about knowing ‘when to hold em’ and ‘when to fold em’ as in cache it or flush it. The ‘sensing’ of knowledge of the data is key to continuous available performance to an application.  Another aspect of cache weakness is the method(s) in which cache mirroring to enable safe Write-Back caching is performed. Many SANs use an external bus in order to affect write-back caching. Most have issues with latency because network interfaces are used such as FC, IB, or SAS.  In order to make write-back caching penalties from mirroring be avoided, the speed and latency of the mirror ‘bus’ should approach or be the same as the internal memory bus speed of the processor in the SAN. This weakness actually causes a ripple effect on performance but also availability when an actual failover occurs. That weakness comes from a lack of full ‘active-active’ access through all controllers in the SAN to the same volumes.

Failover and Recovery when a redundant storage system on the SAN loses part of its processing power due to faults in hardware or software are a key component in availability and weaknesses in storage systems. The desire is to have zero time failover to surviving parts of the system after such a fault. However, due to weaknesses in most storage systems, the time to perform such a ‘failover’ or ‘recovery’ (when the failing part returns), can cause down time to servers based on excessive time taken to perform failover or can cause severe performance degradation that is noticeable by the applications. The process of a failover or recovery is similar to a ‘v-motion’ activity in VMware, as the entire state of what the failing component is taken over by the survivor. The methods in which this occurs all relate to the complexity at the point of failure or recovery. The more work that has to be done to affect completion of the failover or recovery, the longer time it takes to complete the failover or recovery. Many SANs are indeterminate as to how long the process will take, based on weaknesses not only in the failover/recovery code, but all the other weaknesses in the system add up to cause issues here. Some arrays can take minutes to failover, causing downtimes periodically to applications, but also making timeouts longer and causing the fateful ‘pause’ in computing that slows businesses down.

Overall, weaknesses in SANS are due to either old software, patch-worked software, open source software, stacked software with feature upon feature, lack of design vs. ‘seat of the pants’ development, etc. This is the single most reason that storage systems do NOT get the outward efficiency that they should. Typical systems get 1/3 to ½ of the performance they could, while wasting processing power and storage to overcome this. This relates to the cost of storage, the quality of the storage from a user perspective and the cost for available storage. For many it’s one more charge after an other for propping up against these weaknesses that the customer is told is an enhancement or a new service offering.  I have seen array generations get a whopping 5-10% increase in performance with new processing power that is 2-4 times as fast. There is nothing like bad software to ruin your plans. And as quality suffers so does availability. With the mix of open source, written software, glued on software, etc it’s a wonder many startups can get into an enterprise operation, let alone stay in there.

The last weakness, and this one is the most overlooked when customers buy storage, is the maintenance of the system over its life. A storage system is as good as its last upgrade in terms of stability, performance, and availability. Many customers ignore this as they have been burned in the past by failed upgrades or planned downtimes that cause data center outages that can go on forever. Many storage companies, don’t even talk about it on their websites, and even the big companies have ‘planned downtimes’ for upgrades of attached storage devices. This is crazy in this world of non-stop computing and actually causes more money to be spent to cover it up as with above to hide weaknesses in specific parts of a system. In many cases, customers who figure it out end up putting in DR solutions to even cover their maintenance operations of the SAN. What a waste! It should be just like updating a PC or a Mac where its non-intrusive, and with respect to redundant systems, performed one at a time to NOT lose any availability or cause any downtime!

ISE on the other hand, with the extreme focus on performance, reliability, AND availability is now at about 66 years between severity 1 events that cause downtime for servers. This is based on a strong focus on all aspects of the storage software stack as well as the hardware design of ISE. Our performance is 2-4 times that of any HDD device and our Hyper-ISE with HDD and SSD, we have set records in benchmarks across the world with our ability to ‘sense’ application loads and adapt accordingly for random, sequential, and mixed loads. Efficiency is the key to cost effective data centers and ISE is the building block with the most efficiency on earth.

Read more on http://stevesicola.com.

Tags: , , , , ,

Leave a Reply