I have been deploying Storage Area Network for almost ten years in my 18 years Information Technology career. I have deployed various traditional, software defined and converged SANs manufactured by a global vendor like IBM, EMC, NetApp, HP, Dell, etc. I was tasked with the deployment of Dell Compellent in my previous role for several clients. I was excited about the opportunities and paused after reading the documentation presented to me. I could not co-relate implementation of a SAN and expected outcome desired by the customers. When wild sales pitch is sold to businesses with high promises, then there will always be hidden risks that come with the sales pitch.
- Lesson number one, never trusts someone blindly although they have a very decent track record, resellers are often after a quick sale and get out.
- Lesson number two, make sure you know who to trust as your partner in the transition to have a new SAN.
Decide what to procure based on your business case, TCO, workload analysis, capacity planning, lesson learnt and outcome of requirement analysis. Consider current technology trend, where you are at now, a technology roadmap and where you want to be in the future, e.g. Google Cloud, AWS or Azure. Capital investment can be the one off exercise these days before you pull the plug off on the on-premises infrastructure and fork-lift to Azure, Amazon or Google Cloud. Consider aligning technology stream with the business you do.
I have written this article to share my experience and disclose everything I learnt through my engagement on Dell Compellent deployment projects so that you can make a call by yourself. I will elaborate each feature of Dell Compellent and what exactly this feature does when you deploy a Compellent.
For the record, I have no beef with Dell. Let’s start now… “Marketing/sales pitch” vs “practical implication.”
Target Market: Small Business
Lets not go into detail, that will be a different topic for another day. Please read Dell’s business proposition “Ideally suited to smaller deployments across a variety of workloads, the SC Series products are easy to use and value optimized. We will continue to optimize the SC Series for value and server-attach.”
Management Interface: Dell Compellent Storage Center has a GUI designed to be accessible allegedly ease of use. Wizards offer few everyday tasks such as allocation, configuration, and administration functions. Compellent Storage Center monitoring tools provide very little insight on how storage backend is doing. You have to engage Dell remote support for diagnostic, and monitoring tools with alert and notification services. Storage center is not as granular as the competitor NetApp and EMC. Storage center has little information on storage performance, bottle neck and backend storage issues. Compellent is by design thin provisioned storage. There is no option in management center to assign as thick provisioned volume. IOPS and latency are calculated in volume and IOPS and latency are calculated in disks are far too different than real IOPS. You may see little IOPS in volume but click at drive level IOPS you will see storage controller is struggling to cope with the IOPS. Management center does not provide any clues who is generating this much IOPS.
Contact technical support they will say RAID scrub is killing your storage. Your standard request to tech support that stops the RAID scrub in a business hour. “You cannot do it” another classic reply by tech support. If you go through Compellent management center, you will find nothing that can schedule or stop RAID scrub.
Data Progression: In theory, Data Progression is an automated tiering technology that should have optimised the location of data, both on a schedule and on demand as prompted by a storage profile. Compellent’s tiering profiles streamline policy administration by assigning tier attributes based on the profile. On-demand data progression in a business hour will drive Compellent into crazy. If you are Citrix VDI mainstream than your workload is pretty much dead until data progression is complete.
A side effect of this technology is storage controller struggle to maintain on demand data progression and IO request at the same time hence there will be queue depth, and longer seek time in backend storage. In this situation, storage seek time is higher than normal.
Storage Profile: Storage profile in lay man’s terms is segregating expensive and cheap disk and profiling them in tier 1 (SSD RAID 10), tier 2 (15K Fibre Channel RAID 10, RAID 5, RAID 6) and tier 3 (7.2K SATA RAID 5, RAID 6). The storage profile determines how the system reads and writes data to disk for each volume as they are known in Compellent terms and how the data ages over time a feature called Data Progression. For example, random read request goes to tier 1 where you kept hot data, and a year old emails go to tier 3.
Storage Profiles supposed to allow the administrator to manage both writable blocks and replay blocks for a volume. It is fundamentally a tiering of storage in a controlled way. In theory, it supposed to be in a controlled environment. However, in reality, it does add extra workload to Dell Compellent controller. Let’s say you have tiered your storage according to your read and write intense IO. What happens when to READ and WRITE intense volume gets full?. Storage controller automatically triggers an on demand data progression from upper tier to lower tier to store data. Hence a WRITE intense IO is generated in lower tier what you wanted to avoid in the first place that’s why you profiled or tiered your storage. Mixing data progression with storage tiering defeats whole purpose of storage profiling.
Compellent Replay: Replay is essentially a storage snapshot in Dell terms. Dell Compellent Data Instant Replay software creates point-in-time copies called Replays. With Data Instant Replay Dell Compellent storage Replays at any time interval with minimal storage capacity. But here is the catch you will be most likely to run storage replay during the daily backup window. Backup generates lots of READ IOPS and Replays generate lots of READ and WRITE IOPS at the same time which is a daily backup window. Hence your backup is going to be dead slow. You will run out of the backup window and never be going to finish backup before the business hours. It will be a nightmare to fulfil data retention SLA and restore of any file systems and sensitive applications.
IOPS & Latency: Input/Output per second is a measurement unit of any hard disk and storage area network (SAN). This is a key performance matrix of a SAN regardless of manufacture, and this matrix remains unchanged. If you are to measure a SAN, this is where you begin. Never think that you have a bounce of virtual machines and it’s okay to buy SAN without IOPS consideration. There is the difference between a virtualised DHCP server and virtualised SQL server. A DHCP server may generate 20 IOPS but a SQL server can generate 5000 IOPS depends on what you are running on that SQL server. Every query you send to a SQL server or the application depends on the SQL server generate IOPS both read and write IOPS. For a Citrix XenApp and XenDesktop customer, you have to take into consideration that every time you launch a VDI session, open an word document, you generate IOPS, once you click save button on a word document, you generate write IOPS. Now you multiply the IOPS of each VDI session by the number of users, number of applications, number VDI and users inputs to estimate your real IOPS.
Now think about latency, in plain English, latency is the number of seconds or milli seconds you wait to retrieve information from a hard disk drive. This is calculated in round-trip between your request and the hard disk serve your request. Now you think millions of requests are bombarded on the storage area network. A SAN must sustain those requests and serve application requests, again it depends on what sort of workload you are running on a SAN. For example, file servers, Citrix profile, Citrix VDI, Exchange Server and SQL servers need low latency SAN.
In Dell Compellent, you may see volume IOPS e.g. 2000 but if you view disks hosting the same volume, then you might see 5000 IOPS. Then you must ask question how-come 5000-2000=3000 IOPS are generated automatically. Does Compellent has any tools to interrogate storage controller to see how additional workloads are generated? No it doesn’t. Your only bet is Dell support telling you the truth if you are lucky. The answer is automated RAID scrub is generating extra workloads on storage i.e. 3000 IOPS which could have been utilized for real workloads.
To co-relate this analysis with an all flash array storage, e.g. Dell Compellent, the SAN must be able to offer you the major benefits of a storage area network. If this storage cannot provide you low latency and high IO throughput for sensitive applications and workloads then you need to go back to drawing board or hire a consultant who can analyse your requirements and recommend you the options that match your need and budget. For further reading find Citrix validated solutions, storage best practices recommended by VMware and Microsoft. There are many tooling available in the market for you to analyse workload on applications, on a virtual or a physical infrastructure.
RAID Scrub: Data scrubbing is an error correction technique that uses a background task to inspect storage for errors periodically, and then correct detected errors using redundant data in the form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors.
In NetApp, you can schedule a RAID Scrub that suits your time and necessity however in Dell Compellent you cannot schedule a RAID Scrub through GUI or Command line. Dell technical support advised that this is an automated process takes places every day to correct RAID groups in Dell Compellent. There is a major side effect running automatic RAID scrub. RAID scrub will drive your storage to insane IOPS level, and latency will peak to high causing production volume to suffer and under perform. Performance of virtualisation will be degraded so badly that production environment will struggle to serve IO request. Dell advised that Dell can do nothing about RAID scrub because RAID scrub in SCOS operating systems is an automated process.
Compellent Multipathing: By implementing MPIO solution you eliminate any single point of failure in any physical path (s) and logical path(s) among any components such as adapters, cables, fabric switches, servers and storage. If one or more of these elements fails, causing the path to fail, multipathing logic uses an alternate path for I/O so that applications can still access their data. Each network interface card (in the iSCSI case) or HBA should be connected by using redundant switch infrastructures to provide continued access to storage in the event of a failure in a storage fabric component. This is the fundamental concept of any storage area network AKA SAN.
New generation SANs are integrated with multipath I/O (MPIO) support. Both Microsoft and VMware virtualisation architecture supports iSCSI, Fibre Channel and serial attached storage (SAS) SAN connectivity by establishing multiple sessions or connections to the storage array. Failover times may vary by storage vendor, and can be configured various way but the logic of MPIO remains unchanged.
New MPIO features in Windows Server include a Device Specific Module (DSM) designed to work with storage arrays that support the asymmetric logical unit access (ALUA) controller model (as defined in SPC-3), as well as storage arrays that follow the Active/Active controller model.
The Microsoft DSM provides the following load balancing policies. Microsoft load balance policies are generally dependent on the controller design (ALUA or true Active/Active) of the storage array attached to Windows-based computers.
Round-robin with a subset of paths
Dynamic Least Queue Depth
VMware based systems also provide Fixed Path, Most Recently Used (MRU) and Round-Robin Configuration which is the most optimum configuration for VMware virtual infrastructure.
To explain ALUA in simple terms is that Server can see any LUN via both storage processors or Controller or NAS Head as active but only one of these storage processors “owns” the LUN. Both Storage Processor can view logical activities of storage using physical connection either via SAN switch to the server or via direct SAS cable connections. Hyper-v or vSphere ESXi server knows which processor owns which LUNs and sends traffic preferably directly to the owner. In case of controller or processor or NAS Head Failure Hyper-v or vSphere server automatically send traffic to an active processor without loss of any productivity. This is an essential feature of EMC, NetApp and HP products.
Let’s look at Dell Compellent now. Dell Compellent does not offer true Active/Active Controllers for any Storage. Dell Controllers Explained! Dell Verified Answer. Reference from Dell Forum….
“In the Compellent Architecture, both controllers are active. Failover is done at either the port or controller level depending on how the system was installed. Volumes are “owned” by a particular controller for mapping to servers. Changing the owning controller can be done – but it does take a volume down.”
I can confirm that this is exactly Dell Customer support advised me when I called them. Dell Compellent can take up to 60~90 seconds to failover from one controller to another. Which means entire virtual environment will go offline for a while and get back online. To update firmware or to replace a controller you have to bring everything down then bring everything back online which will cause a major outage and productivity loss for the entire organisation.
Performance Issue: To identify Dell Compellent bottleneck for a virtualisation platform hosted in Compellent. Run the following in Windows perfmon in a virtual machine or a physical machine where a volume of Compellent storage is presented via HBA or iSCSI initiator. Use Windows perfmon, create a data collector set of the below attributes and generate a report using PAL tools. Extract seek time, latency, IOPS and queue depth in the Compellent storage. You will see a bottleneck in every area of storage you can expect. Read further on Windows Performance Monitoring Tools
\LogicalDisk\Avg. Disk Sec/Read
\LogicalDisk\Avg. Disk Sec/Write
Use the following Tools to analyse workloads and storage performance in your storage area network:
The cost of each gigabyte of storage is declining rapidly in every segment of the market. Enterprise storage today costs what desktop storage did less than a decade ago. So why are your overall costs increasing when buying storage? Let’s make it simple! Ask yourself questions?
How much will the storage cost? How much will the SAN cost to implement? How much will the SAN cost to operate? Now use the below tools to calculate the real cost of the owing black box?
So what will you be looking in a SAN?
- Lower TCO
- Storage Performance
- Quality of Service
- Uncompromised Availability and uptime
- Cloud Ready
- Reduction of Data (de-duplication)
- Reduction of backup
- Analytics and automation
- Reduction of Data Centre footprint
Summary: Dell Compellent makes a compelling argument for all-flash performance tiers. Yes, this argument is in sales pitch not in reality. A price conscious poor man who needs just any SAN and has a lower IO environment can have Compellent. For mainstream enterprise storage, Dell Compellent is a bad experience and can bring disaster to corporate Storage Area Network (SAN).
I had no doubt when Compellent introduced all flash arrays it was innovative but Compellent’s best days are gone. Just shop around, you will find better all-flash, converged, hybrid and virtual arrays which are built on better software, controllers and SSDs. There are flash arrays in the market which run clever codes and algorithm within the software to produce high IO, low latency and performance for sensitive applications.