I have been deploying Storage Area Network for almost 10 years in my 16 years Information Technology career. I have deployed various traditional, software defined and converged SANs manufactured by global vendor like IBM, EMC, NetApp, HP, Dell, etc. I was tasked with deployment of Dell Compellent in my previous role for several clients. I was excited for the opportunities and paused after reading the documentation presented to me. I could not co-relate implementation of a SAN and expected outcome desired by clients. When over hyped sales pitch is sold to businesses with high promises then there will always be hidden risks that comes with this sales pitch. Lesson number one is never trust someone blindly although they have a very decent track record, re-sellers are often after a quick sale and get out. Lesson number two make sure you know who to trust as your partner in the transition to have a new SAN. Decide what to procure based on your business case, ROI, workload analysis, capacity planning and outcome of requirement analysis. Consider current technology trend, where you are at now, a technology road map and where you want to be in future e.g. AWS or Azure. Capital investment can be one off exercise these days before you pull the plug off on the on-premises infrastructure and fork-lift to Azure or Amazon. Consider aligning technology stream with the business you do. I have written this article to share my own experience and disclose everything I learnt through my engagement on Dell Compellent deployment projects so that you can make a call by yourself. I will elaborate each feature of Dell Compellent and what exactly this feature does when you deploy a Compellent. FYI I have no beef with Dell. Let’s start now… “Marketing/sales pitch” vs “practical implication”
Target Market: Small Business
Lets not go into detail, that will be a different topic for another day. Please read Dell’s business proposition “Ideally suited to smaller deployments across a variety of workloads, the SC Series products are easy to use and value optimized. We will continue to optimize the SC Series for value and server-attach.”
Management: Dell Compellent Storage Center has a GUI designed to be accessible allegedly ease of use. Wizards offers few common tasks such as allocation, configuration, and administration functions. Compellent Storage Center monitoring tools provide very little insight on how storage backend is doing. You have to engage Dell remote support for diagnostic, and monitoring tools with alert and notification services. Storage center is not as granular as the competitor NetApp and EMC. Storage center has little information on storage performance, bottle neck and backend storage issues . Compellent is by design thin provisioned storage. There is no option in management center to assign as thick provisioned volume. IOPS and latency are calculated in volume and IOPS and latency are calculated in disks are far too different than real IOPS. You may see little IOPS in volume but click at disk level IOPS you will see storage controller is struggling to cope with the IOPS. Management center does not provide any clues who is generating this much IOPS.
Contact technical support they will say RAID scrub is killing your storage. Your standard request to tech support that stop the RAID scrub in business hour. “You cannot do it” another classic reply by tech support. If you go through Compellent management center you will find nothing that can schedule or stop RAID scrub.
Data Progression: In theory Data Progression is an automated tiering technology that should have optimized the location of data, both on a schedule and on demand as prompted by a storage profile. Compellent’s tiering profiles streamline policy administration by assigning tier attributes based on the profile. On-demand data progression in business hour will drive Compellent into crazy. If you are Citrix VDI mainstream than your workload is pretty much dead until data progression is complete.
Side effect of this technology is storage controller struggle to maintain on demand data progression and IO request at the same time hence there will be queue depth and longer seek time in backend storage. In this situation storage seek time is higher than normal.
Storage Profile: Storage profile in lay man’s terms is segregating expensive and cheap disk and profiling them in tier 1 (SSD RAID 10), tier 2 (15K Fibre Channel RAID 10, RAID 5, RAID 6) and tier 3 (7.2K SATA RAID 5, RAID 6). The storage profile determines how the system reads and writes data to disk for each volume as they are known in Compellent terms and how the data ages over time a feature called Data Progression. For example, random read request goes to tier 1 where you kept hot data and an year old emails goes to tier 3.
Storage Profiles supposed to allow the administrator to manage both writable blocks and replay blocks for a volume. It is essentially a tiering of storage in controlled way. In theory it supposed to be like a controlled environment. However in reality it does add extra workload to Dell Complellent controller. Let’s say you have tiered your storage according to your read and write intense IO. What happen when READ and WRITE intense volume gets full. Storage controller automatically trigger an on demand data progression from upper tier to lower tier to store data. Hence a WRITE intense IO is generated in lower tier what you wanted to avoid in first place that’s why you profiled or tiered your storage. Mixing data progression with storage tiering defeats whole purpose of storage profiling.
Replay: Replay is essentially a storage snapshot in Dell terms. Dell Compellent Data Instant Replay software creates point-in-time copies called Replays. With Data Instant Replay Dell Compellent storage Replays at any time interval with minimal storage capacity. But here is the catch you will most likely to run storage replay during daily backup window. Backup generates lots of READ IOPS and Replays generates lots of READ and WRITE IOPS at the same time which is daily backup window. Hence your backup is going to be dead slow. You will run out of backup window and never going to finish backup before the business hours. It will be nightmare to fulfill data retention SLA and restore of any file systems and sensitive applications.
IOPS & Latency: Input/Output per second is a measurement unit of any hard disk and storage area network (SAN). This is a key performance matrix of a SAN regardless of manufacture and this matrix remains unchanged. If you are to measure a SAN, this is where you begin. Never ever think that you have bounce of virtual machines and it’s okay to buy SAN without IOPS consideration. There are difference between a virtualized DHCP server and virtualized SQL server. A DHCP server may generate 20 IOPS but a SQL server can generate 5000 IOPS depends on what you are running on that SQL server. Every query you send to a SQL server or the application depends on the SQL server generate IOPS both read and write IOPS. For a Citrix VDI and App customer, you have to take into consideration that every word document you load, you generate IOPS, once you click save button on a word document, you generate write IOPS. Now you multiply by the number of users and session you are running.
Now think about latency, in plain English, latency is the number of seconds or milli seconds you wait to retrieve information from a hard disk drive. This is calculated in round-trip between your request and the disk serve your request. Now you think millions of requests are bombarded on the storage area network. A SAN must sustain those requests and serve application requests, again it depends what sort of workload you are running on a SAN. For example, file servers, Citrix profile, Citrix VDI, Exchange Server and SQL servers need low latency SAN.
In Dell Compellent, you may see volume IOPS e.g. 2000 but if you view disks hosting the same volume then you might see 5000 IOPS. Then you must ask question how-come 5000-2000=3000 IOPS are generated automatically. Does Compellent has any tools to interrogate storage controller to see how additional workloads are generated? No it doesn’t. Your only bet is Dell support telling you the truth if you are lucky. Answer is automated RAID scrub is generating extra workloads on storage i.e. 3000 IOPS which could have been utilized for real workloads.
To co-relate these analysis with an all flash array storage e.g. Dell Compellent, the SAN must be able to offer you the fundamental benefits of a storage area network. If this storage cannot offer you low latency and high IO throughput for sensitive applications and workloads then you need to go back to drawing board or hire a consultant who can analyse your requirements and recommend you the options that match your need and budget. For further reading find Citrix validated solutions, storage best practices recommended by VMware and Microsoft. There are many tooling available in market for you to analyse workload on applications, on a virtual or a physical infrastructure.
RAID Scrub: Data scrubbing is an error correction technique that uses a background task to periodically inspect storage for errors, and then correct detected errors using redundant data in form of different checksums or copies of data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of un-correctable errors.
In NetApp you can schedule a RAID Scrub that suits your time and necessity however in Dell Compellent you cannot schedule a RAID Scrub through GUI or Command line. Dell technical support advised that this is an automated process takes places every day to correct RAID groups in Dell Compellent. There is a major side effect running automated RAID scrub. RAID scrub will drive your storage to insane IOPS level and latency will peak to high causing production volume to suffer and under perform. Performance of virtualization will be degraded so badly that production environment will struggle to serve IO request. Dell advised that Dell can do nothing about RAID scrub because RAID scrub in SCOS operating systems is an automated process.
Multipathing: By implementing MPIO solution you eliminate any single point of failure in any physical path (s) and logical path(s) among any components such as adapters, cables, fabric switches, servers and storage. In the event that one or more of these components fails, causing the path to fail, multipathing logic uses an alternate path for I/O so that applications can still access their data. Each network interface card (in the iSCSI case) or HBA should be connected by using redundant switch infrastructures to provide continued access to storage in the event of a failure in a storage fabric component. This is the fundamental concept of any storage area network AKA SAN.
New generation SANs are integrated with multipath I/O (MPIO) support. Both Microsoft and VMware virtualization architecture supports iSCSI, Fibre Channel and serial attached storage (SAS) SAN connectivity by establishing multiple sessions or connections to the storage array. Failover times may vary by storage vendor, and can be configured various way but the logic of MPIO remains unchanged.
New MPIO features in Windows Server include a Device Specific Module (DSM) designed to work with storage arrays that support the asymmetric logical unit access (ALUA) controller model (as defined in SPC-3), as well as storage arrays that follow the Active/Active controller model.
The Microsoft DSM provides the following load balancing policies. Microsoft load balance policies are generally dependent on the controller model (ALUA or true Active/Active) of the storage array attached to Windows-based computers.
- Round-robin with a subset of paths
- Dynamic Least Queue Depth
- Weighted Path
VMware based systems also provide Fixed Path, Most Recently Used (MRU) and Round-Robin Configuration which is the most optimum configuration for VMware virtual infrastructure.
To explain ALUA in simple terms is that Server can see any LUN via both storage processors or Controller or NAS Head as active but only one of these storage processors “owns” the LUN. Both Storage Processor can view logical activities of storage using physical connection either via SAN switch to server or via direct SAS cable connections. Hyper-v or vSphere ESXi server knows which processor owns which LUNs and sends traffic preferably directly to the owner. In case of controller or processor or NAS Head Failure Hyper-v or vSphere server automatically send traffic to active processor without loss of any productivity. This is a key feature of EMC, NetApp and HP products.
Let’s look at Dell Compellent now. Dell Compellent does not offer true Active/Active Controllers for any Storage. Dell Controllers Explained! Dell Verified Answer. Reference from Dell Forum….
“In the Compellent Architecture, both controllers are active. Failover is done at either the port or controller level depending on how the system was installed. Volumes are “owned” by a specific controller for the purposes of mapping to servers. Changing the owning controller can be done – but it does take a volume down.”
I can confirm that this is exactly Dell Customer support advised me when I called them. Dell Compellent can take up to 60~90 seconds to failover from one controller to another. Which means entire virtual environment will go offline for a while and get back online. To update firmware or to replace a controller you have to bring everything down then bring everything back online which will cause a major outage and productivity loss for entire organization.
Performance Issue: To identify Dell Compellent bottleneck for a virtualization platform hosted in Compellent. Run the following in Windows perfmon in a virtual machine or a physical machine where a volume of Compellent storage is presented via HBA or iSCSI initiator. Use Windows perfmon, create a data collector set of the below attributes and generate report using PAL tools. Extract seek time, latency, IOPS and queue depth in the Compellent storage. You will see bottleneck in every area of storage you can expect. Read further on Windows Performance Monitoring Tools
\LogicalDisk\Avg. Disk Sec/Read
\LogicalDisk\Avg. Disk Sec/Write
Use the following Tools to analyse workloads and storage performance in your storage area network:
Summary: Dell Compellent makes an interesting argument for all-flash performance tiers. Yes this argument is in sales pitch not in reality. A price conscious poor man who needs just any SAN and has a lower IO environment can have Compellent. For mainstream enterprise storage, Dell Compellent is a bad experience and can bring disaster to corporate storage area network.
I have no doubt when Compellent introduced all flash arrays it was innovative but Compellent’s best days are gone. Just shop around, you will find better flash arrays now a days which are built on better software, controllers and SSDs. There are flash arrays in market which runs cleaver codes and algorithm within the software to produce high IO, low latency and performance for sensitive applications.