*Ed note* I wrote this over 3 months ago but didn’t get around to posting. Still relevant though.
Everyone knew SSD drives would change the storage landscape dramatically but the speed of development and rate the capacities are growing is still impressive. Dell has taken the next step in SSD evolution by announcing support for TLC SSD drives in our SC series storage arrays. We were first to market with the high capacity TLC drives in an enterprise storage array. We are still the only vendor as far as I know that can mix multiple SSD types in the same pool.
Why do you care? It all comes down to the different SSD drive costs, quality, resiliency, and perhaps most importantly, *capacity*.
There is a lot of information out there about the various SSD types and their use types so I wont go into it in much detail here(reference Pure SSD breakdown article). There are three types of SSD drives supported in Dell SC series (Compellent) arrays
Perhaps make this into a simple table, resiliency, cost, capacity
- SLC – WI Write Intensive. Great for writes, great for reads, $/TB high
- MLC – PRI Premium RI Read Intensive. Ok for writes, great for reads, $/TB better and higher capacity drives
- TLC – MRI Mainstream Read Intensive, Average for writes, still great for reads, $/TB excellent and the highest capacity (in a 2.5 inch form factor to boot!). Massively outperforms a 15K drive.
*Ed Note* The 1.6TB WI drives are Mixed Used WI drives.
Where SC series has its *magic sauce* is it uses Tiers of different disk types/speeds to move data within the array. Hot data on Tier 1, warm data or write heavy data on tier 2, and older, colder data in Tier 3. Typically Tier 3 would be NLSAS 7.2K spinning drives as they add the best cost/TB. SC series can mix and match the drive types in the same pool because of *data progression*. New writes and heavy lifting is handled by the top tiers and the bottom tiers are only periodically used and only for reads.
The largest TLC drive at time of writing (Sept 2015) is 3.8TB. 3.8TB in a 2.5 inch caddy, low power consumption and no moving parts. I don’t have exact performance details but for read workloads the MRI drives perform about the same as the PRI drives, but for random write workloads they are about half the performance of a PRI SSD. (*Rule of thumb, every workload is different. Speak to your local friendly storage specialist to get the right solution for your workloads). Compare that with a 15K spindle, its better rack density and power saving and a huge performance boost per drive. Then consider 4TB NLSAS drive that is 3.5inch, 80 – 100 IOPS with a random workload, spinning constantly so higher power consumption and moving parts. Obviously you can have situations where a NLSAS drive can spin down when not getting used but thats not the norm. The TLC drive is going to be more expensive than the NLSAS drive but when you take into account power, footprint, added performance over the life of the array it becomes a different calculation.
*magic sauce – secret sauce and magic dust together, like ghost chilli sauce, just with more tiers (geddit? hot sauce .. tears? )
You can see there are 4 capacities being supported at the moment.
- 480GB, 960GB, 1.9TB, 3.8TB!!!
Yup, a nearly 4TB drive in 2.5 inch form-factor that is low power and 1000s of times faster than a 15k spinning drive with the about same cost per GB as a 15K drive. This is just the beginning, there are larger capacities on the roadmap. I wouldn’t be surprised by the end of next year to see the end of 15k and 10k drives in any of our storage arrays.
While we are on the topic, this is an excellent blog on the newer types of flash storage being tested and developed to help take Enterprise Storage into the future, whatever it looks like.
What are the gotchas? It can’t all be peaches and cream. You can see in the table above, there are different SSD types for different workloads. If you have a high write environment then the RI drives may not be a good fit because of the high erase cost and NAND cell resiliency. For that workload you would be better off with the WI SSDs.
However, most of the workloads I see and also stats that come from our awesome free DPACK tool is most environments are about 70/30 R/W% and average 32K IOs. (Typical VM environment). These are a great candidate for the RI drives.
Here is the great part for Compellent SC, if you want the best of both worlds we can do that using tiering and Data Progression to leverage a small group of the WI drives to handle the write workloads and a larger group of the RI drives to handle all the read traffic, even though to the application its just one bucket of flash. Now we can provide an all-flash, or hybid array with loads of flash but with a much much lower $/GB which is essential with the current data growth rates.
Data Progression in SC series
Here is an example. You have a VMware workload that you would like to turbo charge. You want to be able to support more IOPS but you also want those IOPS to be sub millisecond. You reach out to me, I talk about myself for the first 15 mins and then we run the free DPACK tool to analyse your workload.
- DPACK reports 70/30 R/W% and average 32K IOs, with 95% of the time sitting at 5000 IOPS peaking to 12000 during backups.
- Also that there is latency spikes throughout the day when SQL devs run large queries at 10am and 2pm but it usually sits about 3ms – 10ms, not too bad although during backups read latency jumps up to 30ms sometimes.
- Queue depth is pretty good and CPU/MEM usage is fine. Capacity is 60TB used but a lot of probably cold data.
- Looking at the backups about 2TB of data changes per day.
- The SQL devs want to lock the SQL volumes into flash because they write shitty queries and can’t be assed optimising them. (I used to be a Oracle DBA, devs are lazy).
- Growth no more than 30% year but a lot of that will be data growth, not workload growth.
This is a very common workload I see, it helps that Australia and New Zealand are very highly virtualised so a lot the workloads we see are ESX, with Hyper-v becoming more common. With this much information its reasonably simple to design an SC array that I would be 100% confident would nail that workload.
Its not a massive system and growth will mainly be Tier 3 but there are a few writes from the SQL databases so a SC4020 array with WI SSD, RI SSD, and NLSAS for the cold tier should do the trick.
The SC array uses tiering and incoming writes into the array very differently to a lot of arrays in the market. All new writes come into the array into Tier 1 (the fastest tier) as RAID 10 (the most efficient write). This is done on purpose to get the write committed and the ack back to the application as fast as possible. The challenge is R10 has 50% overhead and with flash that can mean $$$ and this is where the two tiers of SSD comes into its own. Every couple of hours (2 hours by default), a replay (snapshot) is taken by the SC array and marks the volumes blocks as read only. This is then instantly migrated to the second RI flash tier as R5 to maximise usable capacity. Because the data isn’t R/W anymore there is no need for it to be R10. SC uses redirect on write so new writes are written into Tier 1 as R10 and volume pointers are just simply updated.
A lot of info in a small paragraph but you can see what is happening there, the WI tier does all the heavy lifting in the array and then older data is moved to the RI tier for it to be read from. Then, as the data gets cold, it is typically moved to Tier 3 (NLSAS in my example) as R6. Same data, moved to the write tier and the right time to maximise performance and $/GB.
The replay is taken every 2 hours and then moves the data down to Tier 2. This means we only need to size Tier 1 for the required IOPS and enough capacity to hold 2 hours worth of writes x 2 (R10 overhead). in my example above there is about 2TB of data written everyday (if every write was a new write, assuming worse case). If you break that into 2 hour chunks its less than 200GB per replay, double it for R10 and I would only need 400GB of WI SSD to service that 60TB workload. The reality is that there are spikes during the day and the DPACK tool identifies those but you get my drift.
So .. Tier 1 lets go with 6 x 400GB WI drives. (1 is a Hot Spare). I wont put the exact figures here but those drives with that workload would smash it out of the park with 0.2ms latency.
Now I can focus on Tier 2 almost purely from a capacity standpoint. Remember, this tier will hold the data being moved down from the WI tier but its also holding data classified as hot that gets read from a lot. Everything in this tier will be R5 to get the best usable capacity number. They have 60TB , change 2TB a day and the SQL DB they want to pin is 10TB. So I want to aim for about 18TB usable in this tier just to be safe. I don’t have to worry about SSD write performance on this tier because it will be nearly 100% read except when data is moved down every couple of hours.
So . Tier 2 I’ll use 12 x 1.9TB MRI drives (1 hot spare). This gives me 18TB usable (not raw, you’ll find Dell guys always talk usable). Plenty of room for hot data and to lock the entire SQL workload in this tier. You would need shelves of 15k to get the same performance.
By splitting up the the WI & RI tiers it gives a level of flexibility that is difficult without tiers. If the write workload stays static, in other words stays around the same IOPS number and TB/day, there is no need to grow it. However some other business units see the benefits that the SQL guys are getting and want in on that action. We can grow the WI and RI tiers separately. Simply add a couple more 1.9TB RI drives and that tier gets bigger. We then change the Storage Profile on that volume (and with VVols we’ll change it on that VM) and voila, that volume is now pinned to flash.
Finally, we need another 40TB for the rest of the workload + 30% a year for growth over three years = approx 90TB.
Note: you can add drives anytime into an SC array and the pool will expand and rebalance so you dont have to purchase everything upfront. Also, with thin (assuming provisioning), thin writes, compression, raid optimisation etc there are extra savings but I’ll leave those out for now.
Like the RI tier, all the data in here will be Read Only and for larger drives will be R6. Because we don’t write to this tier besides internal data movement we are squeezing as much performance out of the spinning rust as possible. The key is to not have too much of a divide between the SSD tiers and the NLSAS tier. Again, DPACK allows us to size for the workload instead of guessing. We know the workload is 5000 IOPS so I want this tier to be about 15-20% of the total number, 1000 IOPS (that’s convenient). The NLSAS drives aren’t being written to and so there is no RAID write penalty so I can assume 80 IOPS per drive, 12 drives gets me very close to my IOPS number with a hot spare and magically its also the amount of 3.5inch drives we can fit in a 2U enclosure. Its almost like I’m making this up I have the drive number, but I want to get to 90TB usable at R6. Different story, with 24 x 6TB drive we get about 100TB usable. The good thing is I know I have met the performance brief.
Still with me, this has been a longer explanation that I intended. Speak of Puns, I hoped some of the 10 puns I have in this post would make you laugh, but sadly no pun in ten did.
End result, I have an SC4020 with 18 SSD drives (6 spare slots for expansion), 2 extra SC200 enclosures with 24 6TB NLSAS drives. 6RU in total and it nails the performance and growth rates needed.
You can see, having the option for multiple flash types makes for very flexible and cost effective solutions.
Where to from here? I’m sure drive capacities will continue to grow and grow, with the newer types becoming more mainstream. Samsung released a 12GB SSD recently and without doubt we’ll see that sort of capacity in our arrays over time. Imagine have a 16TB SSD in 2.5 inch, 32TB? A 1RU XC630 Nutanix node with 24 x 4TB 1.8 inch SSD. The only issue is we still have to back it up!!!
*Final Ed note* Since I wrote this post Dell has released the SC9000 platform. When it is paired up with all flash it is a monster.