Data Storage was revolutionized by hard disk storage a few years ago. But now, the trend seems to be drifted towards Solid-state drives. However, researchers at Facebook and Carnegie- Mellon have shared some astonishing facts from hours of their SSD experience. They came to a conclusion that SSDs performance and its MTBF suffers due to fatigue and temperature. To know more, just follow this article furthermore.
It is a known fact that Facebook was an early adopter of SSDs and so, their study was based on their SSD experience of millions of device days. Unfortunately, the study doesn’t break results based on vendor. But it divulges the details based on SSD age, which means the oldest are roughly first gen devices and the newest as second generation.
What’s more important is the team’s definition of failure i.e. uncorrectable read error (URE) leading to data loss. But this revelation doesn’t mean that the SSD was dead completely. Instead it only meant that one URE led to another. Furthermore, since SSDs do not have the intelligence to keep a tab on internal read errors that a controller can correct, the only read errors the study could capture were those that got reported to the server, which can sometimes reconstruct data that SSD controllers can’t.
Facebook’s study revealed some of the factors which negatively influenced the work functions and life of SSDs. They are–
Temperature- SSDs are sensitive to temperature and when they get hot, they may throttle back on performance. This led to unexplained slowdowns on some servers. Facebook’s study revealed that some first generation SSDs failed more often as temperature rose, possibly due to lack of throttling. And some second generation SSDs throttled aggressively enough to reduce failure rates, while others kept the failure curve flat.
Bus Power- Facebook’s researchers confirm that SSDs are thirsty. PCle v2 SSDs ran anywhere from 8 to 14.5 watts, a high and surprisingly wide range. The team found that as power consumption rose, so did failure rates.
Write fatigue- the team of researchers found that the level of system write activity correlated with SSD failure, and is due to the fact that the writes of SSD require a lot of power. So, HDDs may be a better choice than SSDs for heavy write applications such as logging.
Other reasons for SSD Failures- A.) Coming back again to UREs, SSD failures are relatively common and the study found out that 4.2 to 34.1% of SSDs report uncorrectable errors. In fact, 99.8% of SSDs report an error in one week and then the other in the next week.
B.) Life cycle and failures- In case of hard disks, infant mortality is reported, then a few years of good reliability and then the age catches up. SSDs have an early period of UREs as faulty cells are identified, thus increasing reliability, until clear wear-outs leads to increasing read failures.
C.) The data layout menace– In disk drives, data layout doesn’t affect the performance of the drives, unless it involves a lot of random seeks. But with the SSDs, the situation can be a bit different. It is discovered in the study that sparse logical data layouts offer non contagious data and it leads to higher SSD failure rates due to dense data structures. This behavior is due to the fact that Sparse data allocation can correspond to access patterns that write small amounts of non-contagious data, causing the controllers of SSDs, erase more frequently and copy data compared to writing contagious data.
If you are managing the servers using SSDs, then you should read the following paper “ A Large-Scale Study of Flash Memory Failures in the Field”
It offers an evidence based view of SSD behavior, and offers empirical specifics about SSDs available nowhere else.