Hard Drive Sustainability

Your hard drive with very important family pictures has just failed, and now all data is lost forever. Could you have prevented this from happening? This article is a quick walk though of how to detect hard drive errors before the disk is unusable.

Stephen Dunn
Stephen Dunn

Latest posts by Stephen Dunn (see all)

At the end of this article, you should be able to foresee hard drive errors before they happen, and take steps to backup your data before a total disaster has taken place. RAIDs are helpful in this situation, but it will not be the primary focus of this article, since they do not detect any errors. They only help you recover from one. Instead we will be talking about Smartmontools.

For a brief overview smartmontools is software that allows you to monitor the SMART capabilities of your hard drives. Smartmontools is available for both windows and Linux, but this tutorial will only cover how to use it on a Linux system. Once you install the software package, it will typically come with both smartctl and smartd. With just a few small steps, we will be able to continuously monitor the hard drives and keep alert of any issues.

In order to get started, your hard drives must support SMART. The majority of hard drives do, so this should not be an issue. To verify that your hard drive supports smart you simply:

:~$ sudo smartctl -i /dev/sda | grep support
    SMART support is: Available - device has SMART capability. 
    SMART support is: Enabled

The line should say both Available and Enabled. If the hard drive does not say Available, I suggest you stop reading and replace that hard drive with one that does. However, the majority of people are going to have this say Available and Enabled. For some reason if your drive only says Available and not Enabled, you simply run:

:~$ sudo smartctl -s on /dev/sda

Now that we have verified the drive’s capability, we can start to monitor the status by performing regular self tests. There are two types of tests: a short test that takes at most 5 minutes to complete, and a long test that might take 2 hours to complete. The main difference between the two is that the short test only checks a segment when the long test checks all segments. The tests are typically not labor-intensive on your drives, so it is recommenced to short check them daily and long check them weekly or monthly. Let’s get started with editing some configuration files in order to have the tests run for us automatically.

There are two configuration files that will need to be edited on the a new install of smartmontools. The two files are:

:~$ vim /etc/default/smartmontools
:~$ vim /etc/smartd.conf

The first file smartmontools is going to be edited, and the last two lines are going to be un-commented:

# Automatically starts the process after boot in order to continuously monitor hard drives
# 30 minute intervals

Now we can edit the smartd.conf file in order to set up our hard drives that we would like to monitor. Upon opening it, you can quickly see several options and a great looking configuration file with several examples, in order to do whatever you desire. However, today we are going to be scheduling our first hard drive /dev/hda to have a short test each day at 10pm, and then a long test every Saturday at 1am. In order to do that, please comment out the line that starts with DEVICESCAN and add the following:

#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner 
/dev/sda -a -s (S/../.././22|L/../../6/01) -m root -M exec /usr/share/smartmontools/smartd-runner

Let’s explain what is happening in the configuration above in order to better help you write your own. We first disable DEVICESCAN since it does not work on all operating systems. Its basic purpose is to try and scan all of your devices, this however does not work in all systems, so it should be disabled.

We pass in -a after the hard drive name to say that we want to monitor all attributes. There are several attributes that you might not care about, so if you specify something like -I 194 it would ignore the temperature sensor attribute. The next -s (S/../.././22|L/../../6/01) directly translates to our scan time logic from above. S/ means Short when L/ means Long. Then the logic should translate to /month/day/day of week/hour/ when .. means any and | means or. So the above is reading Run a short test on any day of the year at 10pm or a Long test on Sunday the 6th day of the week at 1am.

Finally the -m root option is just to mail root if any have failed during the tests and the -M /usr/share/smartmontools/smartd-runner will execute all of the scripts located /etc/smartmontools/run.d. Both of the options can be removed, and then a manual check of the test results can always be performed using:

:~$ smartctl -l selftest /dev/sdd

If you’re anything like me and like to write scripts, then placing a script into /etc/smartmontools/run.d could prove to be very fun. There are several variables that you can use in your scripts located in man smartd.conf

Finally, don’t forget to restart your smartd service. There a few ways to do this depending on your distribution, but the below might work perfectly for you:

service smartmontools restart

Congratulations! You now have an automated hard drive checking system. This is a great way to detect problems in your hard drives before it is too late. Hopefully you are now able to set something like this up on your servers.