Smartd removes the need to have smartctl run from a cron job - all you need to do is ensure that smartd runs at startup, and you modify the conf file appropriately (man smartd) It will then check. In this article I will give you an overveiw on the smartmontools which is a set of applications that can test hard drives, automatically notify you when the failure rate rises and read the harddisk SMART statistics to detect failures early. I will cover installation, usage on the.

PLEASE READ FOR FUTURE UPGRADE INFORMATION

The easy to use, hard drive diagnostic software.

Download a Free Trial
See compatibility with 10.15

Works on any Mac running OS X with internal HDDs or SSDs

Also has limited support for external HDDs or SSDs.

Buy Now

Purchasing link coming soon!

Only $25 for a personal license!

Other also licenses available:

$40 for a family license, $100 for a business license, $65 for an educational site license, $350 for a consultant license

WHAT IS SMART UTILITY?
SMART Utility is an application to scan the hardware diagnostics system of hard drives. SMART (Self-Monitoring, Analysis, and Reporting Technology) is a system built into hard drives by their manufacturers to report on various measurements(called attributes) of a hard drive’s operation. The attributes can be used to detect when a hard drive is having mechanical or electrical problems, and can indicate when the hard drive is failing. SMART Utility can read and display these attributes. This allows time to hopefully backup, and then replace the drive. SMART Utility also allows running a drive’s built in self test, which can also indicate malfunctions on the drive.

WHY USE SMART UTILITY?
SMART Utility is different from other drive utilities, such as Disk Utility, which only read the overall SMART Status. SMART Utility not only displays the individual attributes to see their status and information, but it also uses an internal algorithm based on those attributes to detect drives failing before SMART indicates it has failed. This pre-fail detection can save precious data before SMART has determined that the drive has failed. And, while the raw information can be viewed on the command line with smartmontools (which is what SMART Utility is based on, SMART Utility presents it in an easy to read format, as well as running its internal pre-fail algorithm. Plus, with the ability to run self tests, problems can be detected even sooner.

FEATURES
Displays all supported internal drives and their partitions, as well as some external drives (if optional SAT SMART driver is installed)
Displays important information in the main window, such as drive model, capacity, power on hours, temperature, bad sector counts, and error counts and types
Displays easy to read overall SMART status with color coded text
Displays more detailed information in separate windows, including capabilities, all available attributes, and the past five errors
Displays information using the Growl notification service (if installed) and email notifications (if configured)
Displays information in menu bar
Supports scanning in the background
Supports running a hard drive’s built-in test, and displays the results of the test
Supports scanning OS X software RAID drives, as well as drives in many RAID enclosures and cards (including SeriTek drives)
Supports logging all information to a log file for verifying SMART data
Supports customizing the pre-fail algorithm, including only alerting new bad sectors and error counts
Supports saving drive reports for later viewing
Supports printing drive reports
Supports HDDs and SSDs
Supports Mac OS X 10.9 through 10.14
Fully localized in French, thanks to Ronald A. Leroux

View the FAQ that is also available in the app under the Help menu.

Note: SMART Utility is based on the command line “smartmontools”, an open source software package that does the actual scanning of SMART attributes. SMART Utility only parses the data that smartmontools outputs, and it would not exist without it. It is available on their site.

SCREENSHOTS

Main Window
Info Window
Attribute Window

Test Window
Main Window with Errors
Attribute Window with Errors

Errors Window
Menu Extra

Copyright © 2003, 2004, 2005, 2006 Douglas Gilbert

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts.

For an online copy of the license see www.fsf.org/copyleft/fdl.html.

Revision History
Revision 1.62006-10-22dpg
auto '-d sat', background scan, windows device names
Revision 1.52006-06-24dpg
device type 'sat'
Revision 1.42006-05-08dpg
5.38 update, SATA, SAS
Revision 1.32004-09-25dpg
error counter descriptions, error events log page
Revision 1.22004-05-27dpg
reorganise, details in appendix, version 5.31
Revision 1.12003-10-13dpg
freebsd, timestamp
Revision 1.02003-05-26dpg
first cut

Abstract

This article describes how smartmontools interacts with SCSI storage devices (mainly hard disks and tape drives). Smartmontools is a SMART utility toolset. SMART is an acronym for Self-Monitoring, Analysis and Reporting Technology. Smartmontools is available for the these operating systems: Darwin (Mac OS X but with no SCSI support yet), FreeBSD, Linux, NetBSD, OpenBSD, OS/2 (no SCSI support), Solaris and Windows.

Table of Contents

Introduction
Overview of Smartmontools
Operating Systems
SCSI disks
SATA disks
SMART
smartctl command line utility
Self Tests
Error Logs
Background scan
smartd daemon
TapeAlert
Examples
RAID, JBOD and Enclosures
A. Details
Standards
Informational Exceptions
IE reporting
smartctl debug
Links

Smartmontools controls and monitors storage devices using theSelf-Monitoring, Analysis and Reporting Technology (SMART) system. This toolset was originally builtfor the Linux operating system and has been ported to Darwin forMac OS X (no SCSI support yet), FreeBSD, NetBSD, OpenBSD, OS/2 (no SCSI support), Solaris and Windows. This article describes how smartmontools interacts with SCSI devices.Passing reference is also made to devices that use the SCSI commandset such as USB mass storage devices and IEEE1394 devices that usethe 'sbp2' protocol. In many situations SATA disks are accessedusing a (partial) SCSI command set.

The primary web site for smartmontools is atsmartmontools.sourceforge.net from which thelatest versions (both source and binaries) can be obtained. Smartmontoolsgrew out of the now dormant smartsuite project whichis still available on its own sourceforge site. The smartmontools main pageconcentrates on ATA devices.This article supplies some SCSI specific information forthose users of smartmontools that wish to monitor SCSI storage devices.

This document outlines the features found in smartmontoolsversion 5.37 that are relevant to SCSI disks and tape drives.This document was last altered on 22nd October 2006.

Smartmontools is made up of two executable programs, a configuration fileand online documentation (on Unix systems in the form of 'man' pages).The two executable programs are:

  • smartctl: a command line utility

  • smartd: a daemon program providing amonitoring service

SCSI disks and tape drives allow self tests of their media, often monitorthe temperature of the device, maintain error counters and report whenvarious failure prediction thresholds are exceeded. To view the informationavailable try a command like: smartctl -a /dev/sda. IfSMART reporting has not been turned on for this disk then use this commandfirst: smartctl -s on /dev/sda. [For operating systemsother than Linux replace /dev/sda with a SCSI disk device name.]

The smartd daemon program is a service typically startedwhen a machine boots up. In can monitor multiple disks (both ATA and SCSI).In Unix systems its configuration file canbe found /etc/smartd.conf. It sends alerts to thesystem logs and can be configured to email system administrators whenpending failures are reported.

Smartmontools was originally written for Linux. Since then it has beenported to various other Unix based systems and Windows. Note that thedevice names are based on the transport that an operating system sees.These days it is not uncommon for an operating system to see atransport that only conveys SCSI commands connected, via some commandtranslation bridge, to an ATA disk. Examples are USB external diskenclosures and SATA disks behind a SCSI to ATA Translation Layer (SATL)in a SAS or FC domain.

The names of SCSI disk and tape devices vary with the operating system.Here is a summary:

Table 1. SCSI device names in various systems

diskstapesNotes
Linux/dev/sd[a-z]/dev/[n]st[0-9]
FreeBSD/dev/da[0-9]/dev/[n e]sa[0-9]
NetBSD/dev/sd[0-9]+c/dev/st[0-9]+c
OpenBSD/dev/sd[0-9]+c/dev/st[0-9]+c
Solaris/dev/rdsk/c?t?d?s?/dev/rmt/*
Windows/dev/scsi[0-9][0-f]/dev/scsi[0-9][0-f]ASPI adapter:0-9, ID:0-15
/dev/sd[a-z]for '.PhysicalDrive[0-25]'
/dev/pd[0-255]for '.PhysicalDrive[0-255]'
/dev/tape[0-255]for '.Tape[0-255]'
Darwinno support for SCSI devices
OS/2no support for SCSI devices

The above list is a simplification. In Linux there can be multipledrive letters followed by a partition number (1 to 15). Smartmontools willignore the partition number if it is given and query the underlying device.In Linux the SCSI tape device name can be 'nst' and a letter can beappended to the device name, both decorations are ignored by smartmontoolsas it accesses the underlying tape drive. Also in Linux, SCSI devices canbe accessed via their generic name which is of the form/dev/sg[0-9].

Linux also has an optional Solaris likenaming scheme for SCSI device (scsidev), devfs (mainly used in the lk 2.4series) and udev (devfs's replacement in the lk 2.6 series). In short, devicenaming is a complex area and smartmontools does its best to findand identify (i.e. whether ATA or SCSI) a device depending on its name. Insome cases smartmontools needs guidance from the user and this can be givenby the '-d ata scsi sat marvell 3ware,N' option in thesmartctl utility and in smartddaemon's configuration file.

Windows has several schemes for naming devices. The 'scsi[0-9][0-f]' schemeuses the aspi dll from Adaptec. That dll is not distributed with Windows. Theother schemes use the 'SCSI Pass Through' interface which is native toWindows in NT and later. In all cases for Windows, the leading/dev/ is optional.

What is a SCSI disk? A SCSI disk is a storage device that 'talks' the SCSIcommand set. An ATA disk is a storage device that 'talks' the ATA command set. That seems pretty clear. However the command set that adisk uses at its connector (and thus shown on its label) may not bethe command set that the operating system needs to use due to commandset translation between the OS and the disk.

The ATA command set is used over native ATA transports which areparallel ATA (PATA) up to 133 MB/sec and serial ATA (SATA) at linkspeeds of 1.5 Gbps (approximately 150 MB/sec) or 3 Gbps. In the pastwhen ATA disks needed to use some other transport (e.g. USB and IEEE1394)the SCSI command set was sent over the foreign transport. So in thiscase the operating system sees a device 'talking' the SCSI command setbut the device is really an ATA disk. Many current disk external enclosurescontains ATA disks yet seen from the operating systems view point areUSB mass storage devices talking the SCSI command set.

The SCSI command set is used over various transports: the SCSI ParallelInterface (SPI), Fibre Channel (FCP), Serial Attached SCSI (SAS),IEEE1394 (SBP), USB (mass storage) and iSCSI. Many of these transports canconvey multiple command sets (i.e. not just the SCSI command set). TheSAS transport is interesting as it can convey both the SCSIand ATA command sets. There is also the case of a RAID made up of ATAdisks which communicates to host operating system with the SCSI commandset (e.g. 3ware RAID controller).

So what does all this mean for smartmontools? In most cases the answer isnot good news. Devices such as USB external disk enclosures translate incoming (from the host) SCSI commands to their ATA equivalents and processresponses as required. This translation is limited typically to a smallnumber of SCSI commands (e.g. READ and WRITE) but notthose commands needed by smartmontools. The author does not know of any SCSI_over_USB devices that support Smartmontools. The 3ware RAID (6000,7000, 8000 and 9000 series Escalade) controllers are supportedon several operating systems with special code.[1]

There is an emerging SCSI to ATA Translation (SAT) standardat www.t10.orgthat may lead to improvements in this area. Apart from definingsome of the facilities smartmontools needs, it defines two ATA PASS THROUGHSCSI commands. These pass through commands could be used in much thesame way that the 3ware RAID tunnels ATA commands.

The device type '-d sat' instructs the smartctlcommand and the smartd daemon, to form SMARTcommands for the ATA command set and then package those commandswithin the ATA PASS THROUGH SCSI commands. The SCSI commandsare then sent to the 'SCSI' device that the operating systemhas been given. In version 5.37 of smartmontools it is no longernecessary to specify '-d sat' in this situation. All that isneeded is a SATL that complies with the emerging SAT standard.If the automatic detection of an ATA disk behind a SATL istricked, '-d scsi' (or some other device type) can be used tooverride.

It has been reported that many external USB enclosures use a 'Cypress'chipset. This contains an ATACB proprietary pass through (for ATAcommands passed through SCSI commands) for which some publiclyavailable information is available. Smartmontools has no ATACB specificcode but may move in this direction in the future. Another approach isto hope USB and SBP2 external enclosures adopt the SAT standard in thenear future. One interesting comment about ATACB is that it should notbe used at the same time as other types of access to the disk (e.g. amounted file system)! That implies that a disk should be takenoffline before smartmontools is used on it. It also implies thatthe smartd background daemon should not be used.

SATA disks use a 1.5 or 3 Gbps serial transport which carries theATA command set. The serial connection is point to point so each SATA diskneeds its own cable and plug on the host adapter or motherboard.[2]Many aspects of SATA are like SCSI and some operatingsystems use existing SCSI infrastructure to handle SATA hosts (e.g.Linux's libata).

Serial Attached SCSI (SAS) can be viewed as a superset of SATA.It can directly connect thousands of SAS disks to one or more controllersspread across multiple machines in one SAS 'domain'. Such a domain canalso contain SATA disks, connected to intermediate fanout devices calledexpanders (similar to switches in networking). Most SAS host adapterscan also have SATA disks connected directly to the adapter (whichtechnically is not a usage of SAS but that is of little concern tothe end user).

So a SATA disk may be connected

  • to a SATA host controller (on a motherboard or an adapter)

  • directly to a SAS host adapter

  • to a SAS expander which is connected to one or more SAS host adapters

  • or connected via a bridge which is connected to the host computer viasome other transport (e.g. fibre channel)

Since all but the first item might have other disks connected whichuse the SCSI command set (e.g. SAS and FC disks) often the SATA diskshave a SAT layer put in front of them so they look like SCSI disks.That SAT layer may be in:

  • the operating system kernel (e.g. libata in Linux)

  • in the host adapter firmware (or RAID controller)

  • or external to the host computer: within a disk enclosure (e.g.associated with a SAS expander)

For normal file system work, a SCSI to ATA Translation Layer (SATL) onlyneeds to concern itself with around 6 commands. Unfortunatelysmartmontools uses other commands (both in the SCSI and ATAcommand sets). Probably the simplest way to handle SMART for SATA disksbehind a SAT layer is to use the ATA PASS THROUGH SCSI commands.

smartmontools guesses the disk command set (i.e. ATA or SCSI)based on the device node it is given. For example in Linux,/dev/hda would be assumed to use the ATA command setwhile /dev/sda would be assumed to use the SCSIcommand set. [3]By using either the '-d ata' or the '-d scsi' option, the command setguess made by smartmontools can be overridden. The '-d sat' devicetype causes smartmontools to generate ATA commands which are then packagedwithin the ATA PASS THROUGH SCSI commands (defined by the SAT standard)and then sent to the device via a SCSI pass through mechanism.As noted in the previous section, version 5.37 of smartmontools nowautomatically detects a SATA disk behind a SAT layer and acts asif '-d sat' has been given.

SMART never attained the status of 'standard' and itsoriginal documents have been withdrawn. Its catchy name lives on, especiallyon vendors' web sites and obviously in the name of this toolset. Luckilythe good ideas in SMART have been incorporated into theATA and SCSI standards albeit in slightly different forms.

Initially SMART began on SCSI disks as vendorspecific extensions. Gradually the SMART functionality hasmoved into the standards (often by other names) and vendors are improvingtheir standards' compliance. [In the vendors' defence some of the'standards' are drafts and are yet to be ratified.] Some SCSI disk vendorshave product manuals (available on the net) that cover the parts of the SCSIcommand set that their disks support. Some of these manuals fill in detailsthat are left deliberately vague in the the standards.[4]

SCSI standards (found at www.t10.org)only make one footnote reference to the term SMART.In its place the awkward term 'Informational Exceptions' is used. For SCSItapes the term 'TapeAlert' is used.

The smartctl command line utility getsSMART information from the nominated device. In somecases SMART information held by the nominated device can be modified by the smartctl command. The command has many options that can be viewed by the long usage message output beeither of these invocations: smartctl -h orsmartctl --help. Those options that are onlyavailable to ATA disks (i.e. not available to SCSI disks or tape drives)are marked with '(ATA)'. Unix style 'man' page documentation is alsoavailable.

The following options are currently available for SCSI disks and tapedrives unless otherwise noted:

  • -a --all: equivalent to thecombination -i -H -A -l error -l selftest optionsinvoked in that order.

  • -A --attributes: outputs thecurrent device temperature, trip temperature, the number of elementsin the grown defect list (GLIST) and data from the start-stop log page.Outputs some vendor specific information if available.

  • -C --captive: used in conjunctionwith -t short or -t long options todo short or long self tests in the foreground. [Has no effect on tapedrives.]

  • -d TYPE --device=TYPE where TYPE is 'ata', 'scsi', 'sat', 'marvell', '3ware,N', 'hpt,L/N[,M]'or 'cciss,N'. Overrides utility's guess about the class of the devicewhich is based on the form of the nominated device's name.

  • -h --help: outputs lengthy usagemessage and exits without any other action.

  • -H --health: outputs single devicehealth metric determined by the device manufacturer. This will be 'OK'or a failure message.

  • -i --info: outputs device identification information (derived from a SCSI INQUIRY command) andwhether the device supports SMART (and temperature warnings) and if those facilities are currently enabled. Thetype of transport (e.g. FC or SAS) is also reported, if available.Some users have reported disks that report the wrong transport.

  • -l TYPE --log=TYPE where TYPE iseither 'background', 'selftest' or 'error'. Decodes are outputs therequested log. Note that --all does not include--log=background .

  • -q TYPE --quietmode=TYPE where TYPE iseither 'silent' or 'errorsonly'. When the type is silent then nothing isoutput to the console but the exit status is set (so it is suitable forscripts). For 'errorsonly' only errors are output to the console. Theexit status is always set. [See the smartctl man page.]

  • -r TYPE --report=TYPE where TYPE iseither 'ioctl[,<n>]' or 'scsiioctl[,<n>]'. Turns on low leveldebugging of issued commands and responses. These commands are issuedthrough a system command called an 'ioctl' in Unix. The debug can be forall issued commands (i.e. 'ioctl') or only SCSI commands ('scsiioctl').Optionally the TYPE can have a comma and a number post pended to increasethe volume of debug. See this section formore details.

  • -s VALUE --smart=VALUE where VALUE iseither 'on' or 'off'. Enables or disables SMART monitoring (and temperature warnings).

  • -S VALUE --saveauto=VALUE where VALUEis either 'on' or 'off'. Controls whether the error log values arepreserved across device power cycles.

  • -t TEST --test=TEST where TESTis either 'offline', 'short' or 'long'. Despite its name 'offline' isa short foreground test that all SCSI devices should support. A 'short'self test is typically 2 minutes or less. A 'long' self test will beconsiderably longer than 2 minutes, depending on the size of the media.The estimated time that a 'long' self test will take is printed afterthe 'selftest' log (i.e. with '-l selftest' or '-a')

  • -V --version: outputs the smartctlversion number (including the cvs version of all its source files)and build information then exits without any other action.

  • -X --abort: will terminate abackground short or long self test. Usually the self test log notesthat a self test has been aborted. [Has no effect on tape drives.]

After the options smartctl expects a device name.This device name is not required for the '--help' or '--version' options.If no options are given and a valid device name is given then the copyrightnotice is output and the program exits. If the device name is invalidthen that is reported. Only one device name can be given.

Examples of various invocations of smartctl on aSCSI disk follow:

Rather than wait for thresholds to be tripped, an administrator canrequest a self test. Alternatively a self test can be scheduledperiodically (e.g. at 3 a.m. every night or perhaps weekly) withsmartd. All SCSI disks and tape drives shouldsupport a default self test since it is mandatory.This can be invoked with thesmartctl -t offline <device> command. Despitethe term 'offline' this is actually a foreground test of less than 2minutes. On completion the default self test reports any errors detectedin its response. The default self test makes no entry into the self testlog. Most SCSI devices perform a default self test when they are beingpowered up.

The other self tests that are optionally supported by the device are listedhere with the smartctl invocation in brackets:

  • background short [smartctl -t short <device>]

  • background extended [smartctl -t long <device>]

  • foreground short [smartctl -C -t short <device>]

  • foreground extended [smartctl -C -t long <device>]

Short self tests should take less than two minutes to complete. The extendedself tests have been known to take more than one hour for disks that are over 100 GBytes in size. Care should be taken with foreground tests on diskswith mounted file systems as the OS may not take kindly to an hour delayon a simple READ command.[5]

Background self tests can be aborted with the smartctl -X <device> command. The self test log will note that anabort was requested.

Self tests other than the default self test cause an entry to be placedin the self test results log page. The 20 most recent self tests areheld. The self test results can be viewed with thesmartctl -l selftest <device> command. All testsoutput the accumulated power on hours when the test was performed andthe success or otherwise (e.g. the self test was aborted by the user'srequest) of the test. Unsuccessful self tests output a self test segment number (vendor specific), the logical block address of the first failure(if appropriate) and a sense_key,asc,ascq triple (see appendix). Followingthe self test result table is the expected duration of an uninterrupted extended self test (when that figure is provided by the device).

Here is an example of a self test log:

The smartctl -l error <device> command displaysthe error counters maintained in the device's log pages. Here is anexample of an error log:

The displayed error logs (if available) are displayed on separate lines:

  • write error counters

  • read error counters

  • verify error counters (only displayed if non-zero)

  • non-medium error counter (only a single number displayed). This representsthe number of recoverable events other than write, read or verify errors.

  • error events are held in the 'Last n error events' log page. The numberof error event records held (i.e. 'n') is vendor specific (e.g. up to 23records are held for Hitachi 10K300 model disks). The contents of eacherror event record is in ASCII and vendor specific. The parameter codeassociated with each error event record indicates the relative time atwhich the error event occurred. A higher parameter code indicates that theerror event occurred later in time.If this log page is not supported by the device then 'Error Events loggingnot supported' is output. If this log page is supported and there areerror event records then each one is prefixed by 'Error event <n>:'where <n> is the parameter code.

Each of the write, read and verify error counter logs has variousparameters codes. They are itemized below with the smartctl columnname followed, in brackets, with SCSI standard's description andparameter code). A description taken from Seagate's SCSImanual (publication 77738479, Rev J) is then given.

  • Errors Corrected by ECC, fast [Errors corrected without substantial delay:00h]. An error correction was applied to get perfect data (a.k.a. ECCon-the-fly). 'Without substantial delay' means the correction did notpostpone reading of later sectors (e.g. a revolution was not lost). Thecounter is incremented once for each logical block that requires correction.Two different blocks corrected during the same command are counted as twoevents.

  • Errors Corrected by ECC: delayed [Errors corrected with possible delays: 01h].An error code or algorithm (e.g. ECC, checksum) is applied in order toget perfect data with substantial delay. 'With possible delay' means thecorrection took longer than a sector time so that reading/writing ofsubsequent sectors was delayed (e.g. a lost revolution). The counter isincremented once for each logical block that requires correction. Ablock with a double error that is correctable counts as one event andtwo different blocks corrected during the same command count as twoevents.

  • Error corrected by rereads/rewrites [Total (e.g. rewrites and rereads): 02h].This parameter code specifies the counter counting the number of errorsthat are corrected by applying retries. This counts errors recovered,not the number of retries. If five retries were required to recover oneblock of data, the counter increments by one, not five. The counter isincremented once for each logical block that is recovered using retries.If an error is not recoverable while applying retries and is recoveredby ECC, it isn't counted by this counter; it will be counted by thecounter specified by parameter code 01h - Errors Corrected With PossibleDelays.

  • Total errors corrected [Total errors corrected: 03h].This counter counts the total of parameter code errors 00h, 01h and02h (i.e. error corrected by ECC: fast and delayed plus errors correctedby rereads and rewrites). There is no 'double counting' of data errorsamong these three counters. The sum of all correctable errors can bereached by adding parameter code 01h and 02h errors, not by using thistotal. [The author does not understand the previous sentence from theSeagate manual.]

  • Correction algorithm invocations [Total times correction algorithmprocessed: 04h]. This parameter code specifies the counter that countsthe total number of retries, or 'times the retry algorithm is invoked'.If after five attempts a counter 02h type error is recovered, then fiveis added to this counter. If three retries are required to get stableECC syndrome before a counter 01h type error is corrected, then thosethree retries are also counted here. The number of retries applied tounsuccessfully recover an error (counter 06h type error) are alsocounted by this counter.

  • Gigabytes processed {10^9} [Total bytes processed: 05h]. This parametercode specifies the counter that counts the total number of bytes eithersuccessfully or unsuccessfully read, written or verified (dependingon the log page) from the drive. If a transfer terminates early becauseof an unrecoverable error, only the logical blocks up to and includingthe one with the uncorrected data are counted. [smartmontools dividesthis counter by 10^9 before displaying it with three digits to theright of the decimal point. This makes this 64 bit counter easier toread.]

  • Total uncorrected errors [Total uncorrected errors: 06h]. This parametercode specifies the counter that contains the total number of blocks forwhich an uncorrected data error has occurred.

The SCSI standard (SPC-3) cautions that the exactdefinitions of the error counters is not part of the standard (i.e. theyare vendor specific). As noted the above list contains Seagate'sexplanation for its disk products (the last revision of that documentwas 1999). Seagate's disk product manuals imply that the disk firmwarecollects these counter values and periodically commit them to persistentstorage (disk or non-volatile RAM).[6]They also imply that their firmware is monitoring these error countersand if they exceed some threshold (e.g. in a certain time interval)then the firmware will report a thresholds exceeded.

The error counter logs for some disks (e.g. some Seagate models) canlook worrying:

The 'fast' ECC corrected number is high. However the '-H' option reports the disk isin good health as does an extended (long) background self test. The uncorrected errorswould be a problem had in not been for the fact that the author caused them onpurpose (by writing a bad sector with the SCSI WRITE LONG command).

Recent SCSI disks can perform what are termed as 'background scans'. Theseare reads of the whole media with recoverable errors acted on andunrecoverable errors noted. If a sector (block) is found with a recoverableerror (i.e. the error correction codes (ECC) detect a problem but containenough redundant information to fix the problem) it may be fixed with are-write 'in place'. Alternatively the disk may decide to re-assign therecovered data to another physical sector which is assigned the same logicalblock address (and the original faulted sector is unmapped and placed onthe grown defect list (GLIST)). Since unrecoverable errors potentiallyinvolve user data being lost, no automatic recovery action is undertaken bythe disk. However logical block addresses that contain either recovereddata or unrecoverable errors are noted in the Background Scan Resultslog page. The smartctl --log=background command decodesand outputs that log page.

Background scans may be performed periodically (e.g. every 24 hours) orevery time the disk is powered up (or both). These parameters can becontrolled via the Background Control mode page. The sdparm utility can be used to access andmodify this mode page.

Here is an example of the output from the Background Scan Results log page.The first descriptor in that log page shows the status followed by upto 2048 entries for background scan 'events'. In this case a backgroundscan is still in progress and 3 scans have been completed in the past.The 'events' shown are all recoverable errors that the disk dealt withby rewriting the block.

In this case the reassign_status shows that no user intervention isrequired. The other 'don't worry (too much)' reassign_status is 'Logicalblock successfully reassigned'. Any other reassign_status will requireuser intervention to correct. There is a LOWIR ('log only when interventionrequired') bit in the Background Control mode page that the user canset (e.g. with the sdparm utility) to filterout 'noisy' entries like those shown above.

The user can manually re-assign logical blocks with a utility likesg_reassign found in thesg3_utils package. The background scanoutput contains a '[sk,asc,ascq]' tuple of numbers. The one shown abovetranslates to 'recovered error, recovered data with retries'. Unrecoverableerrors would most likely have 3 ('medium eror') or 4 ('hardware error')as the first number. A decoding of the latter two numbers can be foundin the 'Numeric Order Codes' annex of SPC-4 (see www.t10.org) in the Additional Sense Codes section.

smartd is a daemon for monitoring disks (both ATA andSCSI). It is recommended that tape drives and medium changers are monitoredin a more manual fashion with the smartctl commandas discussed in the section called “TapeAlert”.

The configuration file for smartdis called /etc/smartd.conf and has a man page (as doesthe smartd command). The controlling daemon scriptis placed in the normal place for a distribution, typically/etc/rc.d/init.d/smartd.

smartd polls the devices it has recognized when itwas started. By default it polls every 30 minutes. It reports any adversefinding and noteworthy occurrences (e.g. disk drive temperature changes)to a log file (/var/log/messages). smartd can be configured to take other actions, for example sendemail to a system administrator.

SCSI disks can be discovered by smartd via a scan of device nodes (for linux: /dev/sda through to /dev/sdz) by placing the word 'DEVICESCAN' in/etc/smartd.conf file. Alternatively the'DEVICESCAN' word can be removed (or commented out) and SCSI devicesnamed explicitly:

The '-d scsi' argument overrides what smartd wouldguess as the deviceclass (i.e. 'ata', 'scsi', 'sat', 'marvell', '3ware,N', 'hpt,L/N[,M]'or 'cciss,N').

TapeAlert (or 'tape alerts') is closely related to the SMART infrastructure provided for SCSI disks.TapeAlert is specialized for tape and medium changer devices. An example ofa TapeAlert is an indication that the tape drive heads need to be cleaned.

Pending TapeAlert errors can be read from the TapeAlert log page(using smartctl). This can be done even when SMARTmonitoring is disabled (e.g. after smartctl -s off <tape_device>). In fact, the best way to use the TapeAlert mechanism is to poll the flags (with smartctl) at relevant times whenusing the tape, for example:

  • when starting a new job using the tape drive

  • after an unrecoverable error

  • at the end of using each tape (and before it is unloaded)

The TapeAlert information is divided into three severity classes:Critical, Warning, and Information. The critical messages requireurgent user intervention. Both critical and warning errors may lead toloss of data. Some of the errors are related to the medium and othersto the tape drive itself. This is why the TapeAlert information should bechecked when the tape is in use and not polled periodically (i.e. the smartd daemon with its periodic polling is notparticularly useful for TapeAlert mechanism).

Different sets of flags are defined for tape drives and mediachangers. Most of the flags are optional and the set of flagssupported depends on the device. TapeAlert is being included into theSCSI-3 standards. Many SCSI-2 drives support TapeAlert but theimplementation may not fully conform to the SCSI-3 draft definitionused by smartmontools.

It is important that only one application(or OS driver) is monitoring tape alerts since reading the TapeAlert log page deactivates all flags after they are read. [7]Currently the Linux SCSI tape drivers (st and osst) do not check the TapeAlert log page. In Linux, a medium changer device (i.e. the robot ina tape jukebox) is accessed via its SCSI generic (sg) device name.

Code and information on the TapeAlert mechanism have been provided by Kai Mäkisara <Kai.Makisara at kolumbus dot fi>.

Here is some output from the smartctlcommand. Mostly it is for the '--all' option.

  • StorageTek LT20 tape 'jukebox': thetape reading mechanismand themedium changer (robot).Note the TapeAlert warnings in the medium changer output.

  • HP DDS-4 tapedrive.

  • Generic ATAPI CD-RWcd writer is an example of a device thatdoes not support SMART.

  • IBM DDRS 39130disk manufactured in 1998.

  • Fujitsu MAM3184MP 18 GigaByte disk when all is well. Here is the output fromthe smartctl -H command after the IEC Test bit has been set (with the smartctl -s on -r ioctl,3 command) on thesame Fujitsu disk .

  • Fujitsu MAP3735NP 73 GigaByte disk

  • Quantum ATLAS IV 36 WLS, 36 GigaByte disk Mac os install usb el capitan dmg less than second graders.

  • Seagate Cheetah ST336754 36 GigaBytedisk.

It is unlikely that a hardware RAID controller will directly support smartmontools. A SCSI RAID controller is a virtual target device that essentially remaps the SCSI commands it receives to the physical disks on its internal buses. The physical disks in a 'SCSI' RAID could be ATA or sATAdisks, in this case a SCSI bus is used between the host computer and anexternal RAID controller since LVD SCSI buses (SPI-2,3 and 4) can run up to 25 metres (plus other protocol related issues).

Some SCSI RAIDs equipped internally with SCSI disks allow access to the physical disks via logical unit numbers (LUNs) greater than 0. The SCSI RAIDcontroller itself takes a LUN equal to 0. In this case smartmontools couldbe applied to the LUNs greater than 0 that refer to physical disks.

Some SCSI RAIDs equipped internally ATA disks have a mechanism thatallows ATA commands to be tunnelled to the ATA disks. The 3ware 6000and 7000 series Escalade controllers are examples. In this case,special provision has been made in smartmontools (starting withrelease 5.1-16) to tunnel the ATA command required through to thephysical disks. This is done by using the -d 3ware,Noption/Directive. See the smartctl and smartd man pages for details.

The approach that smartmontools takes is to communicate directlywith physical storage devices (e.g. a disk). Another approach isto collectively monitor and manage a group of disks and/or tapedrives (be they a RAID, 'Just a Bunch Of Disks' JBODor a collection of disks and tape drives) in an enclosure. The SCSIEnclosure Services SES (reference: SES-2 atwww.t10.org) is designed for this task.Both SCSI device and recent SATA disk enclosures are using SES. Amongstother things SES can monitor the state of individual devices within theenclosure, the temperature, power supplies and fans. A user can setthresholds, define alarm types and remotely administer the enclosure.

A. Details

One of the first surprises working with SCSI devices and smartmontoolsis that the SCSI standards (found at www.t10.org)do not use the term SMART. In itsplace the awkward term 'Informational Exceptions' (IE) is used.

The original SCSI standard (over 20 years old now) and the SCSI-2 standardwere monolithic documents. In SCSI-3 and beyond the SCSI standards havebeen sub-divided and three categories of interest are the:

  • architectural model [SAM-4]

  • command sets [SPC-4, SBC-3, SSC-3, SMC-2, etc]

  • transports [SPI-4, SBP-2, FCP-3, SAS, etc]

The architectural model while interesting says nothing specific aboutInformational Exceptions or related topics. With respect to the transportsthe term SCSI has often been synonymous with oneof the SCSI Parallel Interface transports (e.g. SPI-4 which is often knowas 'Ultra320') however this is unhelpful. For the purpose of smartmontoolsthe SCSI command sets are more interesting. The main reference is theSCSI Primary Commands (SPC-4) document, specifically these sections:

  • self test operations; SEND DIAGNOSTIC command (which isthe mechanism for requesting self tests)

  • MODE SENSE and MODE SELECT commands (both 6 and 10 bytevariants); Mode parameters [the Informational Exceptions Control (IEC) modepage and the Control mode page]

  • LOG SENSE and LOG SELECT commands;Log parameters [these log pages: Informational exceptions,read/write/verify error counters, non medium error count, temperature, start-stop cycle counter and the self test results]

The SCSI Block Commands (SBC-3) document covers random access storagedevices such as disks (but excluding CD/DVD readers and writers which arecovered by MMC-4) while the SCSI Streaming Commands (SSC-3) document covers tape systems. The SBC-3 standard does not contain any additional information (compared with SPC-4) about Informational Exceptions. The SSC-3 standard covers TapeAlert (section 4.2.15), some extra facilities inthe IEC mode page (see the mode parameters section) and some additionallog pages. Medium changers, typically the 'robots' in jukebox tape systems,often support the TapeAlert mechanism and are described in the SMC-2 standard.

So what are Informational Exceptions in the SCSI context? They are aset of vendor specific parameters that the device firmware monitors and if a 'failure prediction threshold' is exceeded then an exception isreported. A user is also able to set thresholds on error counters andhave an exception reported if a condition is met. Additionally mostmodern disks monitor their temperature and will issue a warning ifa temperature threshold is exceeded.

The 'failure prediction threshold' exception reporting and the temperaturewarning are separately controlled (in byte 2 of the Informational ExceptionsControl (IEC) mode page).[8]In smartmontools thesmartctl -s on <device> command turns on IE.There are various reasons why this may not (fully) work (e.g. IEC modepage not available or not changeable) so this command queries the deviceagain after it has attempted the change and reports the state.The smartctl -s off <device> command turns offIE reporting.[9]

Informational Exceptions are reported via the standard SCSI statusreporting mechanism of an additional sense code (asc) and an additionalsense code qualifier (ascq) pair. A selection of these pairs and the associated message (there is full list in the SPC-3 document) is listed here:

The last entry in the above table results from setting the TEST bit andis for exercising the reporting mechanism rather than the indicationof an actual error.See this footnote for more information.

One difficulty with IE is that the device firmware may detect theseconditions independently of any command executing. Even if it detectsan informational exception during a command it needs to be carefulsending IE error notifications back with a command especially ifthat command succeeded (Linux will not handle this too well in the2.4 kernel series). There is asynchronous event notification (AEN) in SCSI but it is notreliably supported across all transports. So smartmontools relieson a poll from the smartd daemon (the defaultis every 30 minutes) to detect informational exceptions.

The additional sense code and its qualifier are part of what is termed asthe sense buffer which is the response to a REQUEST SENSE command. The sense key is also found in the sense buffer.Synchronous SCSI commands that fail return a single byte status code ofCHECK CONDITION. An OS kernel would see this error/warning status andthen check the sense buffer (by doing a REQUEST SENSE or by other means)and decide how to continue. From smartmontools's point of view, itssmartd daemon would like to process Informational Exceptions without interference from the OS. This is done by setting upthe IEC mode page's MRIE field set to 6. This instructs the SCSI device to hold a pending exception until an unsolicited REQUEST SENSE is sent. If an exception is pending then the sense key will be 'NO SENSE'and the asc, ascq pair will be set accordingly. In the case of no pendingexception the asc,ascq pair will both be zero. The pending exception is also visible in the IE log page, if that is supported. So smartd can check the device during its normal polling cycle.

Pending informational exceptions can also be checked by runningsmartctl -H <device>. A message of'SMART Health Status: OK' indicates that there is no pending IE.[10]

Debug information for smartctl is output when the -r ioctl or the -r scsiioctloption is used. More debug is output when the -r ioctl,<n> form is used (where 'n' is a number greater or equal to 1). Both -r ioctl and r scsiioctl,1 selectthe same amount of SCSI debug information. The debug levels currently defined are:

  • 1 - output SCSI commands sent to the device and the status received fromthe device

  • 2 - additionally, output the first 64 bytes of data sent to or received fromthe device

  • 3 - additionally, set the IEC mode page TEST bit if accompanying the '-s on'option

See this footnote for more information about theuse of the IEC mode page TEST bit.

One shortcoming of the Informational Exception data provided bySCSI devices (at least as defined in the current standard) is thatno LOG SENSE page tells the user how many hours the device has beenin use for. The device needs to track its 'age' for applying timestampsto self test results (seen in the 'Lifetime (hours)' column of thesmartctl -l selftest command) if they are supported.So one way to circumvent this shortcoming is to do dummy self tests. Hence do a smartctl -t short command and thenwait 2 minutes to see the result in the self test log in which the mostrecent self test row (i.e. the first) will have the current lifetime ofthe device.

Here are some links to related projects and packages:

  • the primary reference site for SCSI architecture, command sets and transportsis www.t10.org. The main documents of interestto smartmontools are the 'Primary Commands' (SPC-4), the 'BlockCommands' (SBC-3) for disks and the 'Streaming Commands' (SSC-3) fortape drives. This www.t10.org/scsi-3.htm page contains a diagramshowing the relationships of various SCSI standards.[11]

  • SCSI raid monitoring tools plus a firmware update utility and other low leveltools scsirastools.sourceforge.net .

  • The sdparm utility allows mode page settings to beviewed and changed. It can decode Vital Product Data (VPD) pages.It implements a small number of commands to start and stop media,and to eject and load removable media.See this page www.torque.net/sg/sdparm.html .sdparm is available on Linux with ports toFreeBSD, Tru64 and Windows.

  • A package of SCSI low level tools for Linux called sg3_utils can be foundon this page www.torque.net/sg/sg3_utils.html (the most recentversion is sg3_utils-1.22). Allows command level access to SCSI devicesand is available on Linux with ports to FreeBSD, Tru64 and Windows.

  • There is a HOWTO on the Linux SCSI subsystem in the 2.4 series here:www.tldp.org/HOWTO/SCSI-2.4-HOWTO.

CVS $Id: smartmontools_scsi.xml,v 1.14 2006/06/24 13:05:58 dpgilbert Exp $


[1] The 3ware RAID solution tunnels the ATA commands needed forsmartmontools (together with a disk number) through a vendor specific SCSI command.

[2] There are SATA devices called port multipliers that allow up to 15SATA drives to be connected to one host. SAS expanders seem to be abetter approach to the problem of connecting a large number of disksto one or more hosts.

[3] Even sending trial ATA and SCSI commands to see which one a deviceresponds to could be tricked. ATAPI cd/dvd drives respond toboth ATA commands (a few, for example IDENTIFY PACKET DEVICE) and SCSIcommands (found in MMC).

[4] For example: Seagate's 'Cheetah 15K.3 Product Manual, Rev F' contains sections on SMART,thermal monitor, and drive self test (section 5.2.7 to 5.2.9). It alsolists the supported mode pages with their default and changeable values.

[5] Linux has an additional problem with the foreground extended self tests:it will attempt to time out the command after 10 seconds. This will appearin the self test log page as an aborted self test. This problem is fixedin lk 2.4.22 and the lk 2.6 series (by extending thetimeout to 2 hours). To be on the safe side use the background extendedtest instead. Also some disks silently ignore foreground self tests (e.g. the Seagate Cheetah series).

[6] This is why some models spring to life after minutes of inactivity andperform some operation even though there are no external commandspending.

[7] In a multi initiator environment (e.g. several computers sharing the sametape jukebox) there should only be one application monitoring tape alertsper initiator.

[8] Henceforth the term Informational Exceptions(or IE) will include both Informational Exceptions and thetemperature (or 'enclosure degraded') warnings.

[9] IE have a (minor) performance impact on a disk. Mac os x snow leopard retail torrent. There are various othersettings in the IEC mode page (e.g. PERF, EBF and LOGERR) that addressthis. The standard gives a lot of latitude to the vendor in implementingthese additional flags. This finer level of control may be added to smartmontools if the need arises.

[10] One might worry whether the smartd daemon is properly setup or if the device really will issue IE when the need arises. The mechanismcan be tested by setting the TEST bit in the IEC mode page. That isdone by this command: smartctl -r ioctl,3 -s on <device> [ignore the extra debugging output that '-r ioctl,3' causes]. Aspecial asc/ascq pair is reserved for testing (0x5d,0xff)and the standard associates with it this awkward message: 'Failure prediction threshold exceeded (false)'. A call to smartctl -H <device> or waiting until the next smartd poll should produce that message if the mechanism is working. The IEC mode page TEST bit can be turned off (i.e. back to normalIE) with smartctl -s on <device>. The outputafter the TEST bit has been activated is shown in the Examples section for the Fujitsu MAM3184 disk.

[11] The documents found on the t10 site are actually draftstandards. Once they are ratified they become available from ANSI fora fee. The t10 site maintains the last draft prior to ratification andthe most recent draft of yet to be ratified standards.