Diagnosing Hard Drive read

https://www.lisenet.com/2014/measure-and-troubleshoot-linux-disk-io-resource-usage/

smartctl -a /dev/sda

 

SMART attributes to watch out for:

 

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always – 0
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline – 0
3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always – 12029
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always – 5
5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always – 0
7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always – 0
8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline – 0
9 Power_On_Hours 0x0032 074 074 000 Old_age Always – 10582
10 Spin_Retry_Count 0x0033 100 100 030 Pre-fail Always – 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always – 5
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always – 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always – 4
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always – 6
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always – 22 (Min/Max 14/25)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always – 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always – 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline – 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always – 0
220 Disk_Shift 0x0002 100 100 000 Old_age Always – 0
222 Loaded_Hours 0x0032 074 074 000 Old_age Always – 10650
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always – 0
224 Load_Friction 0x0022 100 100 000 Old_age Always – 0
226 Load-in_Time 0x0026 100 100 000 Old_age Always – 202
240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline – 0

Raw_Read_Error_Rate is a vendor-specific value, however most of the drives we use now in SLC/London have an incremental value — e.g. the higher this is, the worse. We used to use drives (I think some Seagates) which count *down* instead. So take this one with a grain of salt.

Reallocated_Sector_Ct is a metric to watch. If the magnetic spindle attempts to read/write data from a ‘bad’ sector on the disk, the data is instead re-mapped to a ‘spare’ area on the disk platter, and the original sector marked ‘reallocated’. A higher number here is a good indication of immediate failure. Also the higher this is, the worse performance will be, as the spindle goes to the ‘bad’ sector first, sees it’s been reallocated, and then moves to the ‘spare’ area to get the data.

Seek_Error_Rate is a vendor-specific value, however all drives I have seen are more likely to fail the higher this is. It’s basically a counter of how many times the disk arm overshoots a given sector on the disk platter.

Current_Pending_Sector is an indication of sectors that the drive firmware knows are bad, but for various reasons, cannot remap yet. What happens is the drive firmware will wait for the system to attempt writing to the pending bad sector, and automatically write the data to a spare area instead. This *should* be decremented upon successful re-writing, but any values here are bad!

Offline_Uncorrectable is another indicator of physical issues on the disk platter. It’s a generic “tried to read/write, but couldn’t for some reason” counter.

Loaded_Hours is a much more useful indicator of the disk’s real production hours (where the spindles/arms were in motion).

iostat

iostat is a very useful command. It gives you insight into the i/o requests being issued to your disk drive/array. Understanding how your system utilizes i/o is a good first step at optimization and resolving potential issues.

My typical iostat usage:

iostat -x 1 10 -Nz

-x will provide extended statistics (if available).
1 10 tells iostat to run a report every 1 second, for 10 iterations
-N resolves LVM names
-z tells iostat to omit any idle devices during the reporting period

Here’s an example from border (table used for clarity):

Linux 3.10.0-229.4.2.el7.x86_64 (border.100tb.com) 08/13/2015 _x86_64_ (8 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.02 0.08 0.00 99.88

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0 0.02 0.01 1.63 0.28 30.66 37.77 0.01 7.97 7.31 7.97 6.54 1.07
bordervg-root 0 0 0.01 1.55 0.28 30.61 39.77 0.01 8.42 7.39 8.42 6.85 1.06
bordervg-swap 0 0 0 0 0 0 8 0 6.07 6.07 0 1.5 0
bordervg-home 0 0 0 0.01 0 0.06 10.43 0 10.93 7.53 10.96 8.33 0.01

Important fields are:
r/s – # of read requests issued to device per second
w/s – # of write requests issued to device per second
rkB/s – # of kilobytes read from the device per second
wkB/s – # of kilobytes written to the device per second
avgrq-sz – The average size (in sectors) per request issued to the device
avgqu-sz – The average # of i/o requests queued on the device over the reporting period. You want this to be as small as possible, through tuning, using flash/ssd storage, etc.
await – The average time (in milliseconds) i/o requests took to be served. This is start to end –including time in queue and actual service time
%util – The utilization (up to 100%) of CPU time for i/o. The closer to 100% this is, the closer you are to being what’s called i/o-bound

So, looking at the above, it’s hardly utilized (to be expected). We have a small amount of read/write going on and no commands queued.

Let’s look at a server which has more going on:

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0 0 627 5 56952 40 90.18 5.49 8.8 1.58 100
vg-root 0 0 626 0 56888 0 90.88 5.25 8.48 1.6 100
vg-tmp 0 0 0 5 0 40 8 0.23 46.8 20.2 10.1

First things that pop out — The %util is at 100! Taking a look, we can see the following:

– There were 626 read requests issued to the device. This in and of itself isn’t alarming, but is important — if data is fragmented, or sectors remapped, the disks must work harder to service a request
– There were 56,888 sectors (rsec/s) read in a one-seond period. This is an extension of the above point — if data is fragmented or sectors remapped, there’s a huge amount of overhead involved

So given just the information above, we can see that there’s a read-heavy workload on this machine, at least right now. This system uses RAID10, which is balanced but does not excel at reads. A different RAID level (5 for example) would provide a better read performance. A different storage medium (e.g. SSDs or flash) will always help performance, though it’s not viable in a production system. Probably the last thing I could do would be to look at different kernel i/o schedules or tuning the amount of RAM dedicated to i/o in this system.

iostat doesn’t give you the answer to “why is my I/O poor?”, but it does give you insight into what the system is doing and the averages for each i/o task it performs. This is the first step at figuring out what to do to fix it 🙂

dd bs=1M count=1024 if=/dev/zero of=test conv=fdatasync

Some resources you can use to understand various things:

http://linux.die.net/man/1/iostat
http://xenserver.org/blog/entry/avgqusz.html
http://www.monperrus.net/martin/scheduler+queue+size+and+resilience+to+heavy+IO
http://www.dba-oracle.com/t_linux_disk_i_o.htm
https://rhsummit.files.wordpress.com/2013/06/rao_t_0340_tuning_rhel_for_databases3.pdf
http://www.linuxjournal.com/article/3910
https://romanrm.net/dd-benchmark


Leave a Reply

Your email address will not be published. Required fields are marked *