Friday, 16 September 2011

Investigate a SCSI disk error from Linux


Investigate a SCSI disk error from Linux



Collect the following information when a platform reports SCSI disk errors:
The following files:
/var/log/messages*



The output from the following commands:
/sbin/fdisk -l
/bin/cat /proc/scsi/scsi
/bin/dmesg






Once you have collected the data, review the output for any similarities in the errors. If the error always reports the same target, then consider replacing the device itself.
If the errors change to different targets on the same bus, then further troubleshooting is required. In the following references, the id is the target number of the device.

Here are examples of the output you will see when the scsi disks have errors from Red Hat Linux.

Notice in this example, we can see that the channel is 0, id is 1 and lun is 0. Each line of the error refers to the same id. We can also see that the disk is getting errors on different sectors. In this case, id 1 is target 1. This would indicate that target 1 is getting the errors and is the suspect fru.

/var/log/messages

Dec 20 10:33:23 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:23 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:23 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:23 localhost kernel: I/O error: dev 08:03, sector 13631520
Dec 20 10:33:23 localhost kernel: EXT3-fs error (device sd(8,3)): ext3_get_inode_loc: unable to read inode block – inode=852064, block=1703940
Dec 20 10:33:23 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:23 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:23 localhost kernel: EXT3-fs error (device sd(8,3)) in ext3_reserve_inode_write: IO failure
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 13631520
Dec 20 10:33:24 localhost kernel: EXT3-fs error (device sd(8,3)): ext3_get_inode_loc: unable to read inode block – inode=852064, block=1703940
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 0
Dec 20 10:33:24 localhost kernel: EXT3-fs error (device sd(8,3)) in ext3_reserve_inode_write: IO failure
Dec 20 10:33:24 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 27010000
Dec 20 10:33:24 localhost kernel: I/O error: dev 08:03, sector 0

Notice in the following example We see that the errors also report the same channel 0, id 1, and lun 0 each time. We can also note scsi bus timeouts along with scsi check condition messages for the same device. In this case, target 1 is the suspect fru.

Another example of /var/log/messages

Sep 21 23:35:41 localhost kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 21 23:35:41 localhost kernel: Inspecting /boot/System.map-2.4.18-17.7.x.4smp
Sep 21 23:35:41 localhost kernel: Loaded 17857 symbols from /boot/System.map-2.4.18-17.7.x.4smp.
Sep 21 23:35:41 localhost kernel: Symbols match kernel version 2.4.18.
Sep 21 23:35:41 localhost kernel: Loaded 256 symbols from 11 modules.
Sep 21 23:35:41 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0
return code = 27010000
Sep 21 23:35:41 localhost kernel: I/O error: dev 08:17, sector 66453508
Sep 21 23:35:41 localhost kernel: SCSI disk error : host 0 channel 0 id 1 lun 0
return code = 27010000
:
:
Sep 21 23:35:49 localhost kernel: scsi :”’ aborting command due to timeout : pid
43891492, scsi0, channel 0, id 1, lun 0 Write (10) 00 00 4b ae 5b 00 00 02 00”’
Sep 21 23:35:49 localhost kernel: mptscsih: OldAbort scheduling ABORT SCSI IO
(sc=c2db7200)
Sep 21 23:35:49 localhost kernel: IOs outstanding = 5
Sep 21 23:35:49 localhost kernel: scsi : aborting command due to timeout : pid
43891493, scsi0, channel 0, id 1, lun 0 Write (10) 00 00 43 2e 5d 00 00 02 00
:
:
Sep 21 23:35:49 localhost kernel: mptscsih: ioc0: Issue of TaskMgmt Successful!
Sep 21 23:35:49 localhost kernel: SCSI host 0 abort (pid 43891492) timed out – resetting
Sep 21 23:35:49 localhost kernel: SCSI bus is being reset for host 0 channel 0.
Sep 21 23:35:50 localhost kernel: mptscsih: OldReset scheduling BUS_RESET (sc=c2db7200)
Sep 21 23:35:50 localhost kernel: IOs outstanding = 6
Sep 21 23:35:50 localhost kernel: SCSI host 0 abort (pid 43891493) timed out – resetting
:
:
Sep 21 23:35:51 localhost kernel: SCSI host 0 reset (pid 43891492) timed out again -
Sep 21 23:35:51 localhost kernel: probably an unrecoverable SCSI bus or device hang.
Sep 21 23:35:51 localhost kernel: SCSI host 0 reset (pid 43891493) timed out again -
Sep 21 23:35:51 localhost kernel: SCSI Error Report =-=-= (0:0:0)
Sep 21 23:35:51 localhost kernel: SCSI_Status=02h (CHECK CONDITION)
Sep 21 23:35:51 localhost kernel: Original_CDB[]: 28 00 02 B1 4E 62 00 00 04 00
Sep 21 23:35:51 localhost kernel: SenseData[12h]: 70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00
Sep 21 23:35:51 localhost kernel: SenseKey=6h (UNIT ATTENTION); FRU=02h
Sep 21 23:35:51 localhost kernel: ASC/ASCQ=29h/02h “SCSI BUS RESET OCCURRED”
Sep 21 23:35:51 localhost kernel: SCSI Error Report =-=-= (0:1:0)
Sep 21 23:35:51 localhost kernel: SCSI_Status=02h (CHECK CONDITION)
Sep 21 23:35:51 localhost kernel: Original_CDB[]: 2A 00 00 45 EE 5F 00 00 02 00
Sep 21 23:35:51 localhost kernel: SenseData[12h]: 70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00
Sep 21 23:35:51 localhost kernel: SenseKey=6h (UNIT ATTENTION); FRU=02h
Sep 21 23:35:51 localhost kernel: ASC/ASCQ=29h/02h “SCSI BUS RESET OCCURRED”
Sep 21 23:35:51 localhost kernel: md3: no spare disk to reconstruct array! — continuing in degraded mode
Sep 21 23:35:51 localhost kernel: md: updating md2 RAID superblock on device
Sep 21 23:35:52 localhost kernel: md: (skipping faulty sdb5 )
Sep 21 23:35:52 localhost kernel: md: sda5 [events: 00000012]<6>(write) sda5′s sb offset: 4192832
Sep 21 23:35:52 localhost kernel: raid1: sda7: redirecting sector 30736424 to another mirror

In the following example, we see write errors on channel 0, id 0, lun 0 along with a medium error indicating an issue with the device itself. In this example, target 0 is the suspect part.

/var/log/messages

scsi0: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 06 1d 3a 0d 00 00 08 00
Info fld=0x61d3a0d, Deferred sd08:02: sense key Medium Error
Additional sense indicates Write error
I/O error: dev 08:02, sector 102498376
SCSI Error: (0:0:0) Status=02h (CHECK CONDITION)
Key=3h (MEDIUM ERROR); FRU=0Ch
ASC/ASCQ=0Ch/02h “”
CDB: 2A 00 07 F9 3A 2D 00 00 08 00

scsi0: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 07 f9 3a 2d 00 00 08 00
Info fld=0x8153a0d, Deferred sd08:02: sense key Medium Error
Additional sense indicates Write error – auto reallocation failed I/O error: dev 08:02, sector 133693544

Here is an example of the output you would see from cat /proc/scsi/scsi. This will give you a model number of the drive which could help with determining a part number and disk ID. Output will deviate depending on platform model and configuration:

Attached devices:
cat /proc/scsi/scsi

Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: SEAGATE Model: ST373307LC Rev: 0007
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: SEAGATE Model: ST373307LC Rev: 0007
Type: Direct-Access

Here’s an example of what you might see with dmesg. In the example shown, we can see that there is an issue with channel 0, id 1 and lun 0. This example needs more data to determine the failure. However, one may suspect that the disk needs to be replaced as it shows an i/o error 2 times on the same disk.

Sector 3228343
scsi0 (1:0): rejecting I/O to offline device
RAID1 conf printout:
— wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
disk 1, wo:1, o:0, dev:sdb1
RAID1 conf printout:
— wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
scsi0 (1:0): rejecting I/O to offline device
md: write_disk_sb failed for device sdb2
md: errors occurred during superblock update, repeating scsi0 (1:0): rejecting I/O to offline device

No comments:

Post a Comment

Twitter Bird Gadget