How to replace failed disk in Solaris x86 zfs pool

This procedure I followed specifically for SUN FIRE X4270 M2 SERVER.

This process seems to be simple even I had same view about that till yesterday. But I guess your views will also change by seeing the below article. I encountered the below discussed issue. First just given you the glimpse of issue which faced by me
during replacement of disk

Issue 1 ) New disk was not getting detected at OS level.
Issue 2 ) Both spare came into use for one single disk failure.
Issue 3 ) Old disk entry c0t12d0/old was not getting replaced from pool.

Step 1 : Output which I took at first step before replacement of failed disk. Here only one spare is in use which is expected.

[root@vikrant /]# zpool status -xv
pool: mypool
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using ‘zpool online’ or replace the device with
‘zpool replace’.
scan: resilvered 141G in 0h31m with 0 errors on Thu Jul 3 00:06:25 2014
config:

NAME STATE READ WRITE CKSUM
mypool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
c0t11d0 ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
c0t12d0 REMOVED 0 0 0
c0t21d0 ONLINE 0 0 0
c0t13d0 ONLINE 0 0 0
c0t14d0 ONLINE 0 0 0
c0t15d0 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
c0t16d0 ONLINE 0 0 0
c0t17d0 ONLINE 0 0 0
c0t18d0 ONLINE 0 0 0
c0t19d0 ONLINE 0 0 0
c0t20d0 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
c0t8d0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c0t9d0 ONLINE 0 0 0
c0t10d0 ONLINE 0 0 0
cache
c0t1d0 ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0
spares
c0t21d0 INUSE currently in use
c0t22d0 AVAIL

errors: No known data errors

Step 2 : Issuing offline command on failed disk.

[root@vikrant /opt/MegaRAID/CLI]# zpool offline mypool c0t12d0

[root@vikrant /opt/MegaRAID/CLI]# zpool status -xv
pool: mypool
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using ‘zpool online’ or replace the device with
‘zpool replace’.
scan: resilvered 141G in 0h31m with 0 errors on Thu Jul 3 00:06:25 2014
config:

NAME STATE READ WRITE CKSUM
mypool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
c0t11d0 ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
c0t12d0 OFFLINE 0 0 0
c0t21d0 ONLINE 0 0 0
c0t13d0 ONLINE 0 0 0
c0t14d0 ONLINE 0 0 0
c0t15d0 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
c0t16d0 ONLINE 0 0 0
c0t17d0 ONLINE 0 0 0
c0t18d0 ONLINE 0 0 0
c0t19d0 ONLINE 0 0 0
c0t20d0 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
c0t8d0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c0t9d0 ONLINE 0 0 0
c0t10d0 ONLINE 0 0 0
cache
c0t1d0 ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0
spares
c0t21d0 INUSE currently in use
c0t22d0 AVAIL

Step 3 : To see the disk in cfgadm and unconfigure the disk.

root@vikrant /opt/MegaRAID/CLI]# cfgadm -al | grep c0t12d0
c0::dsk/c0t12d0 disk connected configured unknown

[root@vikrant /opt/MegaRAID/CLI]# cfgadm -c unconfigure c0::dsk/c0t12d0

[root@vikrant /opt/MegaRAID/CLI]# cfgadm -al | grep c0t12d0
c0::dsk/c0t12d0 disk connected unconfigured unknown

Step 4 : Tried to scan the new disk on server. But didn’t get detected on server.

[root@vikrant /opt/MegaRAID/CLI]# devfsadm -v
[root@vikrant /opt/MegaRAID/CLI]# cfgadm -al | grep c0t12d0
c0::dsk/c0t12d0 disk connected unconfigured unknown
[root@vikrant /opt/MegaRAID/CLI]# cfgadm -c configure c0::dsk/c0t12d0
cfgadm: Hardware specific failure: failed to configure SCSI device: I/O error

Step 5 : My server was having Megaraid controller.

I used the below command to check the status of physical disk. Shown partial output. Only for our faulted disk in slot 12.

cd /opt/MegaRAID/CLI/
./MegaCli -PDList -aALL

Enclosure Device ID: 32
Slot Number: 12
Enclosure position: 0
Device Id: 34
Sequence Number: 5
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.464 GB [0x22cee000 Sectors]
Firmware state: Unconfigured(good), Spun Up
SAS Address(0): 0x5000cca025371d7d
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: HITACHI H106030SDSUN300GA2B01209NZ9N8B
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive: Not Certified
Drive Temperature :27C (80.60 F)

Step 6 : I issued below command to check the virtual drive status.

root@vikrant /opt/MegaRAID/CLI]# ./MegaCli -LDInfo -Lall -aALL

I was not able to see the virtual drive for Slot 12.

I created that Virtual Drive by using below command. Here 32 is the enclosure device ID and 12 is faulted disk slot that makes 32:12 in below command.

[root@vikrant /opt/MegaRAID/CLI]# ./MegaCli -CfgLDAdd -R0 -[32:12] -a0

Adapter 0: Created VD 12

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

Virtual Drive: 12 (Target Id: 12)
Name :
RAID Level : Primary-0, Secondary-0, RAID Level Qualifier-0
Size : 278.464 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 1
Span Depth : 1
Default Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Cached, No Write Cache if Bad BBU
Access Policy : Read/Write
Disk Cache Policy : Disk’s Default
Encryption Type : None

Step 7 : I issued devfsadm to scan the disk at OS level.

After performing that I was able to see the disk in output.

[root@vikrant /opt/MegaRAID/CLI]# devfsadm
[root@vikrant /opt/MegaRAID/CLI]# echo | format
Searching for disks…done

c0t12d0: configured with capacity of 275.53GB

AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@0,0
1. c0t1d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@1,0
2. c0t2d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@2,0
3. c0t3d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@3,0
4. c0t4d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@4,0
5. c0t5d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@5,0
6. c0t6d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@6,0
7. c0t7d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@7,0
8. c0t8d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@8,0
9. c0t9d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@9,0
10. c0t10d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@a,0
11. c0t11d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@b,0
12. c0t12d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@c,0
13. c0t13d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@d,0
14. c0t14d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@e,0
15. c0t15d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@f,0
16. c0t16d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@10,0
17. c0t17d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@11,0
18. c0t18d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@12,0
19. c0t19d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@13,0
20. c0t20d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@14,0
21. c0t21d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@15,0
22. c0t22d0
/pci@0,0/pci8086,340a@3/pci1000,9263@0/sd@16,0
Specify disk (enter its number): Specify disk (enter its number):

[root@vikrant /opt/MegaRAID/CLI]# cfgadm -al | grep c0t12d0
c0::dsk/c0t12d0 disk connected configured unknown

Step 8 : I issued online command on failed disk in pool.

[root@vikrant /opt/MegaRAID/CLI]# zpool online mypool c0t12d0
warning: device ‘c0t12d0’ onlined, but remains in faulted state
use ‘zpool replace’ to replace devices that are no longer present
[root@vikrant /opt/MegaRAID/CLI]# zpool replace mypool c0t12d0

Step 9 : As soon as I did the above step my status of pool become.

pool: mypool
state: DEGRADED
scan: resilvered 307G in 6h29m with 0 errors on Fri Jul 4 05:31:23 2014
config:

NAME STATE READ WRITE CKSUM
mypool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
c0t11d0 ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 0
c0t12d0/old FAULTED 0 0 0 corrupted data
c0t22d0 ONLINE 0 0 0
c0t12d0 ONLINE 0 0 0
c0t21d0 ONLINE 0 0 0
c0t13d0 ONLINE 0 0 0
c0t14d0 ONLINE 0 0 0
c0t15d0 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
c0t16d0 ONLINE 0 0 0
c0t17d0 ONLINE 0 0 0
c0t18d0 ONLINE 0 0 0
c0t19d0 ONLINE 0 0 0
c0t20d0 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
c0t8d0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c0t9d0 ONLINE 0 0 0
c0t10d0 ONLINE 0 0 0
cache
c0t1d0 ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0
spares
c0t21d0 INUSE currently in use
c0t22d0 INUSE currently in use

Strange both spare disk come into usage.

Step 10 : After resilvering completed I detached the spare disk from pool.

zpool detach poolname c0t22d0

Before detaching other disk just verify the output of replaced disk with below command.

root@vikrant /var/crash/x4270-01]# zdb -lluuu /dev/rdsk/c0t12d0s0

zpool detach poolname c0t21d0

Now after detaching that still old disk was present there. I used the below method to remove the old disk.

pool: mypool
state: DEGRADED
scan: resilvered 307G in 6h29m with 0 errors on Fri Jul 4 05:31:23 2014
config:

NAME STATE READ WRITE CKSUM
mypool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
c0t11d0 ONLINE 0 0 0
replacing-1 DEGRADED 0 0 0
c0t12d0/old FAULTED 0 0 0 corrupted data
c0t12d0 ONLINE 0 0 0
c0t13d0 ONLINE 0 0 0
c0t14d0 ONLINE 0 0 0
c0t15d0 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
c0t16d0 ONLINE 0 0 0
c0t17d0 ONLINE 0 0 0
c0t18d0 ONLINE 0 0 0
c0t19d0 ONLINE 0 0 0
c0t20d0 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0
c0t8d0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c0t9d0 ONLINE 0 0 0
c0t10d0 ONLINE 0 0 0
cache
c0t1d0 ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0
spares
c0t21d0 AVAIL
c0t22d0 AVAIL

[root@vikrant /var/crash/x4270-01]# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace uppc pcplusmp cpu.generic zfs mr_sas sockfs ip hook neti dls sctp arp usba uhci fctl nca lofs sata md cpc fcip random crypto logindmux ptm ufs sppp nfs ipc ]
> ::spa -v ! more
ADDR STATE NAME
ffffffffa68e9040 ACTIVE USB_drive

ADDR STATE AUX DESCRIPTION
ffffffffaba41300 HEALTHY – root
ffffffffaba40cc0 HEALTHY – /dev/dsk/c1t0d0p0
ffffffffa6691a80 ACTIVE mypool

ffffffffac607cc0 DEGRADED – root
ffffffffac607040 DEGRADED – raidz
ffffffffac607680 HEALTHY – /dev/dsk/c0t11d0s0
ffffffffac60a2c0 DEGRADED – spare
ffffffffac608300 DEGRADED – replacing
ffffffffac609000 CANT_OPEN CORRUPT_DATA /dev/dsk/c0t12d0s0/old
ffffffffac609c80 HEALTHY – /dev/dsk/c0t12d0s0
ffffffffaba40040 HEALTHY – /dev/dsk/c0t21d0s0
ffffffffac60a900 HEALTHY – /dev/dsk/c0t13d0s0
ffffffffac60b0c0 HEALTHY – /dev/dsk/c0t14d0s0
ffffffffac60b700 HEALTHY – /dev/dsk/c0t15d0s0
ffffffffac60e340 HEALTHY – raidz
ffffffffac60bd40 HEALTHY – /dev/dsk/c0t16d0s0
ffffffffac60dd00 HEALTHY – /dev/dsk/c0t17d0s0
ffffffffac60c380 HEALTHY – /dev/dsk/c0t18d0s0
ffffffffac60d6c0 HEALTHY – /dev/dsk/c0t19d0s0
ffffffffac60c9c0 HEALTHY – /dev/dsk/c0t20d0s0
ffffffffac60d080 HEALTHY – mirror
ffffffffac60e980 HEALTHY – /dev/dsk/c0t7d0s0
ffffffffac6056c0 HEALTHY – /dev/dsk/c0t8d0s0
ffffffffac605080 HEALTHY – mirror
ffffffffad9219c0 HEALTHY – /dev/dsk/c0t9d0s0
ffffffffad921380 HEALTHY – /dev/dsk/c0t10d0s0
– – – cache
ffffffffad920d40 HEALTHY – /dev/dsk/c0t1d0s0
ffffffffad920700 HEALTHY – /dev/dsk/c0t2d0s0
ffffffffad9200c0 HEALTHY – /dev/dsk/c0t3d0s0
ffffffffae251900 HEALTHY – /dev/dsk/c0t4d0s0
ffffffffae2512c0 HEALTHY – /dev/dsk/c0t5d0s0
ffffffffae250c80 HEALTHY – /dev/dsk/c0t6d0s0
– – – spares
ffffffffa768a0c0 HEALTHY – /dev/dsk/c0t21d0s0
ffffffffa768a700 HEALTHY – /dev/dsk/c0t22d0s0
ffffffffa66a1040 ACTIVE rpool
ffffffff94d1d980 HEALTHY – root
ffffffff80188040 HEALTHY – /dev/dsk/c0t0d0s0

> ffffffffac609000::print vdev_t vdev_guid | =E
3037552813781476140

Last Command to Detach the old disk which didn’t get disappeared automatically.

[root@vikrant /var/crash/x4270-01]# zpool detach mypool 3037552813781476140

After that everything came as expected 🙂 Below are the reference DOCS for this activity.

Reference Oracle DOC : 1470448.1 (Increasing resilvering speed)
1496593.1 (Understanding resilvering operation)
1522855.1 (AULTED or UNAVAIL VDEVs within a ZFS Pool may get SPARED more than once)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s