Wednesday, October 7, 2009

XenServer iSCSI SR's not connecting

The other day we had an issue with one XenServer VM's hanging so I attempted to do a force reboot. Unfortunately I found that this failed (never timed out). After attempting to run xe task-cancel uuid=xxx on the task in question and that not working something that I had found said to run xe-toolstack-restart (DO NOT DO THIS). The toolstack restart failed misureably and led me to having to restart the host. Once it was back online the iSCSI fun began.

  • The restarted host was the master
  • It showed 'almost' all SR's as broken to include the local DVD drive
  • SR's that didn't show broken still couldn't be booted from
  • Broken SR's couldn't be repaired successfully

This led me to start looking at the iSCSI SAN which is an HP Lefthand Networks SAN/iQ v8.1. After opening the SAN/iQ management console I found that many of the Snapshot schedules I had setup where 'paused' due to backlog. In addition all the SS's that I had deleted where still listed, but already reported as deleted if I attempted again.

Things to note:

  • The week prior one LH node had the RAID controller card fail and had to be replaced
  • The failed card had been replaced and system powered back on so that it could restripe
  • All VM's run off the LH Cluster that had the failed LH node in it
  • Snapshots wouldn't delete from either of the 2 clusters in the LH setup (VM cluster or Storage cluster)
  • Gateway connections to the XenServer host showed in a 'failed' status

The LH rep very quickly pointed out that the Local Bandwidth Priority was set to .25 MB/sec. Yikes! That's not right. Changing this setting back to the recommended 4 MB/sec helped a 'little', but not very much. We then changed it to 10 and the difference was not much better. Fortunately I did notice that within a few minutes the XenServer host has picked up it's SR's again. YEAH!

As I was waiting for things to replicate so that speeds would pick up again on the network (I had set it back to 4MB/sec by this point) it occured to me that the node with the failed RAID controller would be attempting to resync still along with all the Snapshot data.

BINGO! Shutdown the LH node that had failed and instantly everything picked up and ran at lightning speeds again. XenServer kicked in and all admin tasks worked great again. Once everything was connected and all Snapshots where taken care of I turned back on the failed LH node and let it resync which was fairly quick at this point and caused no more heart failures.

Leasons learned:

  • .25 MB/sec is way to slow for admin tasks on LH nodes (I already knew this, but now I know to check it)
  • Backlogged LH admin tasks can cause the iSCSI connection initiations to slow to a crawl (I was told this shouldn't affect it, but imo it clearly did).
  • Don't run xe-toolstack-restart unless you absolutely have to. I could have easily fixed the root of the issue (LH replication) without the outage had I not run this command
  • After major failures such as the RAID controller check up on it periodically to ensure that it's finished / processing in a timely manner. Had I done this I would have found the Snapshot issue and resync backlog days in advance.

No comments:

Post a Comment