Can a bad LUN cause all nodes of a cluster to blue screen?


i need in determining cause if failure of cluster nodes.

this 2 node cluster running approximately ten clustered sql server instances. clustered luns being served netapp device via fiber channel. cluster has been running fine on year entire cluster crashed. unaware of changing on either node.

it started out sql instance failing , forth few times - reason unknown. 1 of nodes blue screened, few minutes later other blue screened.

attempts restart either node kept resulting in blue screens. dump indicated blue screen occurring in clusres.dll on both nodes.

we repaired cluster offlining of luns, removing them cluster storage, onlining luns, , adding them cluster 1 one. found several had become corrupt , not mounted. had reformat drives,  rebuild sql instances , restore databases.

whether corrupt volumes cause of failure or result of failure not known. unfortunately cant @ sql server logs servers disks belonged since logs on disks themselves.

eight days later, exact same thing happened again!

can corrupt lun part of sql server resource group cause this? if seems me huge vulnerability in clustering if 1 failed sql instance can cause entire cluster down , cause corruption in other cluster storage volumes.

in first incident 1 of corrupt volumes quorum disk. can under stand cause cluster go down, should cause both servers blue screen? other 2 corrupt volumes belonged sql server resource groups.

in second incident quorum disk not corrupt. still not sure of root cause here. know snapshot of single lun failed on netapp. lun cluster disk 1 sql server instance , able bring instance online first deleting snapshot. again appears single failed resource group somehow caused every node of cluster blue screen.



chuck





in short, yes...issues connectivity lun can cause cluster blue screen stop 0x9e...actually, issues clustered resource can cause box blue screen in 2008 clustering.

i'd recommend reviewing following behavior configurable if don't want box bsod:

http://blogs.technet.com/b/askcore/archive/2009/06/12/why-is-my-2008-failover-clustering-node-blue-screening-with-a-stop-0x0000009e.aspx


visit blog multi-site clustering



Windows Server  >  High Availability (Clustering)



Comments

Popular posts from this blog

some help on Event 540

WMI Repository 4GB limit - Win 2003 Ent Question

Event ID 1302 (error 1307) DFS replication service encountered an error while writing to the debug log file