2 Node - SQL Cluster fails suddenly without much information


hey

we have 2 node cluster sql 2008 r2 failed mysteriously, without reason.looking @ logs on 1 of nodes,

log name: system source: tcpip date: 08/10/2012 00:44:46 event id: 4199 task category: none level: error keywords: classic user: n/a computer: server-db4.local description: system detected address conflict ip address x.x.x.x system having
network hardware address aa-11-bb-22-cc-44. network operations on system may disrupted result.

other error

log name:      system source:        microsoft-windows-failoverclustering date:          08/10/2012 00:44:43 event id:      1135 task category: node mgr level:         critical keywords:       user:          system computer:      server-db4.local description: cluster node 'server-db3' removed active failover cluster membership. 
the cluster service on node may have stopped.
this due node having lost communication other active nodes in failover cluster.
run validate configuration wizard check network configuration.
if condition persists, check hardware or software errors related network adapters on node.
also check failures in other network components node connected such hubs, switches, or bridges.

just add cluster not new setup, been running year now.

looking cluster log i've found,

000004e0.000020e8::2012/10/08-00:35:43.540 warn [res] physical disk <quorum>: pr reserve failed, status 170
000004e0.000020e8::2012/10/08-00:35:43.540 info [res] physical disk: validatereservations: size of reservations 16
000004e0.000020e8::2012/10/08-00:35:43.540 info [res] physical disk: key: 1c9f7466734d, type 5 scope 0
000004e0.000020e8::2012/10/08-00:35:43.540 info [res] physical disk: sleeping 6 secs

00001ad0.00001f48::2012/10/08-00:35:43.789 err [quorum] node 1: death timer expired after 20 seconds (death timer started @ 2012/10/08-00:35:23.368). lost quorum.
00001ad0.00001f48::2012/10/08-00:35:43.789 err lost quorum (status = 5925)
00001ad0.00001f48::2012/10/08-00:35:43.789 err lost quorum (status = 5925), executing onstop
00001ad0.00001f48::2012/10/08-00:35:43.789 info [dm]: shutting down, unloading cluster database.
00001ad0.00001f48::2012/10/08-00:35:43.789 info [dm] shutting down, unloading cluster database (waitforlock: false).

appreciated.

i investigate possible network hardware failure or other hardware failure on 1 node.  looks cluster tried fail on alternate node first node not release ip address(es) after becoming unresponsive.  second node tried bring ip address(es) online first nod re-asserted ownership.  can happen nic cards when os stops respondng doesn't lock. 

geoff n. hiten principal consultant microsoft sql server mvp



Windows Server  >  High Availability (Clustering)



Comments

Popular posts from this blog

some help on Event 540

WMI Repository 4GB limit - Win 2003 Ent Question

Event ID 1302 (error 1307) DFS replication service encountered an error while writing to the debug log file