wiki:Internal/NodeFailureModes

Version 25 (modified by ssugrim, 16 years ago) ( diff )

List of Node Failures

Node Failure Mode Solution / Notes
[1,5] Pxe Halt - Locks up during execution of PXE code Multiple resets (more than 1)
may be required
Might require node Change
[1,5] Dead Node ID box top LED (the blinking one) Power cycle Fixed it
Rabbit Issue?
[3,8] First Power on Halt Locks during the first attempt
Post after reset
[17,4] First Power on Halt Locks during the first attempt
no serial console output
[1,14] First Power on Halt Locks during the first attempt
Reset Fixes it
has new disk
[20,19] Disk Failure Kernel Throws errors during imageing
Disk Changed
[12,9] Disk Controller Failure Disk controller was having issues, disks were being incorrectly recognised
[3,18] Disk Failure Disk Write errors
Disk replaced
[5,11] Disk Failure Disk Write errors
Disk replaced
[14,11] Disk Failure Disk Write errors
Disk replaced
[13,5] Lock Up Rabbit and Node were halted
Power cycled
[4,11] Disk Failure Disk Write errors
Disk replaced
[5,9] Disk Failure Disk Write errors
Disk replaced
[9,11] Disk Failure Disk Write errors
Disk replaced
[3,19] Bad Node Mother board Failure, refused to boot
Replaced
[14,8] Disk Failure Kernel Throws Disk Errors
Disk Changed
[17,9] Disk Failure Disk write halts, imaging times out
Disk replaced
[18,3] Over heat CM measures internal temp at 106F, fails to boot reliably
[20,2] Disk Failure Disk Write errors
Disk replaced
[8,13] Disk Failure Disk Write errors
Disk replaced
[9,10] Disk Failure Disk Write errors
Disk replaced
[5,2] Disk Failure Disk Write errors
Disk replaced
[17,13] Disk Failure Disk Write errors
Disk replaced
[12,1] Disk Failure Disk Write errors
Disk replaced
[6,14] Disk Failure Disk Write errors
Disk replaced
[17,19] Memory Failure Memory Pins did not make proper contact, Bent case and reinserted memory
[7,2] Disk Failure Disk Write errors
Disk replaced
[5,15] Lock Up Rabbit and Node were halted, node ID box LED was solid
Power cycled
[7,2] Lock Up Rabbit and Node were halted, node ID box led was off
Power cycled
[16,1] Lock Up Rabbit and Node were halted
Power cycled
[1,9] Intermitten failure Power cycled
[1,5] Disk Failure Failing disk caused disk controller to fail
Cm had issues also, both replaced
[9,4] Disk Failure Failing disk caused disk controller to fail
Cm had issues also, both replaced
[15,6] Disk Failure Disk Write errors
Disk replaced
[18,16] Disk Failure Disk Write errors
Disk replaced
[3,11] Disk Failure Disk Write errors
Disk replaced
[16,19] Disk Failure Disk Write errors
Disk replaced
[5,17] Disk Failure Disk Write errors
Disk replaced
[20,4] Node Failure Node was replaced
[15,4] Node Failure Node was replaced, bad left antenna connector. Replacement was used
[5,14] Overheat Fan was not plugged in
[17,4] Disk Failure Smartctl reports impending disk death
[9,9] Memory Failure Memory Pins did not make proper contact, Bent case and reinserted memory
[11,4] Disk Failure Disk Write errors
Disk replaced
[12,7] Disk Failure Disk Write errors
Disk replaced
[13,2] Disk Failure Successfully booted from disk, but kernel was throwing disk errors
[16,6] Disk Failure SMART overall-health self-assessment test result: FAILED!
[13,5] Disk Failure kernel throwing disk errors
[17,3] Disk Failure kernel throwing disk errors
[14,12] Pxe Halt - Locks up during execution of PXE code Not Fixed
[11,15] Network Failure Pxe give media check failure
] Node replaced
[19,6] Pxe Halt Powers down during pxe
[15,7] Pxe Halt Halts at random stages in the pxe image download process, before control in handed over to kernel
[16,8] CM crash Power Cycled
[20,20] CM crash CM light stays solid, Power Cycled
[7,2] CM crash Node ID light stays off, Power Cycled
[2,20] CM crash CM light stays solid, Power Cycled
[14,12] Disk Failure Disk Write errors
Disk replaced
[10,7] Disk Failure Disk Write errors
Disk replaced
[11,18] Disk Failure Disk Write errors
Disk replaced
[1,15] Disk Failure Disk Write errors
Disk replaced
[8,3] Disk Failure Disk Write errors
Disk replaced
[2,11] Disk Failure Disk Write errors
Disk replaced
[11,16] Disk Failure Disk Write errors
Disk replaced
[7,8] Disk Failure Bios Does not detect disk
Disk replaced
[18,7] Disk Failure Bios Does not detect disk
Disk replaced
[2,17] Disk Failure Bios Does not detect disk
Disk replaced
[5,19] Disk Failure Bios Does not detect disk
Disk replaced
[7,2] Disk Failure kernel throwing disk errors
Disk replaced
[12,4] Disk Failure kernel throwing disk errors
Disk replaced
[1,8] Disk Failure kernel throwing disk errors
Disk replaced
[18,18] Disk Failure kernel throwing disk errors
Disk replaced
[14,20] Disk Failure kernel throwing disk errors
Disk replaced
[9,16] Disk Failure kernel throwing disk errors
Disk replaced
[4,6] Disk Failure kernel throwing disk errors
Disk replaced
[6,8] Disk Failure kernel throwing disk errors
Disk replaced
[3,13] Disk Failure kernel throwing disk errors
Disk replaced
[5,4] Disk Failure kernel throwing disk errors
Disk replaced
[10,5] Disk Failure kernel throwing disk errors
Disk replaced
[10,8] Disk Failure kernel throwing disk errors
Disk replaced
[8,8] Network Failure Kernel throws network hardware complain during dhcp
[12,4] Disk Failure Bios Does not detect disk
Disk replaced
[8,10] Disk Failure Bios Does not detect disk
Disk replaced
[15,17] Disk Failure Bios Does not detect disk
Disk replaced
[10,2] Disk Failure Bios Does not detect disk
Disk replaced
[1,6] Disk Failure kernel throwing disk errors
Disk replaced
hda: dma_timer_expiry: dma status == 0x21
[18,12] Disk Failure kernel throwing disk errors
Disk replaced
hda: dma_timer_expiry: dma status == 0x21
[1,10] Disk Failure kernel throwing disk errors
Disk replaced
[13,8] Disk Failure kernel throwing disk errors
Disk replaced
[12,12] Disk Failure kernel throwing disk errors
Disk replaced
[8,8] Disk Failure kernel throwing disk errors
Disk replaced
[2,3] Disk Failure kernel throwing disk errors
Disk replaced
[2,14] Disk Failure kernel throwing disk errors
Disk replaced
[13,17] Disk Failure kernel throwing disk errors
Disk replaced
[16,17] Disk Failure kernel throwing disk errors
Disk replaced
[1,2] Node Failure Can't isolate Problem: Seems to over heat and kernel panic
Note: See TracWiki for help on using the wiki.