| 1 | = ORBIT Reliability 2/2007 = |
| 2 | |
| 3 | == Power Supplies == |
| 4 | |
| 5 | The power supplies in some ORBIT nodes are failing. Two power supply |
| 6 | failure modes from regular operation have been identified. First, the |
| 7 | power supply degrades to the point where the CM has enough power to |
| 8 | report back to the CMC, but not enough power to reliably turn the node |
| 9 | PC on or off. It is unclear, but it also seems that this first |
| 10 | failure mode may also mean incorrect communication between the CM and |
| 11 | the Node ID box. Second, the power supply further degrades to where |
| 12 | there is not even enough power to operate the CM at all. It is |
| 13 | possible for a node to operate in one of these failure modes for a |
| 14 | while and then come back, so for example retrying the power on |
| 15 | operation might work on a node in the first failure mode. It seems |
| 16 | the power supplies degrade over time, not for example over how many |
| 17 | times they are used in a particular way. We know this because nodes |
| 18 | that are used more frequently, around (1, 1), do not fail any more |
| 19 | frequently than other nodes. The only known remedy for nodes with |
| 20 | failed power supplies is to replace the power supply entirely. It is |
| 21 | presently unclear how best to do this. The power supplies in the |
| 22 | nodes are not in a regular ATX form, and replacing a part in all 400 |
| 23 | nodes of the grid is not a trivial undertaking. Currently, A small |
| 24 | number of known good power supplies is used to replace power supplies |
| 25 | in nodes in either failure mode during weekly scheduled maintenance, |
| 26 | if not sooner. |
| 27 | |
| 28 | Once a node enters the first failure mode, the problem cascades into |
| 29 | the software. The CMC receives regular watchdog messages from each |
| 30 | CM, with which it makes decisions about node availability. In the |
| 31 | first failure mode, the CM will report back to the CMC as if nothing |
| 32 | is wrong. That is, you will see nodes listed as "available" on the |
| 33 | status page, even when it is impossible for the CM to reliably turn |
| 34 | the node on or off. The CMC in turn reports incorrect node |
| 35 | availability to the NodeAgent and NodeHandler, which frustrates any |
| 36 | attempt to run an experiment on every available node. Once the power |
| 37 | supply has degraded into the second failure mode, the CMC stops |
| 38 | getting watchdog messages, and can correctly mark the node as |
| 39 | unavailable. |
| 40 | |
| 41 | == CM/CMC Software == |
| 42 | |
| 43 | We do not have enough evidence to be sure of this, but it seems that |
| 44 | the CMC issuing UDP commands to CMs fails more often than expect |
| 45 | scripts issuing equivalent telnet commands to CM consoles. |
| 46 | Furthermore, the UDP commands seem to upset the internal state of CM |
| 47 | such that a reset make future commands more reliable. There also |
| 48 | exist error conditions in which the CM operates incorrectly, or |
| 49 | freezes, such that issuing it a reset command does not do anything; |
| 50 | power must be interrupted to recover the CM from such a state. This |
| 51 | is exceptionally bad for remote users, who cannot physically |
| 52 | manipulate the grid to clear the error. |
| 53 | |
| 54 | There is uncertainty associated with the development environment |
| 55 | ``Dynamic C''. Dynamic C is not a mature compiler. Many language |
| 56 | features a C programmer would expect have been left out or are subtly |
| 57 | different. Dynamic C provides several different programming |
| 58 | constructs for cooperative (or preemptive!) multitasking, and it is |
| 59 | unclear whether or not the current CM code is using them correctly. |
| 60 | |
| 61 | == Network Infrastructure == |
| 62 | |
| 63 | We regularly experience bugs in our network switches. Momentarily |
| 64 | interrupting the power of the switches often clears otherwise |
| 65 | unidentifiable network errors. We strongly suspect that any strenuous |
| 66 | utilization of the switches, such as would cause packets to be queued |
| 67 | or discarded, makes the future operation of the switches more likely |
| 68 | to be in error. Additionally, we seem to lose one or two out of 27 |
| 69 | Netgear switches every month, such that the switch becomes completely |
| 70 | inoperable and must be sent back to Netgear for replacement. Higher |
| 71 | quality switches are too expensive for us to obtain. |
| 72 | |
| 73 | == Software Remedies == |
| 74 | |
| 75 | Rewriting the CMC as a properly threaded web service prevents problems |
| 76 | in failed CM software, as well as power supplies in the first failure |
| 77 | mode described above, from cascading into the rest of the system. |
| 78 | Changing the protocol between the CMC and CM to a stateful TCP based |
| 79 | protocol will make detection even quicker. Ultimately, failing power |
| 80 | supplies must be replaced, and the CM code must be made more robust. |
| 81 | Making CMs reset, rather than turn on and off, their nodes can be used |
| 82 | to extend the lifetime of the current grid. There's little we can do |
| 83 | about the switches, but we can at least detect switch problems more |
| 84 | quickly. |
| 85 | |
| 86 | === Threaded CMC === |
| 87 | |
| 88 | It is difficult to instrument the current CMC to compensate for any |
| 89 | failure in a command to a CM to turn the node on or off. One could |
| 90 | imagine a CMC which checked status of nodes after telling them to turn |
| 91 | on, perhaps retrying if the first failure mode is detected. However, |
| 92 | because the CM and the CMC communicate using a stateless, asynchronous |
| 93 | protocol over UDP, and because the present implementation of the CMC |
| 94 | is not threaded, it is impractical to determine whether status check |
| 95 | results came from before or after a restart command was issued. Each |
| 96 | interaction between the CMC and the CM would need to wait from 20 to |
| 97 | 40 seconds to be sure the status being reported was status from after |
| 98 | a command was issued. Because the present CMC implementation can only |
| 99 | interact in this way with one node at a time, this mandatory wait time |
| 100 | does not scale. |
| 101 | |
| 102 | === New CM === |
| 103 | |
| 104 | The CM is a relatively large program, and we do not have the resources |
| 105 | to rewrite it all. However, a smaller feature set would not only make |
| 106 | a rewrite possible, it would reduce the amount of code. Less code |
| 107 | gives the Dynamic C compiler less opportunity to err, and gives us |
| 108 | less to maintain in the long run. |
| 109 | |
| 110 | === Switch Tools === |
| 111 | |
| 112 | We update the firmware in the switches as often as the vendor supplies |
| 113 | changes, but this does not seem to make things better. Because the |
| 114 | software on the switches is closed source software on a closed |
| 115 | hardware platform there is nothing we can do to directly fix the |
| 116 | problem. We are developing better tools for detecting when switch |
| 117 | ports autonegotiate or otherwise enter unexpected states. |
| 118 | |
| 119 | === Reset to 'Off Image' === |
| 120 | |
| 121 | Even in the first failure mode of a power supply, a CM can reliably |
| 122 | reset the node, causing it to reboot. The CMC could be modified to |
| 123 | send reset commands in the place of on and off commands. |
| 124 | Additionally, the CMC could somehow make it so that these reset |
| 125 | commands resulted in booting the node from the network, and that the |
| 126 | network boot image would be a special 'off image' in the case of what |
| 127 | would normally be a off command. The current software is careful to |
| 128 | separate the job of selecting an image for a node into the NodeHandler |
| 129 | and NodeAgent software, so this change would be a kludge. |
| 130 | |
| 131 | Using just this kludge, the CM would always report the node as being |
| 132 | on, and therefore it would be impossible to distinguish between a node |
| 133 | being active or inactive in an experiment. The 'off image' would |
| 134 | therefore be made to run an echo service on an obscure port number, |
| 135 | and the CMC would need to be further modified detect this to determine |
| 136 | each node's activation state. Because it is the only software |
| 137 | performing commands that could change the activation state, the CMC |
| 138 | could instead keep a record of which nodes are active and which are |
| 139 | not, however this is a fragile arrangement; if the CMC failed for any |
| 140 | reason there would need to be something like the obscurely numbered |
| 141 | echo port to rediscover what was going on. |