| | 1 | = ORBIT Reliability 2/2007 = |
| | 2 | |
| | 3 | == Power Supplies == |
| | 4 | |
| | 5 | The power supplies in some ORBIT nodes are failing. Two power supply |
| | 6 | failure modes from regular operation have been identified. First, the |
| | 7 | power supply degrades to the point where the CM has enough power to |
| | 8 | report back to the CMC, but not enough power to reliably turn the node |
| | 9 | PC on or off. It is unclear, but it also seems that this first |
| | 10 | failure mode may also mean incorrect communication between the CM and |
| | 11 | the Node ID box. Second, the power supply further degrades to where |
| | 12 | there is not even enough power to operate the CM at all. It is |
| | 13 | possible for a node to operate in one of these failure modes for a |
| | 14 | while and then come back, so for example retrying the power on |
| | 15 | operation might work on a node in the first failure mode. It seems |
| | 16 | the power supplies degrade over time, not for example over how many |
| | 17 | times they are used in a particular way. We know this because nodes |
| | 18 | that are used more frequently, around (1, 1), do not fail any more |
| | 19 | frequently than other nodes. The only known remedy for nodes with |
| | 20 | failed power supplies is to replace the power supply entirely. It is |
| | 21 | presently unclear how best to do this. The power supplies in the |
| | 22 | nodes are not in a regular ATX form, and replacing a part in all 400 |
| | 23 | nodes of the grid is not a trivial undertaking. Currently, A small |
| | 24 | number of known good power supplies is used to replace power supplies |
| | 25 | in nodes in either failure mode during weekly scheduled maintenance, |
| | 26 | if not sooner. |
| | 27 | |
| | 28 | Once a node enters the first failure mode, the problem cascades into |
| | 29 | the software. The CMC receives regular watchdog messages from each |
| | 30 | CM, with which it makes decisions about node availability. In the |
| | 31 | first failure mode, the CM will report back to the CMC as if nothing |
| | 32 | is wrong. That is, you will see nodes listed as "available" on the |
| | 33 | status page, even when it is impossible for the CM to reliably turn |
| | 34 | the node on or off. The CMC in turn reports incorrect node |
| | 35 | availability to the NodeAgent and NodeHandler, which frustrates any |
| | 36 | attempt to run an experiment on every available node. Once the power |
| | 37 | supply has degraded into the second failure mode, the CMC stops |
| | 38 | getting watchdog messages, and can correctly mark the node as |
| | 39 | unavailable. |
| | 40 | |
| | 41 | == CM/CMC Software == |
| | 42 | |
| | 43 | We do not have enough evidence to be sure of this, but it seems that |
| | 44 | the CMC issuing UDP commands to CMs fails more often than expect |
| | 45 | scripts issuing equivalent telnet commands to CM consoles. |
| | 46 | Furthermore, the UDP commands seem to upset the internal state of CM |
| | 47 | such that a reset make future commands more reliable. There also |
| | 48 | exist error conditions in which the CM operates incorrectly, or |
| | 49 | freezes, such that issuing it a reset command does not do anything; |
| | 50 | power must be interrupted to recover the CM from such a state. This |
| | 51 | is exceptionally bad for remote users, who cannot physically |
| | 52 | manipulate the grid to clear the error. |
| | 53 | |
| | 54 | There is uncertainty associated with the development environment |
| | 55 | ``Dynamic C''. Dynamic C is not a mature compiler. Many language |
| | 56 | features a C programmer would expect have been left out or are subtly |
| | 57 | different. Dynamic C provides several different programming |
| | 58 | constructs for cooperative (or preemptive!) multitasking, and it is |
| | 59 | unclear whether or not the current CM code is using them correctly. |
| | 60 | |
| | 61 | == Network Infrastructure == |
| | 62 | |
| | 63 | We regularly experience bugs in our network switches. Momentarily |
| | 64 | interrupting the power of the switches often clears otherwise |
| | 65 | unidentifiable network errors. We strongly suspect that any strenuous |
| | 66 | utilization of the switches, such as would cause packets to be queued |
| | 67 | or discarded, makes the future operation of the switches more likely |
| | 68 | to be in error. Additionally, we seem to lose one or two out of 27 |
| | 69 | Netgear switches every month, such that the switch becomes completely |
| | 70 | inoperable and must be sent back to Netgear for replacement. Higher |
| | 71 | quality switches are too expensive for us to obtain. |
| | 72 | |
| | 73 | == Software Remedies == |
| | 74 | |
| | 75 | Rewriting the CMC as a properly threaded web service prevents problems |
| | 76 | in failed CM software, as well as power supplies in the first failure |
| | 77 | mode described above, from cascading into the rest of the system. |
| | 78 | Changing the protocol between the CMC and CM to a stateful TCP based |
| | 79 | protocol will make detection even quicker. Ultimately, failing power |
| | 80 | supplies must be replaced, and the CM code must be made more robust. |
| | 81 | Making CMs reset, rather than turn on and off, their nodes can be used |
| | 82 | to extend the lifetime of the current grid. There's little we can do |
| | 83 | about the switches, but we can at least detect switch problems more |
| | 84 | quickly. |
| | 85 | |
| | 86 | === Threaded CMC === |
| | 87 | |
| | 88 | It is difficult to instrument the current CMC to compensate for any |
| | 89 | failure in a command to a CM to turn the node on or off. One could |
| | 90 | imagine a CMC which checked status of nodes after telling them to turn |
| | 91 | on, perhaps retrying if the first failure mode is detected. However, |
| | 92 | because the CM and the CMC communicate using a stateless, asynchronous |
| | 93 | protocol over UDP, and because the present implementation of the CMC |
| | 94 | is not threaded, it is impractical to determine whether status check |
| | 95 | results came from before or after a restart command was issued. Each |
| | 96 | interaction between the CMC and the CM would need to wait from 20 to |
| | 97 | 40 seconds to be sure the status being reported was status from after |
| | 98 | a command was issued. Because the present CMC implementation can only |
| | 99 | interact in this way with one node at a time, this mandatory wait time |
| | 100 | does not scale. |
| | 101 | |
| | 102 | === New CM === |
| | 103 | |
| | 104 | The CM is a relatively large program, and we do not have the resources |
| | 105 | to rewrite it all. However, a smaller feature set would not only make |
| | 106 | a rewrite possible, it would reduce the amount of code. Less code |
| | 107 | gives the Dynamic C compiler less opportunity to err, and gives us |
| | 108 | less to maintain in the long run. |
| | 109 | |
| | 110 | === Switch Tools === |
| | 111 | |
| | 112 | We update the firmware in the switches as often as the vendor supplies |
| | 113 | changes, but this does not seem to make things better. Because the |
| | 114 | software on the switches is closed source software on a closed |
| | 115 | hardware platform there is nothing we can do to directly fix the |
| | 116 | problem. We are developing better tools for detecting when switch |
| | 117 | ports autonegotiate or otherwise enter unexpected states. |
| | 118 | |
| | 119 | === Reset to 'Off Image' === |
| | 120 | |
| | 121 | Even in the first failure mode of a power supply, a CM can reliably |
| | 122 | reset the node, causing it to reboot. The CMC could be modified to |
| | 123 | send reset commands in the place of on and off commands. |
| | 124 | Additionally, the CMC could somehow make it so that these reset |
| | 125 | commands resulted in booting the node from the network, and that the |
| | 126 | network boot image would be a special 'off image' in the case of what |
| | 127 | would normally be a off command. The current software is careful to |
| | 128 | separate the job of selecting an image for a node into the NodeHandler |
| | 129 | and NodeAgent software, so this change would be a kludge. |
| | 130 | |
| | 131 | Using just this kludge, the CM would always report the node as being |
| | 132 | on, and therefore it would be impossible to distinguish between a node |
| | 133 | being active or inactive in an experiment. The 'off image' would |
| | 134 | therefore be made to run an echo service on an obscure port number, |
| | 135 | and the CMC would need to be further modified detect this to determine |
| | 136 | each node's activation state. Because it is the only software |
| | 137 | performing commands that could change the activation state, the CMC |
| | 138 | could instead keep a record of which nodes are active and which are |
| | 139 | not, however this is a fragile arrangement; if the CMC failed for any |
| | 140 | reason there would need to be something like the obscurely numbered |
| | 141 | echo port to rediscover what was going on. |