Node Management¶

managed node behavior¶

node “stalling”¶

There is one difference in how nodes behave when they’re managed (i.e. in MANAGED mode), compared to their base behavior in AUTO mode.

In MANAGED mode, nodes don’t automatically recover after jump transitions. They instead hold in the state they jumped to. This is called a “STALL”.

This allows the manager to see that there has been a jump and coordinate it’s recovery as needed.

NodeManager interface¶

The NodeManager provides an interface whereby one guardian node can “manage” other nodes. The NodeManager object has methods for fully controlling subordinate nodes, as well as monitoring their state, status, and progress towards achieving their requests.

The NodeManager is instantiated in the main body of the module by passing it a list of nodes to be managed:

from guardian import NodeManager

nodes = NodeManager(['SUS_MC1', 'SUS_MC2', 'SUS_MC3'])

Guardian will initialize connections to the nodes automatically. The nodes object is then usable throughout the system to manage the specified nodes.

managing nodes¶

If the manager is going to be setting the requests of the subordinates, it should set the nodes to be in MANAGED mode in the INIT state:

class INIT(GuardState):
    def main(self):
        nodes.set_managed()
        ...

Requests can be made of the nodes, and their progress can be monitored by inspecting their state:

# set the request
nodes['SUS_MC2'] = 'ALIGNED'
# check the current state
if nodes['SUS_MC2'] == 'ALIGNED':
    ...

The arrived property is True if all nodes have arrived at their requested states:

if nodes . arrived :
    ...

reviving stalled nodes¶

If a managed node has “stalled”, i.e. experienced a jump transition, there are two ways to revive it:

issue a new request:

if nodes['SUS_MC2'].stalled:
    nodes['SUS_MC2'] = 'ALIGNED'

issue a guardian.Node.revive() command, which re-requests the last requested state:
```
for node in nodes.get_stalled_nodes():
    node.revive()
```

checking node status¶

The checker method returns a decorator that looks for faults in the nodes. It will report if there are connection errors, node errors, notifications, or if the node mode has been changed:

@nodes.checker()
def main(self):
    ...

It only reports via the NOTIFICATION interface, unless specifically told to jump if there is a fault:

@nodes.checker(fail_return='DOWN')
def main(self):
    ...

The node checker should be run in all states.

Node and NodeManager classes¶

class guardian.NodeManager(nodes)¶

Manager interface to a set of subordinate Guardian nodes.

This should be instantiated with a list of node names to be managed. Node objects are instantiated for each node.

>>> nodes = NodeManager(['SUS_ITMX','SUS_ETMX'])
>>> nodes.init()                   # initialize (handled automatically in daemon)
>>> nodes.set_managed()            # set all nodes to be in MANAGED mode
>>> nodes['SUS_ETMX'] = 'ALIGNED'  # request state of node
>>> nodes['SUS_ITMX'] = 'ALIGNED'  # request state of node
>>> nodes.arrived                  # True if all nodes have arrived at their
                                   # requested states
>>> nodes.check_fault()            # Check for management-related "faults" in all nodes

init()¶

Initialize all nodes.

Under normal circumstances, i.e. in a running guardian daemon, node initialization is handled automatically. This function therefore does not need to be executed in user code.

set_managed(nodes=None)¶

Set all nodes to be managed by this manager.

names can be a list of node names to set managed.

release(nodes=None)¶

Release all nodes from management by this manager.

nodes can be a list of node names to release.

arrived¶: Return True if all nodes have arrived at their requested state.

completed¶: Return True if all nodes are arrived and done.

get_stalled_nodes()¶: Return a list of all stalled nodes.

revive_all()¶: Revive all stalled nodes.

not_ok()¶: Return set of node names not currently reporting OK status.

check_fault()¶

Check fault status of all nodes.

Runs check_fault() method for all nodes. Returns True if any nodes are in fault.

checker(fail_return=None)¶

Return GuardStateDecorator for checking fault status of Nodes.

node_manager is a Node or NodeManager object with a check_fault() method. Returns a GuardStateDecorator with it’s pre_exec method set to be the check_fault method. The “fail_return” option should specify an alternate return value for the decorated state method in case the check fails (i.e. a jump state name) (default None).

class guardian.Node(name)¶

Manager interface to a single Guardian node.

>>> SUS_ETMX = Node('SUS_ETMX')  # create the node object
>>> SUS_ETMX.init()              # initialize (handled automatically in daemon)
>>> SUS_ETMX.set_managed()       # set node to be in MANAGED mode
>>> SUS_ETMX.set_request('DAMPED') # request DAMPED state from node
>>> SUS_ETMX.arrived             # True if node arrived at requested state
>>> SUS_ETMX.check_fault()       # Check for management-related "faults" in the Node
>>> SUS_ETMX.release()           # release node from management

name¶: Node name

init()¶

Initialize the node.

Under normal circumstances, i.e. in a running guardian daemon, node initialization is handled automatically. This function therefore does not need to be executed in user code.

OP¶: node OP

MODE¶: node MODE

managed¶: True if node is MANAGED

MANAGER¶: MANAGER string of node

set_managed()¶: Set node to be managed by this manager.

release()¶: Release node from management by this manager (MODE=>AUTO).

i_manage¶: True if node is being managed by this system

ERROR¶: True if node in ERROR.

NOTIFICATION¶: True if node NOTIFICATION present.

OK¶: Current OK status of node.

REQUEST¶: Current REQUEST state of node.

request¶: Current REQUEST state of node.

set_request(state)¶: Set REQUEST state for node.

STATE¶: Current STATE of node.

state¶: Current STATE of node.

TARGET¶: Current TARGET state of node.

arrived¶

True if node STATE equals the last manager-requested state.

NOTE: This will be False if STATE == REQUEST but REQUEST was not last set by this Node manager object. This prevents false positives in the case that the REQUEST has been changed out of band.

STATUS¶: Current STATUS of node.

done¶

True if STATUS is DONE.

A state is DONE if it is the requested state and the state method has returned True.

completed¶: True is node has arrived at the request state, and state is done.

STALLED¶

True if the node has stalled in the current state.

This is true when STATE == TARGET != REQUEST, which is typically the result of a jump transition while in managed mode.

revive()¶

Re-request last requested state.

The last requested state in this case is the one requested from this Node object.

Useful for reviving stalled nodes, basically counteracting the stalling that is the effect of a jump transition while being in MANAGED mode. See the ‘STALLED’ property.

check_fault()¶

Return fault status of node.

Runs a series of checks on the “management status” of the node, and returns True if any of the following checks fail:

node still alive and running

node does not show ERROR status

REQUEST hasn’t deviated from last set value

if node had been set MANAGED, it is still set, and MANAGER hasn’t changed

node has no notifications (failure does not produce fault)

Any failure of the above also produces a NOTIFICATION message.