Node Management

managed node behavior

node “stalling”

There is one difference in how nodes behave when they’re managed (i.e. in MANAGED mode), compared to their base behavior in AUTO mode.

In MANAGED mode, nodes don’t automatically recover after jump transitions. They instead hold in the state they jumped to. This is called a “STALL”.

This allows the manager to see that there has been a jump and coordinate it’s recovery as needed.

NodeManager interface

The NodeManager provides an interface whereby one guardian node can “manage” other nodes. The NodeManager object has methods for fully controlling subordinate nodes, as well as monitoring their state, status, and progress towards achieving their requests.

The NodeManager is instantiated in the main body of the module by passing it a list of nodes to be managed:

from guardian import NodeManager

nodes = NodeManager(['SUS_MC1', 'SUS_MC2', 'SUS_MC3'])

Guardian will initialize connections to the nodes automatically. The nodes object is then usable throughout the system to manage the specified nodes.

managing nodes

If the manager is going to be setting the requests of the subordinates, it should set the nodes to be in MANAGED mode in the INIT state:

class INIT(GuardState):
    def main(self):
        nodes.set_managed()
        ...

Requests can be made of the nodes, and their progress can be monitored by inspecting their state:

# set the request
nodes['SUS_MC2'] = 'ALIGNED'
# check the current state
if nodes['SUS_MC2'] == 'ALIGNED':
    ...

The arrived property is True if all nodes have arrived at their requested states:

if nodes . arrived :
    ...

reviving stalled nodes

If a managed node has “stalled”, i.e. experienced a jump transition, there are two ways to revive it:

  • issue a new request:

    if nodes['SUS_MC2'].stalled:
        nodes['SUS_MC2'] = 'ALIGNED'
    
  • issue a guardian.Node.revive() command, which re-requests the last requested state:

    for node in nodes.get_stalled_nodes():
        node.revive()
    

checking node status

The checker method returns a decorator that looks for faults in the nodes. It will report if there are connection errors, node errors, notifications, or if the node mode has been changed:

@nodes.checker()
def main(self):
    ...

It only reports via the NOTIFICATION interface, unless specifically told to jump if there is a fault:

@nodes.checker(fail_return='DOWN')
def main(self):
    ...

The node checker should be run in all states.

Node and NodeManager classes

class guardian.NodeManager(nodes)

Manager interface to a set of subordinate Guardian nodes.

This should be instantiated with a list of node names to be managed. Node objects are instantiated for each node.

>>> nodes = NodeManager(['SUS_ITMX','SUS_ETMX'])
>>> nodes.init()                   # initialize (handled automatically in daemon)
>>> nodes.set_managed()            # set all nodes to be in MANAGED mode
>>> nodes['SUS_ETMX'] = 'ALIGNED'  # request state of node
>>> nodes['SUS_ITMX'] = 'ALIGNED'  # request state of node
>>> nodes.arrived                  # True if all nodes have arrived at their
                                   # requested states
>>> nodes.check_fault()            # Check for management-related "faults" in all nodes
init()

Initialize all nodes.

Under normal circumstances, i.e. in a running guardian daemon, node initialization is handled automatically. This function therefore does not need to be executed in user code.

set_managed(nodes=None)

Set all nodes to be managed by this manager.

names can be a list of node names to set managed.

release(nodes=None)

Release all nodes from management by this manager.

nodes can be a list of node names to release.

arrived

Return True if all nodes have arrived at their requested state.

completed

Return True if all nodes are arrived and done.

get_stalled_nodes()

Return a list of all stalled nodes.

revive_all()

Revive all stalled nodes.

not_ok()

Return set of node names not currently reporting OK status.

check_fault()

Check fault status of all nodes.

Runs check_fault() method for all nodes. Returns True if any nodes are in fault.

checker(fail_return=None)

Return GuardStateDecorator for checking fault status of Nodes.

node_manager is a Node or NodeManager object with a check_fault() method. Returns a GuardStateDecorator with it’s pre_exec method set to be the check_fault method. The “fail_return” option should specify an alternate return value for the decorated state method in case the check fails (i.e. a jump state name) (default None).

class guardian.Node(name)

Manager interface to a single Guardian node.

>>> SUS_ETMX = Node('SUS_ETMX')  # create the node object
>>> SUS_ETMX.init()              # initialize (handled automatically in daemon)
>>> SUS_ETMX.set_managed()       # set node to be in MANAGED mode
>>> SUS_ETMX.set_request('DAMPED') # request DAMPED state from node
>>> SUS_ETMX.arrived             # True if node arrived at requested state
>>> SUS_ETMX.check_fault()       # Check for management-related "faults" in the Node
>>> SUS_ETMX.release()           # release node from management
name

Node name

init()

Initialize the node.

Under normal circumstances, i.e. in a running guardian daemon, node initialization is handled automatically. This function therefore does not need to be executed in user code.

OP

node OP

MODE

node MODE

managed

True if node is MANAGED

MANAGER

MANAGER string of node

set_managed()

Set node to be managed by this manager.

release()

Release node from management by this manager (MODE=>AUTO).

i_manage

True if node is being managed by this system

ERROR

True if node in ERROR.

NOTIFICATION

True if node NOTIFICATION present.

OK

Current OK status of node.

REQUEST

Current REQUEST state of node.

request

Current REQUEST state of node.

set_request(state)

Set REQUEST state for node.

STATE

Current STATE of node.

state

Current STATE of node.

TARGET

Current TARGET state of node.

arrived

True if node STATE equals the last manager-requested state.

NOTE: This will be False if STATE == REQUEST but REQUEST was not last set by this Node manager object. This prevents false positives in the case that the REQUEST has been changed out of band.

STATUS

Current STATUS of node.

done

True if STATUS is DONE.

A state is DONE if it is the requested state and the state method has returned True.

completed

True is node has arrived at the request state, and state is done.

STALLED

True if the node has stalled in the current state.

This is true when STATE == TARGET != REQUEST, which is typically the result of a jump transition while in managed mode.

revive()

Re-request last requested state.

The last requested state in this case is the one requested from this Node object.

Useful for reviving stalled nodes, basically counteracting the stalling that is the effect of a jump transition while being in MANAGED mode. See the ‘STALLED’ property.

check_fault()

Return fault status of node.

Runs a series of checks on the “management status” of the node, and returns True if any of the following checks fail:

  • node still alive and running
  • node does not show ERROR status
  • REQUEST hasn’t deviated from last set value
  • if node had been set MANAGED, it is still set, and MANAGER hasn’t changed
  • node has no notifications (failure does not produce fault)

Any failure of the above also produces a NOTIFICATION message.

checker(fail_return=None)

Return GuardStateDecorator for checking fault status of Nodes.

node_manager is a Node or NodeManager object with a check_fault() method. Returns a GuardStateDecorator with it’s pre_exec method set to be the check_fault method. The “fail_return” option should specify an alternate return value for the decorated state method in case the check fails (i.e. a jump state name) (default None).