Site Administration

Node Supervision

Guardian daemon processes run on dedicated machines in the CDS front-end networks: {h,l}1guardian0. They are managed by a process supervision system called [[http://smarden.org/runit/|runit]] ([[http://packages.ubuntu.com/precise/admin/runit|Ubuntu package]]). Runit handles stoping, starting, and logging of the guardian daemons.

Runit manages individual daemon processes with a program called ‘runsv’ and its sister ‘svlogd’ that handles logging. Each runsv process supervises an individual guardian daemon. The overall pool of runsv-supervised guardian daemons is itself supervised by the runsvdir program. At the very top is the Ubuntu Upstart init daemon (process 1).

Here’s an example process tree, with a single active guardian process (‘SUS_SR2’):

init                                               # Upstart init daemon (process 1)
  +-runsvdir -P /etc/guardian/service              # runit main guardian runsvdir process
      +-runsv SUS_SR2                              # runit individual daemon supervisor
          +-guardian SUS_SR2                       # guardian main process
          |   +-guardian SUS_SR2 (worker)          # guardian worker subprocess
          |   |   +-{python}                       # guardian auxilliary threads
          |   |   +-{python}
          |   |   +-{python}
          |   |   +-{python}
          |   +-{python}
          |   +-{python}
          |   +-{python}
          +-svlogd -tt /var/log/guardian/SUS_SR2   # svlogd logging daemon

Interaction with this infrastucture is done via the guardctrl command line utility.

configuration directories and files

The main configuration directly on the guardian machine is /etc/guardian:

controls@h1guardian0:~ 0$ ls -al /etc/guardian
total 24
drwxr-xr-x   4 root     root     4096 Feb 28 18:28 .
drwxr-xr-x 104 root     root     4096 Feb 22 15:57 ..
-rw-r--r--   1 cdsadmin cdsadmin  441 Feb 28 18:27 local-env
lrwxrwxrwx   1 root     root       18 Feb 20 15:34 logs -> /var/log/guardian/
drwxr-xr-x  41 controls controls 4096 Mar 28 16:18 nodes
-rwxr-xr-x   1 cdsadmin cdsadmin  141 Feb 18 15:37 runsvdir-start
drwxr-xr-x   2 controls controls 4096 Mar 28 16:18 service
lrwxrwxrwx   1 root     root       13 Feb 20 15:33 supervise -> /run/guardian

Each node is given an unique configuration directory in /etc/guardian/nodes. Each node directory describes how the guardian daemon should be started and logged:

controls@h1guardian0:~ 0$ find /etc/guardian/nodes/SUS_SR2
/etc/guardian/nodes/SUS_SR2/
/etc/guardian/nodes/SUS_SR2/env
/etc/guardian/nodes/SUS_SR2/supervise
/etc/guardian/nodes/SUS_SR2/log
/etc/guardian/nodes/SUS_SR2/log/supervise
/etc/guardian/nodes/SUS_SR2/log/config
/etc/guardian/nodes/SUS_SR2/log/main
/etc/guardian/nodes/SUS_SR2/log/run
/etc/guardian/nodes/SUS_SR2/run
/etc/guardian/nodes/SUS_SR2/guardian
/etc/guardian/nodes/SUS_SR2/finish

The existence of this directory itself does not initiate the service. To initiate the service the node directory has to be linked into the “service” directory /etc/guardian/service so that it can be instantiated by the main runsvdir supervisor.

The supervise directory, /etc/guardian/supervise, holds runit-specific supervision files for each node (fifos, sockets, etc.). It should be a link to a tmpfs, in this case mounted at /run/guardian.

The logs are stored in /var/log/guardian.

When first creating the infrastructure, it’s important to create the needed extra directories (/run/guardian, /var/log/guardian) with the correct permissions, and then link them into /etc/guardian. The [[Guardian/guardctrl|guardctrl]] utility should then create the necessary subdirectories as needed.

guardian run environment

The run script in the node directory is what actually starts the guardian process. The guardian run scrips source the main environment file, /etc/guardian/local-env, before execution. This file should be configured with any local environmental settings that are necessary for guardian to function properly:

export PATH=/bin:/usr/bin
export LD_LIBRARY_PATH=
export PYTHONPATH=
. /ligo/apps/linux-x86_64/epics/etc/epics-user-env.sh
. /ligo/apps/linux-x86_64/nds2-client/etc/nds2-client-user-env.sh || true
. /ligo/apps/linux-x86_64/cdsutils/etc/cdsutils-user-env.sh
. /ligo/apps/linux-x86_64/guardian/etc/guardian-user-env.sh
ifo=`ls /ligo/cdscfg/ifo`
site=`ls /ligo/cdscfg/site`
export IFO=${ifo^^*}
export NDSSERVER=${ifo}nds0:8088,${ifo}nds1:8088

upstart init

The main service pool manager (runsvdir) is manged by upstart (Ubuntu’s init daemon). The service is called guardian-runsvdir, and uses the following upstart ‘init’ file that specifies when it should be started and stopped, and any pre/post run conditions:

/etc/init/guardian-runsvdir.conf:

start on runlevel [2345]
stop on shutdown
respawn
pre-start script
    mkdir -p /run/guardian
    mountpoint -q /run/guardian || mount -t tmpfs -o size=10M,user=controls tmpfs /run/guardian
end script
exec /etc/guardian/runsvdir-start
kill signal HUP

Note the service depends on there being a tempfs mounted at /run/guardian.

The init script actually executes the /etc/guardian/runsvdir-start script that sets up the environment for all guardian processes and then execs the runsvdir process:

/etc/guardian/runsvdir-start:

#!/bin/bash
PATH=/bin:/usr/bin
srvdir=/etc/guardian/service
exec env - \
    PATH=$PATH \
    chpst -u controls /usr/bin/runsvdir -P $srvdir

Upstart processes are controlled with the initctl utility. This can be used to start, stop, and check status of guardian-runsvdir. Any user can access status of an upstart-supervised daemon, but you must be root to start/stop the process:

controls@h1guardian0:~ 0$ initctl status guardian-runsvdir
guardian-runsvdir start/running, process 23769
controls@h1guardian0:~ 0$