Version 7 (modified by 8 years ago) ( diff ) | ,
---|
Group Experimentation Support
Table of Contents
Experiment scheduler
Installing and configuring packages on Ubuntu
This section describes how to set up torque PBS. If it's already set up, skip this section. First, we install the torque PSB system:
apt-get install torque-server torque-scheduler torque-mom torque-client
Then stop all the running torque processes:
/etc/init.d/torque-mom stop /etc/init.d/torque-scheduler stop /etc/init.d/torque-server stop
Create the PBS server (say "yes" when prompted):
pbs_server -t create
Then kill the PBS server process:
killall pbs_server
Set up the PBS server:
echo $(hostname -f) > /etc/torque/server_name echo $(hostname -f) > /var/spool/torque/server_priv/acl_svr/acl_hosts echo $(hostname -f) > /var/spool/torque/mom_priv/config echo root@$(hostname -f) > /var/spool/torque/server_priv/acl_svr/operators echo root@$(hostname -f) > /var/spool/torque/server_priv/acl_svr/managers echo "$(hostname -f) np=4" > /var/spool/torque/server_priv/nodes
If you have a line in your /etc/hosts file that resolves your hostname to 127.0.1.1, you have to comment it out, e.g.
#127.0.1.1 console.grid.orbit-lab.org console
Once you've done that, start everything back up again:
/etc/init.d/torque-server start /etc/init.d/torque-scheduler start /etc/init.d/torque-mom start
Now we'll set up some configuration values:
qmgr -c "set server scheduling = True" qmgr -c "set server acl_host_enable = True" qmgr -c "set server acl_hosts = $(hostname -f)" qmgr -c "set server allow_node_submit = True"
You'll have to run the commands above as root, since you've set up the root user as the only PBS operator and manager.
Set up queues
Next we'll set up queues: one for each node.
For various reasons, we've decided to make a queue per node and have the console be the single "compute" node, instead of having the nodes be the "compute" nodes. (Mainly because then we can still use legacy disk images, and don't have to worry about configuring the nodes to work with torque.) It might seem "neater" to use "nodes" instead of queues, because this would make it simpler to run an experiment with multiple nodes. In practice, though, it would still be annoying to run an experiment with multiple nodes because you generally care which nodes in this scenario (e.g. you want nodes that are close.)
First we get a list of nodes, then we'll set up a queue for each one:
list=$(omf stat -t system:topo:all | grep "Node:" | awk -F" " '{print $2}' | cut -f1 -d$'.') for l in $list do qmgr -c "create queue $l" qmgr -c "set queue $l queue_type = Execution" qmgr -c "set queue $l max_running = 1" qmgr -c "set queue $l enabled = True" qmgr -c "set queue $l started = True" done
(Again, this must be done as root or as a queue manager.)
Now we'll disable all the queues, because we are going to re-enable them selectively.
list=$(qstat -Q | tail -n+3 | awk -F" " '{print $1}') for l in $list do qdisable "$l" done
Check with
qstat -Q