Work in progress

This site is in the process of being reviewed and updated.

Pre-Installation

User virtualization (consistent username, UID, and GID values)

The username of the user submitting a job must be recognized on the compute host where the job runs and each user must have unique and consistent UID/GID values.

Creating the sgeadmin user account

Similarly virtualized.

Home directories

Grid Engine runs jobs in the user's home directory. For every user, and on every compute host, a home directory is present and contains all the desired dot-file configurations.

Hostnames and DNS

Grid Engine likes DNS and both forward and reverse DNS queries must be configured.

On all hosts, edit /etc/services
sge_qmaster     536/tcp                         # Sun Grid Engine queue master
sge_execd       537/tcp                         # Sun Grid Engine exec daemon
Creating the SGE root directory and exporting it via NFS to all cluster nodes.

All compute farm members must share a common path to the SGE root so be careful to ensure that the path to the GridEngine files is the same on the master node as it is on the other servers and compute elements. This path should be what is used globally as the SGE root directory. For example:

/opt/sge
On Execution Hosts, NFS mount the Grid Engine directory of the Master node, $SGE_ROOT
On Submit hosts

Insert the proper line into system or user .bashrc files.

. /opt/sge/default/common/settings.sh
Application and data files

The prolog and epilog script feature of Grid Engine provides a generic mechanism for implementing a site-specific stage-in/stage-out facility. Alternatively, these steps could be embedded into jobs scripts directly.

Shared filesystem options

If you plan to install into a shared NFS filesystem, make sure the server is not mounting the filesystem with options that block the root user or remap the root UID value to a non-priviledged value. Grid Engine can run as a non-root user but it needs to be started by root. There are also setuid binaries in the distribution that will break if root-squashing is enabled.

Classic Spooling vs. Berkeley-DB Spooling

If you are just starting out with Grid Engine, use classic spooling. If your cluster is less than 20 nodes in size, use classic spooling. Once you have the system up and running for a while you'll easily be able to tell if your standard sorts of workload and workflows are being affected by spool performance. By that time, you'll be comfortable enough with Grid Engine that you'll have no trouble backing up your configuration and reinstalling with berkeley spooling enabled.

The automatic install scripts are not worth dealing with on small clusters

For clusters smaller than 30 nodes in size (where I already have passwordless SSH access set up) it is actually quicker to manually log into each node and invoke the "./install_execd" script by hand.

Qmaster Installation

Unpacking and initial setup
[DIRxSRVx10:root@host ~]# SGE_ROOT=/opt/sge; export SGE_ROOT
[DIRxSRVx10:root@host ~]# cd ${SGE_ROOT}
[DIRxSRVx10:root@host ~]# gzip -dc sge-6.0u8-common.tar.gz | tar xvpf -
[DIRxSRVx10:root@host ~]# gzip -dc sge-6.0u8-bin-lx24-x86.tar.gz | tar xvpf -
[DIRxSRVx10:root@host ~]# gzip -dc sge-6.0u8-bin-lx24-amd64.tar.gz | tar xvpf -
[DIRxSRVx10:root@host ~]# util/setfileperm.sh $SGE_ROOT
Create a db spool dir and start the installation on the master host
[DIRxSRVx10:root@host ~]# export SGE_ROOT=/opt/sge
[DIRxSRVx10:root@host ~]# mkdir -p /var/spool/sge
[DIRxSRVx10:root@host ~]# chown -R sgeadmin /var/spool/sge
[DIRxSRVx10:root@host ~]# cd $SGE_ROOT
[DIRxSRVx10:root@host ~]# ./install_qmaster
Accept defaults except
  • User name to install as sgeadmin
  • Grid Engine group id range of 20000-20200
  • <administrator_mail> set to sgeadmin@example.com
  • Adding admin and submit hosts set to server1 server2 server3
  • Do you want to add your shadow host(s) now? (y/n) [y] >> n

Execution Host Installation

Add execution hosts as administrative hosts

All execution hosts must be administrative hosts during their installation. You may verify your administrative hosts with the command

[DIRxSRVx10:root@host ~]# qconf -sh

and you may add new administrative hosts on the master host with the command

[DIRxSRVx10:root@host ~]# qconf -ah <hostname>
Create spooling directories on each execution host:
[DIRxSRVx10:root@host ~]# mkdir -p /var/spool/sge
[DIRxSRVx10:root@host ~]# chown sgeadmin /var/spool/sge
Run the installer script in auto-install mode

The install_execd script allows options which will install the exec daemon with default options, without interactive input, and [DIRxSRVx10:optionally] without creating the default queue.

[DIRxSRVx10:root@host ~]# export SGE_ROOT=/opt/sge
[DIRxSRVx10:root@host ~]# cd ${SGE_ROOT}
[DIRxSRVx10:root@host ~]# ./install_execd -auto -fast [DIRxSRVx10:-noqueue]
Run the installer script in interactive mode
[DIRxSRVx10:root@host ~]# export SGE_ROOT=/opt/sge
[DIRxSRVx10:root@host ~]# cd ${SGE_ROOT}
[DIRxSRVx10:root@host ~]# ./install_execd
Accept defaults except
  1. Do you want to configure a local spool directory for this host (y/n) [n] >> y
  2. Enter path /var/spool/sge

When the install script is done, Grid Engine should be installed and running. Run

[DIRxSRVx10:root@host ~]# qstat -f 

and you should see an entry for all.q@hostname. If so, everything is set up.

Troubleshooting

Reinstallation

BEFORE you reinstall the server for any reason, you MUST stop the execution host daemons. Then after the install you must reinstall the execution hosts

Grid Engine messages

Grid Engine messages can be found at:

/tmp/qmaster_messages (during qmaster startup)
/tmp/execd_messages (during execution daemon startup)

After startup the daemons log their messages in their spool directories.

Qmaster: /var/spool/qmaster/messages
Exec daemon: <execd_spool_dir>/<hostname>/messages

Queue error states

If a queue enters an error state, the queue must be reset before further jobs will be sheduled on that queue. To reset a queue, become sgeadmin on the qmaster and run the command

[DIRxSRVx10:root@host ~]# qmod -cq <queuename>
For NFS-mounted spool dirs, ensure a spool dir exists and permissions are set
[DIRxSRVx10:root@host ~]# mkdir <SGE_CELL>/spool/<HOSTNAME>
[DIRxSRVx10:root@host ~]# chown sgeadmin.root <SGE_CELL>/spool/<HOSTNAME>/

Resources

  1. aims to enable Java developers to easily and efficiently use the Sun Grid Compute Utility as a platform for the distributed execution of parallel computations.
  • No labels