-
Notifications
You must be signed in to change notification settings - Fork 0
Torque
Torque is a resouce manager. Meaning Torque watches over the nodes to provide control over which jobs run on which compute nodes/gpus/resources in the batch queue. Adapative Computing are the main developers of this Open Source product and you can find the download for Torque can be found here.
I usually prefer to stick to the releases, however, at this time there is a very annoying bug that is only fixed in the github release. Both building from github and the release are covered, but as of 20130518 there is still issues with the release build where the github build works. Fingers crossed that the 4.2.3 release will build and install without problems.
sudo yum groupinstall "development tools"
sudo yum install git imake libtool libxml2-devel openssl-devel
sudo yum install pinentry-gui
If you have a GUI, you will need this package.
mkdir ~/Code
It is good to sign your RPMs so that the servers can trust the RPMs from your repository safely. To sign your RPMs, you need to make a GPG key as the dev user authorized to do so. You should not ever compile as root. Bad things are bound to happen.
gpg --gen-key
- Select default type of key: 1
- Set the key size: 4096
- Key is valid for the life of the cluster: 4y
- Is this correct?: y
- Real Name: Users Fullname
- Email Address: [email protected]
- Comment: Not Needed
- Is this OK?: O
- It will ask for a new password to be created.
- This next part may take time as it generates the keys.
Now find your gpg fingerprint
gpg --fingerprint
The number needed is the second part of the number of the pub line. Find the first grouping of numbers after "pub" and take the second group of those numbers. This is YOUR_FINGERPRINT_ID in the next step.
gpg --fingerprint | awk '/^pub/{print $2}' | awk -F'/' '{print $2}'
We need to export the public key ID from this.
gpg -a -o ~/RPM-GPG-KEY-your_user_name_or_project_name --export $YOUR_FINGERPRINT_ID
Add the fingerprint to your RPM macro file.
echo '%_gpg_name $YOUR_FINGERPRINT_ID' >> ~/.rpmmacros
Copy the Key to the HTTP webserver that was created in the earlier CreateRepo step.
scp ~/RPM-GPG-KEY-your_user_name_or_project_name [email protected]:/var/www/html/Cluster_Repo/.
cd ~/Code
git clone https://github.com/adaptivecomputing/torque.git
cd torque
./autogen.sh
./configure
touch README.torque
# The makefile needs this file which doesn't exist; I already reported this bug a few months ago and as of this moment it still hasn't been fixed...
make rpm
or
make srpm
If you want to build a signed RPM, make a SRPM then rebuild and sign it.
rpmbuild --rebuild --sign ~/rpmbuild/SRPMS/torque-4.2.3-1.adaptive.el6.src.rpm
Find the latest release of Torque. The link is given at the top of this section. At this time the latest version is 4.2.2.
To build the rpm run:
rpmbuild -tb torque-4.2.2.tar.gz
To build the signed rpm run:
rpmbuild -tb --sign torque-4.2.2.tar.gz
To build the SRPM rpm run:
rpmbuild -ts torque-4.2.2.tar.gz
On the webserver http.cluster.domain, copy the rpms and build the repo.
cd /var/www/html/Cluster_Repo
scp -r [email protected]:rpmbuild/RPMS/x86_64 .
createrepo .
Now that it is in the repository, install Torque server and devel packages. See the CreateRepo section for details on the yum repo file.
Download the cluster.repo file.
sudo wget http://http.cluster.domain/repo_files/cluster.repo
Verify that the repo file works and torque is coming from your repository.
sudo yum clean all && sudo yum update && sudo yum info torque
Install the torque packages. Verify that the packages are installing from your repository!
sudo yum install torque-devel torque-server
Create the checkpoint directory.
$ sudo mkdir /var/spool/torque/checkpoint
Edit the /var/spool/torque/server_priv/nodes file to tell it what nodes have resources for use. For now, configure just how many processors each node has.
$ sudo vim /var/spool/torque/server_priv/nodes
node01 np=2
node02 np=2
node03 np=2
node04 np=2
Edit the file /var/spool/torque/server_name for configuring the server hostname.
$ sudo sh -c 'echo frontend01.cluster.domain > /var/spool/torque/server_name'
Restart the Torque service.
$ sudo service pbs_server restart
Verify that the Torque service is running.
$ qmgr -c 'p s'
Common concerns.
- When you start Torque up the first time you will see a message saying "Creating inital TORQUE configuration". It is not uncommon for this to take an absurd amount of time even though your cpu is idle and seemingly not doing anything. I don't know why.
- Frequently, the database creation goes screwy in the first run. If it is really slow or hangs when you run qmgr, rerun the create script and start over by answering yes to the following:
$ sudo /usr/sbin/pbs_server -t create