| 1 |
swift |
1.1 |
<?xml version='1.0' encoding="UTF-8"?>
|
| 2 |
|
|
|
| 3 |
neysx |
1.4 |
<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.3 2005/05/13 20:15:50 neysx Exp $ -->
|
| 4 |
swift |
1.1 |
|
| 5 |
|
|
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
|
| 6 |
neysx |
1.2 |
<guide link="/doc/en/hpc-howto.xml">
|
| 7 |
swift |
1.1 |
|
| 8 |
|
|
<title>High Performance Computing on Gentoo Linux</title>
|
| 9 |
|
|
|
| 10 |
|
|
<author title="Author">
|
| 11 |
|
|
<mail link="marc@adelielinux.com">Marc St-Pierre</mail>
|
| 12 |
|
|
</author>
|
| 13 |
|
|
<author title="Author">
|
| 14 |
|
|
<mail link="benoit@adelielinux.com">Benoit Morin</mail>
|
| 15 |
|
|
</author>
|
| 16 |
|
|
<author title="Assistant/Research">
|
| 17 |
|
|
<mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
|
| 18 |
|
|
</author>
|
| 19 |
|
|
<author title="Assistant/Research">
|
| 20 |
|
|
<mail link="olivier@adelielinux.com">Olivier Crete</mail>
|
| 21 |
|
|
</author>
|
| 22 |
|
|
<author title="Reviewer">
|
| 23 |
|
|
<mail link="spyderous@gentoo.org">Donnie Berkholz</mail>
|
| 24 |
|
|
</author>
|
| 25 |
|
|
|
| 26 |
|
|
<!-- No licensing information; this document has been written by a third-party
|
| 27 |
|
|
organisation without additional licensing information.
|
| 28 |
|
|
|
| 29 |
|
|
In other words, this is copyright adelielinux R&D; Gentoo only has
|
| 30 |
|
|
permission to distribute this document as-is and update it when appropriate
|
| 31 |
|
|
as long as the adelie linux R&D notice stays
|
| 32 |
|
|
-->
|
| 33 |
|
|
|
| 34 |
|
|
<abstract>
|
| 35 |
|
|
This document was written by people at the Adelie Linux R&D Center
|
| 36 |
neysx |
1.2 |
<http://www.adelielinux.com> as a step-by-step guide to turn a Gentoo
|
| 37 |
|
|
System into an High Performance Computing (HPC) system.
|
| 38 |
swift |
1.1 |
</abstract>
|
| 39 |
|
|
|
| 40 |
neysx |
1.4 |
<version>1.1</version>
|
| 41 |
neysx |
1.2 |
<date>2003-08-01</date>
|
| 42 |
swift |
1.1 |
|
| 43 |
|
|
<chapter>
|
| 44 |
|
|
<title>Introduction</title>
|
| 45 |
|
|
<section>
|
| 46 |
|
|
<body>
|
| 47 |
|
|
|
| 48 |
|
|
<p>
|
| 49 |
|
|
Gentoo Linux, a special flavor of Linux that can be automatically optimized
|
| 50 |
|
|
and customized for just about any application or need. Extreme performance,
|
| 51 |
|
|
configurability and a top-notch user and developer community are all hallmarks
|
| 52 |
|
|
of the Gentoo experience.
|
| 53 |
|
|
</p>
|
| 54 |
|
|
|
| 55 |
|
|
<p>
|
| 56 |
|
|
Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
|
| 57 |
|
|
server, development workstation, professional desktop, gaming system, embedded
|
| 58 |
|
|
solution or... a High Performance Computing system. Because of its
|
| 59 |
|
|
near-unlimited adaptability, we call Gentoo Linux a metadistribution.
|
| 60 |
|
|
</p>
|
| 61 |
|
|
|
| 62 |
|
|
<p>
|
| 63 |
|
|
This document explains how to turn a Gentoo system into a High Performance
|
| 64 |
|
|
Computing system. Step by step, it explains what packages one may want to
|
| 65 |
|
|
install and helps configure them.
|
| 66 |
|
|
</p>
|
| 67 |
|
|
|
| 68 |
|
|
<p>
|
| 69 |
neysx |
1.2 |
Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
|
| 70 |
|
|
refer to the <uri link="/doc/en/">documentation</uri> at the same location to
|
| 71 |
|
|
install it.
|
| 72 |
swift |
1.1 |
</p>
|
| 73 |
|
|
|
| 74 |
|
|
</body>
|
| 75 |
|
|
</section>
|
| 76 |
|
|
</chapter>
|
| 77 |
|
|
|
| 78 |
|
|
<chapter>
|
| 79 |
|
|
<title>Configuring Gentoo Linux for Clustering</title>
|
| 80 |
|
|
<section>
|
| 81 |
|
|
<title>Recommended Optimizations</title>
|
| 82 |
|
|
<body>
|
| 83 |
|
|
|
| 84 |
|
|
<note>
|
| 85 |
neysx |
1.2 |
We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
|
| 86 |
swift |
1.1 |
this section.
|
| 87 |
|
|
</note>
|
| 88 |
|
|
|
| 89 |
|
|
<p>
|
| 90 |
|
|
During the installation process, you will have to set your USE variables in
|
| 91 |
|
|
<path>/etc/make.conf</path>. We recommended that you deactivate all the
|
| 92 |
|
|
defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them
|
| 93 |
|
|
in make.conf. However, you may want to keep such use variables as x86, 3dnow,
|
| 94 |
|
|
gpm, mmx, sse, ncurses, pam and tcpd. Refer to the USE documentation for more
|
| 95 |
|
|
information.
|
| 96 |
|
|
</p>
|
| 97 |
|
|
|
| 98 |
|
|
<pre caption="USE Flags">
|
| 99 |
|
|
USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm
|
| 100 |
|
|
-gif gpm -gtk -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod
|
| 101 |
|
|
mmx -motif -mpeg ncurses -nls -oggvorbis -opengl pam -pdflib -png
|
| 102 |
|
|
-python -qt -qtmt -quicktime -readline -sdl -slang -spell -ssl
|
| 103 |
|
|
-svga tcpd -truetype -X -xml2 -xmms -xv -zlib"
|
| 104 |
|
|
</pre>
|
| 105 |
|
|
|
| 106 |
|
|
<p>
|
| 107 |
|
|
Or simply:
|
| 108 |
|
|
</p>
|
| 109 |
|
|
|
| 110 |
|
|
<pre caption="USE Flags - simplified version">
|
| 111 |
|
|
USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
|
| 112 |
|
|
</pre>
|
| 113 |
|
|
|
| 114 |
|
|
<note>
|
| 115 |
|
|
The <e>tcpd</e> USE flag increases security for packages such as xinetd.
|
| 116 |
|
|
</note>
|
| 117 |
|
|
|
| 118 |
|
|
<p>
|
| 119 |
|
|
In step 15 ("Installing the kernel and a System Logger") for stability
|
| 120 |
|
|
reasons, we recommend the vanilla-sources, the official kernel sources
|
| 121 |
|
|
released on <uri>http://www.kernel.org/</uri>, unless you require special
|
| 122 |
|
|
support such as xfs.
|
| 123 |
|
|
</p>
|
| 124 |
|
|
|
| 125 |
|
|
<pre caption="Installing vanilla-sources">
|
| 126 |
|
|
# <i>emerge -p syslog-ng vanilla-sources</i>
|
| 127 |
|
|
</pre>
|
| 128 |
|
|
|
| 129 |
|
|
<p>
|
| 130 |
|
|
When you install miscellaneous packages, we recommend installing the
|
| 131 |
|
|
following:
|
| 132 |
|
|
</p>
|
| 133 |
|
|
|
| 134 |
|
|
<pre caption="Installing necessary packages">
|
| 135 |
|
|
# <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
|
| 136 |
|
|
</pre>
|
| 137 |
|
|
|
| 138 |
|
|
</body>
|
| 139 |
|
|
</section>
|
| 140 |
|
|
<section>
|
| 141 |
|
|
<title>Communication Layer (TCP/IP Network)</title>
|
| 142 |
|
|
<body>
|
| 143 |
|
|
|
| 144 |
|
|
<p>
|
| 145 |
|
|
A cluster requires a communication layer to interconnect the slave nodes to
|
| 146 |
|
|
the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
|
| 147 |
|
|
since they have a good price/performance ratio. Other possibilities include
|
| 148 |
|
|
use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
|
| 149 |
|
|
link="http://quadrics.com/">QsNet</uri> or others.
|
| 150 |
|
|
</p>
|
| 151 |
|
|
|
| 152 |
|
|
<p>
|
| 153 |
|
|
A cluster is composed of two node types: master and slave. Typically, your
|
| 154 |
|
|
cluster will have one master node and several slave nodes.
|
| 155 |
|
|
</p>
|
| 156 |
|
|
|
| 157 |
|
|
<p>
|
| 158 |
|
|
The master node is the cluster's server. It is responsible for telling the
|
| 159 |
|
|
slave nodes what to do. This server will typically run such daemons as dhcpd,
|
| 160 |
|
|
nfs, pbs-server, and pbs-sched. Your master node will allow interactive
|
| 161 |
|
|
sessions for users, and accept job executions.
|
| 162 |
|
|
</p>
|
| 163 |
|
|
|
| 164 |
|
|
<p>
|
| 165 |
|
|
The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
|
| 166 |
|
|
node. They should be dedicated to crunching results and therefore should not
|
| 167 |
|
|
run any unecessary services.
|
| 168 |
|
|
</p>
|
| 169 |
|
|
|
| 170 |
|
|
<p>
|
| 171 |
|
|
The rest of this documentation will assume a cluster configuration as per the
|
| 172 |
|
|
hosts file below. You should maintain on every node such a hosts file
|
| 173 |
|
|
(<path>/etc/hosts</path>) with entries for each node participating node in the
|
| 174 |
|
|
cluster.
|
| 175 |
|
|
</p>
|
| 176 |
|
|
|
| 177 |
|
|
<pre caption="/etc/hosts">
|
| 178 |
|
|
# Adelie Linux Research & Development Center
|
| 179 |
|
|
# /etc/hosts
|
| 180 |
|
|
|
| 181 |
|
|
127.0.0.1 localhost
|
| 182 |
|
|
|
| 183 |
|
|
192.168.1.100 master.adelie master
|
| 184 |
|
|
|
| 185 |
|
|
192.168.1.1 node01.adelie node01
|
| 186 |
|
|
192.168.1.2 node02.adelie node02
|
| 187 |
|
|
</pre>
|
| 188 |
|
|
|
| 189 |
|
|
<p>
|
| 190 |
|
|
To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
|
| 191 |
|
|
file on the master node.
|
| 192 |
|
|
</p>
|
| 193 |
|
|
|
| 194 |
|
|
<pre caption="/etc/conf.d/net">
|
| 195 |
|
|
# Copyright 1999-2002 Gentoo Technologies, Inc.
|
| 196 |
|
|
# Distributed under the terms of the GNU General Public License, v2 or later
|
| 197 |
|
|
|
| 198 |
|
|
# Global config file for net.* rc-scripts
|
| 199 |
|
|
|
| 200 |
|
|
# This is basically the ifconfig argument without the ifconfig $iface
|
| 201 |
|
|
#
|
| 202 |
|
|
|
| 203 |
|
|
iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
|
| 204 |
|
|
# Network Connection to the outside world using dhcp -- configure as required for you network
|
| 205 |
|
|
iface_eth1="dhcp"
|
| 206 |
|
|
</pre>
|
| 207 |
|
|
|
| 208 |
|
|
|
| 209 |
|
|
<p>
|
| 210 |
|
|
Finally, setup a DHCP daemon on the master node to avoid having to maintain a
|
| 211 |
|
|
network configuration on each slave node.
|
| 212 |
|
|
</p>
|
| 213 |
|
|
|
| 214 |
|
|
<pre caption="/etc/dhcp/dhcpd.conf">
|
| 215 |
|
|
# Adelie Linux Research & Development Center
|
| 216 |
|
|
# /etc/dhcp/dhcpd.conf
|
| 217 |
|
|
|
| 218 |
|
|
log-facility local7;
|
| 219 |
|
|
ddns-update-style none;
|
| 220 |
|
|
use-host-decl-names on;
|
| 221 |
|
|
|
| 222 |
|
|
subnet 192.168.1.0 netmask 255.255.255.0 {
|
| 223 |
|
|
option domain-name "adelie";
|
| 224 |
|
|
range 192.168.1.10 192.168.1.99;
|
| 225 |
|
|
option routers 192.168.1.100;
|
| 226 |
|
|
|
| 227 |
|
|
host node01.adelie {
|
| 228 |
|
|
# MAC address of network card on node 01
|
| 229 |
|
|
hardware ethernet 00:07:e9:0f:e2:d4;
|
| 230 |
|
|
fixed-address 192.168.1.1;
|
| 231 |
|
|
}
|
| 232 |
|
|
host node02.adelie {
|
| 233 |
|
|
# MAC address of network card on node 02
|
| 234 |
|
|
hardware ethernet 00:07:e9:0f:e2:6b;
|
| 235 |
|
|
fixed-address 192.168.1.2;
|
| 236 |
|
|
}
|
| 237 |
|
|
}
|
| 238 |
|
|
</pre>
|
| 239 |
|
|
|
| 240 |
|
|
</body>
|
| 241 |
|
|
</section>
|
| 242 |
|
|
<section>
|
| 243 |
|
|
<title>NFS/NIS</title>
|
| 244 |
|
|
<body>
|
| 245 |
|
|
|
| 246 |
|
|
<p>
|
| 247 |
|
|
The Network File System (NFS) was developed to allow machines to mount a disk
|
| 248 |
|
|
partition on a remote machine as if it were on a local hard drive. This allows
|
| 249 |
|
|
for fast, seamless sharing of files across a network.
|
| 250 |
|
|
</p>
|
| 251 |
|
|
|
| 252 |
|
|
<p>
|
| 253 |
|
|
There are other systems that provide similar functionality to NFS which could
|
| 254 |
|
|
be used in a cluster environment. The <uri
|
| 255 |
|
|
link="http://www.transarc.com/Product/EFS/AFS/index.html">Andrew File System
|
| 256 |
|
|
from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
|
| 257 |
|
|
some additional security and performance features. The <uri
|
| 258 |
|
|
link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
|
| 259 |
|
|
development, but is designed to work well with disconnected clients. Many
|
| 260 |
|
|
of the features of the Andrew and Coda file systems are slated for inclusion
|
| 261 |
|
|
in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
|
| 262 |
|
|
The advantage of NFS today is that it is mature, standard, well understood,
|
| 263 |
|
|
and supported robustly across a variety of platforms.
|
| 264 |
|
|
</p>
|
| 265 |
|
|
|
| 266 |
|
|
<pre caption="Ebuilds for NFS-support">
|
| 267 |
|
|
# <i>emerge -p nfs-utils portmap</i>
|
| 268 |
|
|
# <i>emerge nfs-utils portmap</i>
|
| 269 |
|
|
</pre>
|
| 270 |
|
|
|
| 271 |
|
|
<p>
|
| 272 |
|
|
Configure and install a kernel to support NFS v3 on all nodes:
|
| 273 |
|
|
</p>
|
| 274 |
|
|
|
| 275 |
|
|
<pre caption="Required Kernel Configurations for NFS">
|
| 276 |
|
|
CONFIG_NFS_FS=y
|
| 277 |
|
|
CONFIG_NFSD=y
|
| 278 |
|
|
CONFIG_SUNRPC=y
|
| 279 |
|
|
CONFIG_LOCKD=y
|
| 280 |
|
|
CONFIG_NFSD_V3=y
|
| 281 |
|
|
CONFIG_LOCKD_V4=y
|
| 282 |
|
|
</pre>
|
| 283 |
|
|
|
| 284 |
|
|
<p>
|
| 285 |
|
|
On the master node, edit your <path>/etc/hosts.allow</path> file to allow
|
| 286 |
|
|
connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
|
| 287 |
|
|
your <path>hosts.allow</path> will look like:
|
| 288 |
|
|
</p>
|
| 289 |
|
|
|
| 290 |
|
|
<pre caption="hosts.allow">
|
| 291 |
|
|
portmap:192.168.1.0/255.255.255.0
|
| 292 |
|
|
</pre>
|
| 293 |
|
|
|
| 294 |
|
|
<p>
|
| 295 |
|
|
Edit the <path>/etc/exports</path> file of the master node to export a work
|
| 296 |
|
|
directory struture (/home is good for this).
|
| 297 |
|
|
</p>
|
| 298 |
|
|
|
| 299 |
|
|
<pre caption="/etc/exports">
|
| 300 |
|
|
/home/ *(rw)
|
| 301 |
|
|
</pre>
|
| 302 |
|
|
|
| 303 |
|
|
<p>
|
| 304 |
|
|
Add nfs to your master node's default runlevel:
|
| 305 |
|
|
</p>
|
| 306 |
|
|
|
| 307 |
|
|
<pre caption="Adding NFS to the default runlevel">
|
| 308 |
|
|
# <i>rc-update add nfs default</i>
|
| 309 |
|
|
</pre>
|
| 310 |
|
|
|
| 311 |
|
|
<p>
|
| 312 |
|
|
To mount the nfs exported filesystem from the master, you also have to
|
| 313 |
|
|
configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
|
| 314 |
|
|
one:
|
| 315 |
|
|
</p>
|
| 316 |
|
|
|
| 317 |
|
|
<pre caption="/etc/fstab">
|
| 318 |
|
|
master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
|
| 319 |
|
|
</pre>
|
| 320 |
|
|
|
| 321 |
|
|
<p>
|
| 322 |
|
|
You'll also need to set up your nodes so that they mount the nfs filesystem by
|
| 323 |
|
|
issuing this command:
|
| 324 |
|
|
</p>
|
| 325 |
|
|
|
| 326 |
|
|
<pre caption="Adding nfsmount to the default runlevel">
|
| 327 |
|
|
# <i>rc-update add nfsmount default</i>
|
| 328 |
|
|
</pre>
|
| 329 |
|
|
|
| 330 |
|
|
</body>
|
| 331 |
|
|
</section>
|
| 332 |
|
|
<section>
|
| 333 |
|
|
<title>RSH/SSH</title>
|
| 334 |
|
|
<body>
|
| 335 |
|
|
|
| 336 |
|
|
<p>
|
| 337 |
|
|
SSH is a protocol for secure remote login and other secure network services
|
| 338 |
|
|
over an insecure network. OpenSSH uses public key cryptography to provide
|
| 339 |
|
|
secure authorization. Generating the public key, which is shared with remote
|
| 340 |
|
|
systems, and the private key which is kept on the local system, is done first
|
| 341 |
|
|
to configure OpenSSH on the cluster.
|
| 342 |
|
|
</p>
|
| 343 |
|
|
|
| 344 |
|
|
<p>
|
| 345 |
|
|
For transparent cluster usage, private/public keys may be used. This process
|
| 346 |
|
|
has two steps:
|
| 347 |
|
|
</p>
|
| 348 |
|
|
|
| 349 |
|
|
<ul>
|
| 350 |
|
|
<li>Generate public and private keys</li>
|
| 351 |
|
|
<li>Copy public key to slave nodes</li>
|
| 352 |
|
|
</ul>
|
| 353 |
|
|
|
| 354 |
|
|
<p>
|
| 355 |
neysx |
1.3 |
For user based authentification, generate and copy as follows:
|
| 356 |
swift |
1.1 |
</p>
|
| 357 |
|
|
|
| 358 |
|
|
<pre caption="SSH key authentication">
|
| 359 |
|
|
# <i>ssh-keygen -t dsa</i>
|
| 360 |
|
|
Generating public/private dsa key pair.
|
| 361 |
|
|
Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
|
| 362 |
|
|
Enter passphrase (empty for no passphrase):
|
| 363 |
|
|
Enter same passphrase again:
|
| 364 |
|
|
Your identification has been saved in /root/.ssh/id_dsa.
|
| 365 |
|
|
Your public key has been saved in /root/.ssh/id_dsa.pub.
|
| 366 |
|
|
The key fingerprint is:
|
| 367 |
|
|
f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
|
| 368 |
|
|
|
| 369 |
|
|
<comment>WARNING! If you already have an "authorized_keys" file,
|
| 370 |
|
|
please append to it, do not use the following command.</comment>
|
| 371 |
|
|
|
| 372 |
|
|
# <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
|
| 373 |
|
|
root@master's password:
|
| 374 |
|
|
id_dsa.pub 100% 234 2.0MB/s 00:00
|
| 375 |
|
|
|
| 376 |
|
|
# <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
|
| 377 |
|
|
root@master's password:
|
| 378 |
|
|
id_dsa.pub 100% 234 2.0MB/s 00:00
|
| 379 |
|
|
</pre>
|
| 380 |
|
|
|
| 381 |
|
|
<note>
|
| 382 |
|
|
Host keys must have an empty passphrase. RSA is required for host-based
|
| 383 |
|
|
authentification.
|
| 384 |
|
|
</note>
|
| 385 |
|
|
|
| 386 |
|
|
<p>
|
| 387 |
|
|
For host based authentication, you will also need to edit your
|
| 388 |
|
|
<path>/etc/ssh/shosts.equiv</path>.
|
| 389 |
|
|
</p>
|
| 390 |
|
|
|
| 391 |
|
|
<pre caption="/etc/ssh/shosts.equiv">
|
| 392 |
|
|
node01.adelie
|
| 393 |
|
|
node02.adelie
|
| 394 |
|
|
master.adelie
|
| 395 |
|
|
</pre>
|
| 396 |
|
|
|
| 397 |
|
|
<p>
|
| 398 |
|
|
And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
|
| 399 |
|
|
</p>
|
| 400 |
|
|
|
| 401 |
|
|
<pre caption="sshd configurations">
|
| 402 |
|
|
# $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
|
| 403 |
|
|
# This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
|
| 404 |
|
|
|
| 405 |
|
|
# This is the sshd server system-wide configuration file. See sshd(8)
|
| 406 |
|
|
# for more information.
|
| 407 |
|
|
|
| 408 |
|
|
# HostKeys for protocol version 2
|
| 409 |
|
|
HostKey /etc/ssh/ssh_host_rsa_key
|
| 410 |
|
|
</pre>
|
| 411 |
|
|
|
| 412 |
|
|
<p>
|
| 413 |
|
|
If your application require RSH communications, you will need to emerge
|
| 414 |
|
|
net-misc/netkit-rsh and sys-apps/xinetd.
|
| 415 |
|
|
</p>
|
| 416 |
|
|
|
| 417 |
|
|
<pre caption="Installing necessary applicaitons">
|
| 418 |
|
|
# <i>emerge -p xinetd</i>
|
| 419 |
|
|
# <i>emerge xinetd</i>
|
| 420 |
|
|
# <i>emerge -p netkit-rsh</i>
|
| 421 |
|
|
# <i>emerge netkit-rsh</i>
|
| 422 |
|
|
</pre>
|
| 423 |
|
|
|
| 424 |
|
|
<p>
|
| 425 |
|
|
Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
|
| 426 |
|
|
</p>
|
| 427 |
|
|
|
| 428 |
|
|
<pre caption="rsh">
|
| 429 |
|
|
# Adelie Linux Research & Development Center
|
| 430 |
|
|
# /etc/xinetd.d/rsh
|
| 431 |
|
|
|
| 432 |
|
|
service shell
|
| 433 |
|
|
{
|
| 434 |
|
|
socket_type = stream
|
| 435 |
|
|
protocol = tcp
|
| 436 |
|
|
wait = no
|
| 437 |
|
|
user = root
|
| 438 |
|
|
group = tty
|
| 439 |
|
|
server = /usr/sbin/in.rshd
|
| 440 |
|
|
log_type = FILE /var/log/rsh
|
| 441 |
|
|
log_on_success = PID HOST USERID EXIT DURATION
|
| 442 |
|
|
log_on_failure = USERID ATTEMPT
|
| 443 |
|
|
disable = no
|
| 444 |
|
|
}
|
| 445 |
|
|
</pre>
|
| 446 |
|
|
|
| 447 |
|
|
<p>
|
| 448 |
|
|
Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
|
| 449 |
|
|
</p>
|
| 450 |
|
|
|
| 451 |
|
|
<pre caption="hosts.allow">
|
| 452 |
|
|
# Adelie Linux Research & Development Center
|
| 453 |
|
|
# /etc/hosts.allow
|
| 454 |
|
|
|
| 455 |
|
|
in.rshd:192.168.1.0/255.255.255.0
|
| 456 |
|
|
</pre>
|
| 457 |
|
|
|
| 458 |
|
|
<p>
|
| 459 |
|
|
Or you can simply trust your cluster LAN:
|
| 460 |
|
|
</p>
|
| 461 |
|
|
|
| 462 |
|
|
<pre caption="hosts.allow">
|
| 463 |
|
|
# Adelie Linux Research & Development Center
|
| 464 |
|
|
# /etc/hosts.allow
|
| 465 |
|
|
|
| 466 |
|
|
ALL:192.168.1.0/255.255.255.0
|
| 467 |
|
|
</pre>
|
| 468 |
|
|
|
| 469 |
|
|
<p>
|
| 470 |
|
|
Finally, configure host authentification from <path>/etc/hosts.equiv</path>.
|
| 471 |
|
|
</p>
|
| 472 |
|
|
|
| 473 |
|
|
<pre caption="hosts.equiv">
|
| 474 |
|
|
# Adelie Linux Research & Development Center
|
| 475 |
|
|
# /etc/hosts.equiv
|
| 476 |
|
|
|
| 477 |
|
|
master
|
| 478 |
|
|
node01
|
| 479 |
|
|
node02
|
| 480 |
|
|
</pre>
|
| 481 |
|
|
|
| 482 |
|
|
<p>
|
| 483 |
|
|
And, add xinetd to your default runlevel:
|
| 484 |
|
|
</p>
|
| 485 |
|
|
|
| 486 |
|
|
<pre caption="Adding xinetd to the default runlevel">
|
| 487 |
|
|
# <i>rc-update add xinetd default</i>
|
| 488 |
|
|
</pre>
|
| 489 |
|
|
|
| 490 |
|
|
</body>
|
| 491 |
|
|
</section>
|
| 492 |
|
|
<section>
|
| 493 |
|
|
<title>NTP</title>
|
| 494 |
|
|
<body>
|
| 495 |
|
|
|
| 496 |
|
|
<p>
|
| 497 |
|
|
The Network Time Protocol (NTP) is used to synchronize the time of a computer
|
| 498 |
|
|
client or server to another server or reference time source, such as a radio
|
| 499 |
|
|
or satellite receiver or modem. It provides accuracies typically within a
|
| 500 |
|
|
millisecond on LANs and up to a few tens of milliseconds on WANs relative to
|
| 501 |
|
|
Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
|
| 502 |
|
|
receiver, for example. Typical NTP configurations utilize multiple redundant
|
| 503 |
|
|
servers and diverse network paths in order to achieve high accuracy and
|
| 504 |
|
|
reliability.
|
| 505 |
|
|
</p>
|
| 506 |
|
|
|
| 507 |
|
|
<p>
|
| 508 |
|
|
Select a NTP server geographically close to you from <uri
|
| 509 |
|
|
link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
|
| 510 |
|
|
Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
|
| 511 |
|
|
<path>/etc/ntp.conf</path> files on the master node.
|
| 512 |
|
|
</p>
|
| 513 |
|
|
|
| 514 |
|
|
<pre caption="Master /etc/conf.d/ntp">
|
| 515 |
|
|
# Copyright 1999-2002 Gentoo Technologies, Inc.
|
| 516 |
|
|
# Distributed under the terms of the GNU General Public License v2
|
| 517 |
|
|
# /etc/conf.d/ntpd
|
| 518 |
|
|
|
| 519 |
|
|
# NOTES:
|
| 520 |
|
|
# - NTPDATE variables below are used if you wish to set your
|
| 521 |
|
|
# clock when you start the ntp init.d script
|
| 522 |
|
|
# - make sure that the NTPDATE_CMD will close by itself ...
|
| 523 |
|
|
# the init.d script will not attempt to kill/stop it
|
| 524 |
|
|
# - ntpd will be used to maintain synchronization with a time
|
| 525 |
|
|
# server regardless of what NTPDATE is set to
|
| 526 |
|
|
# - read each of the comments above each of the variable
|
| 527 |
|
|
|
| 528 |
|
|
# Comment this out if you dont want the init script to warn
|
| 529 |
|
|
# about not having ntpdate setup
|
| 530 |
|
|
NTPDATE_WARN="n"
|
| 531 |
|
|
|
| 532 |
|
|
# Command to run to set the clock initially
|
| 533 |
|
|
# Most people should just uncomment this line ...
|
| 534 |
|
|
# however, if you know what you're doing, and you
|
| 535 |
|
|
# want to use ntpd to set the clock, change this to 'ntpd'
|
| 536 |
|
|
NTPDATE_CMD="ntpdate"
|
| 537 |
|
|
|
| 538 |
|
|
# Options to pass to the above command
|
| 539 |
|
|
# Most people should just uncomment this variable and
|
| 540 |
|
|
# change 'someserver' to a valid hostname which you
|
| 541 |
|
|
# can aquire from the URL's below
|
| 542 |
|
|
NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
|
| 543 |
|
|
|
| 544 |
|
|
##
|
| 545 |
|
|
# A list of available servers is available here:
|
| 546 |
|
|
# http://www.eecis.udel.edu/~mills/ntp/servers.html
|
| 547 |
|
|
# Please follow the rules of engagement and use a
|
| 548 |
|
|
# Stratum 2 server (unless you qualify for Stratum 1)
|
| 549 |
|
|
##
|
| 550 |
|
|
|
| 551 |
|
|
# Options to pass to the ntpd process that will *always* be run
|
| 552 |
|
|
# Most people should not uncomment this line ...
|
| 553 |
|
|
# however, if you know what you're doing, feel free to tweak
|
| 554 |
|
|
#NTPD_OPTS=""
|
| 555 |
|
|
|
| 556 |
|
|
</pre>
|
| 557 |
|
|
|
| 558 |
|
|
<p>
|
| 559 |
|
|
Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
|
| 560 |
|
|
synchronization source:
|
| 561 |
|
|
</p>
|
| 562 |
|
|
|
| 563 |
|
|
<pre caption="Master ntp.conf">
|
| 564 |
|
|
# Adelie Linux Research & Development Center
|
| 565 |
|
|
# /etc/ntp.conf
|
| 566 |
|
|
|
| 567 |
|
|
# Synchronization source #1
|
| 568 |
|
|
server ntp1.cmc.ec.gc.ca
|
| 569 |
|
|
restrict ntp1.cmc.ec.gc.ca
|
| 570 |
|
|
# Synchronization source #2
|
| 571 |
|
|
server ntp2.cmc.ec.gc.ca
|
| 572 |
|
|
restrict ntp2.cmc.ec.gc.ca
|
| 573 |
|
|
stratum 10
|
| 574 |
|
|
driftfile /etc/ntp.drift.server
|
| 575 |
|
|
logfile /var/log/ntp
|
| 576 |
|
|
broadcast 192.168.1.255
|
| 577 |
|
|
restrict default kod
|
| 578 |
|
|
restrict 127.0.0.1
|
| 579 |
|
|
restrict 192.168.1.0 mask 255.255.255.0
|
| 580 |
|
|
</pre>
|
| 581 |
|
|
|
| 582 |
|
|
<p>
|
| 583 |
|
|
And on all your slave nodes, setup your synchronization source as your master
|
| 584 |
|
|
node.
|
| 585 |
|
|
</p>
|
| 586 |
|
|
|
| 587 |
|
|
<pre caption="Node /etc/conf.d/ntp">
|
| 588 |
|
|
# Copyright 1999-2002 Gentoo Technologies, Inc.
|
| 589 |
|
|
# Distributed under the terms of the GNU General Public License v2
|
| 590 |
|
|
# /etc/conf.d/ntpd
|
| 591 |
|
|
|
| 592 |
|
|
NTPDATE_WARN="n"
|
| 593 |
|
|
NTPDATE_CMD="ntpdate"
|
| 594 |
|
|
NTPDATE_OPTS="-b master"
|
| 595 |
|
|
</pre>
|
| 596 |
|
|
|
| 597 |
|
|
<pre caption="Node ntp.conf">
|
| 598 |
|
|
# Adelie Linux Research & Development Center
|
| 599 |
|
|
# /etc/ntp.conf
|
| 600 |
|
|
|
| 601 |
|
|
# Synchronization source #1
|
| 602 |
|
|
server master
|
| 603 |
|
|
restrict master
|
| 604 |
|
|
stratum 11
|
| 605 |
|
|
driftfile /etc/ntp.drift.server
|
| 606 |
|
|
logfile /var/log/ntp
|
| 607 |
|
|
restrict default kod
|
| 608 |
|
|
restrict 127.0.0.1
|
| 609 |
|
|
</pre>
|
| 610 |
|
|
|
| 611 |
|
|
<p>
|
| 612 |
|
|
Then add ntpd to the default runlevel of all your nodes:
|
| 613 |
|
|
</p>
|
| 614 |
|
|
|
| 615 |
|
|
<pre caption="Adding ntpd to the default runlevel">
|
| 616 |
|
|
# <i>rc-update add ntpd default</i>
|
| 617 |
|
|
</pre>
|
| 618 |
|
|
|
| 619 |
|
|
<note>
|
| 620 |
|
|
NTP will not update the local clock if the time difference between your
|
| 621 |
|
|
synchronization source and the local clock is too great.
|
| 622 |
|
|
</note>
|
| 623 |
|
|
|
| 624 |
|
|
</body>
|
| 625 |
|
|
</section>
|
| 626 |
|
|
<section>
|
| 627 |
|
|
<title>IPTABLES</title>
|
| 628 |
|
|
<body>
|
| 629 |
|
|
|
| 630 |
|
|
<p>
|
| 631 |
|
|
To setup a firewall on your cluster, you will need iptables.
|
| 632 |
|
|
</p>
|
| 633 |
|
|
|
| 634 |
|
|
<pre caption="Installing iptables">
|
| 635 |
|
|
# <i>emerge -p iptables</i>
|
| 636 |
|
|
# <i>emerge iptables</i>
|
| 637 |
|
|
</pre>
|
| 638 |
|
|
|
| 639 |
|
|
<p>
|
| 640 |
|
|
Required kernel configuration:
|
| 641 |
|
|
</p>
|
| 642 |
|
|
|
| 643 |
|
|
<pre caption="IPtables kernel configuration">
|
| 644 |
|
|
CONFIG_NETFILTER=y
|
| 645 |
|
|
CONFIG_IP_NF_CONNTRACK=y
|
| 646 |
|
|
CONFIG_IP_NF_IPTABLES=y
|
| 647 |
|
|
CONFIG_IP_NF_MATCH_STATE=y
|
| 648 |
|
|
CONFIG_IP_NF_FILTER=y
|
| 649 |
|
|
CONFIG_IP_NF_TARGET_REJECT=y
|
| 650 |
|
|
CONFIG_IP_NF_NAT=y
|
| 651 |
|
|
CONFIG_IP_NF_NAT_NEEDED=y
|
| 652 |
|
|
CONFIG_IP_NF_TARGET_MASQUERADE=y
|
| 653 |
|
|
CONFIG_IP_NF_TARGET_LOG=y
|
| 654 |
|
|
</pre>
|
| 655 |
|
|
|
| 656 |
|
|
<p>
|
| 657 |
|
|
And the rules required for this firewall:
|
| 658 |
|
|
</p>
|
| 659 |
|
|
|
| 660 |
|
|
<pre caption="rule-save">
|
| 661 |
|
|
# Adelie Linux Research & Development Center
|
| 662 |
|
|
# /var/lib/iptbles/rule-save
|
| 663 |
|
|
|
| 664 |
|
|
*filter
|
| 665 |
|
|
:INPUT ACCEPT [0:0]
|
| 666 |
|
|
:FORWARD ACCEPT [0:0]
|
| 667 |
|
|
:OUTPUT ACCEPT [0:0]
|
| 668 |
|
|
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
|
| 669 |
|
|
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
|
| 670 |
|
|
-A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
|
| 671 |
|
|
-A INPUT -s 127.0.0.1 -i lo -j ACCEPT
|
| 672 |
|
|
-A INPUT -p icmp -j ACCEPT
|
| 673 |
|
|
-A INPUT -j LOG
|
| 674 |
|
|
-A INPUT -j REJECT --reject-with icmp-port-unreachable
|
| 675 |
|
|
COMMIT
|
| 676 |
|
|
*nat
|
| 677 |
|
|
:PREROUTING ACCEPT [0:0]
|
| 678 |
|
|
:POSTROUTING ACCEPT [0:0]
|
| 679 |
|
|
:OUTPUT ACCEPT [0:0]
|
| 680 |
|
|
-A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
|
| 681 |
|
|
COMMIT
|
| 682 |
|
|
</pre>
|
| 683 |
|
|
|
| 684 |
|
|
<p>
|
| 685 |
|
|
Then add iptables to the default runlevel of all your nodes:
|
| 686 |
|
|
</p>
|
| 687 |
|
|
|
| 688 |
|
|
<pre caption="Adding iptables to the default runlevel">
|
| 689 |
|
|
# <i>rc-update add iptables default</i>
|
| 690 |
|
|
</pre>
|
| 691 |
|
|
|
| 692 |
|
|
</body>
|
| 693 |
|
|
</section>
|
| 694 |
|
|
</chapter>
|
| 695 |
|
|
|
| 696 |
|
|
<chapter>
|
| 697 |
|
|
<title>HPC Tools</title>
|
| 698 |
|
|
<section>
|
| 699 |
|
|
<title>OpenPBS</title>
|
| 700 |
|
|
<body>
|
| 701 |
|
|
|
| 702 |
|
|
<p>
|
| 703 |
|
|
The Portable Batch System (PBS) is a flexible batch queueing and workload
|
| 704 |
|
|
management system originally developed for NASA. It operates on networked,
|
| 705 |
|
|
multi-platform UNIX environments, including heterogeneous clusters of
|
| 706 |
|
|
workstations, supercomputers, and massively parallel systems. Development of
|
| 707 |
|
|
PBS is provided by Altair Grid Technologies.
|
| 708 |
|
|
</p>
|
| 709 |
|
|
|
| 710 |
|
|
<pre caption="Installing openpbs">
|
| 711 |
|
|
# <i>emerge -p openpbs</i>
|
| 712 |
|
|
</pre>
|
| 713 |
|
|
|
| 714 |
|
|
<note>
|
| 715 |
|
|
OpenPBS ebuild does not currently set proper permissions on var-directories
|
| 716 |
|
|
used by OpenPBS.
|
| 717 |
|
|
</note>
|
| 718 |
|
|
|
| 719 |
|
|
<p>
|
| 720 |
|
|
Before starting using OpenPBS, some configurations are required. The files
|
| 721 |
|
|
you will need to personalize for your system are:
|
| 722 |
|
|
</p>
|
| 723 |
|
|
|
| 724 |
|
|
<ul>
|
| 725 |
|
|
<li>/etc/pbs_environment</li>
|
| 726 |
|
|
<li>/var/spool/PBS/server_name</li>
|
| 727 |
|
|
<li>/var/spool/PBS/server_priv/nodes</li>
|
| 728 |
|
|
<li>/var/spool/PBS/mom_priv/config</li>
|
| 729 |
|
|
<li>/var/spool/PBS/sched_priv/sched_config</li>
|
| 730 |
|
|
</ul>
|
| 731 |
|
|
|
| 732 |
|
|
<p>
|
| 733 |
|
|
Here is a sample sched_config:
|
| 734 |
|
|
</p>
|
| 735 |
|
|
|
| 736 |
|
|
<pre caption="/var/spool/PBS/sched_priv/sched_config">
|
| 737 |
|
|
#
|
| 738 |
|
|
# Create queues and set their attributes.
|
| 739 |
|
|
#
|
| 740 |
|
|
#
|
| 741 |
|
|
# Create and define queue upto4nodes
|
| 742 |
|
|
#
|
| 743 |
|
|
create queue upto4nodes
|
| 744 |
|
|
set queue upto4nodes queue_type = Execution
|
| 745 |
|
|
set queue upto4nodes Priority = 100
|
| 746 |
|
|
set queue upto4nodes resources_max.nodect = 4
|
| 747 |
|
|
set queue upto4nodes resources_min.nodect = 1
|
| 748 |
|
|
set queue upto4nodes enabled = True
|
| 749 |
|
|
set queue upto4nodes started = True
|
| 750 |
|
|
#
|
| 751 |
|
|
# Create and define queue default
|
| 752 |
|
|
#
|
| 753 |
|
|
create queue default
|
| 754 |
|
|
set queue default queue_type = Route
|
| 755 |
|
|
set queue default route_destinations = upto4nodes
|
| 756 |
|
|
set queue default enabled = True
|
| 757 |
|
|
set queue default started = True
|
| 758 |
|
|
#
|
| 759 |
|
|
# Set server attributes.
|
| 760 |
|
|
#
|
| 761 |
|
|
set server scheduling = True
|
| 762 |
|
|
set server acl_host_enable = True
|
| 763 |
|
|
set server default_queue = default
|
| 764 |
|
|
set server log_events = 511
|
| 765 |
|
|
set server mail_from = adm
|
| 766 |
|
|
set server query_other_jobs = True
|
| 767 |
|
|
set server resources_default.neednodes = 1
|
| 768 |
|
|
set server resources_default.nodect = 1
|
| 769 |
|
|
set server resources_default.nodes = 1
|
| 770 |
|
|
set server scheduler_iteration = 60
|
| 771 |
|
|
</pre>
|
| 772 |
|
|
|
| 773 |
|
|
<p>
|
| 774 |
|
|
To submit a task to OpenPBS, the command <c>qsub</c> is used with some
|
| 775 |
|
|
optional parameters. In the exemple below, "-l" allows you to specify
|
| 776 |
|
|
the resources required, "-j" provides for redirection of standard out and
|
| 777 |
|
|
standard error, and the "-m" will e-mail the user at begining (b), end (e)
|
| 778 |
|
|
and on abort (a) of the job.
|
| 779 |
|
|
</p>
|
| 780 |
|
|
|
| 781 |
|
|
<pre caption="Submitting a task">
|
| 782 |
|
|
<comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
|
| 783 |
|
|
# <i>qsub -l nodes=2 -j oe -m abe myscript</i>
|
| 784 |
|
|
</pre>
|
| 785 |
|
|
|
| 786 |
|
|
<p>
|
| 787 |
|
|
Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
|
| 788 |
|
|
may want to try a task manually. To request an interactive shell from OpenPBS,
|
| 789 |
|
|
use the "-I" parameter.
|
| 790 |
|
|
</p>
|
| 791 |
|
|
|
| 792 |
|
|
<pre caption="Requesting an interactive shell">
|
| 793 |
|
|
# <i>qsub -I</i>
|
| 794 |
|
|
</pre>
|
| 795 |
|
|
|
| 796 |
|
|
<p>
|
| 797 |
|
|
To check the status of your jobs, use the qstat command:
|
| 798 |
|
|
</p>
|
| 799 |
|
|
|
| 800 |
|
|
<pre caption="Checking the status of the jobs">
|
| 801 |
|
|
# <i>qstat</i>
|
| 802 |
|
|
Job id Name User Time Use S Queue
|
| 803 |
|
|
------ ---- ---- -------- - -----
|
| 804 |
|
|
2.geist STDIN adelie 0 R upto1nodes
|
| 805 |
|
|
</pre>
|
| 806 |
|
|
|
| 807 |
|
|
</body>
|
| 808 |
|
|
</section>
|
| 809 |
|
|
<section>
|
| 810 |
|
|
<title>MPICH</title>
|
| 811 |
|
|
<body>
|
| 812 |
|
|
|
| 813 |
|
|
<p>
|
| 814 |
|
|
Message passing is a paradigm used widely on certain classes of parallel
|
| 815 |
|
|
machines, especially those with distributed memory. MPICH is a freely
|
| 816 |
|
|
available, portable implementation of MPI, the Standard for message-passing
|
| 817 |
|
|
libraries.
|
| 818 |
|
|
</p>
|
| 819 |
|
|
|
| 820 |
|
|
<p>
|
| 821 |
|
|
The mpich ebuild provided by Adelie Linux allows for two USE flags:
|
| 822 |
|
|
<e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
|
| 823 |
|
|
installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
|
| 824 |
|
|
of <c>rsh</c>.
|
| 825 |
|
|
</p>
|
| 826 |
|
|
|
| 827 |
|
|
<pre caption="Installing the mpich application">
|
| 828 |
|
|
# <i>emerge -p mpich</i>
|
| 829 |
|
|
# <i>emerge mpich</i>
|
| 830 |
|
|
</pre>
|
| 831 |
|
|
|
| 832 |
|
|
<p>
|
| 833 |
|
|
You may need to export a mpich work directory to all your slave nodes in
|
| 834 |
|
|
<path>/etc/exports</path>:
|
| 835 |
|
|
</p>
|
| 836 |
|
|
|
| 837 |
|
|
<pre caption="/etc/exports">
|
| 838 |
|
|
/home *(rw)
|
| 839 |
|
|
</pre>
|
| 840 |
|
|
|
| 841 |
|
|
<p>
|
| 842 |
|
|
Most massively parallel processors (MPPs) provide a way to start a program on
|
| 843 |
|
|
a requested number of processors; <c>mpirun</c> makes use of the appropriate
|
| 844 |
|
|
command whenever possible. In contrast, workstation clusters require that each
|
| 845 |
|
|
process in a parallel job be started individually, though programs to help
|
| 846 |
|
|
start these processes exist. Because workstation clusters are not already
|
| 847 |
|
|
organized as an MPP, additional information is required to make use of them.
|
| 848 |
|
|
Mpich should be installed with a list of participating workstations in the
|
| 849 |
|
|
file <path>machines.LINUX</path> in the directory
|
| 850 |
|
|
<path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
|
| 851 |
|
|
processors to run on.
|
| 852 |
|
|
</p>
|
| 853 |
|
|
|
| 854 |
|
|
<p>
|
| 855 |
|
|
Edit this file to reflect your cluster-lan configuration:
|
| 856 |
|
|
</p>
|
| 857 |
|
|
|
| 858 |
|
|
<pre caption="/usr/share/mpich/machines.LINUX">
|
| 859 |
|
|
# Change this file to contain the machines that you want to use
|
| 860 |
|
|
# to run MPI jobs on. The format is one host name per line, with either
|
| 861 |
|
|
# hostname
|
| 862 |
|
|
# or
|
| 863 |
|
|
# hostname:n
|
| 864 |
|
|
# where n is the number of processors in an SMP. The hostname should
|
| 865 |
|
|
# be the same as the result from the command "hostname"
|
| 866 |
|
|
master
|
| 867 |
|
|
node01
|
| 868 |
|
|
node02
|
| 869 |
|
|
# node03
|
| 870 |
|
|
# node04
|
| 871 |
|
|
# ...
|
| 872 |
|
|
</pre>
|
| 873 |
|
|
|
| 874 |
|
|
<p>
|
| 875 |
|
|
Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
|
| 876 |
|
|
you can use all of the machines that you have listed. This script performs
|
| 877 |
|
|
an <c>rsh</c> and a short directory listing; this tests that you both have
|
| 878 |
|
|
access to the node and that a program in the current directory is visible on
|
| 879 |
|
|
the remote node. If there are any problems, they will be listed. These
|
| 880 |
|
|
problems must be fixed before proceeding.
|
| 881 |
|
|
</p>
|
| 882 |
|
|
|
| 883 |
|
|
<p>
|
| 884 |
|
|
The only argument to <c>tstmachines</c> is the name of the architecture; this
|
| 885 |
|
|
is the same name as the extension on the machines file. For example, the
|
| 886 |
|
|
following tests that a program in the current directory can be executed by
|
| 887 |
|
|
all of the machines in the LINUX machines list.
|
| 888 |
|
|
</p>
|
| 889 |
|
|
|
| 890 |
|
|
<pre caption="Running a test">
|
| 891 |
|
|
# <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
|
| 892 |
|
|
</pre>
|
| 893 |
|
|
|
| 894 |
|
|
<note>
|
| 895 |
|
|
This program is silent if all is well; if you want to see what it is doing,
|
| 896 |
|
|
use the -v (for verbose) argument:
|
| 897 |
|
|
</note>
|
| 898 |
|
|
|
| 899 |
|
|
<pre caption="Running a test verbosively">
|
| 900 |
|
|
# <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
|
| 901 |
|
|
</pre>
|
| 902 |
|
|
|
| 903 |
|
|
<p>
|
| 904 |
|
|
The output from this command might look like:
|
| 905 |
|
|
</p>
|
| 906 |
|
|
|
| 907 |
|
|
<pre caption="Output of the above command">
|
| 908 |
|
|
Trying true on host1.uoffoo.edu ...
|
| 909 |
|
|
Trying true on host2.uoffoo.edu ...
|
| 910 |
|
|
Trying ls on host1.uoffoo.edu ...
|
| 911 |
|
|
Trying ls on host2.uoffoo.edu ...
|
| 912 |
|
|
Trying user program on host1.uoffoo.edu ...
|
| 913 |
|
|
Trying user program on host2.uoffoo.edu ...
|
| 914 |
|
|
</pre>
|
| 915 |
|
|
|
| 916 |
|
|
<p>
|
| 917 |
|
|
If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
|
| 918 |
|
|
solutions. In brief, there are three tests:
|
| 919 |
|
|
</p>
|
| 920 |
|
|
|
| 921 |
|
|
<ul>
|
| 922 |
|
|
<li>
|
| 923 |
|
|
<e>Can processes be started on remote machines?</e> tstmachines attempts
|
| 924 |
|
|
to run the shell command true on each machine in the machines files by
|
| 925 |
|
|
using the remote shell command.
|
| 926 |
|
|
</li>
|
| 927 |
|
|
<li>
|
| 928 |
|
|
<e>Is current working directory available to all machines?</e> This
|
| 929 |
|
|
attempts to ls a file that tstmachines creates by running ls using the
|
| 930 |
|
|
remote shell command.
|
| 931 |
|
|
</li>
|
| 932 |
|
|
<li>
|
| 933 |
|
|
<e>Can user programs be run on remote systems?</e> This checks that shared
|
| 934 |
|
|
libraries and other components have been properly installed on all
|
| 935 |
|
|
machines.
|
| 936 |
|
|
</li>
|
| 937 |
|
|
</ul>
|
| 938 |
|
|
|
| 939 |
|
|
<p>
|
| 940 |
|
|
And the required test for every development tool:
|
| 941 |
|
|
</p>
|
| 942 |
|
|
|
| 943 |
|
|
<pre caption="Testing a development tool">
|
| 944 |
|
|
# <i>cd ~</i>
|
| 945 |
|
|
# <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
|
| 946 |
|
|
# <i>make hello++</i>
|
| 947 |
|
|
# <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
|
| 948 |
|
|
</pre>
|
| 949 |
|
|
|
| 950 |
|
|
<p>
|
| 951 |
|
|
For further information on MPICH, consult the documentation at <uri
|
| 952 |
|
|
link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
|
| 953 |
|
|
</p>
|
| 954 |
|
|
|
| 955 |
|
|
</body>
|
| 956 |
|
|
</section>
|
| 957 |
|
|
<section>
|
| 958 |
|
|
<title>LAM</title>
|
| 959 |
|
|
<body>
|
| 960 |
|
|
|
| 961 |
|
|
<p>
|
| 962 |
|
|
(Coming Soon!)
|
| 963 |
|
|
</p>
|
| 964 |
|
|
|
| 965 |
|
|
</body>
|
| 966 |
|
|
</section>
|
| 967 |
|
|
<section>
|
| 968 |
|
|
<title>OMNI</title>
|
| 969 |
|
|
<body>
|
| 970 |
|
|
|
| 971 |
|
|
<p>
|
| 972 |
|
|
(Coming Soon!)
|
| 973 |
|
|
</p>
|
| 974 |
|
|
|
| 975 |
|
|
</body>
|
| 976 |
|
|
</section>
|
| 977 |
|
|
</chapter>
|
| 978 |
|
|
|
| 979 |
|
|
<chapter>
|
| 980 |
|
|
<title>Bibliography</title>
|
| 981 |
|
|
<section>
|
| 982 |
|
|
<body>
|
| 983 |
|
|
|
| 984 |
|
|
<p>
|
| 985 |
|
|
The original document is published at the <uri
|
| 986 |
|
|
link="http://www.adelielinux.com">Adelie Linux R&D Centre</uri> web site,
|
| 987 |
|
|
and is reproduced here with the permission of the authors and <uri
|
| 988 |
|
|
link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&D
|
| 989 |
|
|
Centre.
|
| 990 |
|
|
</p>
|
| 991 |
|
|
|
| 992 |
|
|
<ul>
|
| 993 |
neysx |
1.2 |
<li><uri>http://www.gentoo.org</uri>, Gentoo Technologies, Inc.</li>
|
| 994 |
swift |
1.1 |
<li>
|
| 995 |
|
|
<uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
|
| 996 |
|
|
Adelie Linux Research and Development Centre
|
| 997 |
|
|
</li>
|
| 998 |
|
|
<li>
|
| 999 |
|
|
<uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
|
| 1000 |
|
|
Linux NFS Project
|
| 1001 |
|
|
</li>
|
| 1002 |
|
|
<li>
|
| 1003 |
|
|
<uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
|
| 1004 |
|
|
Mathematics and Computer Science Division, Argonne National Laboratory
|
| 1005 |
|
|
</li>
|
| 1006 |
|
|
<li>
|
| 1007 |
|
|
<uri link="http://www.ntp.org/">http://ntp.org</uri>
|
| 1008 |
|
|
</li>
|
| 1009 |
|
|
<li>
|
| 1010 |
|
|
<uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
|
| 1011 |
|
|
David L. Mills, University of Delaware
|
| 1012 |
|
|
</li>
|
| 1013 |
|
|
<li>
|
| 1014 |
|
|
<uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
|
| 1015 |
|
|
Secure Shell Working Group, IETF, Internet Society
|
| 1016 |
|
|
</li>
|
| 1017 |
|
|
<li>
|
| 1018 |
|
|
<uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
|
| 1019 |
|
|
Guardian Digital
|
| 1020 |
|
|
</li>
|
| 1021 |
|
|
<li>
|
| 1022 |
|
|
<uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
|
| 1023 |
|
|
Altair Grid Technologies, LLC.
|
| 1024 |
|
|
</li>
|
| 1025 |
|
|
</ul>
|
| 1026 |
|
|
|
| 1027 |
|
|
</body>
|
| 1028 |
|
|
</section>
|
| 1029 |
|
|
</chapter>
|
| 1030 |
|
|
|
| 1031 |
|
|
</guide>
|