/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.8 - (show annotations) (download) (as text)
Fri Jul 7 07:39:55 2006 UTC (8 years ago) by neysx
Branch: MAIN
Changes since 1.7: +2 -2 lines
File MIME type: application/xml
#139518 s/spyderous/dberkholz/

1 <?xml version='1.0' encoding="UTF-8"?>
2 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.7 2006/04/17 04:43:56 nightmorph Exp $ -->
3 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
4
5 <guide link="/doc/en/hpc-howto.xml">
6 <title>High Performance Computing on Gentoo Linux</title>
7
8 <author title="Author">
9 <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
10 </author>
11 <author title="Author">
12 <mail link="benoit@adelielinux.com">Benoit Morin</mail>
13 </author>
14 <author title="Assistant/Research">
15 <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
16 </author>
17 <author title="Assistant/Research">
18 <mail link="olivier@adelielinux.com">Olivier Crete</mail>
19 </author>
20 <author title="Reviewer">
21 <mail link="dberkholz@gentoo.org">Donnie Berkholz</mail>
22 </author>
23
24 <!-- No licensing information; this document has been written by a third-party
25 organisation without additional licensing information.
26
27 In other words, this is copyright adelielinux R&D; Gentoo only has
28 permission to distribute this document as-is and update it when appropriate
29 as long as the adelie linux R&D notice stays
30 -->
31
32 <abstract>
33 This document was written by people at the Adelie Linux R&amp;D Center
34 &lt;http://www.adelielinux.com&gt; as a step-by-step guide to turn a Gentoo
35 System into a High Performance Computing (HPC) system.
36 </abstract>
37
38 <version>1.2</version>
39 <date>2003-08-01</date>
40
41 <chapter>
42 <title>Introduction</title>
43 <section>
44 <body>
45
46 <p>
47 Gentoo Linux, a special flavor of Linux that can be automatically optimized
48 and customized for just about any application or need. Extreme performance,
49 configurability and a top-notch user and developer community are all hallmarks
50 of the Gentoo experience.
51 </p>
52
53 <p>
54 Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
55 server, development workstation, professional desktop, gaming system, embedded
56 solution or... a High Performance Computing system. Because of its
57 near-unlimited adaptability, we call Gentoo Linux a metadistribution.
58 </p>
59
60 <p>
61 This document explains how to turn a Gentoo system into a High Performance
62 Computing system. Step by step, it explains what packages one may want to
63 install and helps configure them.
64 </p>
65
66 <p>
67 Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
68 refer to the <uri link="/doc/en/">documentation</uri> at the same location to
69 install it.
70 </p>
71
72 </body>
73 </section>
74 </chapter>
75
76 <chapter>
77 <title>Configuring Gentoo Linux for Clustering</title>
78 <section>
79 <title>Recommended Optimizations</title>
80 <body>
81
82 <note>
83 We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
84 this section.
85 </note>
86
87 <p>
88 During the installation process, you will have to set your USE variables in
89 <path>/etc/make.conf</path>. We recommended that you deactivate all the
90 defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them
91 in make.conf. However, you may want to keep such use variables as x86, 3dnow,
92 gpm, mmx, sse, ncurses, pam and tcpd. Refer to the USE documentation for more
93 information.
94 </p>
95
96 <pre caption="USE Flags">
97 USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm
98 -gif gpm -gtk -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod
99 mmx -motif -mpeg ncurses -nls -oggvorbis -opengl pam -pdflib -png
100 -python -qt -qtmt -quicktime -readline -sdl -slang -spell -ssl
101 -svga tcpd -truetype -X -xml2 -xmms -xv -zlib"
102 </pre>
103
104 <p>
105 Or simply:
106 </p>
107
108 <pre caption="USE Flags - simplified version">
109 USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
110 </pre>
111
112 <note>
113 The <e>tcpd</e> USE flag increases security for packages such as xinetd.
114 </note>
115
116 <p>
117 In step 15 ("Installing the kernel and a System Logger") for stability
118 reasons, we recommend the vanilla-sources, the official kernel sources
119 released on <uri>http://www.kernel.org/</uri>, unless you require special
120 support such as xfs.
121 </p>
122
123 <pre caption="Installing vanilla-sources">
124 # <i>emerge -p syslog-ng vanilla-sources</i>
125 </pre>
126
127 <p>
128 When you install miscellaneous packages, we recommend installing the
129 following:
130 </p>
131
132 <pre caption="Installing necessary packages">
133 # <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
134 </pre>
135
136 </body>
137 </section>
138 <section>
139 <title>Communication Layer (TCP/IP Network)</title>
140 <body>
141
142 <p>
143 A cluster requires a communication layer to interconnect the slave nodes to
144 the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
145 since they have a good price/performance ratio. Other possibilities include
146 use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
147 link="http://quadrics.com/">QsNet</uri> or others.
148 </p>
149
150 <p>
151 A cluster is composed of two node types: master and slave. Typically, your
152 cluster will have one master node and several slave nodes.
153 </p>
154
155 <p>
156 The master node is the cluster's server. It is responsible for telling the
157 slave nodes what to do. This server will typically run such daemons as dhcpd,
158 nfs, pbs-server, and pbs-sched. Your master node will allow interactive
159 sessions for users, and accept job executions.
160 </p>
161
162 <p>
163 The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
164 node. They should be dedicated to crunching results and therefore should not
165 run any unnecessary services.
166 </p>
167
168 <p>
169 The rest of this documentation will assume a cluster configuration as per the
170 hosts file below. You should maintain on every node such a hosts file
171 (<path>/etc/hosts</path>) with entries for each node participating node in the
172 cluster.
173 </p>
174
175 <pre caption="/etc/hosts">
176 # Adelie Linux Research &amp; Development Center
177 # /etc/hosts
178
179 127.0.0.1 localhost
180
181 192.168.1.100 master.adelie master
182
183 192.168.1.1 node01.adelie node01
184 192.168.1.2 node02.adelie node02
185 </pre>
186
187 <p>
188 To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
189 file on the master node.
190 </p>
191
192 <pre caption="/etc/conf.d/net">
193 # Copyright 1999-2002 Gentoo Technologies, Inc.
194 # Distributed under the terms of the GNU General Public License, v2 or later
195
196 # Global config file for net.* rc-scripts
197
198 # This is basically the ifconfig argument without the ifconfig $iface
199 #
200
201 iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
202 # Network Connection to the outside world using dhcp -- configure as required for you network
203 iface_eth1="dhcp"
204 </pre>
205
206
207 <p>
208 Finally, setup a DHCP daemon on the master node to avoid having to maintain a
209 network configuration on each slave node.
210 </p>
211
212 <pre caption="/etc/dhcp/dhcpd.conf">
213 # Adelie Linux Research &amp; Development Center
214 # /etc/dhcp/dhcpd.conf
215
216 log-facility local7;
217 ddns-update-style none;
218 use-host-decl-names on;
219
220 subnet 192.168.1.0 netmask 255.255.255.0 {
221 option domain-name "adelie";
222 range 192.168.1.10 192.168.1.99;
223 option routers 192.168.1.100;
224
225 host node01.adelie {
226 # MAC address of network card on node 01
227 hardware ethernet 00:07:e9:0f:e2:d4;
228 fixed-address 192.168.1.1;
229 }
230 host node02.adelie {
231 # MAC address of network card on node 02
232 hardware ethernet 00:07:e9:0f:e2:6b;
233 fixed-address 192.168.1.2;
234 }
235 }
236 </pre>
237
238 </body>
239 </section>
240 <section>
241 <title>NFS/NIS</title>
242 <body>
243
244 <p>
245 The Network File System (NFS) was developed to allow machines to mount a disk
246 partition on a remote machine as if it were on a local hard drive. This allows
247 for fast, seamless sharing of files across a network.
248 </p>
249
250 <p>
251 There are other systems that provide similar functionality to NFS which could
252 be used in a cluster environment. The <uri
253 link="http://www.openafs.org">Andrew File System
254 from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
255 some additional security and performance features. The <uri
256 link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
257 development, but is designed to work well with disconnected clients. Many
258 of the features of the Andrew and Coda file systems are slated for inclusion
259 in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
260 The advantage of NFS today is that it is mature, standard, well understood,
261 and supported robustly across a variety of platforms.
262 </p>
263
264 <pre caption="Ebuilds for NFS-support">
265 # <i>emerge -p nfs-utils portmap</i>
266 # <i>emerge nfs-utils portmap</i>
267 </pre>
268
269 <p>
270 Configure and install a kernel to support NFS v3 on all nodes:
271 </p>
272
273 <pre caption="Required Kernel Configurations for NFS">
274 CONFIG_NFS_FS=y
275 CONFIG_NFSD=y
276 CONFIG_SUNRPC=y
277 CONFIG_LOCKD=y
278 CONFIG_NFSD_V3=y
279 CONFIG_LOCKD_V4=y
280 </pre>
281
282 <p>
283 On the master node, edit your <path>/etc/hosts.allow</path> file to allow
284 connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
285 your <path>hosts.allow</path> will look like:
286 </p>
287
288 <pre caption="hosts.allow">
289 portmap:192.168.1.0/255.255.255.0
290 </pre>
291
292 <p>
293 Edit the <path>/etc/exports</path> file of the master node to export a work
294 directory structure (/home is good for this).
295 </p>
296
297 <pre caption="/etc/exports">
298 /home/ *(rw)
299 </pre>
300
301 <p>
302 Add nfs to your master node's default runlevel:
303 </p>
304
305 <pre caption="Adding NFS to the default runlevel">
306 # <i>rc-update add nfs default</i>
307 </pre>
308
309 <p>
310 To mount the nfs exported filesystem from the master, you also have to
311 configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
312 one:
313 </p>
314
315 <pre caption="/etc/fstab">
316 master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
317 </pre>
318
319 <p>
320 You'll also need to set up your nodes so that they mount the nfs filesystem by
321 issuing this command:
322 </p>
323
324 <pre caption="Adding nfsmount to the default runlevel">
325 # <i>rc-update add nfsmount default</i>
326 </pre>
327
328 </body>
329 </section>
330 <section>
331 <title>RSH/SSH</title>
332 <body>
333
334 <p>
335 SSH is a protocol for secure remote login and other secure network services
336 over an insecure network. OpenSSH uses public key cryptography to provide
337 secure authorization. Generating the public key, which is shared with remote
338 systems, and the private key which is kept on the local system, is done first
339 to configure OpenSSH on the cluster.
340 </p>
341
342 <p>
343 For transparent cluster usage, private/public keys may be used. This process
344 has two steps:
345 </p>
346
347 <ul>
348 <li>Generate public and private keys</li>
349 <li>Copy public key to slave nodes</li>
350 </ul>
351
352 <p>
353 For user based authentication, generate and copy as follows:
354 </p>
355
356 <pre caption="SSH key authentication">
357 # <i>ssh-keygen -t dsa</i>
358 Generating public/private dsa key pair.
359 Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
360 Enter passphrase (empty for no passphrase):
361 Enter same passphrase again:
362 Your identification has been saved in /root/.ssh/id_dsa.
363 Your public key has been saved in /root/.ssh/id_dsa.pub.
364 The key fingerprint is:
365 f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
366
367 <comment>WARNING! If you already have an "authorized_keys" file,
368 please append to it, do not use the following command.</comment>
369
370 # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
371 root@master's password:
372 id_dsa.pub 100% 234 2.0MB/s 00:00
373
374 # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
375 root@master's password:
376 id_dsa.pub 100% 234 2.0MB/s 00:00
377 </pre>
378
379 <note>
380 Host keys must have an empty passphrase. RSA is required for host-based
381 authentication.
382 </note>
383
384 <p>
385 For host based authentication, you will also need to edit your
386 <path>/etc/ssh/shosts.equiv</path>.
387 </p>
388
389 <pre caption="/etc/ssh/shosts.equiv">
390 node01.adelie
391 node02.adelie
392 master.adelie
393 </pre>
394
395 <p>
396 And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
397 </p>
398
399 <pre caption="sshd configurations">
400 # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
401 # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
402
403 # This is the sshd server system-wide configuration file. See sshd(8)
404 # for more information.
405
406 # HostKeys for protocol version 2
407 HostKey /etc/ssh/ssh_host_rsa_key
408 </pre>
409
410 <p>
411 If your application require RSH communications, you will need to emerge
412 net-misc/netkit-rsh and sys-apps/xinetd.
413 </p>
414
415 <pre caption="Installing necessary applicaitons">
416 # <i>emerge -p xinetd</i>
417 # <i>emerge xinetd</i>
418 # <i>emerge -p netkit-rsh</i>
419 # <i>emerge netkit-rsh</i>
420 </pre>
421
422 <p>
423 Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
424 </p>
425
426 <pre caption="rsh">
427 # Adelie Linux Research &amp; Development Center
428 # /etc/xinetd.d/rsh
429
430 service shell
431 {
432 socket_type = stream
433 protocol = tcp
434 wait = no
435 user = root
436 group = tty
437 server = /usr/sbin/in.rshd
438 log_type = FILE /var/log/rsh
439 log_on_success = PID HOST USERID EXIT DURATION
440 log_on_failure = USERID ATTEMPT
441 disable = no
442 }
443 </pre>
444
445 <p>
446 Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
447 </p>
448
449 <pre caption="hosts.allow">
450 # Adelie Linux Research &amp; Development Center
451 # /etc/hosts.allow
452
453 in.rshd:192.168.1.0/255.255.255.0
454 </pre>
455
456 <p>
457 Or you can simply trust your cluster LAN:
458 </p>
459
460 <pre caption="hosts.allow">
461 # Adelie Linux Research &amp; Development Center
462 # /etc/hosts.allow
463
464 ALL:192.168.1.0/255.255.255.0
465 </pre>
466
467 <p>
468 Finally, configure host authentication from <path>/etc/hosts.equiv</path>.
469 </p>
470
471 <pre caption="hosts.equiv">
472 # Adelie Linux Research &amp; Development Center
473 # /etc/hosts.equiv
474
475 master
476 node01
477 node02
478 </pre>
479
480 <p>
481 And, add xinetd to your default runlevel:
482 </p>
483
484 <pre caption="Adding xinetd to the default runlevel">
485 # <i>rc-update add xinetd default</i>
486 </pre>
487
488 </body>
489 </section>
490 <section>
491 <title>NTP</title>
492 <body>
493
494 <p>
495 The Network Time Protocol (NTP) is used to synchronize the time of a computer
496 client or server to another server or reference time source, such as a radio
497 or satellite receiver or modem. It provides accuracies typically within a
498 millisecond on LANs and up to a few tens of milliseconds on WANs relative to
499 Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
500 receiver, for example. Typical NTP configurations utilize multiple redundant
501 servers and diverse network paths in order to achieve high accuracy and
502 reliability.
503 </p>
504
505 <p>
506 Select a NTP server geographically close to you from <uri
507 link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
508 Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
509 <path>/etc/ntp.conf</path> files on the master node.
510 </p>
511
512 <pre caption="Master /etc/conf.d/ntp">
513 # Copyright 1999-2002 Gentoo Technologies, Inc.
514 # Distributed under the terms of the GNU General Public License v2
515 # /etc/conf.d/ntpd
516
517 # NOTES:
518 # - NTPDATE variables below are used if you wish to set your
519 # clock when you start the ntp init.d script
520 # - make sure that the NTPDATE_CMD will close by itself ...
521 # the init.d script will not attempt to kill/stop it
522 # - ntpd will be used to maintain synchronization with a time
523 # server regardless of what NTPDATE is set to
524 # - read each of the comments above each of the variable
525
526 # Comment this out if you dont want the init script to warn
527 # about not having ntpdate setup
528 NTPDATE_WARN="n"
529
530 # Command to run to set the clock initially
531 # Most people should just uncomment this line ...
532 # however, if you know what you're doing, and you
533 # want to use ntpd to set the clock, change this to 'ntpd'
534 NTPDATE_CMD="ntpdate"
535
536 # Options to pass to the above command
537 # Most people should just uncomment this variable and
538 # change 'someserver' to a valid hostname which you
539 # can acquire from the URL's below
540 NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
541
542 ##
543 # A list of available servers is available here:
544 # http://www.eecis.udel.edu/~mills/ntp/servers.html
545 # Please follow the rules of engagement and use a
546 # Stratum 2 server (unless you qualify for Stratum 1)
547 ##
548
549 # Options to pass to the ntpd process that will *always* be run
550 # Most people should not uncomment this line ...
551 # however, if you know what you're doing, feel free to tweak
552 #NTPD_OPTS=""
553
554 </pre>
555
556 <p>
557 Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
558 synchronization source:
559 </p>
560
561 <pre caption="Master ntp.conf">
562 # Adelie Linux Research &amp; Development Center
563 # /etc/ntp.conf
564
565 # Synchronization source #1
566 server ntp1.cmc.ec.gc.ca
567 restrict ntp1.cmc.ec.gc.ca
568 # Synchronization source #2
569 server ntp2.cmc.ec.gc.ca
570 restrict ntp2.cmc.ec.gc.ca
571 stratum 10
572 driftfile /etc/ntp.drift.server
573 logfile /var/log/ntp
574 broadcast 192.168.1.255
575 restrict default kod
576 restrict 127.0.0.1
577 restrict 192.168.1.0 mask 255.255.255.0
578 </pre>
579
580 <p>
581 And on all your slave nodes, setup your synchronization source as your master
582 node.
583 </p>
584
585 <pre caption="Node /etc/conf.d/ntp">
586 # Copyright 1999-2002 Gentoo Technologies, Inc.
587 # Distributed under the terms of the GNU General Public License v2
588 # /etc/conf.d/ntpd
589
590 NTPDATE_WARN="n"
591 NTPDATE_CMD="ntpdate"
592 NTPDATE_OPTS="-b master"
593 </pre>
594
595 <pre caption="Node ntp.conf">
596 # Adelie Linux Research &amp; Development Center
597 # /etc/ntp.conf
598
599 # Synchronization source #1
600 server master
601 restrict master
602 stratum 11
603 driftfile /etc/ntp.drift.server
604 logfile /var/log/ntp
605 restrict default kod
606 restrict 127.0.0.1
607 </pre>
608
609 <p>
610 Then add ntpd to the default runlevel of all your nodes:
611 </p>
612
613 <pre caption="Adding ntpd to the default runlevel">
614 # <i>rc-update add ntpd default</i>
615 </pre>
616
617 <note>
618 NTP will not update the local clock if the time difference between your
619 synchronization source and the local clock is too great.
620 </note>
621
622 </body>
623 </section>
624 <section>
625 <title>IPTABLES</title>
626 <body>
627
628 <p>
629 To setup a firewall on your cluster, you will need iptables.
630 </p>
631
632 <pre caption="Installing iptables">
633 # <i>emerge -p iptables</i>
634 # <i>emerge iptables</i>
635 </pre>
636
637 <p>
638 Required kernel configuration:
639 </p>
640
641 <pre caption="IPtables kernel configuration">
642 CONFIG_NETFILTER=y
643 CONFIG_IP_NF_CONNTRACK=y
644 CONFIG_IP_NF_IPTABLES=y
645 CONFIG_IP_NF_MATCH_STATE=y
646 CONFIG_IP_NF_FILTER=y
647 CONFIG_IP_NF_TARGET_REJECT=y
648 CONFIG_IP_NF_NAT=y
649 CONFIG_IP_NF_NAT_NEEDED=y
650 CONFIG_IP_NF_TARGET_MASQUERADE=y
651 CONFIG_IP_NF_TARGET_LOG=y
652 </pre>
653
654 <p>
655 And the rules required for this firewall:
656 </p>
657
658 <pre caption="rule-save">
659 # Adelie Linux Research &amp; Development Center
660 # /var/lib/iptables/rule-save
661
662 *filter
663 :INPUT ACCEPT [0:0]
664 :FORWARD ACCEPT [0:0]
665 :OUTPUT ACCEPT [0:0]
666 -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
667 -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
668 -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
669 -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
670 -A INPUT -p icmp -j ACCEPT
671 -A INPUT -j LOG
672 -A INPUT -j REJECT --reject-with icmp-port-unreachable
673 COMMIT
674 *nat
675 :PREROUTING ACCEPT [0:0]
676 :POSTROUTING ACCEPT [0:0]
677 :OUTPUT ACCEPT [0:0]
678 -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
679 COMMIT
680 </pre>
681
682 <p>
683 Then add iptables to the default runlevel of all your nodes:
684 </p>
685
686 <pre caption="Adding iptables to the default runlevel">
687 # <i>rc-update add iptables default</i>
688 </pre>
689
690 </body>
691 </section>
692 </chapter>
693
694 <chapter>
695 <title>HPC Tools</title>
696 <section>
697 <title>OpenPBS</title>
698 <body>
699
700 <p>
701 The Portable Batch System (PBS) is a flexible batch queueing and workload
702 management system originally developed for NASA. It operates on networked,
703 multi-platform UNIX environments, including heterogeneous clusters of
704 workstations, supercomputers, and massively parallel systems. Development of
705 PBS is provided by Altair Grid Technologies.
706 </p>
707
708 <pre caption="Installing openpbs">
709 # <i>emerge -p openpbs</i>
710 </pre>
711
712 <note>
713 OpenPBS ebuild does not currently set proper permissions on var-directories
714 used by OpenPBS.
715 </note>
716
717 <p>
718 Before starting using OpenPBS, some configurations are required. The files
719 you will need to personalize for your system are:
720 </p>
721
722 <ul>
723 <li>/etc/pbs_environment</li>
724 <li>/var/spool/PBS/server_name</li>
725 <li>/var/spool/PBS/server_priv/nodes</li>
726 <li>/var/spool/PBS/mom_priv/config</li>
727 <li>/var/spool/PBS/sched_priv/sched_config</li>
728 </ul>
729
730 <p>
731 Here is a sample sched_config:
732 </p>
733
734 <pre caption="/var/spool/PBS/sched_priv/sched_config">
735 #
736 # Create queues and set their attributes.
737 #
738 #
739 # Create and define queue upto4nodes
740 #
741 create queue upto4nodes
742 set queue upto4nodes queue_type = Execution
743 set queue upto4nodes Priority = 100
744 set queue upto4nodes resources_max.nodect = 4
745 set queue upto4nodes resources_min.nodect = 1
746 set queue upto4nodes enabled = True
747 set queue upto4nodes started = True
748 #
749 # Create and define queue default
750 #
751 create queue default
752 set queue default queue_type = Route
753 set queue default route_destinations = upto4nodes
754 set queue default enabled = True
755 set queue default started = True
756 #
757 # Set server attributes.
758 #
759 set server scheduling = True
760 set server acl_host_enable = True
761 set server default_queue = default
762 set server log_events = 511
763 set server mail_from = adm
764 set server query_other_jobs = True
765 set server resources_default.neednodes = 1
766 set server resources_default.nodect = 1
767 set server resources_default.nodes = 1
768 set server scheduler_iteration = 60
769 </pre>
770
771 <p>
772 To submit a task to OpenPBS, the command <c>qsub</c> is used with some
773 optional parameters. In the example below, "-l" allows you to specify
774 the resources required, "-j" provides for redirection of standard out and
775 standard error, and the "-m" will e-mail the user at beginning (b), end (e)
776 and on abort (a) of the job.
777 </p>
778
779 <pre caption="Submitting a task">
780 <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
781 # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
782 </pre>
783
784 <p>
785 Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
786 may want to try a task manually. To request an interactive shell from OpenPBS,
787 use the "-I" parameter.
788 </p>
789
790 <pre caption="Requesting an interactive shell">
791 # <i>qsub -I</i>
792 </pre>
793
794 <p>
795 To check the status of your jobs, use the qstat command:
796 </p>
797
798 <pre caption="Checking the status of the jobs">
799 # <i>qstat</i>
800 Job id Name User Time Use S Queue
801 ------ ---- ---- -------- - -----
802 2.geist STDIN adelie 0 R upto1nodes
803 </pre>
804
805 </body>
806 </section>
807 <section>
808 <title>MPICH</title>
809 <body>
810
811 <p>
812 Message passing is a paradigm used widely on certain classes of parallel
813 machines, especially those with distributed memory. MPICH is a freely
814 available, portable implementation of MPI, the Standard for message-passing
815 libraries.
816 </p>
817
818 <p>
819 The mpich ebuild provided by Adelie Linux allows for two USE flags:
820 <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
821 installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
822 of <c>rsh</c>.
823 </p>
824
825 <pre caption="Installing the mpich application">
826 # <i>emerge -p mpich</i>
827 # <i>emerge mpich</i>
828 </pre>
829
830 <p>
831 You may need to export a mpich work directory to all your slave nodes in
832 <path>/etc/exports</path>:
833 </p>
834
835 <pre caption="/etc/exports">
836 /home *(rw)
837 </pre>
838
839 <p>
840 Most massively parallel processors (MPPs) provide a way to start a program on
841 a requested number of processors; <c>mpirun</c> makes use of the appropriate
842 command whenever possible. In contrast, workstation clusters require that each
843 process in a parallel job be started individually, though programs to help
844 start these processes exist. Because workstation clusters are not already
845 organized as an MPP, additional information is required to make use of them.
846 Mpich should be installed with a list of participating workstations in the
847 file <path>machines.LINUX</path> in the directory
848 <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
849 processors to run on.
850 </p>
851
852 <p>
853 Edit this file to reflect your cluster-lan configuration:
854 </p>
855
856 <pre caption="/usr/share/mpich/machines.LINUX">
857 # Change this file to contain the machines that you want to use
858 # to run MPI jobs on. The format is one host name per line, with either
859 # hostname
860 # or
861 # hostname:n
862 # where n is the number of processors in an SMP. The hostname should
863 # be the same as the result from the command "hostname"
864 master
865 node01
866 node02
867 # node03
868 # node04
869 # ...
870 </pre>
871
872 <p>
873 Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
874 you can use all of the machines that you have listed. This script performs
875 an <c>rsh</c> and a short directory listing; this tests that you both have
876 access to the node and that a program in the current directory is visible on
877 the remote node. If there are any problems, they will be listed. These
878 problems must be fixed before proceeding.
879 </p>
880
881 <p>
882 The only argument to <c>tstmachines</c> is the name of the architecture; this
883 is the same name as the extension on the machines file. For example, the
884 following tests that a program in the current directory can be executed by
885 all of the machines in the LINUX machines list.
886 </p>
887
888 <pre caption="Running a test">
889 # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
890 </pre>
891
892 <note>
893 This program is silent if all is well; if you want to see what it is doing,
894 use the -v (for verbose) argument:
895 </note>
896
897 <pre caption="Running a test verbosively">
898 # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
899 </pre>
900
901 <p>
902 The output from this command might look like:
903 </p>
904
905 <pre caption="Output of the above command">
906 Trying true on host1.uoffoo.edu ...
907 Trying true on host2.uoffoo.edu ...
908 Trying ls on host1.uoffoo.edu ...
909 Trying ls on host2.uoffoo.edu ...
910 Trying user program on host1.uoffoo.edu ...
911 Trying user program on host2.uoffoo.edu ...
912 </pre>
913
914 <p>
915 If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
916 solutions. In brief, there are three tests:
917 </p>
918
919 <ul>
920 <li>
921 <e>Can processes be started on remote machines?</e> tstmachines attempts
922 to run the shell command true on each machine in the machines files by
923 using the remote shell command.
924 </li>
925 <li>
926 <e>Is current working directory available to all machines?</e> This
927 attempts to ls a file that tstmachines creates by running ls using the
928 remote shell command.
929 </li>
930 <li>
931 <e>Can user programs be run on remote systems?</e> This checks that shared
932 libraries and other components have been properly installed on all
933 machines.
934 </li>
935 </ul>
936
937 <p>
938 And the required test for every development tool:
939 </p>
940
941 <pre caption="Testing a development tool">
942 # <i>cd ~</i>
943 # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
944 # <i>make hello++</i>
945 # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
946 </pre>
947
948 <p>
949 For further information on MPICH, consult the documentation at <uri
950 link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
951 </p>
952
953 </body>
954 </section>
955 <section>
956 <title>LAM</title>
957 <body>
958
959 <p>
960 (Coming Soon!)
961 </p>
962
963 </body>
964 </section>
965 <section>
966 <title>OMNI</title>
967 <body>
968
969 <p>
970 (Coming Soon!)
971 </p>
972
973 </body>
974 </section>
975 </chapter>
976
977 <chapter>
978 <title>Bibliography</title>
979 <section>
980 <body>
981
982 <p>
983 The original document is published at the <uri
984 link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
985 and is reproduced here with the permission of the authors and <uri
986 link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
987 Centre.
988 </p>
989
990 <ul>
991 <li><uri>http://www.gentoo.org</uri>, Gentoo Technologies, Inc.</li>
992 <li>
993 <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
994 Adelie Linux Research and Development Centre
995 </li>
996 <li>
997 <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
998 Linux NFS Project
999 </li>
1000 <li>
1001 <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
1002 Mathematics and Computer Science Division, Argonne National Laboratory
1003 </li>
1004 <li>
1005 <uri link="http://www.ntp.org/">http://ntp.org</uri>
1006 </li>
1007 <li>
1008 <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1009 David L. Mills, University of Delaware
1010 </li>
1011 <li>
1012 <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1013 Secure Shell Working Group, IETF, Internet Society
1014 </li>
1015 <li>
1016 <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1017 Guardian Digital
1018 </li>
1019 <li>
1020 <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1021 Altair Grid Technologies, LLC.
1022 </li>
1023 </ul>
1024
1025 </body>
1026 </section>
1027 </chapter>
1028
1029 </guide>

  ViewVC Help
Powered by ViewVC 1.1.20