/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.13 - (show annotations) (download) (as text)
Mon Dec 18 21:47:19 2006 UTC (7 years, 9 months ago) by nightmorph
Branch: MAIN
Changes since 1.12: +14 -14 lines
File MIME type: application/xml
added nptl and nptlonly (standard system USE flags) to the HPC howto for bug 158491

1 <?xml version='1.0' encoding="UTF-8"?>
2 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.12 2006/11/02 19:13:17 nightmorph Exp $ -->
3 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
4
5 <guide link="/doc/en/hpc-howto.xml">
6 <title>High Performance Computing on Gentoo Linux</title>
7
8 <author title="Author">
9 <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
10 </author>
11 <author title="Author">
12 <mail link="benoit@adelielinux.com">Benoit Morin</mail>
13 </author>
14 <author title="Assistant/Research">
15 <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
16 </author>
17 <author title="Assistant/Research">
18 <mail link="olivier@adelielinux.com">Olivier Crete</mail>
19 </author>
20 <author title="Reviewer">
21 <mail link="dberkholz@gentoo.org">Donnie Berkholz</mail>
22 </author>
23
24 <!-- No licensing information; this document has been written by a third-party
25 organisation without additional licensing information.
26
27 In other words, this is copyright adelielinux R&D; Gentoo only has
28 permission to distribute this document as-is and update it when appropriate
29 as long as the adelie linux R&D notice stays
30 -->
31
32 <abstract>
33 This document was written by people at the Adelie Linux R&amp;D Center
34 &lt;http://www.adelielinux.com&gt; as a step-by-step guide to turn a Gentoo
35 System into a High Performance Computing (HPC) system.
36 </abstract>
37
38 <version>1.6</version>
39 <date>2006-12-18</date>
40
41 <chapter>
42 <title>Introduction</title>
43 <section>
44 <body>
45
46 <p>
47 Gentoo Linux, a special flavor of Linux that can be automatically optimized
48 and customized for just about any application or need. Extreme performance,
49 configurability and a top-notch user and developer community are all hallmarks
50 of the Gentoo experience.
51 </p>
52
53 <p>
54 Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
55 server, development workstation, professional desktop, gaming system, embedded
56 solution or... a High Performance Computing system. Because of its
57 near-unlimited adaptability, we call Gentoo Linux a metadistribution.
58 </p>
59
60 <p>
61 This document explains how to turn a Gentoo system into a High Performance
62 Computing system. Step by step, it explains what packages one may want to
63 install and helps configure them.
64 </p>
65
66 <p>
67 Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
68 refer to the <uri link="/doc/en/">documentation</uri> at the same location to
69 install it.
70 </p>
71
72 </body>
73 </section>
74 </chapter>
75
76 <chapter>
77 <title>Configuring Gentoo Linux for Clustering</title>
78 <section>
79 <title>Recommended Optimizations</title>
80 <body>
81
82 <note>
83 We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
84 this section.
85 </note>
86
87 <p>
88 During the installation process, you will have to set your USE variables in
89 <path>/etc/make.conf</path>. We recommended that you deactivate all the
90 defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them in
91 make.conf. However, you may want to keep such use variables as x86, 3dnow, gpm,
92 mmx, nptl, nptlonly, sse, ncurses, pam and tcpd. Refer to the USE documentation
93 for more information.
94 </p>
95
96 <pre caption="USE Flags">
97 USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm -gif gpm -gtk
98 -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod mmx -motif -mpeg ncurses
99 -nls nptl nptlonly -oggvorbis -opengl pam -pdflib -png -python -qt3 -qt4 -qtmt
100 -quicktime -readline -sdl -slang -spell -ssl -svga tcpd -truetype -X -xml2 -xv
101 -zlib"
102 </pre>
103
104 <p>
105 Or simply:
106 </p>
107
108 <pre caption="USE Flags - simplified version">
109 USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
110 </pre>
111
112 <note>
113 The <e>tcpd</e> USE flag increases security for packages such as xinetd.
114 </note>
115
116 <p>
117 In step 15 ("Installing the kernel and a System Logger") for stability
118 reasons, we recommend the vanilla-sources, the official kernel sources
119 released on <uri>http://www.kernel.org/</uri>, unless you require special
120 support such as xfs.
121 </p>
122
123 <pre caption="Installing vanilla-sources">
124 # <i>emerge -p syslog-ng vanilla-sources</i>
125 </pre>
126
127 <p>
128 When you install miscellaneous packages, we recommend installing the
129 following:
130 </p>
131
132 <pre caption="Installing necessary packages">
133 # <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
134 </pre>
135
136 </body>
137 </section>
138 <section>
139 <title>Communication Layer (TCP/IP Network)</title>
140 <body>
141
142 <p>
143 A cluster requires a communication layer to interconnect the slave nodes to
144 the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
145 since they have a good price/performance ratio. Other possibilities include
146 use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
147 link="http://quadrics.com/">QsNet</uri> or others.
148 </p>
149
150 <p>
151 A cluster is composed of two node types: master and slave. Typically, your
152 cluster will have one master node and several slave nodes.
153 </p>
154
155 <p>
156 The master node is the cluster's server. It is responsible for telling the
157 slave nodes what to do. This server will typically run such daemons as dhcpd,
158 nfs, pbs-server, and pbs-sched. Your master node will allow interactive
159 sessions for users, and accept job executions.
160 </p>
161
162 <p>
163 The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
164 node. They should be dedicated to crunching results and therefore should not
165 run any unnecessary services.
166 </p>
167
168 <p>
169 The rest of this documentation will assume a cluster configuration as per the
170 hosts file below. You should maintain on every node such a hosts file
171 (<path>/etc/hosts</path>) with entries for each node participating node in the
172 cluster.
173 </p>
174
175 <pre caption="/etc/hosts">
176 # Adelie Linux Research &amp; Development Center
177 # /etc/hosts
178
179 127.0.0.1 localhost
180
181 192.168.1.100 master.adelie master
182
183 192.168.1.1 node01.adelie node01
184 192.168.1.2 node02.adelie node02
185 </pre>
186
187 <p>
188 To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
189 file on the master node.
190 </p>
191
192 <pre caption="/etc/conf.d/net">
193 # Global config file for net.* rc-scripts
194
195 # This is basically the ifconfig argument without the ifconfig $iface
196 #
197
198 iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
199 # Network Connection to the outside world using dhcp -- configure as required for you network
200 iface_eth1="dhcp"
201 </pre>
202
203
204 <p>
205 Finally, setup a DHCP daemon on the master node to avoid having to maintain a
206 network configuration on each slave node.
207 </p>
208
209 <pre caption="/etc/dhcp/dhcpd.conf">
210 # Adelie Linux Research &amp; Development Center
211 # /etc/dhcp/dhcpd.conf
212
213 log-facility local7;
214 ddns-update-style none;
215 use-host-decl-names on;
216
217 subnet 192.168.1.0 netmask 255.255.255.0 {
218 option domain-name "adelie";
219 range 192.168.1.10 192.168.1.99;
220 option routers 192.168.1.100;
221
222 host node01.adelie {
223 # MAC address of network card on node 01
224 hardware ethernet 00:07:e9:0f:e2:d4;
225 fixed-address 192.168.1.1;
226 }
227 host node02.adelie {
228 # MAC address of network card on node 02
229 hardware ethernet 00:07:e9:0f:e2:6b;
230 fixed-address 192.168.1.2;
231 }
232 }
233 </pre>
234
235 </body>
236 </section>
237 <section>
238 <title>NFS/NIS</title>
239 <body>
240
241 <p>
242 The Network File System (NFS) was developed to allow machines to mount a disk
243 partition on a remote machine as if it were on a local hard drive. This allows
244 for fast, seamless sharing of files across a network.
245 </p>
246
247 <p>
248 There are other systems that provide similar functionality to NFS which could
249 be used in a cluster environment. The <uri
250 link="http://www.openafs.org">Andrew File System
251 from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
252 some additional security and performance features. The <uri
253 link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
254 development, but is designed to work well with disconnected clients. Many
255 of the features of the Andrew and Coda file systems are slated for inclusion
256 in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
257 The advantage of NFS today is that it is mature, standard, well understood,
258 and supported robustly across a variety of platforms.
259 </p>
260
261 <pre caption="Ebuilds for NFS-support">
262 # <i>emerge -p nfs-utils portmap</i>
263 # <i>emerge nfs-utils portmap</i>
264 </pre>
265
266 <p>
267 Configure and install a kernel to support NFS v3 on all nodes:
268 </p>
269
270 <pre caption="Required Kernel Configurations for NFS">
271 CONFIG_NFS_FS=y
272 CONFIG_NFSD=y
273 CONFIG_SUNRPC=y
274 CONFIG_LOCKD=y
275 CONFIG_NFSD_V3=y
276 CONFIG_LOCKD_V4=y
277 </pre>
278
279 <p>
280 On the master node, edit your <path>/etc/hosts.allow</path> file to allow
281 connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
282 your <path>hosts.allow</path> will look like:
283 </p>
284
285 <pre caption="hosts.allow">
286 portmap:192.168.1.0/255.255.255.0
287 </pre>
288
289 <p>
290 Edit the <path>/etc/exports</path> file of the master node to export a work
291 directory structure (/home is good for this).
292 </p>
293
294 <pre caption="/etc/exports">
295 /home/ *(rw)
296 </pre>
297
298 <p>
299 Add nfs to your master node's default runlevel:
300 </p>
301
302 <pre caption="Adding NFS to the default runlevel">
303 # <i>rc-update add nfs default</i>
304 </pre>
305
306 <p>
307 To mount the nfs exported filesystem from the master, you also have to
308 configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
309 one:
310 </p>
311
312 <pre caption="/etc/fstab">
313 master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
314 </pre>
315
316 <p>
317 You'll also need to set up your nodes so that they mount the nfs filesystem by
318 issuing this command:
319 </p>
320
321 <pre caption="Adding nfsmount to the default runlevel">
322 # <i>rc-update add nfsmount default</i>
323 </pre>
324
325 </body>
326 </section>
327 <section>
328 <title>RSH/SSH</title>
329 <body>
330
331 <p>
332 SSH is a protocol for secure remote login and other secure network services
333 over an insecure network. OpenSSH uses public key cryptography to provide
334 secure authorization. Generating the public key, which is shared with remote
335 systems, and the private key which is kept on the local system, is done first
336 to configure OpenSSH on the cluster.
337 </p>
338
339 <p>
340 For transparent cluster usage, private/public keys may be used. This process
341 has two steps:
342 </p>
343
344 <ul>
345 <li>Generate public and private keys</li>
346 <li>Copy public key to slave nodes</li>
347 </ul>
348
349 <p>
350 For user based authentication, generate and copy as follows:
351 </p>
352
353 <pre caption="SSH key authentication">
354 # <i>ssh-keygen -t dsa</i>
355 Generating public/private dsa key pair.
356 Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
357 Enter passphrase (empty for no passphrase):
358 Enter same passphrase again:
359 Your identification has been saved in /root/.ssh/id_dsa.
360 Your public key has been saved in /root/.ssh/id_dsa.pub.
361 The key fingerprint is:
362 f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
363
364 <comment>WARNING! If you already have an "authorized_keys" file,
365 please append to it, do not use the following command.</comment>
366
367 # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
368 root@master's password:
369 id_dsa.pub 100% 234 2.0MB/s 00:00
370
371 # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
372 root@master's password:
373 id_dsa.pub 100% 234 2.0MB/s 00:00
374 </pre>
375
376 <note>
377 Host keys must have an empty passphrase. RSA is required for host-based
378 authentication.
379 </note>
380
381 <p>
382 For host based authentication, you will also need to edit your
383 <path>/etc/ssh/shosts.equiv</path>.
384 </p>
385
386 <pre caption="/etc/ssh/shosts.equiv">
387 node01.adelie
388 node02.adelie
389 master.adelie
390 </pre>
391
392 <p>
393 And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
394 </p>
395
396 <pre caption="sshd configurations">
397 # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
398 # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
399
400 # This is the sshd server system-wide configuration file. See sshd(8)
401 # for more information.
402
403 # HostKeys for protocol version 2
404 HostKey /etc/ssh/ssh_host_rsa_key
405 </pre>
406
407 <p>
408 If your application require RSH communications, you will need to emerge
409 net-misc/netkit-rsh and sys-apps/xinetd.
410 </p>
411
412 <pre caption="Installing necessary applicaitons">
413 # <i>emerge -p xinetd</i>
414 # <i>emerge xinetd</i>
415 # <i>emerge -p netkit-rsh</i>
416 # <i>emerge netkit-rsh</i>
417 </pre>
418
419 <p>
420 Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
421 </p>
422
423 <pre caption="rsh">
424 # Adelie Linux Research &amp; Development Center
425 # /etc/xinetd.d/rsh
426
427 service shell
428 {
429 socket_type = stream
430 protocol = tcp
431 wait = no
432 user = root
433 group = tty
434 server = /usr/sbin/in.rshd
435 log_type = FILE /var/log/rsh
436 log_on_success = PID HOST USERID EXIT DURATION
437 log_on_failure = USERID ATTEMPT
438 disable = no
439 }
440 </pre>
441
442 <p>
443 Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
444 </p>
445
446 <pre caption="hosts.allow">
447 # Adelie Linux Research &amp; Development Center
448 # /etc/hosts.allow
449
450 in.rshd:192.168.1.0/255.255.255.0
451 </pre>
452
453 <p>
454 Or you can simply trust your cluster LAN:
455 </p>
456
457 <pre caption="hosts.allow">
458 # Adelie Linux Research &amp; Development Center
459 # /etc/hosts.allow
460
461 ALL:192.168.1.0/255.255.255.0
462 </pre>
463
464 <p>
465 Finally, configure host authentication from <path>/etc/hosts.equiv</path>.
466 </p>
467
468 <pre caption="hosts.equiv">
469 # Adelie Linux Research &amp; Development Center
470 # /etc/hosts.equiv
471
472 master
473 node01
474 node02
475 </pre>
476
477 <p>
478 And, add xinetd to your default runlevel:
479 </p>
480
481 <pre caption="Adding xinetd to the default runlevel">
482 # <i>rc-update add xinetd default</i>
483 </pre>
484
485 </body>
486 </section>
487 <section>
488 <title>NTP</title>
489 <body>
490
491 <p>
492 The Network Time Protocol (NTP) is used to synchronize the time of a computer
493 client or server to another server or reference time source, such as a radio
494 or satellite receiver or modem. It provides accuracies typically within a
495 millisecond on LANs and up to a few tens of milliseconds on WANs relative to
496 Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
497 receiver, for example. Typical NTP configurations utilize multiple redundant
498 servers and diverse network paths in order to achieve high accuracy and
499 reliability.
500 </p>
501
502 <p>
503 Select a NTP server geographically close to you from <uri
504 link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
505 Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
506 <path>/etc/ntp.conf</path> files on the master node.
507 </p>
508
509 <pre caption="Master /etc/conf.d/ntp">
510 # /etc/conf.d/ntpd
511
512 # NOTES:
513 # - NTPDATE variables below are used if you wish to set your
514 # clock when you start the ntp init.d script
515 # - make sure that the NTPDATE_CMD will close by itself ...
516 # the init.d script will not attempt to kill/stop it
517 # - ntpd will be used to maintain synchronization with a time
518 # server regardless of what NTPDATE is set to
519 # - read each of the comments above each of the variable
520
521 # Comment this out if you dont want the init script to warn
522 # about not having ntpdate setup
523 NTPDATE_WARN="n"
524
525 # Command to run to set the clock initially
526 # Most people should just uncomment this line ...
527 # however, if you know what you're doing, and you
528 # want to use ntpd to set the clock, change this to 'ntpd'
529 NTPDATE_CMD="ntpdate"
530
531 # Options to pass to the above command
532 # Most people should just uncomment this variable and
533 # change 'someserver' to a valid hostname which you
534 # can acquire from the URL's below
535 NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
536
537 ##
538 # A list of available servers is available here:
539 # http://www.eecis.udel.edu/~mills/ntp/servers.html
540 # Please follow the rules of engagement and use a
541 # Stratum 2 server (unless you qualify for Stratum 1)
542 ##
543
544 # Options to pass to the ntpd process that will *always* be run
545 # Most people should not uncomment this line ...
546 # however, if you know what you're doing, feel free to tweak
547 #NTPD_OPTS=""
548
549 </pre>
550
551 <p>
552 Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
553 synchronization source:
554 </p>
555
556 <pre caption="Master ntp.conf">
557 # Adelie Linux Research &amp; Development Center
558 # /etc/ntp.conf
559
560 # Synchronization source #1
561 server ntp1.cmc.ec.gc.ca
562 restrict ntp1.cmc.ec.gc.ca
563 # Synchronization source #2
564 server ntp2.cmc.ec.gc.ca
565 restrict ntp2.cmc.ec.gc.ca
566 stratum 10
567 driftfile /etc/ntp.drift.server
568 logfile /var/log/ntp
569 broadcast 192.168.1.255
570 restrict default kod
571 restrict 127.0.0.1
572 restrict 192.168.1.0 mask 255.255.255.0
573 </pre>
574
575 <p>
576 And on all your slave nodes, setup your synchronization source as your master
577 node.
578 </p>
579
580 <pre caption="Node /etc/conf.d/ntp">
581 # /etc/conf.d/ntpd
582
583 NTPDATE_WARN="n"
584 NTPDATE_CMD="ntpdate"
585 NTPDATE_OPTS="-b master"
586 </pre>
587
588 <pre caption="Node ntp.conf">
589 # Adelie Linux Research &amp; Development Center
590 # /etc/ntp.conf
591
592 # Synchronization source #1
593 server master
594 restrict master
595 stratum 11
596 driftfile /etc/ntp.drift.server
597 logfile /var/log/ntp
598 restrict default kod
599 restrict 127.0.0.1
600 </pre>
601
602 <p>
603 Then add ntpd to the default runlevel of all your nodes:
604 </p>
605
606 <pre caption="Adding ntpd to the default runlevel">
607 # <i>rc-update add ntpd default</i>
608 </pre>
609
610 <note>
611 NTP will not update the local clock if the time difference between your
612 synchronization source and the local clock is too great.
613 </note>
614
615 </body>
616 </section>
617 <section>
618 <title>IPTABLES</title>
619 <body>
620
621 <p>
622 To setup a firewall on your cluster, you will need iptables.
623 </p>
624
625 <pre caption="Installing iptables">
626 # <i>emerge -p iptables</i>
627 # <i>emerge iptables</i>
628 </pre>
629
630 <p>
631 Required kernel configuration:
632 </p>
633
634 <pre caption="IPtables kernel configuration">
635 CONFIG_NETFILTER=y
636 CONFIG_IP_NF_CONNTRACK=y
637 CONFIG_IP_NF_IPTABLES=y
638 CONFIG_IP_NF_MATCH_STATE=y
639 CONFIG_IP_NF_FILTER=y
640 CONFIG_IP_NF_TARGET_REJECT=y
641 CONFIG_IP_NF_NAT=y
642 CONFIG_IP_NF_NAT_NEEDED=y
643 CONFIG_IP_NF_TARGET_MASQUERADE=y
644 CONFIG_IP_NF_TARGET_LOG=y
645 </pre>
646
647 <p>
648 And the rules required for this firewall:
649 </p>
650
651 <pre caption="rule-save">
652 # Adelie Linux Research &amp; Development Center
653 # /var/lib/iptables/rule-save
654
655 *filter
656 :INPUT ACCEPT [0:0]
657 :FORWARD ACCEPT [0:0]
658 :OUTPUT ACCEPT [0:0]
659 -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
660 -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
661 -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
662 -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
663 -A INPUT -p icmp -j ACCEPT
664 -A INPUT -j LOG
665 -A INPUT -j REJECT --reject-with icmp-port-unreachable
666 COMMIT
667 *nat
668 :PREROUTING ACCEPT [0:0]
669 :POSTROUTING ACCEPT [0:0]
670 :OUTPUT ACCEPT [0:0]
671 -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
672 COMMIT
673 </pre>
674
675 <p>
676 Then add iptables to the default runlevel of all your nodes:
677 </p>
678
679 <pre caption="Adding iptables to the default runlevel">
680 # <i>rc-update add iptables default</i>
681 </pre>
682
683 </body>
684 </section>
685 </chapter>
686
687 <chapter>
688 <title>HPC Tools</title>
689 <section>
690 <title>OpenPBS</title>
691 <body>
692
693 <p>
694 The Portable Batch System (PBS) is a flexible batch queueing and workload
695 management system originally developed for NASA. It operates on networked,
696 multi-platform UNIX environments, including heterogeneous clusters of
697 workstations, supercomputers, and massively parallel systems. Development of
698 PBS is provided by Altair Grid Technologies.
699 </p>
700
701 <pre caption="Installing openpbs">
702 # <i>emerge -p openpbs</i>
703 </pre>
704
705 <note>
706 OpenPBS ebuild does not currently set proper permissions on var-directories
707 used by OpenPBS.
708 </note>
709
710 <p>
711 Before starting using OpenPBS, some configurations are required. The files
712 you will need to personalize for your system are:
713 </p>
714
715 <ul>
716 <li>/etc/pbs_environment</li>
717 <li>/var/spool/PBS/server_name</li>
718 <li>/var/spool/PBS/server_priv/nodes</li>
719 <li>/var/spool/PBS/mom_priv/config</li>
720 <li>/var/spool/PBS/sched_priv/sched_config</li>
721 </ul>
722
723 <p>
724 Here is a sample sched_config:
725 </p>
726
727 <pre caption="/var/spool/PBS/sched_priv/sched_config">
728 #
729 # Create queues and set their attributes.
730 #
731 #
732 # Create and define queue upto4nodes
733 #
734 create queue upto4nodes
735 set queue upto4nodes queue_type = Execution
736 set queue upto4nodes Priority = 100
737 set queue upto4nodes resources_max.nodect = 4
738 set queue upto4nodes resources_min.nodect = 1
739 set queue upto4nodes enabled = True
740 set queue upto4nodes started = True
741 #
742 # Create and define queue default
743 #
744 create queue default
745 set queue default queue_type = Route
746 set queue default route_destinations = upto4nodes
747 set queue default enabled = True
748 set queue default started = True
749 #
750 # Set server attributes.
751 #
752 set server scheduling = True
753 set server acl_host_enable = True
754 set server default_queue = default
755 set server log_events = 511
756 set server mail_from = adm
757 set server query_other_jobs = True
758 set server resources_default.neednodes = 1
759 set server resources_default.nodect = 1
760 set server resources_default.nodes = 1
761 set server scheduler_iteration = 60
762 </pre>
763
764 <p>
765 To submit a task to OpenPBS, the command <c>qsub</c> is used with some
766 optional parameters. In the example below, "-l" allows you to specify
767 the resources required, "-j" provides for redirection of standard out and
768 standard error, and the "-m" will e-mail the user at beginning (b), end (e)
769 and on abort (a) of the job.
770 </p>
771
772 <pre caption="Submitting a task">
773 <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
774 # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
775 </pre>
776
777 <p>
778 Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
779 may want to try a task manually. To request an interactive shell from OpenPBS,
780 use the "-I" parameter.
781 </p>
782
783 <pre caption="Requesting an interactive shell">
784 # <i>qsub -I</i>
785 </pre>
786
787 <p>
788 To check the status of your jobs, use the qstat command:
789 </p>
790
791 <pre caption="Checking the status of the jobs">
792 # <i>qstat</i>
793 Job id Name User Time Use S Queue
794 ------ ---- ---- -------- - -----
795 2.geist STDIN adelie 0 R upto1nodes
796 </pre>
797
798 </body>
799 </section>
800 <section>
801 <title>MPICH</title>
802 <body>
803
804 <p>
805 Message passing is a paradigm used widely on certain classes of parallel
806 machines, especially those with distributed memory. MPICH is a freely
807 available, portable implementation of MPI, the Standard for message-passing
808 libraries.
809 </p>
810
811 <p>
812 The mpich ebuild provided by Adelie Linux allows for two USE flags:
813 <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
814 installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
815 of <c>rsh</c>.
816 </p>
817
818 <pre caption="Installing the mpich application">
819 # <i>emerge -p mpich</i>
820 # <i>emerge mpich</i>
821 </pre>
822
823 <p>
824 You may need to export a mpich work directory to all your slave nodes in
825 <path>/etc/exports</path>:
826 </p>
827
828 <pre caption="/etc/exports">
829 /home *(rw)
830 </pre>
831
832 <p>
833 Most massively parallel processors (MPPs) provide a way to start a program on
834 a requested number of processors; <c>mpirun</c> makes use of the appropriate
835 command whenever possible. In contrast, workstation clusters require that each
836 process in a parallel job be started individually, though programs to help
837 start these processes exist. Because workstation clusters are not already
838 organized as an MPP, additional information is required to make use of them.
839 Mpich should be installed with a list of participating workstations in the
840 file <path>machines.LINUX</path> in the directory
841 <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
842 processors to run on.
843 </p>
844
845 <p>
846 Edit this file to reflect your cluster-lan configuration:
847 </p>
848
849 <pre caption="/usr/share/mpich/machines.LINUX">
850 # Change this file to contain the machines that you want to use
851 # to run MPI jobs on. The format is one host name per line, with either
852 # hostname
853 # or
854 # hostname:n
855 # where n is the number of processors in an SMP. The hostname should
856 # be the same as the result from the command "hostname"
857 master
858 node01
859 node02
860 # node03
861 # node04
862 # ...
863 </pre>
864
865 <p>
866 Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
867 you can use all of the machines that you have listed. This script performs
868 an <c>rsh</c> and a short directory listing; this tests that you both have
869 access to the node and that a program in the current directory is visible on
870 the remote node. If there are any problems, they will be listed. These
871 problems must be fixed before proceeding.
872 </p>
873
874 <p>
875 The only argument to <c>tstmachines</c> is the name of the architecture; this
876 is the same name as the extension on the machines file. For example, the
877 following tests that a program in the current directory can be executed by
878 all of the machines in the LINUX machines list.
879 </p>
880
881 <pre caption="Running a test">
882 # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
883 </pre>
884
885 <note>
886 This program is silent if all is well; if you want to see what it is doing,
887 use the -v (for verbose) argument:
888 </note>
889
890 <pre caption="Running a test verbosively">
891 # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
892 </pre>
893
894 <p>
895 The output from this command might look like:
896 </p>
897
898 <pre caption="Output of the above command">
899 Trying true on host1.uoffoo.edu ...
900 Trying true on host2.uoffoo.edu ...
901 Trying ls on host1.uoffoo.edu ...
902 Trying ls on host2.uoffoo.edu ...
903 Trying user program on host1.uoffoo.edu ...
904 Trying user program on host2.uoffoo.edu ...
905 </pre>
906
907 <p>
908 If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
909 solutions. In brief, there are three tests:
910 </p>
911
912 <ul>
913 <li>
914 <e>Can processes be started on remote machines?</e> tstmachines attempts
915 to run the shell command true on each machine in the machines files by
916 using the remote shell command.
917 </li>
918 <li>
919 <e>Is current working directory available to all machines?</e> This
920 attempts to ls a file that tstmachines creates by running ls using the
921 remote shell command.
922 </li>
923 <li>
924 <e>Can user programs be run on remote systems?</e> This checks that shared
925 libraries and other components have been properly installed on all
926 machines.
927 </li>
928 </ul>
929
930 <p>
931 And the required test for every development tool:
932 </p>
933
934 <pre caption="Testing a development tool">
935 # <i>cd ~</i>
936 # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
937 # <i>make hello++</i>
938 # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
939 </pre>
940
941 <p>
942 For further information on MPICH, consult the documentation at <uri
943 link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
944 </p>
945
946 </body>
947 </section>
948 <section>
949 <title>LAM</title>
950 <body>
951
952 <p>
953 (Coming Soon!)
954 </p>
955
956 </body>
957 </section>
958 <section>
959 <title>OMNI</title>
960 <body>
961
962 <p>
963 (Coming Soon!)
964 </p>
965
966 </body>
967 </section>
968 </chapter>
969
970 <chapter>
971 <title>Bibliography</title>
972 <section>
973 <body>
974
975 <p>
976 The original document is published at the <uri
977 link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
978 and is reproduced here with the permission of the authors and <uri
979 link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
980 Centre.
981 </p>
982
983 <ul>
984 <li><uri>http://www.gentoo.org</uri>, Gentoo Foundation, Inc.</li>
985 <li>
986 <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
987 Adelie Linux Research and Development Centre
988 </li>
989 <li>
990 <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
991 Linux NFS Project
992 </li>
993 <li>
994 <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
995 Mathematics and Computer Science Division, Argonne National Laboratory
996 </li>
997 <li>
998 <uri link="http://www.ntp.org/">http://ntp.org</uri>
999 </li>
1000 <li>
1001 <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1002 David L. Mills, University of Delaware
1003 </li>
1004 <li>
1005 <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1006 Secure Shell Working Group, IETF, Internet Society
1007 </li>
1008 <li>
1009 <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1010 Guardian Digital
1011 </li>
1012 <li>
1013 <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1014 Altair Grid Technologies, LLC.
1015 </li>
1016 </ul>
1017
1018 </body>
1019 </section>
1020 </chapter>
1021
1022 </guide>

  ViewVC Help
Powered by ViewVC 1.1.20