/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (hide annotations) (download) (as text)
Mon Jan 3 10:00:04 2005 UTC (9 years, 7 months ago) by swift
Branch: MAIN
File MIME type: application/xml
Migrating the document from the cluster project space to documentation repository

1 swift 1.1 <?xml version='1.0' encoding="UTF-8"?>
2    
3     <!-- $Header$ -->
4    
5     <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
6     <guide link="hpc-howto.xml">
7    
8     <title>High Performance Computing on Gentoo Linux</title>
9    
10     <author title="Author">
11     <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
12     </author>
13     <author title="Author">
14     <mail link="benoit@adelielinux.com">Benoit Morin</mail>
15     </author>
16     <author title="Assistant/Research">
17     <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
18     </author>
19     <author title="Assistant/Research">
20     <mail link="olivier@adelielinux.com">Olivier Crete</mail>
21     </author>
22     <author title="Reviewer">
23     <mail link="spyderous@gentoo.org">Donnie Berkholz</mail>
24     </author>
25    
26     <!-- No licensing information; this document has been written by a third-party
27     organisation without additional licensing information.
28    
29     In other words, this is copyright adelielinux R&D; Gentoo only has
30     permission to distribute this document as-is and update it when appropriate
31     as long as the adelie linux R&D notice stays
32     -->
33    
34     <abstract>
35     This document was written by people at the Adelie Linux R&amp;D Center
36     &lt;http://www.adelielinux.com&gt; as a
37     step-by-step guide to turn a Gentoo System into an High Performance Computing
38     (HPC) system.
39     </abstract>
40    
41     <version>1.0</version>
42     <date>August 1, 2003</date>
43    
44     <chapter>
45     <title>Introduction</title>
46     <section>
47     <body>
48    
49     <p>
50     Gentoo Linux, a special flavor of Linux that can be automatically optimized
51     and customized for just about any application or need. Extreme performance,
52     configurability and a top-notch user and developer community are all hallmarks
53     of the Gentoo experience.
54     </p>
55    
56     <p>
57     Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
58     server, development workstation, professional desktop, gaming system, embedded
59     solution or... a High Performance Computing system. Because of its
60     near-unlimited adaptability, we call Gentoo Linux a metadistribution.
61     </p>
62    
63     <p>
64     This document explains how to turn a Gentoo system into a High Performance
65     Computing system. Step by step, it explains what packages one may want to
66     install and helps configure them.
67     </p>
68    
69     <p>
70     Obtain Gentoo Linux from the website <uri
71     link="http://www.gentoo.org/">www.gentoo.org</uri>, and refer to the <uri
72     link="http://www.gentoo.org/main/en/docs.xml">documentation</uri> at the same
73     location to install it.
74     </p>
75    
76     </body>
77     </section>
78     </chapter>
79    
80     <chapter>
81     <title>Configuring Gentoo Linux for Clustering</title>
82     <section>
83     <title>Recommended Optimizations</title>
84     <body>
85    
86     <note>
87     We refer to the <uri
88     link="http://www.gentoo.org/doc/en/handbook">Gentoo Linux Handbooks</uri> in
89     this section.
90     </note>
91    
92     <p>
93     During the installation process, you will have to set your USE variables in
94     <path>/etc/make.conf</path>. We recommended that you deactivate all the
95     defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them
96     in make.conf. However, you may want to keep such use variables as x86, 3dnow,
97     gpm, mmx, sse, ncurses, pam and tcpd. Refer to the USE documentation for more
98     information.
99     </p>
100    
101     <pre caption="USE Flags">
102     # Copyright 2000-2003 Daniel Robbins, Gentoo Technologies, Inc.
103     # Contains local system settings for Portage system
104    
105     # Please review 'man make.conf' for more information.
106    
107     USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm
108     -gif gpm -gtk -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod
109     mmx -motif -mpeg ncurses -nls -oggvorbis -opengl pam -pdflib -png
110     -python -qt -qtmt -quicktime -readline -sdl -slang -spell -ssl
111     -svga tcpd -truetype -X -xml2 -xmms -xv -zlib"
112     </pre>
113    
114     <p>
115     Or simply:
116     </p>
117    
118     <pre caption="USE Flags - simplified version">
119     # Copyright 2000-2003 Daniel Robbins, Gentoo Technologies, Inc.
120     # Contains local system settings for Portage system
121    
122     # Please review 'man make.conf' for more information.
123    
124     USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
125     </pre>
126    
127     <note>
128     The <e>tcpd</e> USE flag increases security for packages such as xinetd.
129     </note>
130    
131     <p>
132     In step 15 ("Installing the kernel and a System Logger") for stability
133     reasons, we recommend the vanilla-sources, the official kernel sources
134     released on <uri>http://www.kernel.org/</uri>, unless you require special
135     support such as xfs.
136     </p>
137    
138     <pre caption="Installing vanilla-sources">
139     # <i>emerge -p syslog-ng vanilla-sources</i>
140     </pre>
141    
142     <p>
143     When you install miscellaneous packages, we recommend installing the
144     following:
145     </p>
146    
147     <pre caption="Installing necessary packages">
148     # <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
149     </pre>
150    
151     </body>
152     </section>
153     <section>
154     <title>Communication Layer (TCP/IP Network)</title>
155     <body>
156    
157     <p>
158     A cluster requires a communication layer to interconnect the slave nodes to
159     the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
160     since they have a good price/performance ratio. Other possibilities include
161     use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
162     link="http://quadrics.com/">QsNet</uri> or others.
163     </p>
164    
165     <p>
166     A cluster is composed of two node types: master and slave. Typically, your
167     cluster will have one master node and several slave nodes.
168     </p>
169    
170     <p>
171     The master node is the cluster's server. It is responsible for telling the
172     slave nodes what to do. This server will typically run such daemons as dhcpd,
173     nfs, pbs-server, and pbs-sched. Your master node will allow interactive
174     sessions for users, and accept job executions.
175     </p>
176    
177     <p>
178     The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
179     node. They should be dedicated to crunching results and therefore should not
180     run any unecessary services.
181     </p>
182    
183     <p>
184     The rest of this documentation will assume a cluster configuration as per the
185     hosts file below. You should maintain on every node such a hosts file
186     (<path>/etc/hosts</path>) with entries for each node participating node in the
187     cluster.
188     </p>
189    
190     <pre caption="/etc/hosts">
191     # Adelie Linux Research &amp; Development Center
192     # /etc/hosts
193    
194     127.0.0.1 localhost
195    
196     192.168.1.100 master.adelie master
197    
198     192.168.1.1 node01.adelie node01
199     192.168.1.2 node02.adelie node02
200     </pre>
201    
202     <p>
203     To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
204     file on the master node.
205     </p>
206    
207     <pre caption="/etc/conf.d/net">
208     # Copyright 1999-2002 Gentoo Technologies, Inc.
209     # Distributed under the terms of the GNU General Public License, v2 or later
210    
211     # Global config file for net.* rc-scripts
212    
213     # This is basically the ifconfig argument without the ifconfig $iface
214     #
215    
216     iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
217     # Network Connection to the outside world using dhcp -- configure as required for you network
218     iface_eth1="dhcp"
219     </pre>
220    
221    
222     <p>
223     Finally, setup a DHCP daemon on the master node to avoid having to maintain a
224     network configuration on each slave node.
225     </p>
226    
227     <pre caption="/etc/dhcp/dhcpd.conf">
228     # Adelie Linux Research &amp; Development Center
229     # /etc/dhcp/dhcpd.conf
230    
231     log-facility local7;
232     ddns-update-style none;
233     use-host-decl-names on;
234    
235     subnet 192.168.1.0 netmask 255.255.255.0 {
236     option domain-name "adelie";
237     range 192.168.1.10 192.168.1.99;
238     option routers 192.168.1.100;
239    
240     host node01.adelie {
241     # MAC address of network card on node 01
242     hardware ethernet 00:07:e9:0f:e2:d4;
243     fixed-address 192.168.1.1;
244     }
245     host node02.adelie {
246     # MAC address of network card on node 02
247     hardware ethernet 00:07:e9:0f:e2:6b;
248     fixed-address 192.168.1.2;
249     }
250     }
251     </pre>
252    
253     </body>
254     </section>
255     <section>
256     <title>NFS/NIS</title>
257     <body>
258    
259     <p>
260     The Network File System (NFS) was developed to allow machines to mount a disk
261     partition on a remote machine as if it were on a local hard drive. This allows
262     for fast, seamless sharing of files across a network.
263     </p>
264    
265     <p>
266     There are other systems that provide similar functionality to NFS which could
267     be used in a cluster environment. The <uri
268     link="http://www.transarc.com/Product/EFS/AFS/index.html">Andrew File System
269     from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
270     some additional security and performance features. The <uri
271     link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
272     development, but is designed to work well with disconnected clients. Many
273     of the features of the Andrew and Coda file systems are slated for inclusion
274     in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
275     The advantage of NFS today is that it is mature, standard, well understood,
276     and supported robustly across a variety of platforms.
277     </p>
278    
279     <pre caption="Ebuilds for NFS-support">
280     # <i>emerge -p nfs-utils portmap</i>
281     # <i>emerge nfs-utils portmap</i>
282     </pre>
283    
284     <p>
285     Configure and install a kernel to support NFS v3 on all nodes:
286     </p>
287    
288     <pre caption="Required Kernel Configurations for NFS">
289     CONFIG_NFS_FS=y
290     CONFIG_NFSD=y
291     CONFIG_SUNRPC=y
292     CONFIG_LOCKD=y
293     CONFIG_NFSD_V3=y
294     CONFIG_LOCKD_V4=y
295     </pre>
296    
297     <p>
298     On the master node, edit your <path>/etc/hosts.allow</path> file to allow
299     connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
300     your <path>hosts.allow</path> will look like:
301     </p>
302    
303     <pre caption="hosts.allow">
304     portmap:192.168.1.0/255.255.255.0
305     </pre>
306    
307     <p>
308     Edit the <path>/etc/exports</path> file of the master node to export a work
309     directory struture (/home is good for this).
310     </p>
311    
312     <pre caption="/etc/exports">
313     /home/ *(rw)
314     </pre>
315    
316     <p>
317     Add nfs to your master node's default runlevel:
318     </p>
319    
320     <pre caption="Adding NFS to the default runlevel">
321     # <i>rc-update add nfs default</i>
322     </pre>
323    
324     <p>
325     To mount the nfs exported filesystem from the master, you also have to
326     configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
327     one:
328     </p>
329    
330     <pre caption="/etc/fstab">
331     master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
332     </pre>
333    
334     <p>
335     You'll also need to set up your nodes so that they mount the nfs filesystem by
336     issuing this command:
337     </p>
338    
339     <pre caption="Adding nfsmount to the default runlevel">
340     # <i>rc-update add nfsmount default</i>
341     </pre>
342    
343     </body>
344     </section>
345     <section>
346     <title>RSH/SSH</title>
347     <body>
348    
349     <p>
350     SSH is a protocol for secure remote login and other secure network services
351     over an insecure network. OpenSSH uses public key cryptography to provide
352     secure authorization. Generating the public key, which is shared with remote
353     systems, and the private key which is kept on the local system, is done first
354     to configure OpenSSH on the cluster.
355     </p>
356    
357     <p>
358     For transparent cluster usage, private/public keys may be used. This process
359     has two steps:
360     </p>
361    
362     <ul>
363     <li>Generate public and private keys</li>
364     <li>Copy public key to slave nodes</li>
365     </ul>
366    
367     <p>
368     For user based authentification, general and copy as follows:
369     </p>
370    
371     <pre caption="SSH key authentication">
372     # <i>ssh-keygen -t dsa</i>
373     Generating public/private dsa key pair.
374     Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
375     Enter passphrase (empty for no passphrase):
376     Enter same passphrase again:
377     Your identification has been saved in /root/.ssh/id_dsa.
378     Your public key has been saved in /root/.ssh/id_dsa.pub.
379     The key fingerprint is:
380     f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
381    
382     <comment>WARNING! If you already have an "authorized_keys" file,
383     please append to it, do not use the following command.</comment>
384    
385     # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
386     root@master's password:
387     id_dsa.pub 100% 234 2.0MB/s 00:00
388    
389     # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
390     root@master's password:
391     id_dsa.pub 100% 234 2.0MB/s 00:00
392     </pre>
393    
394     <note>
395     Host keys must have an empty passphrase. RSA is required for host-based
396     authentification.
397     </note>
398    
399     <p>
400     For host based authentication, you will also need to edit your
401     <path>/etc/ssh/shosts.equiv</path>.
402     </p>
403    
404     <pre caption="/etc/ssh/shosts.equiv">
405     node01.adelie
406     node02.adelie
407     master.adelie
408     </pre>
409    
410     <p>
411     And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
412     </p>
413    
414     <pre caption="sshd configurations">
415     # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
416     # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
417    
418     # This is the sshd server system-wide configuration file. See sshd(8)
419     # for more information.
420    
421     # HostKeys for protocol version 2
422     HostKey /etc/ssh/ssh_host_rsa_key
423     </pre>
424    
425     <p>
426     If your application require RSH communications, you will need to emerge
427     net-misc/netkit-rsh and sys-apps/xinetd.
428     </p>
429    
430     <pre caption="Installing necessary applicaitons">
431     # <i>emerge -p xinetd</i>
432     # <i>emerge xinetd</i>
433     # <i>emerge -p netkit-rsh</i>
434     # <i>emerge netkit-rsh</i>
435     </pre>
436    
437     <p>
438     Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
439     </p>
440    
441     <pre caption="rsh">
442     # Adelie Linux Research &amp; Development Center
443     # /etc/xinetd.d/rsh
444    
445     service shell
446     {
447     socket_type = stream
448     protocol = tcp
449     wait = no
450     user = root
451     group = tty
452     server = /usr/sbin/in.rshd
453     log_type = FILE /var/log/rsh
454     log_on_success = PID HOST USERID EXIT DURATION
455     log_on_failure = USERID ATTEMPT
456     disable = no
457     }
458     </pre>
459    
460     <p>
461     Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
462     </p>
463    
464     <pre caption="hosts.allow">
465     # Adelie Linux Research &amp; Development Center
466     # /etc/hosts.allow
467    
468     in.rshd:192.168.1.0/255.255.255.0
469     </pre>
470    
471     <p>
472     Or you can simply trust your cluster LAN:
473     </p>
474    
475     <pre caption="hosts.allow">
476     # Adelie Linux Research &amp; Development Center
477     # /etc/hosts.allow
478    
479     ALL:192.168.1.0/255.255.255.0
480     </pre>
481    
482     <p>
483     Finally, configure host authentification from <path>/etc/hosts.equiv</path>.
484     </p>
485    
486     <pre caption="hosts.equiv">
487     # Adelie Linux Research &amp; Development Center
488     # /etc/hosts.equiv
489    
490     master
491     node01
492     node02
493     </pre>
494    
495     <p>
496     And, add xinetd to your default runlevel:
497     </p>
498    
499     <pre caption="Adding xinetd to the default runlevel">
500     # <i>rc-update add xinetd default</i>
501     </pre>
502    
503     </body>
504     </section>
505     <section>
506     <title>NTP</title>
507     <body>
508    
509     <p>
510     The Network Time Protocol (NTP) is used to synchronize the time of a computer
511     client or server to another server or reference time source, such as a radio
512     or satellite receiver or modem. It provides accuracies typically within a
513     millisecond on LANs and up to a few tens of milliseconds on WANs relative to
514     Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
515     receiver, for example. Typical NTP configurations utilize multiple redundant
516     servers and diverse network paths in order to achieve high accuracy and
517     reliability.
518     </p>
519    
520     <p>
521     Select a NTP server geographically close to you from <uri
522     link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
523     Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
524     <path>/etc/ntp.conf</path> files on the master node.
525     </p>
526    
527     <pre caption="Master /etc/conf.d/ntp">
528     # Copyright 1999-2002 Gentoo Technologies, Inc.
529     # Distributed under the terms of the GNU General Public License v2
530     # /etc/conf.d/ntpd
531    
532     # NOTES:
533     # - NTPDATE variables below are used if you wish to set your
534     # clock when you start the ntp init.d script
535     # - make sure that the NTPDATE_CMD will close by itself ...
536     # the init.d script will not attempt to kill/stop it
537     # - ntpd will be used to maintain synchronization with a time
538     # server regardless of what NTPDATE is set to
539     # - read each of the comments above each of the variable
540    
541     # Comment this out if you dont want the init script to warn
542     # about not having ntpdate setup
543     NTPDATE_WARN="n"
544    
545     # Command to run to set the clock initially
546     # Most people should just uncomment this line ...
547     # however, if you know what you're doing, and you
548     # want to use ntpd to set the clock, change this to 'ntpd'
549     NTPDATE_CMD="ntpdate"
550    
551     # Options to pass to the above command
552     # Most people should just uncomment this variable and
553     # change 'someserver' to a valid hostname which you
554     # can aquire from the URL's below
555     NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
556    
557     ##
558     # A list of available servers is available here:
559     # http://www.eecis.udel.edu/~mills/ntp/servers.html
560     # Please follow the rules of engagement and use a
561     # Stratum 2 server (unless you qualify for Stratum 1)
562     ##
563    
564     # Options to pass to the ntpd process that will *always* be run
565     # Most people should not uncomment this line ...
566     # however, if you know what you're doing, feel free to tweak
567     #NTPD_OPTS=""
568    
569     </pre>
570    
571     <p>
572     Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
573     synchronization source:
574     </p>
575    
576     <pre caption="Master ntp.conf">
577     # Adelie Linux Research &amp; Development Center
578     # /etc/ntp.conf
579    
580     # Synchronization source #1
581     server ntp1.cmc.ec.gc.ca
582     restrict ntp1.cmc.ec.gc.ca
583     # Synchronization source #2
584     server ntp2.cmc.ec.gc.ca
585     restrict ntp2.cmc.ec.gc.ca
586     stratum 10
587     driftfile /etc/ntp.drift.server
588     logfile /var/log/ntp
589     broadcast 192.168.1.255
590     restrict default kod
591     restrict 127.0.0.1
592     restrict 192.168.1.0 mask 255.255.255.0
593     </pre>
594    
595     <p>
596     And on all your slave nodes, setup your synchronization source as your master
597     node.
598     </p>
599    
600     <pre caption="Node /etc/conf.d/ntp">
601     # Copyright 1999-2002 Gentoo Technologies, Inc.
602     # Distributed under the terms of the GNU General Public License v2
603     # /etc/conf.d/ntpd
604    
605     NTPDATE_WARN="n"
606     NTPDATE_CMD="ntpdate"
607     NTPDATE_OPTS="-b master"
608     </pre>
609    
610     <pre caption="Node ntp.conf">
611     # Adelie Linux Research &amp; Development Center
612     # /etc/ntp.conf
613    
614     # Synchronization source #1
615     server master
616     restrict master
617     stratum 11
618     driftfile /etc/ntp.drift.server
619     logfile /var/log/ntp
620     restrict default kod
621     restrict 127.0.0.1
622     </pre>
623    
624     <p>
625     Then add ntpd to the default runlevel of all your nodes:
626     </p>
627    
628     <pre caption="Adding ntpd to the default runlevel">
629     # <i>rc-update add ntpd default</i>
630     </pre>
631    
632     <note>
633     NTP will not update the local clock if the time difference between your
634     synchronization source and the local clock is too great.
635     </note>
636    
637     </body>
638     </section>
639     <section>
640     <title>IPTABLES</title>
641     <body>
642    
643     <p>
644     To setup a firewall on your cluster, you will need iptables.
645     </p>
646    
647     <pre caption="Installing iptables">
648     # <i>emerge -p iptables</i>
649     # <i>emerge iptables</i>
650     </pre>
651    
652     <p>
653     Required kernel configuration:
654     </p>
655    
656     <pre caption="IPtables kernel configuration">
657     CONFIG_NETFILTER=y
658     CONFIG_IP_NF_CONNTRACK=y
659     CONFIG_IP_NF_IPTABLES=y
660     CONFIG_IP_NF_MATCH_STATE=y
661     CONFIG_IP_NF_FILTER=y
662     CONFIG_IP_NF_TARGET_REJECT=y
663     CONFIG_IP_NF_NAT=y
664     CONFIG_IP_NF_NAT_NEEDED=y
665     CONFIG_IP_NF_TARGET_MASQUERADE=y
666     CONFIG_IP_NF_TARGET_LOG=y
667     </pre>
668    
669     <p>
670     And the rules required for this firewall:
671     </p>
672    
673     <pre caption="rule-save">
674     # Adelie Linux Research &amp; Development Center
675     # /var/lib/iptbles/rule-save
676    
677     *filter
678     :INPUT ACCEPT [0:0]
679     :FORWARD ACCEPT [0:0]
680     :OUTPUT ACCEPT [0:0]
681     -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
682     -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
683     -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
684     -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
685     -A INPUT -p icmp -j ACCEPT
686     -A INPUT -j LOG
687     -A INPUT -j REJECT --reject-with icmp-port-unreachable
688     COMMIT
689     *nat
690     :PREROUTING ACCEPT [0:0]
691     :POSTROUTING ACCEPT [0:0]
692     :OUTPUT ACCEPT [0:0]
693     -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
694     COMMIT
695     </pre>
696    
697     <p>
698     Then add iptables to the default runlevel of all your nodes:
699     </p>
700    
701     <pre caption="Adding iptables to the default runlevel">
702     # <i>rc-update add iptables default</i>
703     </pre>
704    
705     </body>
706     </section>
707     </chapter>
708    
709     <chapter>
710     <title>HPC Tools</title>
711     <section>
712     <title>OpenPBS</title>
713     <body>
714    
715     <p>
716     The Portable Batch System (PBS) is a flexible batch queueing and workload
717     management system originally developed for NASA. It operates on networked,
718     multi-platform UNIX environments, including heterogeneous clusters of
719     workstations, supercomputers, and massively parallel systems. Development of
720     PBS is provided by Altair Grid Technologies.
721     </p>
722    
723     <pre caption="Installing openpbs">
724     # <i>emerge -p openpbs</i>
725     </pre>
726    
727     <note>
728     OpenPBS ebuild does not currently set proper permissions on var-directories
729     used by OpenPBS.
730     </note>
731    
732     <p>
733     Before starting using OpenPBS, some configurations are required. The files
734     you will need to personalize for your system are:
735     </p>
736    
737     <ul>
738     <li>/etc/pbs_environment</li>
739     <li>/var/spool/PBS/server_name</li>
740     <li>/var/spool/PBS/server_priv/nodes</li>
741     <li>/var/spool/PBS/mom_priv/config</li>
742     <li>/var/spool/PBS/sched_priv/sched_config</li>
743     </ul>
744    
745     <p>
746     Here is a sample sched_config:
747     </p>
748    
749     <pre caption="/var/spool/PBS/sched_priv/sched_config">
750     #
751     # Create queues and set their attributes.
752     #
753     #
754     # Create and define queue upto4nodes
755     #
756     create queue upto4nodes
757     set queue upto4nodes queue_type = Execution
758     set queue upto4nodes Priority = 100
759     set queue upto4nodes resources_max.nodect = 4
760     set queue upto4nodes resources_min.nodect = 1
761     set queue upto4nodes enabled = True
762     set queue upto4nodes started = True
763     #
764     # Create and define queue default
765     #
766     create queue default
767     set queue default queue_type = Route
768     set queue default route_destinations = upto4nodes
769     set queue default enabled = True
770     set queue default started = True
771     #
772     # Set server attributes.
773     #
774     set server scheduling = True
775     set server acl_host_enable = True
776     set server default_queue = default
777     set server log_events = 511
778     set server mail_from = adm
779     set server query_other_jobs = True
780     set server resources_default.neednodes = 1
781     set server resources_default.nodect = 1
782     set server resources_default.nodes = 1
783     set server scheduler_iteration = 60
784     </pre>
785    
786     <p>
787     To submit a task to OpenPBS, the command <c>qsub</c> is used with some
788     optional parameters. In the exemple below, "-l" allows you to specify
789     the resources required, "-j" provides for redirection of standard out and
790     standard error, and the "-m" will e-mail the user at begining (b), end (e)
791     and on abort (a) of the job.
792     </p>
793    
794     <pre caption="Submitting a task">
795     <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
796     # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
797     </pre>
798    
799     <p>
800     Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
801     may want to try a task manually. To request an interactive shell from OpenPBS,
802     use the "-I" parameter.
803     </p>
804    
805     <pre caption="Requesting an interactive shell">
806     # <i>qsub -I</i>
807     </pre>
808    
809     <p>
810     To check the status of your jobs, use the qstat command:
811     </p>
812    
813     <pre caption="Checking the status of the jobs">
814     # <i>qstat</i>
815     Job id Name User Time Use S Queue
816     ------ ---- ---- -------- - -----
817     2.geist STDIN adelie 0 R upto1nodes
818     </pre>
819    
820     </body>
821     </section>
822     <section>
823     <title>MPICH</title>
824     <body>
825    
826     <p>
827     Message passing is a paradigm used widely on certain classes of parallel
828     machines, especially those with distributed memory. MPICH is a freely
829     available, portable implementation of MPI, the Standard for message-passing
830     libraries.
831     </p>
832    
833     <p>
834     The mpich ebuild provided by Adelie Linux allows for two USE flags:
835     <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
836     installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
837     of <c>rsh</c>.
838     </p>
839    
840     <pre caption="Installing the mpich application">
841     # <i>emerge -p mpich</i>
842     # <i>emerge mpich</i>
843     </pre>
844    
845     <p>
846     You may need to export a mpich work directory to all your slave nodes in
847     <path>/etc/exports</path>:
848     </p>
849    
850     <pre caption="/etc/exports">
851     /home *(rw)
852     </pre>
853    
854     <p>
855     Most massively parallel processors (MPPs) provide a way to start a program on
856     a requested number of processors; <c>mpirun</c> makes use of the appropriate
857     command whenever possible. In contrast, workstation clusters require that each
858     process in a parallel job be started individually, though programs to help
859     start these processes exist. Because workstation clusters are not already
860     organized as an MPP, additional information is required to make use of them.
861     Mpich should be installed with a list of participating workstations in the
862     file <path>machines.LINUX</path> in the directory
863     <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
864     processors to run on.
865     </p>
866    
867     <p>
868     Edit this file to reflect your cluster-lan configuration:
869     </p>
870    
871     <pre caption="/usr/share/mpich/machines.LINUX">
872     # Change this file to contain the machines that you want to use
873     # to run MPI jobs on. The format is one host name per line, with either
874     # hostname
875     # or
876     # hostname:n
877     # where n is the number of processors in an SMP. The hostname should
878     # be the same as the result from the command "hostname"
879     master
880     node01
881     node02
882     # node03
883     # node04
884     # ...
885     </pre>
886    
887     <p>
888     Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
889     you can use all of the machines that you have listed. This script performs
890     an <c>rsh</c> and a short directory listing; this tests that you both have
891     access to the node and that a program in the current directory is visible on
892     the remote node. If there are any problems, they will be listed. These
893     problems must be fixed before proceeding.
894     </p>
895    
896     <p>
897     The only argument to <c>tstmachines</c> is the name of the architecture; this
898     is the same name as the extension on the machines file. For example, the
899     following tests that a program in the current directory can be executed by
900     all of the machines in the LINUX machines list.
901     </p>
902    
903     <pre caption="Running a test">
904     # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
905     </pre>
906    
907     <note>
908     This program is silent if all is well; if you want to see what it is doing,
909     use the -v (for verbose) argument:
910     </note>
911    
912     <pre caption="Running a test verbosively">
913     # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
914     </pre>
915    
916     <p>
917     The output from this command might look like:
918     </p>
919    
920     <pre caption="Output of the above command">
921     Trying true on host1.uoffoo.edu ...
922     Trying true on host2.uoffoo.edu ...
923     Trying ls on host1.uoffoo.edu ...
924     Trying ls on host2.uoffoo.edu ...
925     Trying user program on host1.uoffoo.edu ...
926     Trying user program on host2.uoffoo.edu ...
927     </pre>
928    
929     <p>
930     If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
931     solutions. In brief, there are three tests:
932     </p>
933    
934     <ul>
935     <li>
936     <e>Can processes be started on remote machines?</e> tstmachines attempts
937     to run the shell command true on each machine in the machines files by
938     using the remote shell command.
939     </li>
940     <li>
941     <e>Is current working directory available to all machines?</e> This
942     attempts to ls a file that tstmachines creates by running ls using the
943     remote shell command.
944     </li>
945     <li>
946     <e>Can user programs be run on remote systems?</e> This checks that shared
947     libraries and other components have been properly installed on all
948     machines.
949     </li>
950     </ul>
951    
952     <p>
953     And the required test for every development tool:
954     </p>
955    
956     <pre caption="Testing a development tool">
957     # <i>cd ~</i>
958     # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
959     # <i>make hello++</i>
960     # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
961     </pre>
962    
963     <p>
964     For further information on MPICH, consult the documentation at <uri
965     link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
966     </p>
967    
968     </body>
969     </section>
970     <section>
971     <title>LAM</title>
972     <body>
973    
974     <p>
975     (Coming Soon!)
976     </p>
977    
978     </body>
979     </section>
980     <section>
981     <title>OMNI</title>
982     <body>
983    
984     <p>
985     (Coming Soon!)
986     </p>
987    
988     </body>
989     </section>
990     </chapter>
991    
992     <chapter>
993     <title>Bibliography</title>
994     <section>
995     <body>
996    
997     <p>
998     The original document is published at the <uri
999     link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
1000     and is reproduced here with the permission of the authors and <uri
1001     link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
1002     Centre.
1003     </p>
1004    
1005     <ul>
1006     <li>
1007     <uri link="http://www.gentoo.org">http://www.gentoo.org</uri>, Gentoo
1008     Technologies, Inc.
1009     </li>
1010     <li>
1011     <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
1012     Adelie Linux Research and Development Centre
1013     </li>
1014     <li>
1015     <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
1016     Linux NFS Project
1017     </li>
1018     <li>
1019     <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
1020     Mathematics and Computer Science Division, Argonne National Laboratory
1021     </li>
1022     <li>
1023     <uri link="http://www.ntp.org/">http://ntp.org</uri>
1024     </li>
1025     <li>
1026     <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1027     David L. Mills, University of Delaware
1028     </li>
1029     <li>
1030     <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1031     Secure Shell Working Group, IETF, Internet Society
1032     </li>
1033     <li>
1034     <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1035     Guardian Digital
1036     </li>
1037     <li>
1038     <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1039     Altair Grid Technologies, LLC.
1040     </li>
1041     </ul>
1042    
1043     </body>
1044     </section>
1045     </chapter>
1046    
1047     </guide>

  ViewVC Help
Powered by ViewVC 1.1.20