/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (hide annotations) (download) (as text)
Fri May 20 16:54:18 2005 UTC (9 years, 6 months ago) by neysx
Branch: MAIN
Changes since 1.3: +2 -12 lines
File MIME type: application/xml
Removed copyright in comments, tx rane (date left unchanged on purpose)

1 swift 1.1 <?xml version='1.0' encoding="UTF-8"?>
2    
3 neysx 1.4 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.3 2005/05/13 20:15:50 neysx Exp $ -->
4 swift 1.1
5     <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
6 neysx 1.2 <guide link="/doc/en/hpc-howto.xml">
7 swift 1.1
8     <title>High Performance Computing on Gentoo Linux</title>
9    
10     <author title="Author">
11     <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
12     </author>
13     <author title="Author">
14     <mail link="benoit@adelielinux.com">Benoit Morin</mail>
15     </author>
16     <author title="Assistant/Research">
17     <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
18     </author>
19     <author title="Assistant/Research">
20     <mail link="olivier@adelielinux.com">Olivier Crete</mail>
21     </author>
22     <author title="Reviewer">
23     <mail link="spyderous@gentoo.org">Donnie Berkholz</mail>
24     </author>
25    
26     <!-- No licensing information; this document has been written by a third-party
27     organisation without additional licensing information.
28    
29     In other words, this is copyright adelielinux R&D; Gentoo only has
30     permission to distribute this document as-is and update it when appropriate
31     as long as the adelie linux R&D notice stays
32     -->
33    
34     <abstract>
35     This document was written by people at the Adelie Linux R&amp;D Center
36 neysx 1.2 &lt;http://www.adelielinux.com&gt; as a step-by-step guide to turn a Gentoo
37     System into an High Performance Computing (HPC) system.
38 swift 1.1 </abstract>
39    
40 neysx 1.4 <version>1.1</version>
41 neysx 1.2 <date>2003-08-01</date>
42 swift 1.1
43     <chapter>
44     <title>Introduction</title>
45     <section>
46     <body>
47    
48     <p>
49     Gentoo Linux, a special flavor of Linux that can be automatically optimized
50     and customized for just about any application or need. Extreme performance,
51     configurability and a top-notch user and developer community are all hallmarks
52     of the Gentoo experience.
53     </p>
54    
55     <p>
56     Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
57     server, development workstation, professional desktop, gaming system, embedded
58     solution or... a High Performance Computing system. Because of its
59     near-unlimited adaptability, we call Gentoo Linux a metadistribution.
60     </p>
61    
62     <p>
63     This document explains how to turn a Gentoo system into a High Performance
64     Computing system. Step by step, it explains what packages one may want to
65     install and helps configure them.
66     </p>
67    
68     <p>
69 neysx 1.2 Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
70     refer to the <uri link="/doc/en/">documentation</uri> at the same location to
71     install it.
72 swift 1.1 </p>
73    
74     </body>
75     </section>
76     </chapter>
77    
78     <chapter>
79     <title>Configuring Gentoo Linux for Clustering</title>
80     <section>
81     <title>Recommended Optimizations</title>
82     <body>
83    
84     <note>
85 neysx 1.2 We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
86 swift 1.1 this section.
87     </note>
88    
89     <p>
90     During the installation process, you will have to set your USE variables in
91     <path>/etc/make.conf</path>. We recommended that you deactivate all the
92     defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them
93     in make.conf. However, you may want to keep such use variables as x86, 3dnow,
94     gpm, mmx, sse, ncurses, pam and tcpd. Refer to the USE documentation for more
95     information.
96     </p>
97    
98     <pre caption="USE Flags">
99     USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm
100     -gif gpm -gtk -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod
101     mmx -motif -mpeg ncurses -nls -oggvorbis -opengl pam -pdflib -png
102     -python -qt -qtmt -quicktime -readline -sdl -slang -spell -ssl
103     -svga tcpd -truetype -X -xml2 -xmms -xv -zlib"
104     </pre>
105    
106     <p>
107     Or simply:
108     </p>
109    
110     <pre caption="USE Flags - simplified version">
111     USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
112     </pre>
113    
114     <note>
115     The <e>tcpd</e> USE flag increases security for packages such as xinetd.
116     </note>
117    
118     <p>
119     In step 15 ("Installing the kernel and a System Logger") for stability
120     reasons, we recommend the vanilla-sources, the official kernel sources
121     released on <uri>http://www.kernel.org/</uri>, unless you require special
122     support such as xfs.
123     </p>
124    
125     <pre caption="Installing vanilla-sources">
126     # <i>emerge -p syslog-ng vanilla-sources</i>
127     </pre>
128    
129     <p>
130     When you install miscellaneous packages, we recommend installing the
131     following:
132     </p>
133    
134     <pre caption="Installing necessary packages">
135     # <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
136     </pre>
137    
138     </body>
139     </section>
140     <section>
141     <title>Communication Layer (TCP/IP Network)</title>
142     <body>
143    
144     <p>
145     A cluster requires a communication layer to interconnect the slave nodes to
146     the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
147     since they have a good price/performance ratio. Other possibilities include
148     use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
149     link="http://quadrics.com/">QsNet</uri> or others.
150     </p>
151    
152     <p>
153     A cluster is composed of two node types: master and slave. Typically, your
154     cluster will have one master node and several slave nodes.
155     </p>
156    
157     <p>
158     The master node is the cluster's server. It is responsible for telling the
159     slave nodes what to do. This server will typically run such daemons as dhcpd,
160     nfs, pbs-server, and pbs-sched. Your master node will allow interactive
161     sessions for users, and accept job executions.
162     </p>
163    
164     <p>
165     The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
166     node. They should be dedicated to crunching results and therefore should not
167     run any unecessary services.
168     </p>
169    
170     <p>
171     The rest of this documentation will assume a cluster configuration as per the
172     hosts file below. You should maintain on every node such a hosts file
173     (<path>/etc/hosts</path>) with entries for each node participating node in the
174     cluster.
175     </p>
176    
177     <pre caption="/etc/hosts">
178     # Adelie Linux Research &amp; Development Center
179     # /etc/hosts
180    
181     127.0.0.1 localhost
182    
183     192.168.1.100 master.adelie master
184    
185     192.168.1.1 node01.adelie node01
186     192.168.1.2 node02.adelie node02
187     </pre>
188    
189     <p>
190     To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
191     file on the master node.
192     </p>
193    
194     <pre caption="/etc/conf.d/net">
195     # Copyright 1999-2002 Gentoo Technologies, Inc.
196     # Distributed under the terms of the GNU General Public License, v2 or later
197    
198     # Global config file for net.* rc-scripts
199    
200     # This is basically the ifconfig argument without the ifconfig $iface
201     #
202    
203     iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
204     # Network Connection to the outside world using dhcp -- configure as required for you network
205     iface_eth1="dhcp"
206     </pre>
207    
208    
209     <p>
210     Finally, setup a DHCP daemon on the master node to avoid having to maintain a
211     network configuration on each slave node.
212     </p>
213    
214     <pre caption="/etc/dhcp/dhcpd.conf">
215     # Adelie Linux Research &amp; Development Center
216     # /etc/dhcp/dhcpd.conf
217    
218     log-facility local7;
219     ddns-update-style none;
220     use-host-decl-names on;
221    
222     subnet 192.168.1.0 netmask 255.255.255.0 {
223     option domain-name "adelie";
224     range 192.168.1.10 192.168.1.99;
225     option routers 192.168.1.100;
226    
227     host node01.adelie {
228     # MAC address of network card on node 01
229     hardware ethernet 00:07:e9:0f:e2:d4;
230     fixed-address 192.168.1.1;
231     }
232     host node02.adelie {
233     # MAC address of network card on node 02
234     hardware ethernet 00:07:e9:0f:e2:6b;
235     fixed-address 192.168.1.2;
236     }
237     }
238     </pre>
239    
240     </body>
241     </section>
242     <section>
243     <title>NFS/NIS</title>
244     <body>
245    
246     <p>
247     The Network File System (NFS) was developed to allow machines to mount a disk
248     partition on a remote machine as if it were on a local hard drive. This allows
249     for fast, seamless sharing of files across a network.
250     </p>
251    
252     <p>
253     There are other systems that provide similar functionality to NFS which could
254     be used in a cluster environment. The <uri
255     link="http://www.transarc.com/Product/EFS/AFS/index.html">Andrew File System
256     from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
257     some additional security and performance features. The <uri
258     link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
259     development, but is designed to work well with disconnected clients. Many
260     of the features of the Andrew and Coda file systems are slated for inclusion
261     in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
262     The advantage of NFS today is that it is mature, standard, well understood,
263     and supported robustly across a variety of platforms.
264     </p>
265    
266     <pre caption="Ebuilds for NFS-support">
267     # <i>emerge -p nfs-utils portmap</i>
268     # <i>emerge nfs-utils portmap</i>
269     </pre>
270    
271     <p>
272     Configure and install a kernel to support NFS v3 on all nodes:
273     </p>
274    
275     <pre caption="Required Kernel Configurations for NFS">
276     CONFIG_NFS_FS=y
277     CONFIG_NFSD=y
278     CONFIG_SUNRPC=y
279     CONFIG_LOCKD=y
280     CONFIG_NFSD_V3=y
281     CONFIG_LOCKD_V4=y
282     </pre>
283    
284     <p>
285     On the master node, edit your <path>/etc/hosts.allow</path> file to allow
286     connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
287     your <path>hosts.allow</path> will look like:
288     </p>
289    
290     <pre caption="hosts.allow">
291     portmap:192.168.1.0/255.255.255.0
292     </pre>
293    
294     <p>
295     Edit the <path>/etc/exports</path> file of the master node to export a work
296     directory struture (/home is good for this).
297     </p>
298    
299     <pre caption="/etc/exports">
300     /home/ *(rw)
301     </pre>
302    
303     <p>
304     Add nfs to your master node's default runlevel:
305     </p>
306    
307     <pre caption="Adding NFS to the default runlevel">
308     # <i>rc-update add nfs default</i>
309     </pre>
310    
311     <p>
312     To mount the nfs exported filesystem from the master, you also have to
313     configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
314     one:
315     </p>
316    
317     <pre caption="/etc/fstab">
318     master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
319     </pre>
320    
321     <p>
322     You'll also need to set up your nodes so that they mount the nfs filesystem by
323     issuing this command:
324     </p>
325    
326     <pre caption="Adding nfsmount to the default runlevel">
327     # <i>rc-update add nfsmount default</i>
328     </pre>
329    
330     </body>
331     </section>
332     <section>
333     <title>RSH/SSH</title>
334     <body>
335    
336     <p>
337     SSH is a protocol for secure remote login and other secure network services
338     over an insecure network. OpenSSH uses public key cryptography to provide
339     secure authorization. Generating the public key, which is shared with remote
340     systems, and the private key which is kept on the local system, is done first
341     to configure OpenSSH on the cluster.
342     </p>
343    
344     <p>
345     For transparent cluster usage, private/public keys may be used. This process
346     has two steps:
347     </p>
348    
349     <ul>
350     <li>Generate public and private keys</li>
351     <li>Copy public key to slave nodes</li>
352     </ul>
353    
354     <p>
355 neysx 1.3 For user based authentification, generate and copy as follows:
356 swift 1.1 </p>
357    
358     <pre caption="SSH key authentication">
359     # <i>ssh-keygen -t dsa</i>
360     Generating public/private dsa key pair.
361     Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
362     Enter passphrase (empty for no passphrase):
363     Enter same passphrase again:
364     Your identification has been saved in /root/.ssh/id_dsa.
365     Your public key has been saved in /root/.ssh/id_dsa.pub.
366     The key fingerprint is:
367     f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
368    
369     <comment>WARNING! If you already have an "authorized_keys" file,
370     please append to it, do not use the following command.</comment>
371    
372     # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
373     root@master's password:
374     id_dsa.pub 100% 234 2.0MB/s 00:00
375    
376     # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
377     root@master's password:
378     id_dsa.pub 100% 234 2.0MB/s 00:00
379     </pre>
380    
381     <note>
382     Host keys must have an empty passphrase. RSA is required for host-based
383     authentification.
384     </note>
385    
386     <p>
387     For host based authentication, you will also need to edit your
388     <path>/etc/ssh/shosts.equiv</path>.
389     </p>
390    
391     <pre caption="/etc/ssh/shosts.equiv">
392     node01.adelie
393     node02.adelie
394     master.adelie
395     </pre>
396    
397     <p>
398     And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
399     </p>
400    
401     <pre caption="sshd configurations">
402     # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
403     # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
404    
405     # This is the sshd server system-wide configuration file. See sshd(8)
406     # for more information.
407    
408     # HostKeys for protocol version 2
409     HostKey /etc/ssh/ssh_host_rsa_key
410     </pre>
411    
412     <p>
413     If your application require RSH communications, you will need to emerge
414     net-misc/netkit-rsh and sys-apps/xinetd.
415     </p>
416    
417     <pre caption="Installing necessary applicaitons">
418     # <i>emerge -p xinetd</i>
419     # <i>emerge xinetd</i>
420     # <i>emerge -p netkit-rsh</i>
421     # <i>emerge netkit-rsh</i>
422     </pre>
423    
424     <p>
425     Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
426     </p>
427    
428     <pre caption="rsh">
429     # Adelie Linux Research &amp; Development Center
430     # /etc/xinetd.d/rsh
431    
432     service shell
433     {
434     socket_type = stream
435     protocol = tcp
436     wait = no
437     user = root
438     group = tty
439     server = /usr/sbin/in.rshd
440     log_type = FILE /var/log/rsh
441     log_on_success = PID HOST USERID EXIT DURATION
442     log_on_failure = USERID ATTEMPT
443     disable = no
444     }
445     </pre>
446    
447     <p>
448     Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
449     </p>
450    
451     <pre caption="hosts.allow">
452     # Adelie Linux Research &amp; Development Center
453     # /etc/hosts.allow
454    
455     in.rshd:192.168.1.0/255.255.255.0
456     </pre>
457    
458     <p>
459     Or you can simply trust your cluster LAN:
460     </p>
461    
462     <pre caption="hosts.allow">
463     # Adelie Linux Research &amp; Development Center
464     # /etc/hosts.allow
465    
466     ALL:192.168.1.0/255.255.255.0
467     </pre>
468    
469     <p>
470     Finally, configure host authentification from <path>/etc/hosts.equiv</path>.
471     </p>
472    
473     <pre caption="hosts.equiv">
474     # Adelie Linux Research &amp; Development Center
475     # /etc/hosts.equiv
476    
477     master
478     node01
479     node02
480     </pre>
481    
482     <p>
483     And, add xinetd to your default runlevel:
484     </p>
485    
486     <pre caption="Adding xinetd to the default runlevel">
487     # <i>rc-update add xinetd default</i>
488     </pre>
489    
490     </body>
491     </section>
492     <section>
493     <title>NTP</title>
494     <body>
495    
496     <p>
497     The Network Time Protocol (NTP) is used to synchronize the time of a computer
498     client or server to another server or reference time source, such as a radio
499     or satellite receiver or modem. It provides accuracies typically within a
500     millisecond on LANs and up to a few tens of milliseconds on WANs relative to
501     Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
502     receiver, for example. Typical NTP configurations utilize multiple redundant
503     servers and diverse network paths in order to achieve high accuracy and
504     reliability.
505     </p>
506    
507     <p>
508     Select a NTP server geographically close to you from <uri
509     link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
510     Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
511     <path>/etc/ntp.conf</path> files on the master node.
512     </p>
513    
514     <pre caption="Master /etc/conf.d/ntp">
515     # Copyright 1999-2002 Gentoo Technologies, Inc.
516     # Distributed under the terms of the GNU General Public License v2
517     # /etc/conf.d/ntpd
518    
519     # NOTES:
520     # - NTPDATE variables below are used if you wish to set your
521     # clock when you start the ntp init.d script
522     # - make sure that the NTPDATE_CMD will close by itself ...
523     # the init.d script will not attempt to kill/stop it
524     # - ntpd will be used to maintain synchronization with a time
525     # server regardless of what NTPDATE is set to
526     # - read each of the comments above each of the variable
527    
528     # Comment this out if you dont want the init script to warn
529     # about not having ntpdate setup
530     NTPDATE_WARN="n"
531    
532     # Command to run to set the clock initially
533     # Most people should just uncomment this line ...
534     # however, if you know what you're doing, and you
535     # want to use ntpd to set the clock, change this to 'ntpd'
536     NTPDATE_CMD="ntpdate"
537    
538     # Options to pass to the above command
539     # Most people should just uncomment this variable and
540     # change 'someserver' to a valid hostname which you
541     # can aquire from the URL's below
542     NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
543    
544     ##
545     # A list of available servers is available here:
546     # http://www.eecis.udel.edu/~mills/ntp/servers.html
547     # Please follow the rules of engagement and use a
548     # Stratum 2 server (unless you qualify for Stratum 1)
549     ##
550    
551     # Options to pass to the ntpd process that will *always* be run
552     # Most people should not uncomment this line ...
553     # however, if you know what you're doing, feel free to tweak
554     #NTPD_OPTS=""
555    
556     </pre>
557    
558     <p>
559     Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
560     synchronization source:
561     </p>
562    
563     <pre caption="Master ntp.conf">
564     # Adelie Linux Research &amp; Development Center
565     # /etc/ntp.conf
566    
567     # Synchronization source #1
568     server ntp1.cmc.ec.gc.ca
569     restrict ntp1.cmc.ec.gc.ca
570     # Synchronization source #2
571     server ntp2.cmc.ec.gc.ca
572     restrict ntp2.cmc.ec.gc.ca
573     stratum 10
574     driftfile /etc/ntp.drift.server
575     logfile /var/log/ntp
576     broadcast 192.168.1.255
577     restrict default kod
578     restrict 127.0.0.1
579     restrict 192.168.1.0 mask 255.255.255.0
580     </pre>
581    
582     <p>
583     And on all your slave nodes, setup your synchronization source as your master
584     node.
585     </p>
586    
587     <pre caption="Node /etc/conf.d/ntp">
588     # Copyright 1999-2002 Gentoo Technologies, Inc.
589     # Distributed under the terms of the GNU General Public License v2
590     # /etc/conf.d/ntpd
591    
592     NTPDATE_WARN="n"
593     NTPDATE_CMD="ntpdate"
594     NTPDATE_OPTS="-b master"
595     </pre>
596    
597     <pre caption="Node ntp.conf">
598     # Adelie Linux Research &amp; Development Center
599     # /etc/ntp.conf
600    
601     # Synchronization source #1
602     server master
603     restrict master
604     stratum 11
605     driftfile /etc/ntp.drift.server
606     logfile /var/log/ntp
607     restrict default kod
608     restrict 127.0.0.1
609     </pre>
610    
611     <p>
612     Then add ntpd to the default runlevel of all your nodes:
613     </p>
614    
615     <pre caption="Adding ntpd to the default runlevel">
616     # <i>rc-update add ntpd default</i>
617     </pre>
618    
619     <note>
620     NTP will not update the local clock if the time difference between your
621     synchronization source and the local clock is too great.
622     </note>
623    
624     </body>
625     </section>
626     <section>
627     <title>IPTABLES</title>
628     <body>
629    
630     <p>
631     To setup a firewall on your cluster, you will need iptables.
632     </p>
633    
634     <pre caption="Installing iptables">
635     # <i>emerge -p iptables</i>
636     # <i>emerge iptables</i>
637     </pre>
638    
639     <p>
640     Required kernel configuration:
641     </p>
642    
643     <pre caption="IPtables kernel configuration">
644     CONFIG_NETFILTER=y
645     CONFIG_IP_NF_CONNTRACK=y
646     CONFIG_IP_NF_IPTABLES=y
647     CONFIG_IP_NF_MATCH_STATE=y
648     CONFIG_IP_NF_FILTER=y
649     CONFIG_IP_NF_TARGET_REJECT=y
650     CONFIG_IP_NF_NAT=y
651     CONFIG_IP_NF_NAT_NEEDED=y
652     CONFIG_IP_NF_TARGET_MASQUERADE=y
653     CONFIG_IP_NF_TARGET_LOG=y
654     </pre>
655    
656     <p>
657     And the rules required for this firewall:
658     </p>
659    
660     <pre caption="rule-save">
661     # Adelie Linux Research &amp; Development Center
662     # /var/lib/iptbles/rule-save
663    
664     *filter
665     :INPUT ACCEPT [0:0]
666     :FORWARD ACCEPT [0:0]
667     :OUTPUT ACCEPT [0:0]
668     -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
669     -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
670     -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
671     -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
672     -A INPUT -p icmp -j ACCEPT
673     -A INPUT -j LOG
674     -A INPUT -j REJECT --reject-with icmp-port-unreachable
675     COMMIT
676     *nat
677     :PREROUTING ACCEPT [0:0]
678     :POSTROUTING ACCEPT [0:0]
679     :OUTPUT ACCEPT [0:0]
680     -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
681     COMMIT
682     </pre>
683    
684     <p>
685     Then add iptables to the default runlevel of all your nodes:
686     </p>
687    
688     <pre caption="Adding iptables to the default runlevel">
689     # <i>rc-update add iptables default</i>
690     </pre>
691    
692     </body>
693     </section>
694     </chapter>
695    
696     <chapter>
697     <title>HPC Tools</title>
698     <section>
699     <title>OpenPBS</title>
700     <body>
701    
702     <p>
703     The Portable Batch System (PBS) is a flexible batch queueing and workload
704     management system originally developed for NASA. It operates on networked,
705     multi-platform UNIX environments, including heterogeneous clusters of
706     workstations, supercomputers, and massively parallel systems. Development of
707     PBS is provided by Altair Grid Technologies.
708     </p>
709    
710     <pre caption="Installing openpbs">
711     # <i>emerge -p openpbs</i>
712     </pre>
713    
714     <note>
715     OpenPBS ebuild does not currently set proper permissions on var-directories
716     used by OpenPBS.
717     </note>
718    
719     <p>
720     Before starting using OpenPBS, some configurations are required. The files
721     you will need to personalize for your system are:
722     </p>
723    
724     <ul>
725     <li>/etc/pbs_environment</li>
726     <li>/var/spool/PBS/server_name</li>
727     <li>/var/spool/PBS/server_priv/nodes</li>
728     <li>/var/spool/PBS/mom_priv/config</li>
729     <li>/var/spool/PBS/sched_priv/sched_config</li>
730     </ul>
731    
732     <p>
733     Here is a sample sched_config:
734     </p>
735    
736     <pre caption="/var/spool/PBS/sched_priv/sched_config">
737     #
738     # Create queues and set their attributes.
739     #
740     #
741     # Create and define queue upto4nodes
742     #
743     create queue upto4nodes
744     set queue upto4nodes queue_type = Execution
745     set queue upto4nodes Priority = 100
746     set queue upto4nodes resources_max.nodect = 4
747     set queue upto4nodes resources_min.nodect = 1
748     set queue upto4nodes enabled = True
749     set queue upto4nodes started = True
750     #
751     # Create and define queue default
752     #
753     create queue default
754     set queue default queue_type = Route
755     set queue default route_destinations = upto4nodes
756     set queue default enabled = True
757     set queue default started = True
758     #
759     # Set server attributes.
760     #
761     set server scheduling = True
762     set server acl_host_enable = True
763     set server default_queue = default
764     set server log_events = 511
765     set server mail_from = adm
766     set server query_other_jobs = True
767     set server resources_default.neednodes = 1
768     set server resources_default.nodect = 1
769     set server resources_default.nodes = 1
770     set server scheduler_iteration = 60
771     </pre>
772    
773     <p>
774     To submit a task to OpenPBS, the command <c>qsub</c> is used with some
775     optional parameters. In the exemple below, "-l" allows you to specify
776     the resources required, "-j" provides for redirection of standard out and
777     standard error, and the "-m" will e-mail the user at begining (b), end (e)
778     and on abort (a) of the job.
779     </p>
780    
781     <pre caption="Submitting a task">
782     <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
783     # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
784     </pre>
785    
786     <p>
787     Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
788     may want to try a task manually. To request an interactive shell from OpenPBS,
789     use the "-I" parameter.
790     </p>
791    
792     <pre caption="Requesting an interactive shell">
793     # <i>qsub -I</i>
794     </pre>
795    
796     <p>
797     To check the status of your jobs, use the qstat command:
798     </p>
799    
800     <pre caption="Checking the status of the jobs">
801     # <i>qstat</i>
802     Job id Name User Time Use S Queue
803     ------ ---- ---- -------- - -----
804     2.geist STDIN adelie 0 R upto1nodes
805     </pre>
806    
807     </body>
808     </section>
809     <section>
810     <title>MPICH</title>
811     <body>
812    
813     <p>
814     Message passing is a paradigm used widely on certain classes of parallel
815     machines, especially those with distributed memory. MPICH is a freely
816     available, portable implementation of MPI, the Standard for message-passing
817     libraries.
818     </p>
819    
820     <p>
821     The mpich ebuild provided by Adelie Linux allows for two USE flags:
822     <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
823     installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
824     of <c>rsh</c>.
825     </p>
826    
827     <pre caption="Installing the mpich application">
828     # <i>emerge -p mpich</i>
829     # <i>emerge mpich</i>
830     </pre>
831    
832     <p>
833     You may need to export a mpich work directory to all your slave nodes in
834     <path>/etc/exports</path>:
835     </p>
836    
837     <pre caption="/etc/exports">
838     /home *(rw)
839     </pre>
840    
841     <p>
842     Most massively parallel processors (MPPs) provide a way to start a program on
843     a requested number of processors; <c>mpirun</c> makes use of the appropriate
844     command whenever possible. In contrast, workstation clusters require that each
845     process in a parallel job be started individually, though programs to help
846     start these processes exist. Because workstation clusters are not already
847     organized as an MPP, additional information is required to make use of them.
848     Mpich should be installed with a list of participating workstations in the
849     file <path>machines.LINUX</path> in the directory
850     <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
851     processors to run on.
852     </p>
853    
854     <p>
855     Edit this file to reflect your cluster-lan configuration:
856     </p>
857    
858     <pre caption="/usr/share/mpich/machines.LINUX">
859     # Change this file to contain the machines that you want to use
860     # to run MPI jobs on. The format is one host name per line, with either
861     # hostname
862     # or
863     # hostname:n
864     # where n is the number of processors in an SMP. The hostname should
865     # be the same as the result from the command "hostname"
866     master
867     node01
868     node02
869     # node03
870     # node04
871     # ...
872     </pre>
873    
874     <p>
875     Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
876     you can use all of the machines that you have listed. This script performs
877     an <c>rsh</c> and a short directory listing; this tests that you both have
878     access to the node and that a program in the current directory is visible on
879     the remote node. If there are any problems, they will be listed. These
880     problems must be fixed before proceeding.
881     </p>
882    
883     <p>
884     The only argument to <c>tstmachines</c> is the name of the architecture; this
885     is the same name as the extension on the machines file. For example, the
886     following tests that a program in the current directory can be executed by
887     all of the machines in the LINUX machines list.
888     </p>
889    
890     <pre caption="Running a test">
891     # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
892     </pre>
893    
894     <note>
895     This program is silent if all is well; if you want to see what it is doing,
896     use the -v (for verbose) argument:
897     </note>
898    
899     <pre caption="Running a test verbosively">
900     # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
901     </pre>
902    
903     <p>
904     The output from this command might look like:
905     </p>
906    
907     <pre caption="Output of the above command">
908     Trying true on host1.uoffoo.edu ...
909     Trying true on host2.uoffoo.edu ...
910     Trying ls on host1.uoffoo.edu ...
911     Trying ls on host2.uoffoo.edu ...
912     Trying user program on host1.uoffoo.edu ...
913     Trying user program on host2.uoffoo.edu ...
914     </pre>
915    
916     <p>
917     If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
918     solutions. In brief, there are three tests:
919     </p>
920    
921     <ul>
922     <li>
923     <e>Can processes be started on remote machines?</e> tstmachines attempts
924     to run the shell command true on each machine in the machines files by
925     using the remote shell command.
926     </li>
927     <li>
928     <e>Is current working directory available to all machines?</e> This
929     attempts to ls a file that tstmachines creates by running ls using the
930     remote shell command.
931     </li>
932     <li>
933     <e>Can user programs be run on remote systems?</e> This checks that shared
934     libraries and other components have been properly installed on all
935     machines.
936     </li>
937     </ul>
938    
939     <p>
940     And the required test for every development tool:
941     </p>
942    
943     <pre caption="Testing a development tool">
944     # <i>cd ~</i>
945     # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
946     # <i>make hello++</i>
947     # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
948     </pre>
949    
950     <p>
951     For further information on MPICH, consult the documentation at <uri
952     link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
953     </p>
954    
955     </body>
956     </section>
957     <section>
958     <title>LAM</title>
959     <body>
960    
961     <p>
962     (Coming Soon!)
963     </p>
964    
965     </body>
966     </section>
967     <section>
968     <title>OMNI</title>
969     <body>
970    
971     <p>
972     (Coming Soon!)
973     </p>
974    
975     </body>
976     </section>
977     </chapter>
978    
979     <chapter>
980     <title>Bibliography</title>
981     <section>
982     <body>
983    
984     <p>
985     The original document is published at the <uri
986     link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
987     and is reproduced here with the permission of the authors and <uri
988     link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
989     Centre.
990     </p>
991    
992     <ul>
993 neysx 1.2 <li><uri>http://www.gentoo.org</uri>, Gentoo Technologies, Inc.</li>
994 swift 1.1 <li>
995     <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
996     Adelie Linux Research and Development Centre
997     </li>
998     <li>
999     <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
1000     Linux NFS Project
1001     </li>
1002     <li>
1003     <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
1004     Mathematics and Computer Science Division, Argonne National Laboratory
1005     </li>
1006     <li>
1007     <uri link="http://www.ntp.org/">http://ntp.org</uri>
1008     </li>
1009     <li>
1010     <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1011     David L. Mills, University of Delaware
1012     </li>
1013     <li>
1014     <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1015     Secure Shell Working Group, IETF, Internet Society
1016     </li>
1017     <li>
1018     <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1019     Guardian Digital
1020     </li>
1021     <li>
1022     <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1023     Altair Grid Technologies, LLC.
1024     </li>
1025     </ul>
1026    
1027     </body>
1028     </section>
1029     </chapter>
1030    
1031     </guide>

  ViewVC Help
Powered by ViewVC 1.1.20