/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.8 - (hide annotations) (download) (as text)
Fri Jul 7 07:39:55 2006 UTC (8 years, 5 months ago) by neysx
Branch: MAIN
Changes since 1.7: +2 -2 lines
File MIME type: application/xml
#139518 s/spyderous/dberkholz/

1 swift 1.1 <?xml version='1.0' encoding="UTF-8"?>
2 neysx 1.8 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.7 2006/04/17 04:43:56 nightmorph Exp $ -->
3 rane 1.5 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
4 swift 1.1
5 neysx 1.2 <guide link="/doc/en/hpc-howto.xml">
6 swift 1.1 <title>High Performance Computing on Gentoo Linux</title>
7    
8     <author title="Author">
9     <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
10     </author>
11     <author title="Author">
12     <mail link="benoit@adelielinux.com">Benoit Morin</mail>
13     </author>
14     <author title="Assistant/Research">
15     <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
16     </author>
17     <author title="Assistant/Research">
18     <mail link="olivier@adelielinux.com">Olivier Crete</mail>
19     </author>
20     <author title="Reviewer">
21 neysx 1.8 <mail link="dberkholz@gentoo.org">Donnie Berkholz</mail>
22 swift 1.1 </author>
23    
24     <!-- No licensing information; this document has been written by a third-party
25     organisation without additional licensing information.
26    
27     In other words, this is copyright adelielinux R&D; Gentoo only has
28     permission to distribute this document as-is and update it when appropriate
29     as long as the adelie linux R&D notice stays
30     -->
31    
32     <abstract>
33     This document was written by people at the Adelie Linux R&amp;D Center
34 neysx 1.2 &lt;http://www.adelielinux.com&gt; as a step-by-step guide to turn a Gentoo
35 nightmorph 1.7 System into a High Performance Computing (HPC) system.
36 swift 1.1 </abstract>
37    
38 rane 1.5 <version>1.2</version>
39 neysx 1.2 <date>2003-08-01</date>
40 swift 1.1
41     <chapter>
42     <title>Introduction</title>
43     <section>
44     <body>
45    
46     <p>
47     Gentoo Linux, a special flavor of Linux that can be automatically optimized
48     and customized for just about any application or need. Extreme performance,
49     configurability and a top-notch user and developer community are all hallmarks
50     of the Gentoo experience.
51     </p>
52    
53     <p>
54     Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
55     server, development workstation, professional desktop, gaming system, embedded
56     solution or... a High Performance Computing system. Because of its
57     near-unlimited adaptability, we call Gentoo Linux a metadistribution.
58     </p>
59    
60     <p>
61     This document explains how to turn a Gentoo system into a High Performance
62     Computing system. Step by step, it explains what packages one may want to
63     install and helps configure them.
64     </p>
65    
66     <p>
67 neysx 1.2 Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
68     refer to the <uri link="/doc/en/">documentation</uri> at the same location to
69     install it.
70 swift 1.1 </p>
71    
72     </body>
73     </section>
74     </chapter>
75    
76     <chapter>
77     <title>Configuring Gentoo Linux for Clustering</title>
78     <section>
79     <title>Recommended Optimizations</title>
80     <body>
81    
82     <note>
83 neysx 1.2 We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
84 swift 1.1 this section.
85     </note>
86    
87     <p>
88     During the installation process, you will have to set your USE variables in
89     <path>/etc/make.conf</path>. We recommended that you deactivate all the
90     defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them
91     in make.conf. However, you may want to keep such use variables as x86, 3dnow,
92     gpm, mmx, sse, ncurses, pam and tcpd. Refer to the USE documentation for more
93     information.
94     </p>
95    
96     <pre caption="USE Flags">
97     USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm
98     -gif gpm -gtk -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod
99     mmx -motif -mpeg ncurses -nls -oggvorbis -opengl pam -pdflib -png
100     -python -qt -qtmt -quicktime -readline -sdl -slang -spell -ssl
101     -svga tcpd -truetype -X -xml2 -xmms -xv -zlib"
102     </pre>
103    
104     <p>
105     Or simply:
106     </p>
107    
108     <pre caption="USE Flags - simplified version">
109     USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
110     </pre>
111    
112     <note>
113     The <e>tcpd</e> USE flag increases security for packages such as xinetd.
114     </note>
115    
116     <p>
117     In step 15 ("Installing the kernel and a System Logger") for stability
118     reasons, we recommend the vanilla-sources, the official kernel sources
119     released on <uri>http://www.kernel.org/</uri>, unless you require special
120     support such as xfs.
121     </p>
122    
123     <pre caption="Installing vanilla-sources">
124     # <i>emerge -p syslog-ng vanilla-sources</i>
125     </pre>
126    
127     <p>
128     When you install miscellaneous packages, we recommend installing the
129     following:
130     </p>
131    
132     <pre caption="Installing necessary packages">
133     # <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
134     </pre>
135    
136     </body>
137     </section>
138     <section>
139     <title>Communication Layer (TCP/IP Network)</title>
140     <body>
141    
142     <p>
143     A cluster requires a communication layer to interconnect the slave nodes to
144     the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
145     since they have a good price/performance ratio. Other possibilities include
146     use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
147     link="http://quadrics.com/">QsNet</uri> or others.
148     </p>
149    
150     <p>
151     A cluster is composed of two node types: master and slave. Typically, your
152     cluster will have one master node and several slave nodes.
153     </p>
154    
155     <p>
156     The master node is the cluster's server. It is responsible for telling the
157     slave nodes what to do. This server will typically run such daemons as dhcpd,
158     nfs, pbs-server, and pbs-sched. Your master node will allow interactive
159     sessions for users, and accept job executions.
160     </p>
161    
162     <p>
163     The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
164     node. They should be dedicated to crunching results and therefore should not
165 rane 1.6 run any unnecessary services.
166 swift 1.1 </p>
167    
168     <p>
169     The rest of this documentation will assume a cluster configuration as per the
170     hosts file below. You should maintain on every node such a hosts file
171     (<path>/etc/hosts</path>) with entries for each node participating node in the
172     cluster.
173     </p>
174    
175     <pre caption="/etc/hosts">
176     # Adelie Linux Research &amp; Development Center
177     # /etc/hosts
178    
179 rane 1.6 127.0.0.1 localhost
180 swift 1.1
181 rane 1.6 192.168.1.100 master.adelie master
182 swift 1.1
183 rane 1.6 192.168.1.1 node01.adelie node01
184     192.168.1.2 node02.adelie node02
185 swift 1.1 </pre>
186    
187     <p>
188     To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
189     file on the master node.
190     </p>
191    
192     <pre caption="/etc/conf.d/net">
193     # Copyright 1999-2002 Gentoo Technologies, Inc.
194     # Distributed under the terms of the GNU General Public License, v2 or later
195    
196     # Global config file for net.* rc-scripts
197    
198     # This is basically the ifconfig argument without the ifconfig $iface
199     #
200    
201     iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
202     # Network Connection to the outside world using dhcp -- configure as required for you network
203     iface_eth1="dhcp"
204     </pre>
205    
206    
207     <p>
208     Finally, setup a DHCP daemon on the master node to avoid having to maintain a
209     network configuration on each slave node.
210     </p>
211    
212     <pre caption="/etc/dhcp/dhcpd.conf">
213     # Adelie Linux Research &amp; Development Center
214     # /etc/dhcp/dhcpd.conf
215    
216     log-facility local7;
217     ddns-update-style none;
218     use-host-decl-names on;
219    
220     subnet 192.168.1.0 netmask 255.255.255.0 {
221     option domain-name "adelie";
222     range 192.168.1.10 192.168.1.99;
223     option routers 192.168.1.100;
224    
225     host node01.adelie {
226 rane 1.6 # MAC address of network card on node 01
227 swift 1.1 hardware ethernet 00:07:e9:0f:e2:d4;
228     fixed-address 192.168.1.1;
229     }
230     host node02.adelie {
231 rane 1.6 # MAC address of network card on node 02
232 swift 1.1 hardware ethernet 00:07:e9:0f:e2:6b;
233     fixed-address 192.168.1.2;
234     }
235     }
236     </pre>
237    
238     </body>
239     </section>
240     <section>
241     <title>NFS/NIS</title>
242     <body>
243    
244     <p>
245     The Network File System (NFS) was developed to allow machines to mount a disk
246     partition on a remote machine as if it were on a local hard drive. This allows
247     for fast, seamless sharing of files across a network.
248     </p>
249    
250     <p>
251     There are other systems that provide similar functionality to NFS which could
252     be used in a cluster environment. The <uri
253 rane 1.5 link="http://www.openafs.org">Andrew File System
254 swift 1.1 from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
255     some additional security and performance features. The <uri
256     link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
257     development, but is designed to work well with disconnected clients. Many
258     of the features of the Andrew and Coda file systems are slated for inclusion
259     in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
260     The advantage of NFS today is that it is mature, standard, well understood,
261     and supported robustly across a variety of platforms.
262     </p>
263    
264     <pre caption="Ebuilds for NFS-support">
265     # <i>emerge -p nfs-utils portmap</i>
266     # <i>emerge nfs-utils portmap</i>
267     </pre>
268    
269     <p>
270     Configure and install a kernel to support NFS v3 on all nodes:
271     </p>
272    
273     <pre caption="Required Kernel Configurations for NFS">
274     CONFIG_NFS_FS=y
275     CONFIG_NFSD=y
276     CONFIG_SUNRPC=y
277     CONFIG_LOCKD=y
278     CONFIG_NFSD_V3=y
279     CONFIG_LOCKD_V4=y
280     </pre>
281    
282     <p>
283     On the master node, edit your <path>/etc/hosts.allow</path> file to allow
284     connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
285     your <path>hosts.allow</path> will look like:
286     </p>
287    
288     <pre caption="hosts.allow">
289     portmap:192.168.1.0/255.255.255.0
290     </pre>
291    
292     <p>
293     Edit the <path>/etc/exports</path> file of the master node to export a work
294 rane 1.6 directory structure (/home is good for this).
295 swift 1.1 </p>
296    
297     <pre caption="/etc/exports">
298 rane 1.6 /home/ *(rw)
299 swift 1.1 </pre>
300    
301     <p>
302     Add nfs to your master node's default runlevel:
303     </p>
304    
305     <pre caption="Adding NFS to the default runlevel">
306     # <i>rc-update add nfs default</i>
307     </pre>
308    
309     <p>
310     To mount the nfs exported filesystem from the master, you also have to
311     configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
312     one:
313     </p>
314    
315     <pre caption="/etc/fstab">
316 rane 1.6 master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
317 swift 1.1 </pre>
318    
319     <p>
320     You'll also need to set up your nodes so that they mount the nfs filesystem by
321     issuing this command:
322     </p>
323    
324     <pre caption="Adding nfsmount to the default runlevel">
325     # <i>rc-update add nfsmount default</i>
326     </pre>
327    
328     </body>
329     </section>
330     <section>
331     <title>RSH/SSH</title>
332     <body>
333    
334     <p>
335     SSH is a protocol for secure remote login and other secure network services
336     over an insecure network. OpenSSH uses public key cryptography to provide
337     secure authorization. Generating the public key, which is shared with remote
338     systems, and the private key which is kept on the local system, is done first
339     to configure OpenSSH on the cluster.
340     </p>
341    
342     <p>
343     For transparent cluster usage, private/public keys may be used. This process
344     has two steps:
345     </p>
346    
347     <ul>
348     <li>Generate public and private keys</li>
349     <li>Copy public key to slave nodes</li>
350     </ul>
351    
352     <p>
353 rane 1.6 For user based authentication, generate and copy as follows:
354 swift 1.1 </p>
355    
356     <pre caption="SSH key authentication">
357     # <i>ssh-keygen -t dsa</i>
358     Generating public/private dsa key pair.
359     Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
360     Enter passphrase (empty for no passphrase):
361     Enter same passphrase again:
362     Your identification has been saved in /root/.ssh/id_dsa.
363     Your public key has been saved in /root/.ssh/id_dsa.pub.
364     The key fingerprint is:
365     f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
366    
367     <comment>WARNING! If you already have an "authorized_keys" file,
368     please append to it, do not use the following command.</comment>
369    
370     # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
371     root@master's password:
372     id_dsa.pub 100% 234 2.0MB/s 00:00
373    
374     # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
375     root@master's password:
376     id_dsa.pub 100% 234 2.0MB/s 00:00
377     </pre>
378    
379     <note>
380     Host keys must have an empty passphrase. RSA is required for host-based
381 rane 1.6 authentication.
382 swift 1.1 </note>
383    
384     <p>
385     For host based authentication, you will also need to edit your
386     <path>/etc/ssh/shosts.equiv</path>.
387     </p>
388    
389     <pre caption="/etc/ssh/shosts.equiv">
390     node01.adelie
391     node02.adelie
392     master.adelie
393     </pre>
394    
395     <p>
396     And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
397     </p>
398    
399     <pre caption="sshd configurations">
400     # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
401     # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
402    
403     # This is the sshd server system-wide configuration file. See sshd(8)
404     # for more information.
405    
406     # HostKeys for protocol version 2
407     HostKey /etc/ssh/ssh_host_rsa_key
408     </pre>
409    
410     <p>
411     If your application require RSH communications, you will need to emerge
412     net-misc/netkit-rsh and sys-apps/xinetd.
413     </p>
414    
415     <pre caption="Installing necessary applicaitons">
416     # <i>emerge -p xinetd</i>
417     # <i>emerge xinetd</i>
418     # <i>emerge -p netkit-rsh</i>
419     # <i>emerge netkit-rsh</i>
420     </pre>
421    
422     <p>
423     Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
424     </p>
425    
426     <pre caption="rsh">
427     # Adelie Linux Research &amp; Development Center
428     # /etc/xinetd.d/rsh
429    
430     service shell
431     {
432     socket_type = stream
433     protocol = tcp
434     wait = no
435     user = root
436     group = tty
437     server = /usr/sbin/in.rshd
438     log_type = FILE /var/log/rsh
439     log_on_success = PID HOST USERID EXIT DURATION
440     log_on_failure = USERID ATTEMPT
441     disable = no
442     }
443     </pre>
444    
445     <p>
446     Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
447     </p>
448    
449     <pre caption="hosts.allow">
450     # Adelie Linux Research &amp; Development Center
451     # /etc/hosts.allow
452    
453     in.rshd:192.168.1.0/255.255.255.0
454     </pre>
455    
456     <p>
457     Or you can simply trust your cluster LAN:
458     </p>
459    
460     <pre caption="hosts.allow">
461     # Adelie Linux Research &amp; Development Center
462     # /etc/hosts.allow
463    
464     ALL:192.168.1.0/255.255.255.0
465     </pre>
466    
467     <p>
468 rane 1.6 Finally, configure host authentication from <path>/etc/hosts.equiv</path>.
469 swift 1.1 </p>
470    
471     <pre caption="hosts.equiv">
472     # Adelie Linux Research &amp; Development Center
473     # /etc/hosts.equiv
474    
475     master
476     node01
477     node02
478     </pre>
479    
480     <p>
481     And, add xinetd to your default runlevel:
482     </p>
483    
484     <pre caption="Adding xinetd to the default runlevel">
485     # <i>rc-update add xinetd default</i>
486     </pre>
487    
488     </body>
489     </section>
490     <section>
491     <title>NTP</title>
492     <body>
493    
494     <p>
495     The Network Time Protocol (NTP) is used to synchronize the time of a computer
496     client or server to another server or reference time source, such as a radio
497     or satellite receiver or modem. It provides accuracies typically within a
498     millisecond on LANs and up to a few tens of milliseconds on WANs relative to
499     Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
500     receiver, for example. Typical NTP configurations utilize multiple redundant
501     servers and diverse network paths in order to achieve high accuracy and
502     reliability.
503     </p>
504    
505     <p>
506     Select a NTP server geographically close to you from <uri
507     link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
508     Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
509     <path>/etc/ntp.conf</path> files on the master node.
510     </p>
511    
512     <pre caption="Master /etc/conf.d/ntp">
513     # Copyright 1999-2002 Gentoo Technologies, Inc.
514     # Distributed under the terms of the GNU General Public License v2
515     # /etc/conf.d/ntpd
516    
517     # NOTES:
518     # - NTPDATE variables below are used if you wish to set your
519     # clock when you start the ntp init.d script
520     # - make sure that the NTPDATE_CMD will close by itself ...
521     # the init.d script will not attempt to kill/stop it
522     # - ntpd will be used to maintain synchronization with a time
523     # server regardless of what NTPDATE is set to
524     # - read each of the comments above each of the variable
525    
526     # Comment this out if you dont want the init script to warn
527     # about not having ntpdate setup
528     NTPDATE_WARN="n"
529    
530     # Command to run to set the clock initially
531     # Most people should just uncomment this line ...
532     # however, if you know what you're doing, and you
533     # want to use ntpd to set the clock, change this to 'ntpd'
534     NTPDATE_CMD="ntpdate"
535    
536     # Options to pass to the above command
537     # Most people should just uncomment this variable and
538     # change 'someserver' to a valid hostname which you
539 rane 1.6 # can acquire from the URL's below
540 swift 1.1 NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
541    
542     ##
543     # A list of available servers is available here:
544     # http://www.eecis.udel.edu/~mills/ntp/servers.html
545     # Please follow the rules of engagement and use a
546     # Stratum 2 server (unless you qualify for Stratum 1)
547     ##
548    
549     # Options to pass to the ntpd process that will *always* be run
550     # Most people should not uncomment this line ...
551     # however, if you know what you're doing, feel free to tweak
552     #NTPD_OPTS=""
553    
554     </pre>
555    
556     <p>
557     Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
558     synchronization source:
559     </p>
560    
561     <pre caption="Master ntp.conf">
562     # Adelie Linux Research &amp; Development Center
563     # /etc/ntp.conf
564    
565     # Synchronization source #1
566     server ntp1.cmc.ec.gc.ca
567     restrict ntp1.cmc.ec.gc.ca
568     # Synchronization source #2
569     server ntp2.cmc.ec.gc.ca
570     restrict ntp2.cmc.ec.gc.ca
571     stratum 10
572     driftfile /etc/ntp.drift.server
573     logfile /var/log/ntp
574     broadcast 192.168.1.255
575     restrict default kod
576     restrict 127.0.0.1
577     restrict 192.168.1.0 mask 255.255.255.0
578     </pre>
579    
580     <p>
581     And on all your slave nodes, setup your synchronization source as your master
582     node.
583     </p>
584    
585     <pre caption="Node /etc/conf.d/ntp">
586     # Copyright 1999-2002 Gentoo Technologies, Inc.
587     # Distributed under the terms of the GNU General Public License v2
588     # /etc/conf.d/ntpd
589    
590     NTPDATE_WARN="n"
591     NTPDATE_CMD="ntpdate"
592     NTPDATE_OPTS="-b master"
593     </pre>
594    
595     <pre caption="Node ntp.conf">
596     # Adelie Linux Research &amp; Development Center
597     # /etc/ntp.conf
598    
599     # Synchronization source #1
600     server master
601     restrict master
602     stratum 11
603     driftfile /etc/ntp.drift.server
604     logfile /var/log/ntp
605     restrict default kod
606     restrict 127.0.0.1
607     </pre>
608    
609     <p>
610     Then add ntpd to the default runlevel of all your nodes:
611     </p>
612    
613     <pre caption="Adding ntpd to the default runlevel">
614     # <i>rc-update add ntpd default</i>
615     </pre>
616    
617     <note>
618     NTP will not update the local clock if the time difference between your
619     synchronization source and the local clock is too great.
620     </note>
621    
622     </body>
623     </section>
624     <section>
625     <title>IPTABLES</title>
626     <body>
627    
628     <p>
629     To setup a firewall on your cluster, you will need iptables.
630     </p>
631    
632     <pre caption="Installing iptables">
633     # <i>emerge -p iptables</i>
634     # <i>emerge iptables</i>
635     </pre>
636    
637     <p>
638     Required kernel configuration:
639     </p>
640    
641     <pre caption="IPtables kernel configuration">
642     CONFIG_NETFILTER=y
643     CONFIG_IP_NF_CONNTRACK=y
644     CONFIG_IP_NF_IPTABLES=y
645     CONFIG_IP_NF_MATCH_STATE=y
646     CONFIG_IP_NF_FILTER=y
647     CONFIG_IP_NF_TARGET_REJECT=y
648     CONFIG_IP_NF_NAT=y
649     CONFIG_IP_NF_NAT_NEEDED=y
650     CONFIG_IP_NF_TARGET_MASQUERADE=y
651     CONFIG_IP_NF_TARGET_LOG=y
652     </pre>
653    
654     <p>
655     And the rules required for this firewall:
656     </p>
657    
658     <pre caption="rule-save">
659     # Adelie Linux Research &amp; Development Center
660 rane 1.6 # /var/lib/iptables/rule-save
661 swift 1.1
662     *filter
663     :INPUT ACCEPT [0:0]
664     :FORWARD ACCEPT [0:0]
665     :OUTPUT ACCEPT [0:0]
666     -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
667     -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
668     -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
669     -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
670     -A INPUT -p icmp -j ACCEPT
671     -A INPUT -j LOG
672     -A INPUT -j REJECT --reject-with icmp-port-unreachable
673     COMMIT
674     *nat
675     :PREROUTING ACCEPT [0:0]
676     :POSTROUTING ACCEPT [0:0]
677     :OUTPUT ACCEPT [0:0]
678     -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
679     COMMIT
680     </pre>
681    
682     <p>
683     Then add iptables to the default runlevel of all your nodes:
684     </p>
685    
686     <pre caption="Adding iptables to the default runlevel">
687     # <i>rc-update add iptables default</i>
688     </pre>
689    
690     </body>
691     </section>
692     </chapter>
693    
694     <chapter>
695     <title>HPC Tools</title>
696     <section>
697     <title>OpenPBS</title>
698     <body>
699    
700     <p>
701     The Portable Batch System (PBS) is a flexible batch queueing and workload
702     management system originally developed for NASA. It operates on networked,
703     multi-platform UNIX environments, including heterogeneous clusters of
704     workstations, supercomputers, and massively parallel systems. Development of
705     PBS is provided by Altair Grid Technologies.
706     </p>
707    
708     <pre caption="Installing openpbs">
709     # <i>emerge -p openpbs</i>
710     </pre>
711    
712     <note>
713     OpenPBS ebuild does not currently set proper permissions on var-directories
714     used by OpenPBS.
715     </note>
716    
717     <p>
718     Before starting using OpenPBS, some configurations are required. The files
719     you will need to personalize for your system are:
720     </p>
721    
722     <ul>
723 rane 1.6 <li>/etc/pbs_environment</li>
724     <li>/var/spool/PBS/server_name</li>
725     <li>/var/spool/PBS/server_priv/nodes</li>
726     <li>/var/spool/PBS/mom_priv/config</li>
727     <li>/var/spool/PBS/sched_priv/sched_config</li>
728 swift 1.1 </ul>
729    
730     <p>
731     Here is a sample sched_config:
732     </p>
733    
734     <pre caption="/var/spool/PBS/sched_priv/sched_config">
735     #
736     # Create queues and set their attributes.
737     #
738     #
739     # Create and define queue upto4nodes
740     #
741     create queue upto4nodes
742     set queue upto4nodes queue_type = Execution
743     set queue upto4nodes Priority = 100
744     set queue upto4nodes resources_max.nodect = 4
745     set queue upto4nodes resources_min.nodect = 1
746     set queue upto4nodes enabled = True
747     set queue upto4nodes started = True
748     #
749     # Create and define queue default
750     #
751     create queue default
752     set queue default queue_type = Route
753     set queue default route_destinations = upto4nodes
754     set queue default enabled = True
755     set queue default started = True
756     #
757     # Set server attributes.
758     #
759     set server scheduling = True
760     set server acl_host_enable = True
761     set server default_queue = default
762     set server log_events = 511
763     set server mail_from = adm
764     set server query_other_jobs = True
765     set server resources_default.neednodes = 1
766     set server resources_default.nodect = 1
767     set server resources_default.nodes = 1
768     set server scheduler_iteration = 60
769     </pre>
770    
771     <p>
772     To submit a task to OpenPBS, the command <c>qsub</c> is used with some
773 rane 1.6 optional parameters. In the example below, "-l" allows you to specify
774 swift 1.1 the resources required, "-j" provides for redirection of standard out and
775 rane 1.6 standard error, and the "-m" will e-mail the user at beginning (b), end (e)
776 swift 1.1 and on abort (a) of the job.
777     </p>
778    
779     <pre caption="Submitting a task">
780     <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
781     # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
782     </pre>
783    
784     <p>
785     Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
786     may want to try a task manually. To request an interactive shell from OpenPBS,
787     use the "-I" parameter.
788     </p>
789    
790     <pre caption="Requesting an interactive shell">
791     # <i>qsub -I</i>
792     </pre>
793    
794     <p>
795     To check the status of your jobs, use the qstat command:
796     </p>
797    
798     <pre caption="Checking the status of the jobs">
799     # <i>qstat</i>
800     Job id Name User Time Use S Queue
801     ------ ---- ---- -------- - -----
802     2.geist STDIN adelie 0 R upto1nodes
803     </pre>
804    
805     </body>
806     </section>
807     <section>
808     <title>MPICH</title>
809     <body>
810    
811     <p>
812     Message passing is a paradigm used widely on certain classes of parallel
813     machines, especially those with distributed memory. MPICH is a freely
814     available, portable implementation of MPI, the Standard for message-passing
815     libraries.
816     </p>
817    
818     <p>
819     The mpich ebuild provided by Adelie Linux allows for two USE flags:
820     <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
821     installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
822     of <c>rsh</c>.
823     </p>
824    
825     <pre caption="Installing the mpich application">
826     # <i>emerge -p mpich</i>
827     # <i>emerge mpich</i>
828     </pre>
829    
830     <p>
831     You may need to export a mpich work directory to all your slave nodes in
832     <path>/etc/exports</path>:
833     </p>
834    
835     <pre caption="/etc/exports">
836 rane 1.6 /home *(rw)
837 swift 1.1 </pre>
838    
839     <p>
840     Most massively parallel processors (MPPs) provide a way to start a program on
841     a requested number of processors; <c>mpirun</c> makes use of the appropriate
842     command whenever possible. In contrast, workstation clusters require that each
843     process in a parallel job be started individually, though programs to help
844     start these processes exist. Because workstation clusters are not already
845     organized as an MPP, additional information is required to make use of them.
846     Mpich should be installed with a list of participating workstations in the
847     file <path>machines.LINUX</path> in the directory
848     <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
849     processors to run on.
850     </p>
851    
852     <p>
853     Edit this file to reflect your cluster-lan configuration:
854     </p>
855    
856     <pre caption="/usr/share/mpich/machines.LINUX">
857     # Change this file to contain the machines that you want to use
858     # to run MPI jobs on. The format is one host name per line, with either
859     # hostname
860     # or
861     # hostname:n
862     # where n is the number of processors in an SMP. The hostname should
863     # be the same as the result from the command "hostname"
864     master
865     node01
866     node02
867     # node03
868     # node04
869     # ...
870     </pre>
871    
872     <p>
873     Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
874     you can use all of the machines that you have listed. This script performs
875     an <c>rsh</c> and a short directory listing; this tests that you both have
876     access to the node and that a program in the current directory is visible on
877     the remote node. If there are any problems, they will be listed. These
878     problems must be fixed before proceeding.
879     </p>
880    
881     <p>
882     The only argument to <c>tstmachines</c> is the name of the architecture; this
883     is the same name as the extension on the machines file. For example, the
884     following tests that a program in the current directory can be executed by
885     all of the machines in the LINUX machines list.
886     </p>
887    
888     <pre caption="Running a test">
889     # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
890     </pre>
891    
892     <note>
893     This program is silent if all is well; if you want to see what it is doing,
894     use the -v (for verbose) argument:
895     </note>
896    
897     <pre caption="Running a test verbosively">
898     # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
899     </pre>
900    
901     <p>
902     The output from this command might look like:
903     </p>
904    
905     <pre caption="Output of the above command">
906     Trying true on host1.uoffoo.edu ...
907     Trying true on host2.uoffoo.edu ...
908     Trying ls on host1.uoffoo.edu ...
909     Trying ls on host2.uoffoo.edu ...
910     Trying user program on host1.uoffoo.edu ...
911     Trying user program on host2.uoffoo.edu ...
912     </pre>
913    
914     <p>
915     If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
916     solutions. In brief, there are three tests:
917     </p>
918    
919     <ul>
920 rane 1.6 <li>
921 swift 1.1 <e>Can processes be started on remote machines?</e> tstmachines attempts
922     to run the shell command true on each machine in the machines files by
923     using the remote shell command.
924     </li>
925 rane 1.6 <li>
926 swift 1.1 <e>Is current working directory available to all machines?</e> This
927     attempts to ls a file that tstmachines creates by running ls using the
928     remote shell command.
929     </li>
930 rane 1.6 <li>
931 swift 1.1 <e>Can user programs be run on remote systems?</e> This checks that shared
932     libraries and other components have been properly installed on all
933     machines.
934     </li>
935     </ul>
936    
937     <p>
938     And the required test for every development tool:
939     </p>
940    
941     <pre caption="Testing a development tool">
942     # <i>cd ~</i>
943     # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
944     # <i>make hello++</i>
945     # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
946     </pre>
947    
948     <p>
949     For further information on MPICH, consult the documentation at <uri
950     link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
951     </p>
952    
953     </body>
954     </section>
955     <section>
956     <title>LAM</title>
957     <body>
958    
959     <p>
960     (Coming Soon!)
961     </p>
962    
963     </body>
964     </section>
965     <section>
966     <title>OMNI</title>
967     <body>
968    
969     <p>
970     (Coming Soon!)
971     </p>
972    
973     </body>
974     </section>
975     </chapter>
976    
977     <chapter>
978     <title>Bibliography</title>
979     <section>
980     <body>
981    
982     <p>
983     The original document is published at the <uri
984     link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
985     and is reproduced here with the permission of the authors and <uri
986     link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
987     Centre.
988     </p>
989    
990     <ul>
991 rane 1.6 <li><uri>http://www.gentoo.org</uri>, Gentoo Technologies, Inc.</li>
992     <li>
993 swift 1.1 <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
994     Adelie Linux Research and Development Centre
995     </li>
996 rane 1.6 <li>
997 swift 1.1 <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
998     Linux NFS Project
999     </li>
1000 rane 1.6 <li>
1001 swift 1.1 <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
1002     Mathematics and Computer Science Division, Argonne National Laboratory
1003     </li>
1004 rane 1.6 <li>
1005 swift 1.1 <uri link="http://www.ntp.org/">http://ntp.org</uri>
1006     </li>
1007 rane 1.6 <li>
1008 swift 1.1 <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1009     David L. Mills, University of Delaware
1010     </li>
1011 rane 1.6 <li>
1012 swift 1.1 <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1013     Secure Shell Working Group, IETF, Internet Society
1014     </li>
1015 rane 1.6 <li>
1016 swift 1.1 <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1017     Guardian Digital
1018     </li>
1019 rane 1.6 <li>
1020 swift 1.1 <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1021     Altair Grid Technologies, LLC.
1022     </li>
1023     </ul>
1024    
1025     </body>
1026     </section>
1027     </chapter>
1028    
1029     </guide>

  ViewVC Help
Powered by ViewVC 1.1.20