/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.15 - (hide annotations) (download) (as text)
Mon Jun 7 09:08:37 2010 UTC (4 years, 4 months ago) by nightmorph
Branch: MAIN
Changes since 1.14: +21 -23 lines
File MIME type: application/xml
spruced up the hpc guide, punting dead USE flags and fixing redundant installation steps

1 swift 1.1 <?xml version='1.0' encoding="UTF-8"?>
2 nightmorph 1.15 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.14 2008/05/19 20:56:20 swift Exp $ -->
3 rane 1.5 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
4 swift 1.1
5 nightmorph 1.15 <guide>
6 swift 1.1 <title>High Performance Computing on Gentoo Linux</title>
7    
8     <author title="Author">
9     <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
10     </author>
11     <author title="Author">
12     <mail link="benoit@adelielinux.com">Benoit Morin</mail>
13     </author>
14     <author title="Assistant/Research">
15     <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
16     </author>
17     <author title="Assistant/Research">
18     <mail link="olivier@adelielinux.com">Olivier Crete</mail>
19     </author>
20     <author title="Reviewer">
21 neysx 1.8 <mail link="dberkholz@gentoo.org">Donnie Berkholz</mail>
22 swift 1.1 </author>
23 nightmorph 1.15 <author title="Editor">
24     <mail link="nightmorph"/>
25     </author>
26 swift 1.1
27     <!-- No licensing information; this document has been written by a third-party
28     organisation without additional licensing information.
29    
30     In other words, this is copyright adelielinux R&D; Gentoo only has
31     permission to distribute this document as-is and update it when appropriate
32     as long as the adelie linux R&D notice stays
33     -->
34 swift 1.14
35 swift 1.1 <abstract>
36     This document was written by people at the Adelie Linux R&amp;D Center
37 neysx 1.2 &lt;http://www.adelielinux.com&gt; as a step-by-step guide to turn a Gentoo
38 nightmorph 1.7 System into a High Performance Computing (HPC) system.
39 swift 1.1 </abstract>
40    
41 nightmorph 1.15 <version>1.7</version>
42     <date>2010-06-07</date>
43 swift 1.1
44     <chapter>
45     <title>Introduction</title>
46     <section>
47     <body>
48    
49     <p>
50 swift 1.14 Gentoo Linux, a special flavor of Linux that can be automatically optimized
51     and customized for just about any application or need. Extreme performance,
52 swift 1.1 configurability and a top-notch user and developer community are all hallmarks
53     of the Gentoo experience.
54     </p>
55    
56     <p>
57 swift 1.14 Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
58 swift 1.1 server, development workstation, professional desktop, gaming system, embedded
59 swift 1.14 solution or... a High Performance Computing system. Because of its
60 swift 1.1 near-unlimited adaptability, we call Gentoo Linux a metadistribution.
61     </p>
62    
63     <p>
64 swift 1.14 This document explains how to turn a Gentoo system into a High Performance
65     Computing system. Step by step, it explains what packages one may want to
66 swift 1.1 install and helps configure them.
67     </p>
68    
69     <p>
70 neysx 1.2 Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
71     refer to the <uri link="/doc/en/">documentation</uri> at the same location to
72     install it.
73 swift 1.1 </p>
74    
75     </body>
76     </section>
77     </chapter>
78    
79     <chapter>
80     <title>Configuring Gentoo Linux for Clustering</title>
81     <section>
82     <title>Recommended Optimizations</title>
83     <body>
84    
85     <note>
86 neysx 1.2 We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
87 swift 1.1 this section.
88     </note>
89    
90     <p>
91 nightmorph 1.13 During the installation process, you will have to set your USE variables in
92 swift 1.14 <path>/etc/make.conf</path>. We recommended that you deactivate all the
93 nightmorph 1.13 defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them in
94 nightmorph 1.15 make.conf. However, you may want to keep such use variables as 3dnow, gpm,
95 swift 1.14 mmx, nptl, nptlonly, sse, ncurses, pam and tcpd. Refer to the USE documentation
96 nightmorph 1.13 for more information.
97 swift 1.1 </p>
98    
99     <pre caption="USE Flags">
100 nightmorph 1.15 USE="-oss 3dnow -apm -avi -berkdb -crypt -cups -encode -gdbm -gif gpm -gtk
101 nightmorph 1.13 -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod mmx -motif -mpeg ncurses
102 nightmorph 1.15 -nls nptl nptlonly -ogg -opengl pam -pdflib -png -python -qt4 -qtmt
103     -quicktime -readline -sdl -slang -spell -ssl -svga tcpd -truetype -vorbis -X
104     -xml2 -xv -zlib"
105 swift 1.1 </pre>
106    
107     <p>
108     Or simply:
109     </p>
110    
111     <pre caption="USE Flags - simplified version">
112     USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
113     </pre>
114    
115     <note>
116     The <e>tcpd</e> USE flag increases security for packages such as xinetd.
117     </note>
118    
119     <p>
120 swift 1.14 In step 15 ("Installing the kernel and a System Logger") for stability
121     reasons, we recommend the vanilla-sources, the official kernel sources
122 swift 1.1 released on <uri>http://www.kernel.org/</uri>, unless you require special
123     support such as xfs.
124     </p>
125    
126     <pre caption="Installing vanilla-sources">
127 nightmorph 1.15 # <i>emerge -a syslog-ng vanilla-sources</i>
128 swift 1.1 </pre>
129    
130     <p>
131 swift 1.14 When you install miscellaneous packages, we recommend installing the
132 swift 1.1 following:
133     </p>
134    
135     <pre caption="Installing necessary packages">
136 nightmorph 1.15 # <i>emerge -a nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
137 swift 1.1 </pre>
138    
139     </body>
140     </section>
141     <section>
142     <title>Communication Layer (TCP/IP Network)</title>
143     <body>
144    
145     <p>
146 swift 1.14 A cluster requires a communication layer to interconnect the slave nodes to
147     the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
148     since they have a good price/performance ratio. Other possibilities include
149     use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
150 swift 1.1 link="http://quadrics.com/">QsNet</uri> or others.
151     </p>
152    
153     <p>
154 swift 1.14 A cluster is composed of two node types: master and slave. Typically, your
155 swift 1.1 cluster will have one master node and several slave nodes.
156     </p>
157    
158     <p>
159 swift 1.14 The master node is the cluster's server. It is responsible for telling the
160     slave nodes what to do. This server will typically run such daemons as dhcpd,
161     nfs, pbs-server, and pbs-sched. Your master node will allow interactive
162 swift 1.1 sessions for users, and accept job executions.
163     </p>
164    
165     <p>
166 swift 1.14 The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
167     node. They should be dedicated to crunching results and therefore should not
168 rane 1.6 run any unnecessary services.
169 swift 1.1 </p>
170    
171     <p>
172 swift 1.14 The rest of this documentation will assume a cluster configuration as per the
173     hosts file below. You should maintain on every node such a hosts file
174     (<path>/etc/hosts</path>) with entries for each node participating node in the
175 swift 1.1 cluster.
176     </p>
177    
178     <pre caption="/etc/hosts">
179     # Adelie Linux Research &amp; Development Center
180     # /etc/hosts
181    
182 rane 1.6 127.0.0.1 localhost
183 swift 1.1
184 rane 1.6 192.168.1.100 master.adelie master
185 swift 1.1
186 rane 1.6 192.168.1.1 node01.adelie node01
187     192.168.1.2 node02.adelie node02
188 swift 1.1 </pre>
189    
190     <p>
191 swift 1.14 To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
192 swift 1.1 file on the master node.
193     </p>
194    
195     <pre caption="/etc/conf.d/net">
196     # Global config file for net.* rc-scripts
197    
198     # This is basically the ifconfig argument without the ifconfig $iface
199     #
200    
201     iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
202     # Network Connection to the outside world using dhcp -- configure as required for you network
203     iface_eth1="dhcp"
204     </pre>
205    
206    
207     <p>
208 swift 1.14 Finally, setup a DHCP daemon on the master node to avoid having to maintain a
209 swift 1.1 network configuration on each slave node.
210     </p>
211    
212     <pre caption="/etc/dhcp/dhcpd.conf">
213     # Adelie Linux Research &amp; Development Center
214     # /etc/dhcp/dhcpd.conf
215    
216     log-facility local7;
217     ddns-update-style none;
218     use-host-decl-names on;
219    
220     subnet 192.168.1.0 netmask 255.255.255.0 {
221     option domain-name "adelie";
222     range 192.168.1.10 192.168.1.99;
223     option routers 192.168.1.100;
224    
225     host node01.adelie {
226 rane 1.6 # MAC address of network card on node 01
227 swift 1.1 hardware ethernet 00:07:e9:0f:e2:d4;
228     fixed-address 192.168.1.1;
229     }
230     host node02.adelie {
231 rane 1.6 # MAC address of network card on node 02
232 swift 1.1 hardware ethernet 00:07:e9:0f:e2:6b;
233     fixed-address 192.168.1.2;
234     }
235     }
236     </pre>
237    
238     </body>
239     </section>
240     <section>
241     <title>NFS/NIS</title>
242     <body>
243    
244     <p>
245 swift 1.14 The Network File System (NFS) was developed to allow machines to mount a disk
246 swift 1.1 partition on a remote machine as if it were on a local hard drive. This allows
247     for fast, seamless sharing of files across a network.
248     </p>
249    
250     <p>
251     There are other systems that provide similar functionality to NFS which could
252 swift 1.14 be used in a cluster environment. The <uri
253     link="http://www.openafs.org">Andrew File System
254     from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
255     some additional security and performance features. The <uri
256     link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
257     development, but is designed to work well with disconnected clients. Many
258 swift 1.1 of the features of the Andrew and Coda file systems are slated for inclusion
259     in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
260 swift 1.14 The advantage of NFS today is that it is mature, standard, well understood,
261 swift 1.1 and supported robustly across a variety of platforms.
262     </p>
263    
264     <pre caption="Ebuilds for NFS-support">
265 nightmorph 1.15 # <i>emerge -a nfs-utils portmap</i>
266 swift 1.1 </pre>
267    
268     <p>
269     Configure and install a kernel to support NFS v3 on all nodes:
270     </p>
271    
272     <pre caption="Required Kernel Configurations for NFS">
273     CONFIG_NFS_FS=y
274     CONFIG_NFSD=y
275     CONFIG_SUNRPC=y
276     CONFIG_LOCKD=y
277     CONFIG_NFSD_V3=y
278     CONFIG_LOCKD_V4=y
279     </pre>
280    
281     <p>
282 swift 1.14 On the master node, edit your <path>/etc/hosts.allow</path> file to allow
283     connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
284 swift 1.1 your <path>hosts.allow</path> will look like:
285     </p>
286    
287     <pre caption="hosts.allow">
288     portmap:192.168.1.0/255.255.255.0
289     </pre>
290    
291     <p>
292 swift 1.14 Edit the <path>/etc/exports</path> file of the master node to export a work
293 rane 1.6 directory structure (/home is good for this).
294 swift 1.1 </p>
295    
296     <pre caption="/etc/exports">
297 rane 1.6 /home/ *(rw)
298 swift 1.1 </pre>
299    
300     <p>
301     Add nfs to your master node's default runlevel:
302     </p>
303    
304     <pre caption="Adding NFS to the default runlevel">
305     # <i>rc-update add nfs default</i>
306     </pre>
307    
308     <p>
309 swift 1.14 To mount the nfs exported filesystem from the master, you also have to
310     configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
311 swift 1.1 one:
312     </p>
313    
314     <pre caption="/etc/fstab">
315 rane 1.6 master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
316 swift 1.1 </pre>
317    
318     <p>
319 swift 1.14 You'll also need to set up your nodes so that they mount the nfs filesystem by
320 swift 1.1 issuing this command:
321     </p>
322    
323     <pre caption="Adding nfsmount to the default runlevel">
324     # <i>rc-update add nfsmount default</i>
325     </pre>
326    
327     </body>
328     </section>
329     <section>
330     <title>RSH/SSH</title>
331     <body>
332    
333     <p>
334 swift 1.14 SSH is a protocol for secure remote login and other secure network services
335     over an insecure network. OpenSSH uses public key cryptography to provide
336     secure authorization. Generating the public key, which is shared with remote
337     systems, and the private key which is kept on the local system, is done first
338 swift 1.1 to configure OpenSSH on the cluster.
339     </p>
340    
341     <p>
342 swift 1.14 For transparent cluster usage, private/public keys may be used. This process
343 swift 1.1 has two steps:
344     </p>
345    
346     <ul>
347     <li>Generate public and private keys</li>
348     <li>Copy public key to slave nodes</li>
349     </ul>
350    
351     <p>
352 rane 1.6 For user based authentication, generate and copy as follows:
353 swift 1.1 </p>
354    
355     <pre caption="SSH key authentication">
356     # <i>ssh-keygen -t dsa</i>
357     Generating public/private dsa key pair.
358     Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
359     Enter passphrase (empty for no passphrase):
360     Enter same passphrase again:
361     Your identification has been saved in /root/.ssh/id_dsa.
362     Your public key has been saved in /root/.ssh/id_dsa.pub.
363     The key fingerprint is:
364     f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
365    
366     <comment>WARNING! If you already have an "authorized_keys" file,
367     please append to it, do not use the following command.</comment>
368    
369     # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
370     root@master's password:
371     id_dsa.pub 100% 234 2.0MB/s 00:00
372    
373     # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
374     root@master's password:
375     id_dsa.pub 100% 234 2.0MB/s 00:00
376     </pre>
377    
378     <note>
379 swift 1.14 Host keys must have an empty passphrase. RSA is required for host-based
380 rane 1.6 authentication.
381 swift 1.1 </note>
382    
383     <p>
384 swift 1.14 For host based authentication, you will also need to edit your
385 swift 1.1 <path>/etc/ssh/shosts.equiv</path>.
386     </p>
387    
388     <pre caption="/etc/ssh/shosts.equiv">
389     node01.adelie
390     node02.adelie
391     master.adelie
392     </pre>
393    
394     <p>
395     And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
396     </p>
397    
398     <pre caption="sshd configurations">
399     # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
400     # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
401    
402 swift 1.14 # This is the sshd server system-wide configuration file. See sshd(8)
403 swift 1.1 # for more information.
404    
405     # HostKeys for protocol version 2
406     HostKey /etc/ssh/ssh_host_rsa_key
407     </pre>
408    
409     <p>
410 swift 1.14 If your application require RSH communications, you will need to emerge
411 nightmorph 1.15 <c>net-misc/netkit-rsh</c> and <c>sys-apps/xinetd</c>.
412 swift 1.1 </p>
413    
414     <pre caption="Installing necessary applicaitons">
415 nightmorph 1.15 # <i>emerge -a xinetd</i>
416     # <i>emerge -a netkit-rsh</i>
417 swift 1.1 </pre>
418    
419     <p>
420 swift 1.14 Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
421 swift 1.1 </p>
422    
423     <pre caption="rsh">
424     # Adelie Linux Research &amp; Development Center
425     # /etc/xinetd.d/rsh
426    
427     service shell
428     {
429     socket_type = stream
430     protocol = tcp
431     wait = no
432     user = root
433     group = tty
434     server = /usr/sbin/in.rshd
435     log_type = FILE /var/log/rsh
436     log_on_success = PID HOST USERID EXIT DURATION
437     log_on_failure = USERID ATTEMPT
438     disable = no
439     }
440     </pre>
441    
442     <p>
443     Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
444     </p>
445    
446     <pre caption="hosts.allow">
447     # Adelie Linux Research &amp; Development Center
448     # /etc/hosts.allow
449    
450     in.rshd:192.168.1.0/255.255.255.0
451     </pre>
452    
453     <p>
454     Or you can simply trust your cluster LAN:
455     </p>
456    
457     <pre caption="hosts.allow">
458     # Adelie Linux Research &amp; Development Center
459 swift 1.14 # /etc/hosts.allow
460 swift 1.1
461     ALL:192.168.1.0/255.255.255.0
462     </pre>
463    
464     <p>
465 rane 1.6 Finally, configure host authentication from <path>/etc/hosts.equiv</path>.
466 swift 1.1 </p>
467    
468     <pre caption="hosts.equiv">
469     # Adelie Linux Research &amp; Development Center
470     # /etc/hosts.equiv
471    
472     master
473     node01
474     node02
475     </pre>
476    
477     <p>
478     And, add xinetd to your default runlevel:
479     </p>
480    
481     <pre caption="Adding xinetd to the default runlevel">
482     # <i>rc-update add xinetd default</i>
483     </pre>
484    
485     </body>
486     </section>
487     <section>
488     <title>NTP</title>
489     <body>
490    
491     <p>
492 swift 1.14 The Network Time Protocol (NTP) is used to synchronize the time of a computer
493     client or server to another server or reference time source, such as a radio
494     or satellite receiver or modem. It provides accuracies typically within a
495     millisecond on LANs and up to a few tens of milliseconds on WANs relative to
496     Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
497 swift 1.1 receiver, for example. Typical NTP configurations utilize multiple redundant
498 swift 1.14 servers and diverse network paths in order to achieve high accuracy and
499 swift 1.1 reliability.
500     </p>
501    
502     <p>
503 swift 1.14 Select a NTP server geographically close to you from <uri
504     link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
505     Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
506 swift 1.1 <path>/etc/ntp.conf</path> files on the master node.
507     </p>
508    
509     <pre caption="Master /etc/conf.d/ntp">
510     # /etc/conf.d/ntpd
511    
512     # NOTES:
513     # - NTPDATE variables below are used if you wish to set your
514     # clock when you start the ntp init.d script
515     # - make sure that the NTPDATE_CMD will close by itself ...
516     # the init.d script will not attempt to kill/stop it
517     # - ntpd will be used to maintain synchronization with a time
518     # server regardless of what NTPDATE is set to
519     # - read each of the comments above each of the variable
520    
521     # Comment this out if you dont want the init script to warn
522     # about not having ntpdate setup
523     NTPDATE_WARN="n"
524    
525     # Command to run to set the clock initially
526     # Most people should just uncomment this line ...
527     # however, if you know what you're doing, and you
528     # want to use ntpd to set the clock, change this to 'ntpd'
529     NTPDATE_CMD="ntpdate"
530    
531     # Options to pass to the above command
532     # Most people should just uncomment this variable and
533     # change 'someserver' to a valid hostname which you
534 rane 1.6 # can acquire from the URL's below
535 swift 1.1 NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
536    
537     ##
538     # A list of available servers is available here:
539     # http://www.eecis.udel.edu/~mills/ntp/servers.html
540     # Please follow the rules of engagement and use a
541     # Stratum 2 server (unless you qualify for Stratum 1)
542     ##
543    
544     # Options to pass to the ntpd process that will *always* be run
545     # Most people should not uncomment this line ...
546     # however, if you know what you're doing, feel free to tweak
547     #NTPD_OPTS=""
548    
549     </pre>
550    
551     <p>
552 swift 1.14 Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
553 swift 1.1 synchronization source:
554     </p>
555    
556     <pre caption="Master ntp.conf">
557     # Adelie Linux Research &amp; Development Center
558     # /etc/ntp.conf
559    
560     # Synchronization source #1
561     server ntp1.cmc.ec.gc.ca
562     restrict ntp1.cmc.ec.gc.ca
563     # Synchronization source #2
564     server ntp2.cmc.ec.gc.ca
565     restrict ntp2.cmc.ec.gc.ca
566     stratum 10
567     driftfile /etc/ntp.drift.server
568 swift 1.14 logfile /var/log/ntp
569 swift 1.1 broadcast 192.168.1.255
570     restrict default kod
571     restrict 127.0.0.1
572     restrict 192.168.1.0 mask 255.255.255.0
573     </pre>
574    
575     <p>
576 swift 1.14 And on all your slave nodes, setup your synchronization source as your master
577 swift 1.1 node.
578     </p>
579    
580     <pre caption="Node /etc/conf.d/ntp">
581     # /etc/conf.d/ntpd
582    
583     NTPDATE_WARN="n"
584     NTPDATE_CMD="ntpdate"
585     NTPDATE_OPTS="-b master"
586     </pre>
587    
588     <pre caption="Node ntp.conf">
589     # Adelie Linux Research &amp; Development Center
590     # /etc/ntp.conf
591    
592     # Synchronization source #1
593     server master
594     restrict master
595     stratum 11
596     driftfile /etc/ntp.drift.server
597 swift 1.14 logfile /var/log/ntp
598 swift 1.1 restrict default kod
599     restrict 127.0.0.1
600     </pre>
601    
602     <p>
603     Then add ntpd to the default runlevel of all your nodes:
604     </p>
605    
606     <pre caption="Adding ntpd to the default runlevel">
607     # <i>rc-update add ntpd default</i>
608     </pre>
609    
610     <note>
611 swift 1.14 NTP will not update the local clock if the time difference between your
612 swift 1.1 synchronization source and the local clock is too great.
613     </note>
614    
615     </body>
616     </section>
617     <section>
618     <title>IPTABLES</title>
619     <body>
620    
621     <p>
622     To setup a firewall on your cluster, you will need iptables.
623     </p>
624    
625     <pre caption="Installing iptables">
626 nightmorph 1.15 # <i>emerge -a iptables</i>
627 swift 1.1 </pre>
628    
629     <p>
630     Required kernel configuration:
631     </p>
632    
633     <pre caption="IPtables kernel configuration">
634     CONFIG_NETFILTER=y
635     CONFIG_IP_NF_CONNTRACK=y
636     CONFIG_IP_NF_IPTABLES=y
637     CONFIG_IP_NF_MATCH_STATE=y
638     CONFIG_IP_NF_FILTER=y
639     CONFIG_IP_NF_TARGET_REJECT=y
640     CONFIG_IP_NF_NAT=y
641     CONFIG_IP_NF_NAT_NEEDED=y
642     CONFIG_IP_NF_TARGET_MASQUERADE=y
643     CONFIG_IP_NF_TARGET_LOG=y
644     </pre>
645    
646     <p>
647     And the rules required for this firewall:
648     </p>
649    
650     <pre caption="rule-save">
651     # Adelie Linux Research &amp; Development Center
652 rane 1.6 # /var/lib/iptables/rule-save
653 swift 1.1
654     *filter
655     :INPUT ACCEPT [0:0]
656     :FORWARD ACCEPT [0:0]
657     :OUTPUT ACCEPT [0:0]
658     -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
659     -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
660     -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
661     -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
662     -A INPUT -p icmp -j ACCEPT
663     -A INPUT -j LOG
664     -A INPUT -j REJECT --reject-with icmp-port-unreachable
665     COMMIT
666     *nat
667     :PREROUTING ACCEPT [0:0]
668     :POSTROUTING ACCEPT [0:0]
669     :OUTPUT ACCEPT [0:0]
670     -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
671     COMMIT
672     </pre>
673    
674     <p>
675     Then add iptables to the default runlevel of all your nodes:
676     </p>
677    
678     <pre caption="Adding iptables to the default runlevel">
679     # <i>rc-update add iptables default</i>
680     </pre>
681    
682     </body>
683     </section>
684     </chapter>
685    
686     <chapter>
687     <title>HPC Tools</title>
688     <section>
689     <title>OpenPBS</title>
690     <body>
691    
692     <p>
693 swift 1.14 The Portable Batch System (PBS) is a flexible batch queueing and workload
694 swift 1.1 management system originally developed for NASA. It operates on networked,
695 swift 1.14 multi-platform UNIX environments, including heterogeneous clusters of
696     workstations, supercomputers, and massively parallel systems. Development of
697 swift 1.1 PBS is provided by Altair Grid Technologies.
698     </p>
699    
700     <pre caption="Installing openpbs">
701 nightmorph 1.15 # <i>emerge -a openpbs</i>
702 swift 1.1 </pre>
703    
704     <note>
705 swift 1.14 OpenPBS ebuild does not currently set proper permissions on var-directories
706 swift 1.1 used by OpenPBS.
707     </note>
708    
709     <p>
710 swift 1.14 Before starting using OpenPBS, some configurations are required. The files
711 swift 1.1 you will need to personalize for your system are:
712     </p>
713    
714     <ul>
715 rane 1.6 <li>/etc/pbs_environment</li>
716     <li>/var/spool/PBS/server_name</li>
717     <li>/var/spool/PBS/server_priv/nodes</li>
718     <li>/var/spool/PBS/mom_priv/config</li>
719     <li>/var/spool/PBS/sched_priv/sched_config</li>
720 swift 1.1 </ul>
721    
722     <p>
723     Here is a sample sched_config:
724     </p>
725    
726     <pre caption="/var/spool/PBS/sched_priv/sched_config">
727     #
728     # Create queues and set their attributes.
729     #
730     #
731     # Create and define queue upto4nodes
732     #
733     create queue upto4nodes
734     set queue upto4nodes queue_type = Execution
735     set queue upto4nodes Priority = 100
736     set queue upto4nodes resources_max.nodect = 4
737     set queue upto4nodes resources_min.nodect = 1
738     set queue upto4nodes enabled = True
739     set queue upto4nodes started = True
740     #
741     # Create and define queue default
742     #
743     create queue default
744     set queue default queue_type = Route
745     set queue default route_destinations = upto4nodes
746     set queue default enabled = True
747     set queue default started = True
748     #
749     # Set server attributes.
750     #
751     set server scheduling = True
752     set server acl_host_enable = True
753     set server default_queue = default
754     set server log_events = 511
755     set server mail_from = adm
756     set server query_other_jobs = True
757     set server resources_default.neednodes = 1
758     set server resources_default.nodect = 1
759     set server resources_default.nodes = 1
760     set server scheduler_iteration = 60
761     </pre>
762    
763     <p>
764 swift 1.14 To submit a task to OpenPBS, the command <c>qsub</c> is used with some
765     optional parameters. In the example below, "-l" allows you to specify
766 swift 1.1 the resources required, "-j" provides for redirection of standard out and
767 swift 1.14 standard error, and the "-m" will e-mail the user at beginning (b), end (e)
768 swift 1.1 and on abort (a) of the job.
769     </p>
770    
771     <pre caption="Submitting a task">
772     <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
773     # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
774     </pre>
775    
776     <p>
777 swift 1.14 Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
778     may want to try a task manually. To request an interactive shell from OpenPBS,
779 swift 1.1 use the "-I" parameter.
780     </p>
781    
782     <pre caption="Requesting an interactive shell">
783     # <i>qsub -I</i>
784     </pre>
785    
786     <p>
787     To check the status of your jobs, use the qstat command:
788     </p>
789    
790     <pre caption="Checking the status of the jobs">
791     # <i>qstat</i>
792     Job id Name User Time Use S Queue
793     ------ ---- ---- -------- - -----
794     2.geist STDIN adelie 0 R upto1nodes
795     </pre>
796    
797     </body>
798     </section>
799     <section>
800     <title>MPICH</title>
801     <body>
802    
803     <p>
804 swift 1.14 Message passing is a paradigm used widely on certain classes of parallel
805     machines, especially those with distributed memory. MPICH is a freely
806     available, portable implementation of MPI, the Standard for message-passing
807 swift 1.1 libraries.
808     </p>
809    
810     <p>
811 swift 1.14 The mpich ebuild provided by Adelie Linux allows for two USE flags:
812     <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
813     installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
814 swift 1.1 of <c>rsh</c>.
815     </p>
816    
817     <pre caption="Installing the mpich application">
818 nightmorph 1.15 # <i>emerge -a mpich</i>
819 swift 1.1 </pre>
820    
821     <p>
822 swift 1.14 You may need to export a mpich work directory to all your slave nodes in
823 swift 1.1 <path>/etc/exports</path>:
824     </p>
825    
826     <pre caption="/etc/exports">
827 rane 1.6 /home *(rw)
828 swift 1.1 </pre>
829    
830     <p>
831 swift 1.14 Most massively parallel processors (MPPs) provide a way to start a program on
832     a requested number of processors; <c>mpirun</c> makes use of the appropriate
833 swift 1.1 command whenever possible. In contrast, workstation clusters require that each
834 swift 1.14 process in a parallel job be started individually, though programs to help
835     start these processes exist. Because workstation clusters are not already
836     organized as an MPP, additional information is required to make use of them.
837     Mpich should be installed with a list of participating workstations in the
838     file <path>machines.LINUX</path> in the directory
839     <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
840 swift 1.1 processors to run on.
841     </p>
842    
843     <p>
844     Edit this file to reflect your cluster-lan configuration:
845     </p>
846    
847     <pre caption="/usr/share/mpich/machines.LINUX">
848     # Change this file to contain the machines that you want to use
849 swift 1.14 # to run MPI jobs on. The format is one host name per line, with either
850 swift 1.1 # hostname
851     # or
852     # hostname:n
853 swift 1.14 # where n is the number of processors in an SMP. The hostname should
854 swift 1.1 # be the same as the result from the command "hostname"
855     master
856     node01
857     node02
858     # node03
859     # node04
860     # ...
861     </pre>
862    
863     <p>
864 swift 1.14 Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
865     you can use all of the machines that you have listed. This script performs
866     an <c>rsh</c> and a short directory listing; this tests that you both have
867     access to the node and that a program in the current directory is visible on
868     the remote node. If there are any problems, they will be listed. These
869 swift 1.1 problems must be fixed before proceeding.
870     </p>
871    
872     <p>
873 swift 1.14 The only argument to <c>tstmachines</c> is the name of the architecture; this
874     is the same name as the extension on the machines file. For example, the
875     following tests that a program in the current directory can be executed by
876 swift 1.1 all of the machines in the LINUX machines list.
877     </p>
878    
879     <pre caption="Running a test">
880     # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
881     </pre>
882    
883     <note>
884 swift 1.14 This program is silent if all is well; if you want to see what it is doing,
885 swift 1.1 use the -v (for verbose) argument:
886     </note>
887    
888     <pre caption="Running a test verbosively">
889     # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
890     </pre>
891    
892     <p>
893     The output from this command might look like:
894     </p>
895    
896     <pre caption="Output of the above command">
897     Trying true on host1.uoffoo.edu ...
898     Trying true on host2.uoffoo.edu ...
899     Trying ls on host1.uoffoo.edu ...
900     Trying ls on host2.uoffoo.edu ...
901     Trying user program on host1.uoffoo.edu ...
902     Trying user program on host2.uoffoo.edu ...
903     </pre>
904    
905     <p>
906 swift 1.14 If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
907 swift 1.1 solutions. In brief, there are three tests:
908     </p>
909    
910     <ul>
911 rane 1.6 <li>
912 swift 1.14 <e>Can processes be started on remote machines?</e> tstmachines attempts
913     to run the shell command true on each machine in the machines files by
914 swift 1.1 using the remote shell command.
915     </li>
916 rane 1.6 <li>
917 swift 1.14 <e>Is current working directory available to all machines?</e> This
918     attempts to ls a file that tstmachines creates by running ls using the
919 swift 1.1 remote shell command.
920     </li>
921 rane 1.6 <li>
922 swift 1.1 <e>Can user programs be run on remote systems?</e> This checks that shared
923 swift 1.14 libraries and other components have been properly installed on all
924 swift 1.1 machines.
925     </li>
926     </ul>
927    
928     <p>
929     And the required test for every development tool:
930     </p>
931    
932     <pre caption="Testing a development tool">
933     # <i>cd ~</i>
934     # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
935     # <i>make hello++</i>
936     # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
937     </pre>
938    
939     <p>
940 swift 1.14 For further information on MPICH, consult the documentation at <uri
941 swift 1.1 link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
942     </p>
943    
944     </body>
945     </section>
946     <section>
947     <title>LAM</title>
948     <body>
949    
950     <p>
951     (Coming Soon!)
952     </p>
953    
954     </body>
955     </section>
956     <section>
957     <title>OMNI</title>
958     <body>
959    
960     <p>
961     (Coming Soon!)
962     </p>
963    
964     </body>
965     </section>
966     </chapter>
967    
968     <chapter>
969     <title>Bibliography</title>
970     <section>
971     <body>
972    
973     <p>
974 swift 1.14 The original document is published at the <uri
975     link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
976     and is reproduced here with the permission of the authors and <uri
977     link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
978 swift 1.1 Centre.
979     </p>
980    
981     <ul>
982 vapier 1.9 <li><uri>http://www.gentoo.org</uri>, Gentoo Foundation, Inc.</li>
983 rane 1.6 <li>
984 swift 1.14 <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
985 swift 1.1 Adelie Linux Research and Development Centre
986     </li>
987 rane 1.6 <li>
988 swift 1.14 <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
989 swift 1.1 Linux NFS Project
990     </li>
991 rane 1.6 <li>
992 swift 1.14 <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
993 swift 1.1 Mathematics and Computer Science Division, Argonne National Laboratory
994     </li>
995 rane 1.6 <li>
996 swift 1.1 <uri link="http://www.ntp.org/">http://ntp.org</uri>
997     </li>
998 rane 1.6 <li>
999 swift 1.14 <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1000 swift 1.1 David L. Mills, University of Delaware
1001     </li>
1002 rane 1.6 <li>
1003 swift 1.14 <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1004 swift 1.1 Secure Shell Working Group, IETF, Internet Society
1005     </li>
1006 rane 1.6 <li>
1007 swift 1.14 <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1008 swift 1.1 Guardian Digital
1009     </li>
1010 rane 1.6 <li>
1011 swift 1.14 <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1012 swift 1.1 Altair Grid Technologies, LLC.
1013     </li>
1014     </ul>
1015    
1016     </body>
1017     </section>
1018     </chapter>
1019    
1020     </guide>

  ViewVC Help
Powered by ViewVC 1.1.20