/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (hide annotations) (download) (as text)
Fri May 13 20:15:50 2005 UTC (9 years, 2 months ago) by neysx
Branch: MAIN
Changes since 1.2: +2 -2 lines
File MIME type: application/xml
#92529 silly typo

1 swift 1.1 <?xml version='1.0' encoding="UTF-8"?>
2    
3 neysx 1.3 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.2 2005/01/03 13:52:07 neysx Exp $ -->
4 swift 1.1
5     <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
6 neysx 1.2 <guide link="/doc/en/hpc-howto.xml">
7 swift 1.1
8     <title>High Performance Computing on Gentoo Linux</title>
9    
10     <author title="Author">
11     <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
12     </author>
13     <author title="Author">
14     <mail link="benoit@adelielinux.com">Benoit Morin</mail>
15     </author>
16     <author title="Assistant/Research">
17     <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
18     </author>
19     <author title="Assistant/Research">
20     <mail link="olivier@adelielinux.com">Olivier Crete</mail>
21     </author>
22     <author title="Reviewer">
23     <mail link="spyderous@gentoo.org">Donnie Berkholz</mail>
24     </author>
25    
26     <!-- No licensing information; this document has been written by a third-party
27     organisation without additional licensing information.
28    
29     In other words, this is copyright adelielinux R&D; Gentoo only has
30     permission to distribute this document as-is and update it when appropriate
31     as long as the adelie linux R&D notice stays
32     -->
33    
34     <abstract>
35     This document was written by people at the Adelie Linux R&amp;D Center
36 neysx 1.2 &lt;http://www.adelielinux.com&gt; as a step-by-step guide to turn a Gentoo
37     System into an High Performance Computing (HPC) system.
38 swift 1.1 </abstract>
39    
40     <version>1.0</version>
41 neysx 1.2 <date>2003-08-01</date>
42 swift 1.1
43     <chapter>
44     <title>Introduction</title>
45     <section>
46     <body>
47    
48     <p>
49     Gentoo Linux, a special flavor of Linux that can be automatically optimized
50     and customized for just about any application or need. Extreme performance,
51     configurability and a top-notch user and developer community are all hallmarks
52     of the Gentoo experience.
53     </p>
54    
55     <p>
56     Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
57     server, development workstation, professional desktop, gaming system, embedded
58     solution or... a High Performance Computing system. Because of its
59     near-unlimited adaptability, we call Gentoo Linux a metadistribution.
60     </p>
61    
62     <p>
63     This document explains how to turn a Gentoo system into a High Performance
64     Computing system. Step by step, it explains what packages one may want to
65     install and helps configure them.
66     </p>
67    
68     <p>
69 neysx 1.2 Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
70     refer to the <uri link="/doc/en/">documentation</uri> at the same location to
71     install it.
72 swift 1.1 </p>
73    
74     </body>
75     </section>
76     </chapter>
77    
78     <chapter>
79     <title>Configuring Gentoo Linux for Clustering</title>
80     <section>
81     <title>Recommended Optimizations</title>
82     <body>
83    
84     <note>
85 neysx 1.2 We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
86 swift 1.1 this section.
87     </note>
88    
89     <p>
90     During the installation process, you will have to set your USE variables in
91     <path>/etc/make.conf</path>. We recommended that you deactivate all the
92     defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them
93     in make.conf. However, you may want to keep such use variables as x86, 3dnow,
94     gpm, mmx, sse, ncurses, pam and tcpd. Refer to the USE documentation for more
95     information.
96     </p>
97    
98     <pre caption="USE Flags">
99     # Copyright 2000-2003 Daniel Robbins, Gentoo Technologies, Inc.
100     # Contains local system settings for Portage system
101    
102     # Please review 'man make.conf' for more information.
103    
104     USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm
105     -gif gpm -gtk -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod
106     mmx -motif -mpeg ncurses -nls -oggvorbis -opengl pam -pdflib -png
107     -python -qt -qtmt -quicktime -readline -sdl -slang -spell -ssl
108     -svga tcpd -truetype -X -xml2 -xmms -xv -zlib"
109     </pre>
110    
111     <p>
112     Or simply:
113     </p>
114    
115     <pre caption="USE Flags - simplified version">
116     # Copyright 2000-2003 Daniel Robbins, Gentoo Technologies, Inc.
117     # Contains local system settings for Portage system
118    
119     # Please review 'man make.conf' for more information.
120    
121     USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
122     </pre>
123    
124     <note>
125     The <e>tcpd</e> USE flag increases security for packages such as xinetd.
126     </note>
127    
128     <p>
129     In step 15 ("Installing the kernel and a System Logger") for stability
130     reasons, we recommend the vanilla-sources, the official kernel sources
131     released on <uri>http://www.kernel.org/</uri>, unless you require special
132     support such as xfs.
133     </p>
134    
135     <pre caption="Installing vanilla-sources">
136     # <i>emerge -p syslog-ng vanilla-sources</i>
137     </pre>
138    
139     <p>
140     When you install miscellaneous packages, we recommend installing the
141     following:
142     </p>
143    
144     <pre caption="Installing necessary packages">
145     # <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
146     </pre>
147    
148     </body>
149     </section>
150     <section>
151     <title>Communication Layer (TCP/IP Network)</title>
152     <body>
153    
154     <p>
155     A cluster requires a communication layer to interconnect the slave nodes to
156     the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
157     since they have a good price/performance ratio. Other possibilities include
158     use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
159     link="http://quadrics.com/">QsNet</uri> or others.
160     </p>
161    
162     <p>
163     A cluster is composed of two node types: master and slave. Typically, your
164     cluster will have one master node and several slave nodes.
165     </p>
166    
167     <p>
168     The master node is the cluster's server. It is responsible for telling the
169     slave nodes what to do. This server will typically run such daemons as dhcpd,
170     nfs, pbs-server, and pbs-sched. Your master node will allow interactive
171     sessions for users, and accept job executions.
172     </p>
173    
174     <p>
175     The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
176     node. They should be dedicated to crunching results and therefore should not
177     run any unecessary services.
178     </p>
179    
180     <p>
181     The rest of this documentation will assume a cluster configuration as per the
182     hosts file below. You should maintain on every node such a hosts file
183     (<path>/etc/hosts</path>) with entries for each node participating node in the
184     cluster.
185     </p>
186    
187     <pre caption="/etc/hosts">
188     # Adelie Linux Research &amp; Development Center
189     # /etc/hosts
190    
191     127.0.0.1 localhost
192    
193     192.168.1.100 master.adelie master
194    
195     192.168.1.1 node01.adelie node01
196     192.168.1.2 node02.adelie node02
197     </pre>
198    
199     <p>
200     To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
201     file on the master node.
202     </p>
203    
204     <pre caption="/etc/conf.d/net">
205     # Copyright 1999-2002 Gentoo Technologies, Inc.
206     # Distributed under the terms of the GNU General Public License, v2 or later
207    
208     # Global config file for net.* rc-scripts
209    
210     # This is basically the ifconfig argument without the ifconfig $iface
211     #
212    
213     iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
214     # Network Connection to the outside world using dhcp -- configure as required for you network
215     iface_eth1="dhcp"
216     </pre>
217    
218    
219     <p>
220     Finally, setup a DHCP daemon on the master node to avoid having to maintain a
221     network configuration on each slave node.
222     </p>
223    
224     <pre caption="/etc/dhcp/dhcpd.conf">
225     # Adelie Linux Research &amp; Development Center
226     # /etc/dhcp/dhcpd.conf
227    
228     log-facility local7;
229     ddns-update-style none;
230     use-host-decl-names on;
231    
232     subnet 192.168.1.0 netmask 255.255.255.0 {
233     option domain-name "adelie";
234     range 192.168.1.10 192.168.1.99;
235     option routers 192.168.1.100;
236    
237     host node01.adelie {
238     # MAC address of network card on node 01
239     hardware ethernet 00:07:e9:0f:e2:d4;
240     fixed-address 192.168.1.1;
241     }
242     host node02.adelie {
243     # MAC address of network card on node 02
244     hardware ethernet 00:07:e9:0f:e2:6b;
245     fixed-address 192.168.1.2;
246     }
247     }
248     </pre>
249    
250     </body>
251     </section>
252     <section>
253     <title>NFS/NIS</title>
254     <body>
255    
256     <p>
257     The Network File System (NFS) was developed to allow machines to mount a disk
258     partition on a remote machine as if it were on a local hard drive. This allows
259     for fast, seamless sharing of files across a network.
260     </p>
261    
262     <p>
263     There are other systems that provide similar functionality to NFS which could
264     be used in a cluster environment. The <uri
265     link="http://www.transarc.com/Product/EFS/AFS/index.html">Andrew File System
266     from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
267     some additional security and performance features. The <uri
268     link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
269     development, but is designed to work well with disconnected clients. Many
270     of the features of the Andrew and Coda file systems are slated for inclusion
271     in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
272     The advantage of NFS today is that it is mature, standard, well understood,
273     and supported robustly across a variety of platforms.
274     </p>
275    
276     <pre caption="Ebuilds for NFS-support">
277     # <i>emerge -p nfs-utils portmap</i>
278     # <i>emerge nfs-utils portmap</i>
279     </pre>
280    
281     <p>
282     Configure and install a kernel to support NFS v3 on all nodes:
283     </p>
284    
285     <pre caption="Required Kernel Configurations for NFS">
286     CONFIG_NFS_FS=y
287     CONFIG_NFSD=y
288     CONFIG_SUNRPC=y
289     CONFIG_LOCKD=y
290     CONFIG_NFSD_V3=y
291     CONFIG_LOCKD_V4=y
292     </pre>
293    
294     <p>
295     On the master node, edit your <path>/etc/hosts.allow</path> file to allow
296     connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
297     your <path>hosts.allow</path> will look like:
298     </p>
299    
300     <pre caption="hosts.allow">
301     portmap:192.168.1.0/255.255.255.0
302     </pre>
303    
304     <p>
305     Edit the <path>/etc/exports</path> file of the master node to export a work
306     directory struture (/home is good for this).
307     </p>
308    
309     <pre caption="/etc/exports">
310     /home/ *(rw)
311     </pre>
312    
313     <p>
314     Add nfs to your master node's default runlevel:
315     </p>
316    
317     <pre caption="Adding NFS to the default runlevel">
318     # <i>rc-update add nfs default</i>
319     </pre>
320    
321     <p>
322     To mount the nfs exported filesystem from the master, you also have to
323     configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
324     one:
325     </p>
326    
327     <pre caption="/etc/fstab">
328     master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
329     </pre>
330    
331     <p>
332     You'll also need to set up your nodes so that they mount the nfs filesystem by
333     issuing this command:
334     </p>
335    
336     <pre caption="Adding nfsmount to the default runlevel">
337     # <i>rc-update add nfsmount default</i>
338     </pre>
339    
340     </body>
341     </section>
342     <section>
343     <title>RSH/SSH</title>
344     <body>
345    
346     <p>
347     SSH is a protocol for secure remote login and other secure network services
348     over an insecure network. OpenSSH uses public key cryptography to provide
349     secure authorization. Generating the public key, which is shared with remote
350     systems, and the private key which is kept on the local system, is done first
351     to configure OpenSSH on the cluster.
352     </p>
353    
354     <p>
355     For transparent cluster usage, private/public keys may be used. This process
356     has two steps:
357     </p>
358    
359     <ul>
360     <li>Generate public and private keys</li>
361     <li>Copy public key to slave nodes</li>
362     </ul>
363    
364     <p>
365 neysx 1.3 For user based authentification, generate and copy as follows:
366 swift 1.1 </p>
367    
368     <pre caption="SSH key authentication">
369     # <i>ssh-keygen -t dsa</i>
370     Generating public/private dsa key pair.
371     Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
372     Enter passphrase (empty for no passphrase):
373     Enter same passphrase again:
374     Your identification has been saved in /root/.ssh/id_dsa.
375     Your public key has been saved in /root/.ssh/id_dsa.pub.
376     The key fingerprint is:
377     f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
378    
379     <comment>WARNING! If you already have an "authorized_keys" file,
380     please append to it, do not use the following command.</comment>
381    
382     # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
383     root@master's password:
384     id_dsa.pub 100% 234 2.0MB/s 00:00
385    
386     # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
387     root@master's password:
388     id_dsa.pub 100% 234 2.0MB/s 00:00
389     </pre>
390    
391     <note>
392     Host keys must have an empty passphrase. RSA is required for host-based
393     authentification.
394     </note>
395    
396     <p>
397     For host based authentication, you will also need to edit your
398     <path>/etc/ssh/shosts.equiv</path>.
399     </p>
400    
401     <pre caption="/etc/ssh/shosts.equiv">
402     node01.adelie
403     node02.adelie
404     master.adelie
405     </pre>
406    
407     <p>
408     And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
409     </p>
410    
411     <pre caption="sshd configurations">
412     # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
413     # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
414    
415     # This is the sshd server system-wide configuration file. See sshd(8)
416     # for more information.
417    
418     # HostKeys for protocol version 2
419     HostKey /etc/ssh/ssh_host_rsa_key
420     </pre>
421    
422     <p>
423     If your application require RSH communications, you will need to emerge
424     net-misc/netkit-rsh and sys-apps/xinetd.
425     </p>
426    
427     <pre caption="Installing necessary applicaitons">
428     # <i>emerge -p xinetd</i>
429     # <i>emerge xinetd</i>
430     # <i>emerge -p netkit-rsh</i>
431     # <i>emerge netkit-rsh</i>
432     </pre>
433    
434     <p>
435     Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
436     </p>
437    
438     <pre caption="rsh">
439     # Adelie Linux Research &amp; Development Center
440     # /etc/xinetd.d/rsh
441    
442     service shell
443     {
444     socket_type = stream
445     protocol = tcp
446     wait = no
447     user = root
448     group = tty
449     server = /usr/sbin/in.rshd
450     log_type = FILE /var/log/rsh
451     log_on_success = PID HOST USERID EXIT DURATION
452     log_on_failure = USERID ATTEMPT
453     disable = no
454     }
455     </pre>
456    
457     <p>
458     Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
459     </p>
460    
461     <pre caption="hosts.allow">
462     # Adelie Linux Research &amp; Development Center
463     # /etc/hosts.allow
464    
465     in.rshd:192.168.1.0/255.255.255.0
466     </pre>
467    
468     <p>
469     Or you can simply trust your cluster LAN:
470     </p>
471    
472     <pre caption="hosts.allow">
473     # Adelie Linux Research &amp; Development Center
474     # /etc/hosts.allow
475    
476     ALL:192.168.1.0/255.255.255.0
477     </pre>
478    
479     <p>
480     Finally, configure host authentification from <path>/etc/hosts.equiv</path>.
481     </p>
482    
483     <pre caption="hosts.equiv">
484     # Adelie Linux Research &amp; Development Center
485     # /etc/hosts.equiv
486    
487     master
488     node01
489     node02
490     </pre>
491    
492     <p>
493     And, add xinetd to your default runlevel:
494     </p>
495    
496     <pre caption="Adding xinetd to the default runlevel">
497     # <i>rc-update add xinetd default</i>
498     </pre>
499    
500     </body>
501     </section>
502     <section>
503     <title>NTP</title>
504     <body>
505    
506     <p>
507     The Network Time Protocol (NTP) is used to synchronize the time of a computer
508     client or server to another server or reference time source, such as a radio
509     or satellite receiver or modem. It provides accuracies typically within a
510     millisecond on LANs and up to a few tens of milliseconds on WANs relative to
511     Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
512     receiver, for example. Typical NTP configurations utilize multiple redundant
513     servers and diverse network paths in order to achieve high accuracy and
514     reliability.
515     </p>
516    
517     <p>
518     Select a NTP server geographically close to you from <uri
519     link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
520     Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
521     <path>/etc/ntp.conf</path> files on the master node.
522     </p>
523    
524     <pre caption="Master /etc/conf.d/ntp">
525     # Copyright 1999-2002 Gentoo Technologies, Inc.
526     # Distributed under the terms of the GNU General Public License v2
527     # /etc/conf.d/ntpd
528    
529     # NOTES:
530     # - NTPDATE variables below are used if you wish to set your
531     # clock when you start the ntp init.d script
532     # - make sure that the NTPDATE_CMD will close by itself ...
533     # the init.d script will not attempt to kill/stop it
534     # - ntpd will be used to maintain synchronization with a time
535     # server regardless of what NTPDATE is set to
536     # - read each of the comments above each of the variable
537    
538     # Comment this out if you dont want the init script to warn
539     # about not having ntpdate setup
540     NTPDATE_WARN="n"
541    
542     # Command to run to set the clock initially
543     # Most people should just uncomment this line ...
544     # however, if you know what you're doing, and you
545     # want to use ntpd to set the clock, change this to 'ntpd'
546     NTPDATE_CMD="ntpdate"
547    
548     # Options to pass to the above command
549     # Most people should just uncomment this variable and
550     # change 'someserver' to a valid hostname which you
551     # can aquire from the URL's below
552     NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
553    
554     ##
555     # A list of available servers is available here:
556     # http://www.eecis.udel.edu/~mills/ntp/servers.html
557     # Please follow the rules of engagement and use a
558     # Stratum 2 server (unless you qualify for Stratum 1)
559     ##
560    
561     # Options to pass to the ntpd process that will *always* be run
562     # Most people should not uncomment this line ...
563     # however, if you know what you're doing, feel free to tweak
564     #NTPD_OPTS=""
565    
566     </pre>
567    
568     <p>
569     Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
570     synchronization source:
571     </p>
572    
573     <pre caption="Master ntp.conf">
574     # Adelie Linux Research &amp; Development Center
575     # /etc/ntp.conf
576    
577     # Synchronization source #1
578     server ntp1.cmc.ec.gc.ca
579     restrict ntp1.cmc.ec.gc.ca
580     # Synchronization source #2
581     server ntp2.cmc.ec.gc.ca
582     restrict ntp2.cmc.ec.gc.ca
583     stratum 10
584     driftfile /etc/ntp.drift.server
585     logfile /var/log/ntp
586     broadcast 192.168.1.255
587     restrict default kod
588     restrict 127.0.0.1
589     restrict 192.168.1.0 mask 255.255.255.0
590     </pre>
591    
592     <p>
593     And on all your slave nodes, setup your synchronization source as your master
594     node.
595     </p>
596    
597     <pre caption="Node /etc/conf.d/ntp">
598     # Copyright 1999-2002 Gentoo Technologies, Inc.
599     # Distributed under the terms of the GNU General Public License v2
600     # /etc/conf.d/ntpd
601    
602     NTPDATE_WARN="n"
603     NTPDATE_CMD="ntpdate"
604     NTPDATE_OPTS="-b master"
605     </pre>
606    
607     <pre caption="Node ntp.conf">
608     # Adelie Linux Research &amp; Development Center
609     # /etc/ntp.conf
610    
611     # Synchronization source #1
612     server master
613     restrict master
614     stratum 11
615     driftfile /etc/ntp.drift.server
616     logfile /var/log/ntp
617     restrict default kod
618     restrict 127.0.0.1
619     </pre>
620    
621     <p>
622     Then add ntpd to the default runlevel of all your nodes:
623     </p>
624    
625     <pre caption="Adding ntpd to the default runlevel">
626     # <i>rc-update add ntpd default</i>
627     </pre>
628    
629     <note>
630     NTP will not update the local clock if the time difference between your
631     synchronization source and the local clock is too great.
632     </note>
633    
634     </body>
635     </section>
636     <section>
637     <title>IPTABLES</title>
638     <body>
639    
640     <p>
641     To setup a firewall on your cluster, you will need iptables.
642     </p>
643    
644     <pre caption="Installing iptables">
645     # <i>emerge -p iptables</i>
646     # <i>emerge iptables</i>
647     </pre>
648    
649     <p>
650     Required kernel configuration:
651     </p>
652    
653     <pre caption="IPtables kernel configuration">
654     CONFIG_NETFILTER=y
655     CONFIG_IP_NF_CONNTRACK=y
656     CONFIG_IP_NF_IPTABLES=y
657     CONFIG_IP_NF_MATCH_STATE=y
658     CONFIG_IP_NF_FILTER=y
659     CONFIG_IP_NF_TARGET_REJECT=y
660     CONFIG_IP_NF_NAT=y
661     CONFIG_IP_NF_NAT_NEEDED=y
662     CONFIG_IP_NF_TARGET_MASQUERADE=y
663     CONFIG_IP_NF_TARGET_LOG=y
664     </pre>
665    
666     <p>
667     And the rules required for this firewall:
668     </p>
669    
670     <pre caption="rule-save">
671     # Adelie Linux Research &amp; Development Center
672     # /var/lib/iptbles/rule-save
673    
674     *filter
675     :INPUT ACCEPT [0:0]
676     :FORWARD ACCEPT [0:0]
677     :OUTPUT ACCEPT [0:0]
678     -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
679     -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
680     -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
681     -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
682     -A INPUT -p icmp -j ACCEPT
683     -A INPUT -j LOG
684     -A INPUT -j REJECT --reject-with icmp-port-unreachable
685     COMMIT
686     *nat
687     :PREROUTING ACCEPT [0:0]
688     :POSTROUTING ACCEPT [0:0]
689     :OUTPUT ACCEPT [0:0]
690     -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
691     COMMIT
692     </pre>
693    
694     <p>
695     Then add iptables to the default runlevel of all your nodes:
696     </p>
697    
698     <pre caption="Adding iptables to the default runlevel">
699     # <i>rc-update add iptables default</i>
700     </pre>
701    
702     </body>
703     </section>
704     </chapter>
705    
706     <chapter>
707     <title>HPC Tools</title>
708     <section>
709     <title>OpenPBS</title>
710     <body>
711    
712     <p>
713     The Portable Batch System (PBS) is a flexible batch queueing and workload
714     management system originally developed for NASA. It operates on networked,
715     multi-platform UNIX environments, including heterogeneous clusters of
716     workstations, supercomputers, and massively parallel systems. Development of
717     PBS is provided by Altair Grid Technologies.
718     </p>
719    
720     <pre caption="Installing openpbs">
721     # <i>emerge -p openpbs</i>
722     </pre>
723    
724     <note>
725     OpenPBS ebuild does not currently set proper permissions on var-directories
726     used by OpenPBS.
727     </note>
728    
729     <p>
730     Before starting using OpenPBS, some configurations are required. The files
731     you will need to personalize for your system are:
732     </p>
733    
734     <ul>
735     <li>/etc/pbs_environment</li>
736     <li>/var/spool/PBS/server_name</li>
737     <li>/var/spool/PBS/server_priv/nodes</li>
738     <li>/var/spool/PBS/mom_priv/config</li>
739     <li>/var/spool/PBS/sched_priv/sched_config</li>
740     </ul>
741    
742     <p>
743     Here is a sample sched_config:
744     </p>
745    
746     <pre caption="/var/spool/PBS/sched_priv/sched_config">
747     #
748     # Create queues and set their attributes.
749     #
750     #
751     # Create and define queue upto4nodes
752     #
753     create queue upto4nodes
754     set queue upto4nodes queue_type = Execution
755     set queue upto4nodes Priority = 100
756     set queue upto4nodes resources_max.nodect = 4
757     set queue upto4nodes resources_min.nodect = 1
758     set queue upto4nodes enabled = True
759     set queue upto4nodes started = True
760     #
761     # Create and define queue default
762     #
763     create queue default
764     set queue default queue_type = Route
765     set queue default route_destinations = upto4nodes
766     set queue default enabled = True
767     set queue default started = True
768     #
769     # Set server attributes.
770     #
771     set server scheduling = True
772     set server acl_host_enable = True
773     set server default_queue = default
774     set server log_events = 511
775     set server mail_from = adm
776     set server query_other_jobs = True
777     set server resources_default.neednodes = 1
778     set server resources_default.nodect = 1
779     set server resources_default.nodes = 1
780     set server scheduler_iteration = 60
781     </pre>
782    
783     <p>
784     To submit a task to OpenPBS, the command <c>qsub</c> is used with some
785     optional parameters. In the exemple below, "-l" allows you to specify
786     the resources required, "-j" provides for redirection of standard out and
787     standard error, and the "-m" will e-mail the user at begining (b), end (e)
788     and on abort (a) of the job.
789     </p>
790    
791     <pre caption="Submitting a task">
792     <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
793     # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
794     </pre>
795    
796     <p>
797     Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
798     may want to try a task manually. To request an interactive shell from OpenPBS,
799     use the "-I" parameter.
800     </p>
801    
802     <pre caption="Requesting an interactive shell">
803     # <i>qsub -I</i>
804     </pre>
805    
806     <p>
807     To check the status of your jobs, use the qstat command:
808     </p>
809    
810     <pre caption="Checking the status of the jobs">
811     # <i>qstat</i>
812     Job id Name User Time Use S Queue
813     ------ ---- ---- -------- - -----
814     2.geist STDIN adelie 0 R upto1nodes
815     </pre>
816    
817     </body>
818     </section>
819     <section>
820     <title>MPICH</title>
821     <body>
822    
823     <p>
824     Message passing is a paradigm used widely on certain classes of parallel
825     machines, especially those with distributed memory. MPICH is a freely
826     available, portable implementation of MPI, the Standard for message-passing
827     libraries.
828     </p>
829    
830     <p>
831     The mpich ebuild provided by Adelie Linux allows for two USE flags:
832     <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
833     installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
834     of <c>rsh</c>.
835     </p>
836    
837     <pre caption="Installing the mpich application">
838     # <i>emerge -p mpich</i>
839     # <i>emerge mpich</i>
840     </pre>
841    
842     <p>
843     You may need to export a mpich work directory to all your slave nodes in
844     <path>/etc/exports</path>:
845     </p>
846    
847     <pre caption="/etc/exports">
848     /home *(rw)
849     </pre>
850    
851     <p>
852     Most massively parallel processors (MPPs) provide a way to start a program on
853     a requested number of processors; <c>mpirun</c> makes use of the appropriate
854     command whenever possible. In contrast, workstation clusters require that each
855     process in a parallel job be started individually, though programs to help
856     start these processes exist. Because workstation clusters are not already
857     organized as an MPP, additional information is required to make use of them.
858     Mpich should be installed with a list of participating workstations in the
859     file <path>machines.LINUX</path> in the directory
860     <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
861     processors to run on.
862     </p>
863    
864     <p>
865     Edit this file to reflect your cluster-lan configuration:
866     </p>
867    
868     <pre caption="/usr/share/mpich/machines.LINUX">
869     # Change this file to contain the machines that you want to use
870     # to run MPI jobs on. The format is one host name per line, with either
871     # hostname
872     # or
873     # hostname:n
874     # where n is the number of processors in an SMP. The hostname should
875     # be the same as the result from the command "hostname"
876     master
877     node01
878     node02
879     # node03
880     # node04
881     # ...
882     </pre>
883    
884     <p>
885     Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
886     you can use all of the machines that you have listed. This script performs
887     an <c>rsh</c> and a short directory listing; this tests that you both have
888     access to the node and that a program in the current directory is visible on
889     the remote node. If there are any problems, they will be listed. These
890     problems must be fixed before proceeding.
891     </p>
892    
893     <p>
894     The only argument to <c>tstmachines</c> is the name of the architecture; this
895     is the same name as the extension on the machines file. For example, the
896     following tests that a program in the current directory can be executed by
897     all of the machines in the LINUX machines list.
898     </p>
899    
900     <pre caption="Running a test">
901     # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
902     </pre>
903    
904     <note>
905     This program is silent if all is well; if you want to see what it is doing,
906     use the -v (for verbose) argument:
907     </note>
908    
909     <pre caption="Running a test verbosively">
910     # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
911     </pre>
912    
913     <p>
914     The output from this command might look like:
915     </p>
916    
917     <pre caption="Output of the above command">
918     Trying true on host1.uoffoo.edu ...
919     Trying true on host2.uoffoo.edu ...
920     Trying ls on host1.uoffoo.edu ...
921     Trying ls on host2.uoffoo.edu ...
922     Trying user program on host1.uoffoo.edu ...
923     Trying user program on host2.uoffoo.edu ...
924     </pre>
925    
926     <p>
927     If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
928     solutions. In brief, there are three tests:
929     </p>
930    
931     <ul>
932     <li>
933     <e>Can processes be started on remote machines?</e> tstmachines attempts
934     to run the shell command true on each machine in the machines files by
935     using the remote shell command.
936     </li>
937     <li>
938     <e>Is current working directory available to all machines?</e> This
939     attempts to ls a file that tstmachines creates by running ls using the
940     remote shell command.
941     </li>
942     <li>
943     <e>Can user programs be run on remote systems?</e> This checks that shared
944     libraries and other components have been properly installed on all
945     machines.
946     </li>
947     </ul>
948    
949     <p>
950     And the required test for every development tool:
951     </p>
952    
953     <pre caption="Testing a development tool">
954     # <i>cd ~</i>
955     # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
956     # <i>make hello++</i>
957     # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
958     </pre>
959    
960     <p>
961     For further information on MPICH, consult the documentation at <uri
962     link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
963     </p>
964    
965     </body>
966     </section>
967     <section>
968     <title>LAM</title>
969     <body>
970    
971     <p>
972     (Coming Soon!)
973     </p>
974    
975     </body>
976     </section>
977     <section>
978     <title>OMNI</title>
979     <body>
980    
981     <p>
982     (Coming Soon!)
983     </p>
984    
985     </body>
986     </section>
987     </chapter>
988    
989     <chapter>
990     <title>Bibliography</title>
991     <section>
992     <body>
993    
994     <p>
995     The original document is published at the <uri
996     link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
997     and is reproduced here with the permission of the authors and <uri
998     link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
999     Centre.
1000     </p>
1001    
1002     <ul>
1003 neysx 1.2 <li><uri>http://www.gentoo.org</uri>, Gentoo Technologies, Inc.</li>
1004 swift 1.1 <li>
1005     <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
1006     Adelie Linux Research and Development Centre
1007     </li>
1008     <li>
1009     <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
1010     Linux NFS Project
1011     </li>
1012     <li>
1013     <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
1014     Mathematics and Computer Science Division, Argonne National Laboratory
1015     </li>
1016     <li>
1017     <uri link="http://www.ntp.org/">http://ntp.org</uri>
1018     </li>
1019     <li>
1020     <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1021     David L. Mills, University of Delaware
1022     </li>
1023     <li>
1024     <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1025     Secure Shell Working Group, IETF, Internet Society
1026     </li>
1027     <li>
1028     <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1029     Guardian Digital
1030     </li>
1031     <li>
1032     <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1033     Altair Grid Technologies, LLC.
1034     </li>
1035     </ul>
1036    
1037     </body>
1038     </section>
1039     </chapter>
1040    
1041     </guide>

  ViewVC Help
Powered by ViewVC 1.1.20