/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.1 - (show annotations) (download) (as text)
Mon Jan 3 10:00:04 2005 UTC (9 years, 10 months ago) by swift
Branch: MAIN
File MIME type: application/xml
Migrating the document from the cluster project space to documentation repository

1 <?xml version='1.0' encoding="UTF-8"?>
2
3 <!-- $Header$ -->
4
5 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
6 <guide link="hpc-howto.xml">
7
8 <title>High Performance Computing on Gentoo Linux</title>
9
10 <author title="Author">
11 <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
12 </author>
13 <author title="Author">
14 <mail link="benoit@adelielinux.com">Benoit Morin</mail>
15 </author>
16 <author title="Assistant/Research">
17 <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
18 </author>
19 <author title="Assistant/Research">
20 <mail link="olivier@adelielinux.com">Olivier Crete</mail>
21 </author>
22 <author title="Reviewer">
23 <mail link="spyderous@gentoo.org">Donnie Berkholz</mail>
24 </author>
25
26 <!-- No licensing information; this document has been written by a third-party
27 organisation without additional licensing information.
28
29 In other words, this is copyright adelielinux R&D; Gentoo only has
30 permission to distribute this document as-is and update it when appropriate
31 as long as the adelie linux R&D notice stays
32 -->
33
34 <abstract>
35 This document was written by people at the Adelie Linux R&amp;D Center
36 &lt;http://www.adelielinux.com&gt; as a
37 step-by-step guide to turn a Gentoo System into an High Performance Computing
38 (HPC) system.
39 </abstract>
40
41 <version>1.0</version>
42 <date>August 1, 2003</date>
43
44 <chapter>
45 <title>Introduction</title>
46 <section>
47 <body>
48
49 <p>
50 Gentoo Linux, a special flavor of Linux that can be automatically optimized
51 and customized for just about any application or need. Extreme performance,
52 configurability and a top-notch user and developer community are all hallmarks
53 of the Gentoo experience.
54 </p>
55
56 <p>
57 Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
58 server, development workstation, professional desktop, gaming system, embedded
59 solution or... a High Performance Computing system. Because of its
60 near-unlimited adaptability, we call Gentoo Linux a metadistribution.
61 </p>
62
63 <p>
64 This document explains how to turn a Gentoo system into a High Performance
65 Computing system. Step by step, it explains what packages one may want to
66 install and helps configure them.
67 </p>
68
69 <p>
70 Obtain Gentoo Linux from the website <uri
71 link="http://www.gentoo.org/">www.gentoo.org</uri>, and refer to the <uri
72 link="http://www.gentoo.org/main/en/docs.xml">documentation</uri> at the same
73 location to install it.
74 </p>
75
76 </body>
77 </section>
78 </chapter>
79
80 <chapter>
81 <title>Configuring Gentoo Linux for Clustering</title>
82 <section>
83 <title>Recommended Optimizations</title>
84 <body>
85
86 <note>
87 We refer to the <uri
88 link="http://www.gentoo.org/doc/en/handbook">Gentoo Linux Handbooks</uri> in
89 this section.
90 </note>
91
92 <p>
93 During the installation process, you will have to set your USE variables in
94 <path>/etc/make.conf</path>. We recommended that you deactivate all the
95 defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them
96 in make.conf. However, you may want to keep such use variables as x86, 3dnow,
97 gpm, mmx, sse, ncurses, pam and tcpd. Refer to the USE documentation for more
98 information.
99 </p>
100
101 <pre caption="USE Flags">
102 # Copyright 2000-2003 Daniel Robbins, Gentoo Technologies, Inc.
103 # Contains local system settings for Portage system
104
105 # Please review 'man make.conf' for more information.
106
107 USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm
108 -gif gpm -gtk -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod
109 mmx -motif -mpeg ncurses -nls -oggvorbis -opengl pam -pdflib -png
110 -python -qt -qtmt -quicktime -readline -sdl -slang -spell -ssl
111 -svga tcpd -truetype -X -xml2 -xmms -xv -zlib"
112 </pre>
113
114 <p>
115 Or simply:
116 </p>
117
118 <pre caption="USE Flags - simplified version">
119 # Copyright 2000-2003 Daniel Robbins, Gentoo Technologies, Inc.
120 # Contains local system settings for Portage system
121
122 # Please review 'man make.conf' for more information.
123
124 USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
125 </pre>
126
127 <note>
128 The <e>tcpd</e> USE flag increases security for packages such as xinetd.
129 </note>
130
131 <p>
132 In step 15 ("Installing the kernel and a System Logger") for stability
133 reasons, we recommend the vanilla-sources, the official kernel sources
134 released on <uri>http://www.kernel.org/</uri>, unless you require special
135 support such as xfs.
136 </p>
137
138 <pre caption="Installing vanilla-sources">
139 # <i>emerge -p syslog-ng vanilla-sources</i>
140 </pre>
141
142 <p>
143 When you install miscellaneous packages, we recommend installing the
144 following:
145 </p>
146
147 <pre caption="Installing necessary packages">
148 # <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
149 </pre>
150
151 </body>
152 </section>
153 <section>
154 <title>Communication Layer (TCP/IP Network)</title>
155 <body>
156
157 <p>
158 A cluster requires a communication layer to interconnect the slave nodes to
159 the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
160 since they have a good price/performance ratio. Other possibilities include
161 use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
162 link="http://quadrics.com/">QsNet</uri> or others.
163 </p>
164
165 <p>
166 A cluster is composed of two node types: master and slave. Typically, your
167 cluster will have one master node and several slave nodes.
168 </p>
169
170 <p>
171 The master node is the cluster's server. It is responsible for telling the
172 slave nodes what to do. This server will typically run such daemons as dhcpd,
173 nfs, pbs-server, and pbs-sched. Your master node will allow interactive
174 sessions for users, and accept job executions.
175 </p>
176
177 <p>
178 The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
179 node. They should be dedicated to crunching results and therefore should not
180 run any unecessary services.
181 </p>
182
183 <p>
184 The rest of this documentation will assume a cluster configuration as per the
185 hosts file below. You should maintain on every node such a hosts file
186 (<path>/etc/hosts</path>) with entries for each node participating node in the
187 cluster.
188 </p>
189
190 <pre caption="/etc/hosts">
191 # Adelie Linux Research &amp; Development Center
192 # /etc/hosts
193
194 127.0.0.1 localhost
195
196 192.168.1.100 master.adelie master
197
198 192.168.1.1 node01.adelie node01
199 192.168.1.2 node02.adelie node02
200 </pre>
201
202 <p>
203 To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
204 file on the master node.
205 </p>
206
207 <pre caption="/etc/conf.d/net">
208 # Copyright 1999-2002 Gentoo Technologies, Inc.
209 # Distributed under the terms of the GNU General Public License, v2 or later
210
211 # Global config file for net.* rc-scripts
212
213 # This is basically the ifconfig argument without the ifconfig $iface
214 #
215
216 iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
217 # Network Connection to the outside world using dhcp -- configure as required for you network
218 iface_eth1="dhcp"
219 </pre>
220
221
222 <p>
223 Finally, setup a DHCP daemon on the master node to avoid having to maintain a
224 network configuration on each slave node.
225 </p>
226
227 <pre caption="/etc/dhcp/dhcpd.conf">
228 # Adelie Linux Research &amp; Development Center
229 # /etc/dhcp/dhcpd.conf
230
231 log-facility local7;
232 ddns-update-style none;
233 use-host-decl-names on;
234
235 subnet 192.168.1.0 netmask 255.255.255.0 {
236 option domain-name "adelie";
237 range 192.168.1.10 192.168.1.99;
238 option routers 192.168.1.100;
239
240 host node01.adelie {
241 # MAC address of network card on node 01
242 hardware ethernet 00:07:e9:0f:e2:d4;
243 fixed-address 192.168.1.1;
244 }
245 host node02.adelie {
246 # MAC address of network card on node 02
247 hardware ethernet 00:07:e9:0f:e2:6b;
248 fixed-address 192.168.1.2;
249 }
250 }
251 </pre>
252
253 </body>
254 </section>
255 <section>
256 <title>NFS/NIS</title>
257 <body>
258
259 <p>
260 The Network File System (NFS) was developed to allow machines to mount a disk
261 partition on a remote machine as if it were on a local hard drive. This allows
262 for fast, seamless sharing of files across a network.
263 </p>
264
265 <p>
266 There are other systems that provide similar functionality to NFS which could
267 be used in a cluster environment. The <uri
268 link="http://www.transarc.com/Product/EFS/AFS/index.html">Andrew File System
269 from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
270 some additional security and performance features. The <uri
271 link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
272 development, but is designed to work well with disconnected clients. Many
273 of the features of the Andrew and Coda file systems are slated for inclusion
274 in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
275 The advantage of NFS today is that it is mature, standard, well understood,
276 and supported robustly across a variety of platforms.
277 </p>
278
279 <pre caption="Ebuilds for NFS-support">
280 # <i>emerge -p nfs-utils portmap</i>
281 # <i>emerge nfs-utils portmap</i>
282 </pre>
283
284 <p>
285 Configure and install a kernel to support NFS v3 on all nodes:
286 </p>
287
288 <pre caption="Required Kernel Configurations for NFS">
289 CONFIG_NFS_FS=y
290 CONFIG_NFSD=y
291 CONFIG_SUNRPC=y
292 CONFIG_LOCKD=y
293 CONFIG_NFSD_V3=y
294 CONFIG_LOCKD_V4=y
295 </pre>
296
297 <p>
298 On the master node, edit your <path>/etc/hosts.allow</path> file to allow
299 connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
300 your <path>hosts.allow</path> will look like:
301 </p>
302
303 <pre caption="hosts.allow">
304 portmap:192.168.1.0/255.255.255.0
305 </pre>
306
307 <p>
308 Edit the <path>/etc/exports</path> file of the master node to export a work
309 directory struture (/home is good for this).
310 </p>
311
312 <pre caption="/etc/exports">
313 /home/ *(rw)
314 </pre>
315
316 <p>
317 Add nfs to your master node's default runlevel:
318 </p>
319
320 <pre caption="Adding NFS to the default runlevel">
321 # <i>rc-update add nfs default</i>
322 </pre>
323
324 <p>
325 To mount the nfs exported filesystem from the master, you also have to
326 configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
327 one:
328 </p>
329
330 <pre caption="/etc/fstab">
331 master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
332 </pre>
333
334 <p>
335 You'll also need to set up your nodes so that they mount the nfs filesystem by
336 issuing this command:
337 </p>
338
339 <pre caption="Adding nfsmount to the default runlevel">
340 # <i>rc-update add nfsmount default</i>
341 </pre>
342
343 </body>
344 </section>
345 <section>
346 <title>RSH/SSH</title>
347 <body>
348
349 <p>
350 SSH is a protocol for secure remote login and other secure network services
351 over an insecure network. OpenSSH uses public key cryptography to provide
352 secure authorization. Generating the public key, which is shared with remote
353 systems, and the private key which is kept on the local system, is done first
354 to configure OpenSSH on the cluster.
355 </p>
356
357 <p>
358 For transparent cluster usage, private/public keys may be used. This process
359 has two steps:
360 </p>
361
362 <ul>
363 <li>Generate public and private keys</li>
364 <li>Copy public key to slave nodes</li>
365 </ul>
366
367 <p>
368 For user based authentification, general and copy as follows:
369 </p>
370
371 <pre caption="SSH key authentication">
372 # <i>ssh-keygen -t dsa</i>
373 Generating public/private dsa key pair.
374 Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
375 Enter passphrase (empty for no passphrase):
376 Enter same passphrase again:
377 Your identification has been saved in /root/.ssh/id_dsa.
378 Your public key has been saved in /root/.ssh/id_dsa.pub.
379 The key fingerprint is:
380 f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
381
382 <comment>WARNING! If you already have an "authorized_keys" file,
383 please append to it, do not use the following command.</comment>
384
385 # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
386 root@master's password:
387 id_dsa.pub 100% 234 2.0MB/s 00:00
388
389 # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
390 root@master's password:
391 id_dsa.pub 100% 234 2.0MB/s 00:00
392 </pre>
393
394 <note>
395 Host keys must have an empty passphrase. RSA is required for host-based
396 authentification.
397 </note>
398
399 <p>
400 For host based authentication, you will also need to edit your
401 <path>/etc/ssh/shosts.equiv</path>.
402 </p>
403
404 <pre caption="/etc/ssh/shosts.equiv">
405 node01.adelie
406 node02.adelie
407 master.adelie
408 </pre>
409
410 <p>
411 And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
412 </p>
413
414 <pre caption="sshd configurations">
415 # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
416 # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
417
418 # This is the sshd server system-wide configuration file. See sshd(8)
419 # for more information.
420
421 # HostKeys for protocol version 2
422 HostKey /etc/ssh/ssh_host_rsa_key
423 </pre>
424
425 <p>
426 If your application require RSH communications, you will need to emerge
427 net-misc/netkit-rsh and sys-apps/xinetd.
428 </p>
429
430 <pre caption="Installing necessary applicaitons">
431 # <i>emerge -p xinetd</i>
432 # <i>emerge xinetd</i>
433 # <i>emerge -p netkit-rsh</i>
434 # <i>emerge netkit-rsh</i>
435 </pre>
436
437 <p>
438 Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
439 </p>
440
441 <pre caption="rsh">
442 # Adelie Linux Research &amp; Development Center
443 # /etc/xinetd.d/rsh
444
445 service shell
446 {
447 socket_type = stream
448 protocol = tcp
449 wait = no
450 user = root
451 group = tty
452 server = /usr/sbin/in.rshd
453 log_type = FILE /var/log/rsh
454 log_on_success = PID HOST USERID EXIT DURATION
455 log_on_failure = USERID ATTEMPT
456 disable = no
457 }
458 </pre>
459
460 <p>
461 Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
462 </p>
463
464 <pre caption="hosts.allow">
465 # Adelie Linux Research &amp; Development Center
466 # /etc/hosts.allow
467
468 in.rshd:192.168.1.0/255.255.255.0
469 </pre>
470
471 <p>
472 Or you can simply trust your cluster LAN:
473 </p>
474
475 <pre caption="hosts.allow">
476 # Adelie Linux Research &amp; Development Center
477 # /etc/hosts.allow
478
479 ALL:192.168.1.0/255.255.255.0
480 </pre>
481
482 <p>
483 Finally, configure host authentification from <path>/etc/hosts.equiv</path>.
484 </p>
485
486 <pre caption="hosts.equiv">
487 # Adelie Linux Research &amp; Development Center
488 # /etc/hosts.equiv
489
490 master
491 node01
492 node02
493 </pre>
494
495 <p>
496 And, add xinetd to your default runlevel:
497 </p>
498
499 <pre caption="Adding xinetd to the default runlevel">
500 # <i>rc-update add xinetd default</i>
501 </pre>
502
503 </body>
504 </section>
505 <section>
506 <title>NTP</title>
507 <body>
508
509 <p>
510 The Network Time Protocol (NTP) is used to synchronize the time of a computer
511 client or server to another server or reference time source, such as a radio
512 or satellite receiver or modem. It provides accuracies typically within a
513 millisecond on LANs and up to a few tens of milliseconds on WANs relative to
514 Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
515 receiver, for example. Typical NTP configurations utilize multiple redundant
516 servers and diverse network paths in order to achieve high accuracy and
517 reliability.
518 </p>
519
520 <p>
521 Select a NTP server geographically close to you from <uri
522 link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
523 Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
524 <path>/etc/ntp.conf</path> files on the master node.
525 </p>
526
527 <pre caption="Master /etc/conf.d/ntp">
528 # Copyright 1999-2002 Gentoo Technologies, Inc.
529 # Distributed under the terms of the GNU General Public License v2
530 # /etc/conf.d/ntpd
531
532 # NOTES:
533 # - NTPDATE variables below are used if you wish to set your
534 # clock when you start the ntp init.d script
535 # - make sure that the NTPDATE_CMD will close by itself ...
536 # the init.d script will not attempt to kill/stop it
537 # - ntpd will be used to maintain synchronization with a time
538 # server regardless of what NTPDATE is set to
539 # - read each of the comments above each of the variable
540
541 # Comment this out if you dont want the init script to warn
542 # about not having ntpdate setup
543 NTPDATE_WARN="n"
544
545 # Command to run to set the clock initially
546 # Most people should just uncomment this line ...
547 # however, if you know what you're doing, and you
548 # want to use ntpd to set the clock, change this to 'ntpd'
549 NTPDATE_CMD="ntpdate"
550
551 # Options to pass to the above command
552 # Most people should just uncomment this variable and
553 # change 'someserver' to a valid hostname which you
554 # can aquire from the URL's below
555 NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
556
557 ##
558 # A list of available servers is available here:
559 # http://www.eecis.udel.edu/~mills/ntp/servers.html
560 # Please follow the rules of engagement and use a
561 # Stratum 2 server (unless you qualify for Stratum 1)
562 ##
563
564 # Options to pass to the ntpd process that will *always* be run
565 # Most people should not uncomment this line ...
566 # however, if you know what you're doing, feel free to tweak
567 #NTPD_OPTS=""
568
569 </pre>
570
571 <p>
572 Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
573 synchronization source:
574 </p>
575
576 <pre caption="Master ntp.conf">
577 # Adelie Linux Research &amp; Development Center
578 # /etc/ntp.conf
579
580 # Synchronization source #1
581 server ntp1.cmc.ec.gc.ca
582 restrict ntp1.cmc.ec.gc.ca
583 # Synchronization source #2
584 server ntp2.cmc.ec.gc.ca
585 restrict ntp2.cmc.ec.gc.ca
586 stratum 10
587 driftfile /etc/ntp.drift.server
588 logfile /var/log/ntp
589 broadcast 192.168.1.255
590 restrict default kod
591 restrict 127.0.0.1
592 restrict 192.168.1.0 mask 255.255.255.0
593 </pre>
594
595 <p>
596 And on all your slave nodes, setup your synchronization source as your master
597 node.
598 </p>
599
600 <pre caption="Node /etc/conf.d/ntp">
601 # Copyright 1999-2002 Gentoo Technologies, Inc.
602 # Distributed under the terms of the GNU General Public License v2
603 # /etc/conf.d/ntpd
604
605 NTPDATE_WARN="n"
606 NTPDATE_CMD="ntpdate"
607 NTPDATE_OPTS="-b master"
608 </pre>
609
610 <pre caption="Node ntp.conf">
611 # Adelie Linux Research &amp; Development Center
612 # /etc/ntp.conf
613
614 # Synchronization source #1
615 server master
616 restrict master
617 stratum 11
618 driftfile /etc/ntp.drift.server
619 logfile /var/log/ntp
620 restrict default kod
621 restrict 127.0.0.1
622 </pre>
623
624 <p>
625 Then add ntpd to the default runlevel of all your nodes:
626 </p>
627
628 <pre caption="Adding ntpd to the default runlevel">
629 # <i>rc-update add ntpd default</i>
630 </pre>
631
632 <note>
633 NTP will not update the local clock if the time difference between your
634 synchronization source and the local clock is too great.
635 </note>
636
637 </body>
638 </section>
639 <section>
640 <title>IPTABLES</title>
641 <body>
642
643 <p>
644 To setup a firewall on your cluster, you will need iptables.
645 </p>
646
647 <pre caption="Installing iptables">
648 # <i>emerge -p iptables</i>
649 # <i>emerge iptables</i>
650 </pre>
651
652 <p>
653 Required kernel configuration:
654 </p>
655
656 <pre caption="IPtables kernel configuration">
657 CONFIG_NETFILTER=y
658 CONFIG_IP_NF_CONNTRACK=y
659 CONFIG_IP_NF_IPTABLES=y
660 CONFIG_IP_NF_MATCH_STATE=y
661 CONFIG_IP_NF_FILTER=y
662 CONFIG_IP_NF_TARGET_REJECT=y
663 CONFIG_IP_NF_NAT=y
664 CONFIG_IP_NF_NAT_NEEDED=y
665 CONFIG_IP_NF_TARGET_MASQUERADE=y
666 CONFIG_IP_NF_TARGET_LOG=y
667 </pre>
668
669 <p>
670 And the rules required for this firewall:
671 </p>
672
673 <pre caption="rule-save">
674 # Adelie Linux Research &amp; Development Center
675 # /var/lib/iptbles/rule-save
676
677 *filter
678 :INPUT ACCEPT [0:0]
679 :FORWARD ACCEPT [0:0]
680 :OUTPUT ACCEPT [0:0]
681 -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
682 -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
683 -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
684 -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
685 -A INPUT -p icmp -j ACCEPT
686 -A INPUT -j LOG
687 -A INPUT -j REJECT --reject-with icmp-port-unreachable
688 COMMIT
689 *nat
690 :PREROUTING ACCEPT [0:0]
691 :POSTROUTING ACCEPT [0:0]
692 :OUTPUT ACCEPT [0:0]
693 -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
694 COMMIT
695 </pre>
696
697 <p>
698 Then add iptables to the default runlevel of all your nodes:
699 </p>
700
701 <pre caption="Adding iptables to the default runlevel">
702 # <i>rc-update add iptables default</i>
703 </pre>
704
705 </body>
706 </section>
707 </chapter>
708
709 <chapter>
710 <title>HPC Tools</title>
711 <section>
712 <title>OpenPBS</title>
713 <body>
714
715 <p>
716 The Portable Batch System (PBS) is a flexible batch queueing and workload
717 management system originally developed for NASA. It operates on networked,
718 multi-platform UNIX environments, including heterogeneous clusters of
719 workstations, supercomputers, and massively parallel systems. Development of
720 PBS is provided by Altair Grid Technologies.
721 </p>
722
723 <pre caption="Installing openpbs">
724 # <i>emerge -p openpbs</i>
725 </pre>
726
727 <note>
728 OpenPBS ebuild does not currently set proper permissions on var-directories
729 used by OpenPBS.
730 </note>
731
732 <p>
733 Before starting using OpenPBS, some configurations are required. The files
734 you will need to personalize for your system are:
735 </p>
736
737 <ul>
738 <li>/etc/pbs_environment</li>
739 <li>/var/spool/PBS/server_name</li>
740 <li>/var/spool/PBS/server_priv/nodes</li>
741 <li>/var/spool/PBS/mom_priv/config</li>
742 <li>/var/spool/PBS/sched_priv/sched_config</li>
743 </ul>
744
745 <p>
746 Here is a sample sched_config:
747 </p>
748
749 <pre caption="/var/spool/PBS/sched_priv/sched_config">
750 #
751 # Create queues and set their attributes.
752 #
753 #
754 # Create and define queue upto4nodes
755 #
756 create queue upto4nodes
757 set queue upto4nodes queue_type = Execution
758 set queue upto4nodes Priority = 100
759 set queue upto4nodes resources_max.nodect = 4
760 set queue upto4nodes resources_min.nodect = 1
761 set queue upto4nodes enabled = True
762 set queue upto4nodes started = True
763 #
764 # Create and define queue default
765 #
766 create queue default
767 set queue default queue_type = Route
768 set queue default route_destinations = upto4nodes
769 set queue default enabled = True
770 set queue default started = True
771 #
772 # Set server attributes.
773 #
774 set server scheduling = True
775 set server acl_host_enable = True
776 set server default_queue = default
777 set server log_events = 511
778 set server mail_from = adm
779 set server query_other_jobs = True
780 set server resources_default.neednodes = 1
781 set server resources_default.nodect = 1
782 set server resources_default.nodes = 1
783 set server scheduler_iteration = 60
784 </pre>
785
786 <p>
787 To submit a task to OpenPBS, the command <c>qsub</c> is used with some
788 optional parameters. In the exemple below, "-l" allows you to specify
789 the resources required, "-j" provides for redirection of standard out and
790 standard error, and the "-m" will e-mail the user at begining (b), end (e)
791 and on abort (a) of the job.
792 </p>
793
794 <pre caption="Submitting a task">
795 <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
796 # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
797 </pre>
798
799 <p>
800 Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
801 may want to try a task manually. To request an interactive shell from OpenPBS,
802 use the "-I" parameter.
803 </p>
804
805 <pre caption="Requesting an interactive shell">
806 # <i>qsub -I</i>
807 </pre>
808
809 <p>
810 To check the status of your jobs, use the qstat command:
811 </p>
812
813 <pre caption="Checking the status of the jobs">
814 # <i>qstat</i>
815 Job id Name User Time Use S Queue
816 ------ ---- ---- -------- - -----
817 2.geist STDIN adelie 0 R upto1nodes
818 </pre>
819
820 </body>
821 </section>
822 <section>
823 <title>MPICH</title>
824 <body>
825
826 <p>
827 Message passing is a paradigm used widely on certain classes of parallel
828 machines, especially those with distributed memory. MPICH is a freely
829 available, portable implementation of MPI, the Standard for message-passing
830 libraries.
831 </p>
832
833 <p>
834 The mpich ebuild provided by Adelie Linux allows for two USE flags:
835 <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
836 installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
837 of <c>rsh</c>.
838 </p>
839
840 <pre caption="Installing the mpich application">
841 # <i>emerge -p mpich</i>
842 # <i>emerge mpich</i>
843 </pre>
844
845 <p>
846 You may need to export a mpich work directory to all your slave nodes in
847 <path>/etc/exports</path>:
848 </p>
849
850 <pre caption="/etc/exports">
851 /home *(rw)
852 </pre>
853
854 <p>
855 Most massively parallel processors (MPPs) provide a way to start a program on
856 a requested number of processors; <c>mpirun</c> makes use of the appropriate
857 command whenever possible. In contrast, workstation clusters require that each
858 process in a parallel job be started individually, though programs to help
859 start these processes exist. Because workstation clusters are not already
860 organized as an MPP, additional information is required to make use of them.
861 Mpich should be installed with a list of participating workstations in the
862 file <path>machines.LINUX</path> in the directory
863 <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
864 processors to run on.
865 </p>
866
867 <p>
868 Edit this file to reflect your cluster-lan configuration:
869 </p>
870
871 <pre caption="/usr/share/mpich/machines.LINUX">
872 # Change this file to contain the machines that you want to use
873 # to run MPI jobs on. The format is one host name per line, with either
874 # hostname
875 # or
876 # hostname:n
877 # where n is the number of processors in an SMP. The hostname should
878 # be the same as the result from the command "hostname"
879 master
880 node01
881 node02
882 # node03
883 # node04
884 # ...
885 </pre>
886
887 <p>
888 Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
889 you can use all of the machines that you have listed. This script performs
890 an <c>rsh</c> and a short directory listing; this tests that you both have
891 access to the node and that a program in the current directory is visible on
892 the remote node. If there are any problems, they will be listed. These
893 problems must be fixed before proceeding.
894 </p>
895
896 <p>
897 The only argument to <c>tstmachines</c> is the name of the architecture; this
898 is the same name as the extension on the machines file. For example, the
899 following tests that a program in the current directory can be executed by
900 all of the machines in the LINUX machines list.
901 </p>
902
903 <pre caption="Running a test">
904 # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
905 </pre>
906
907 <note>
908 This program is silent if all is well; if you want to see what it is doing,
909 use the -v (for verbose) argument:
910 </note>
911
912 <pre caption="Running a test verbosively">
913 # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
914 </pre>
915
916 <p>
917 The output from this command might look like:
918 </p>
919
920 <pre caption="Output of the above command">
921 Trying true on host1.uoffoo.edu ...
922 Trying true on host2.uoffoo.edu ...
923 Trying ls on host1.uoffoo.edu ...
924 Trying ls on host2.uoffoo.edu ...
925 Trying user program on host1.uoffoo.edu ...
926 Trying user program on host2.uoffoo.edu ...
927 </pre>
928
929 <p>
930 If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
931 solutions. In brief, there are three tests:
932 </p>
933
934 <ul>
935 <li>
936 <e>Can processes be started on remote machines?</e> tstmachines attempts
937 to run the shell command true on each machine in the machines files by
938 using the remote shell command.
939 </li>
940 <li>
941 <e>Is current working directory available to all machines?</e> This
942 attempts to ls a file that tstmachines creates by running ls using the
943 remote shell command.
944 </li>
945 <li>
946 <e>Can user programs be run on remote systems?</e> This checks that shared
947 libraries and other components have been properly installed on all
948 machines.
949 </li>
950 </ul>
951
952 <p>
953 And the required test for every development tool:
954 </p>
955
956 <pre caption="Testing a development tool">
957 # <i>cd ~</i>
958 # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
959 # <i>make hello++</i>
960 # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
961 </pre>
962
963 <p>
964 For further information on MPICH, consult the documentation at <uri
965 link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
966 </p>
967
968 </body>
969 </section>
970 <section>
971 <title>LAM</title>
972 <body>
973
974 <p>
975 (Coming Soon!)
976 </p>
977
978 </body>
979 </section>
980 <section>
981 <title>OMNI</title>
982 <body>
983
984 <p>
985 (Coming Soon!)
986 </p>
987
988 </body>
989 </section>
990 </chapter>
991
992 <chapter>
993 <title>Bibliography</title>
994 <section>
995 <body>
996
997 <p>
998 The original document is published at the <uri
999 link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
1000 and is reproduced here with the permission of the authors and <uri
1001 link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
1002 Centre.
1003 </p>
1004
1005 <ul>
1006 <li>
1007 <uri link="http://www.gentoo.org">http://www.gentoo.org</uri>, Gentoo
1008 Technologies, Inc.
1009 </li>
1010 <li>
1011 <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
1012 Adelie Linux Research and Development Centre
1013 </li>
1014 <li>
1015 <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
1016 Linux NFS Project
1017 </li>
1018 <li>
1019 <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
1020 Mathematics and Computer Science Division, Argonne National Laboratory
1021 </li>
1022 <li>
1023 <uri link="http://www.ntp.org/">http://ntp.org</uri>
1024 </li>
1025 <li>
1026 <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1027 David L. Mills, University of Delaware
1028 </li>
1029 <li>
1030 <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1031 Secure Shell Working Group, IETF, Internet Society
1032 </li>
1033 <li>
1034 <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1035 Guardian Digital
1036 </li>
1037 <li>
1038 <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1039 Altair Grid Technologies, LLC.
1040 </li>
1041 </ul>
1042
1043 </body>
1044 </section>
1045 </chapter>
1046
1047 </guide>

  ViewVC Help
Powered by ViewVC 1.1.20