/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.2 - (show annotations) (download) (as text)
Mon Jan 3 13:52:07 2005 UTC (9 years, 10 months ago) by neysx
Branch: MAIN
Changes since 1.1: +10 -16 lines
File MIME type: application/xml
Fixed links and a bit of coding style

1 <?xml version='1.0' encoding="UTF-8"?>
2
3 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.1 2005/01/03 10:00:04 swift Exp $ -->
4
5 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
6 <guide link="/doc/en/hpc-howto.xml">
7
8 <title>High Performance Computing on Gentoo Linux</title>
9
10 <author title="Author">
11 <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
12 </author>
13 <author title="Author">
14 <mail link="benoit@adelielinux.com">Benoit Morin</mail>
15 </author>
16 <author title="Assistant/Research">
17 <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
18 </author>
19 <author title="Assistant/Research">
20 <mail link="olivier@adelielinux.com">Olivier Crete</mail>
21 </author>
22 <author title="Reviewer">
23 <mail link="spyderous@gentoo.org">Donnie Berkholz</mail>
24 </author>
25
26 <!-- No licensing information; this document has been written by a third-party
27 organisation without additional licensing information.
28
29 In other words, this is copyright adelielinux R&D; Gentoo only has
30 permission to distribute this document as-is and update it when appropriate
31 as long as the adelie linux R&D notice stays
32 -->
33
34 <abstract>
35 This document was written by people at the Adelie Linux R&amp;D Center
36 &lt;http://www.adelielinux.com&gt; as a step-by-step guide to turn a Gentoo
37 System into an High Performance Computing (HPC) system.
38 </abstract>
39
40 <version>1.0</version>
41 <date>2003-08-01</date>
42
43 <chapter>
44 <title>Introduction</title>
45 <section>
46 <body>
47
48 <p>
49 Gentoo Linux, a special flavor of Linux that can be automatically optimized
50 and customized for just about any application or need. Extreme performance,
51 configurability and a top-notch user and developer community are all hallmarks
52 of the Gentoo experience.
53 </p>
54
55 <p>
56 Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
57 server, development workstation, professional desktop, gaming system, embedded
58 solution or... a High Performance Computing system. Because of its
59 near-unlimited adaptability, we call Gentoo Linux a metadistribution.
60 </p>
61
62 <p>
63 This document explains how to turn a Gentoo system into a High Performance
64 Computing system. Step by step, it explains what packages one may want to
65 install and helps configure them.
66 </p>
67
68 <p>
69 Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
70 refer to the <uri link="/doc/en/">documentation</uri> at the same location to
71 install it.
72 </p>
73
74 </body>
75 </section>
76 </chapter>
77
78 <chapter>
79 <title>Configuring Gentoo Linux for Clustering</title>
80 <section>
81 <title>Recommended Optimizations</title>
82 <body>
83
84 <note>
85 We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
86 this section.
87 </note>
88
89 <p>
90 During the installation process, you will have to set your USE variables in
91 <path>/etc/make.conf</path>. We recommended that you deactivate all the
92 defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them
93 in make.conf. However, you may want to keep such use variables as x86, 3dnow,
94 gpm, mmx, sse, ncurses, pam and tcpd. Refer to the USE documentation for more
95 information.
96 </p>
97
98 <pre caption="USE Flags">
99 # Copyright 2000-2003 Daniel Robbins, Gentoo Technologies, Inc.
100 # Contains local system settings for Portage system
101
102 # Please review 'man make.conf' for more information.
103
104 USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm
105 -gif gpm -gtk -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod
106 mmx -motif -mpeg ncurses -nls -oggvorbis -opengl pam -pdflib -png
107 -python -qt -qtmt -quicktime -readline -sdl -slang -spell -ssl
108 -svga tcpd -truetype -X -xml2 -xmms -xv -zlib"
109 </pre>
110
111 <p>
112 Or simply:
113 </p>
114
115 <pre caption="USE Flags - simplified version">
116 # Copyright 2000-2003 Daniel Robbins, Gentoo Technologies, Inc.
117 # Contains local system settings for Portage system
118
119 # Please review 'man make.conf' for more information.
120
121 USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
122 </pre>
123
124 <note>
125 The <e>tcpd</e> USE flag increases security for packages such as xinetd.
126 </note>
127
128 <p>
129 In step 15 ("Installing the kernel and a System Logger") for stability
130 reasons, we recommend the vanilla-sources, the official kernel sources
131 released on <uri>http://www.kernel.org/</uri>, unless you require special
132 support such as xfs.
133 </p>
134
135 <pre caption="Installing vanilla-sources">
136 # <i>emerge -p syslog-ng vanilla-sources</i>
137 </pre>
138
139 <p>
140 When you install miscellaneous packages, we recommend installing the
141 following:
142 </p>
143
144 <pre caption="Installing necessary packages">
145 # <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
146 </pre>
147
148 </body>
149 </section>
150 <section>
151 <title>Communication Layer (TCP/IP Network)</title>
152 <body>
153
154 <p>
155 A cluster requires a communication layer to interconnect the slave nodes to
156 the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
157 since they have a good price/performance ratio. Other possibilities include
158 use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
159 link="http://quadrics.com/">QsNet</uri> or others.
160 </p>
161
162 <p>
163 A cluster is composed of two node types: master and slave. Typically, your
164 cluster will have one master node and several slave nodes.
165 </p>
166
167 <p>
168 The master node is the cluster's server. It is responsible for telling the
169 slave nodes what to do. This server will typically run such daemons as dhcpd,
170 nfs, pbs-server, and pbs-sched. Your master node will allow interactive
171 sessions for users, and accept job executions.
172 </p>
173
174 <p>
175 The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
176 node. They should be dedicated to crunching results and therefore should not
177 run any unecessary services.
178 </p>
179
180 <p>
181 The rest of this documentation will assume a cluster configuration as per the
182 hosts file below. You should maintain on every node such a hosts file
183 (<path>/etc/hosts</path>) with entries for each node participating node in the
184 cluster.
185 </p>
186
187 <pre caption="/etc/hosts">
188 # Adelie Linux Research &amp; Development Center
189 # /etc/hosts
190
191 127.0.0.1 localhost
192
193 192.168.1.100 master.adelie master
194
195 192.168.1.1 node01.adelie node01
196 192.168.1.2 node02.adelie node02
197 </pre>
198
199 <p>
200 To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
201 file on the master node.
202 </p>
203
204 <pre caption="/etc/conf.d/net">
205 # Copyright 1999-2002 Gentoo Technologies, Inc.
206 # Distributed under the terms of the GNU General Public License, v2 or later
207
208 # Global config file for net.* rc-scripts
209
210 # This is basically the ifconfig argument without the ifconfig $iface
211 #
212
213 iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
214 # Network Connection to the outside world using dhcp -- configure as required for you network
215 iface_eth1="dhcp"
216 </pre>
217
218
219 <p>
220 Finally, setup a DHCP daemon on the master node to avoid having to maintain a
221 network configuration on each slave node.
222 </p>
223
224 <pre caption="/etc/dhcp/dhcpd.conf">
225 # Adelie Linux Research &amp; Development Center
226 # /etc/dhcp/dhcpd.conf
227
228 log-facility local7;
229 ddns-update-style none;
230 use-host-decl-names on;
231
232 subnet 192.168.1.0 netmask 255.255.255.0 {
233 option domain-name "adelie";
234 range 192.168.1.10 192.168.1.99;
235 option routers 192.168.1.100;
236
237 host node01.adelie {
238 # MAC address of network card on node 01
239 hardware ethernet 00:07:e9:0f:e2:d4;
240 fixed-address 192.168.1.1;
241 }
242 host node02.adelie {
243 # MAC address of network card on node 02
244 hardware ethernet 00:07:e9:0f:e2:6b;
245 fixed-address 192.168.1.2;
246 }
247 }
248 </pre>
249
250 </body>
251 </section>
252 <section>
253 <title>NFS/NIS</title>
254 <body>
255
256 <p>
257 The Network File System (NFS) was developed to allow machines to mount a disk
258 partition on a remote machine as if it were on a local hard drive. This allows
259 for fast, seamless sharing of files across a network.
260 </p>
261
262 <p>
263 There are other systems that provide similar functionality to NFS which could
264 be used in a cluster environment. The <uri
265 link="http://www.transarc.com/Product/EFS/AFS/index.html">Andrew File System
266 from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
267 some additional security and performance features. The <uri
268 link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
269 development, but is designed to work well with disconnected clients. Many
270 of the features of the Andrew and Coda file systems are slated for inclusion
271 in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
272 The advantage of NFS today is that it is mature, standard, well understood,
273 and supported robustly across a variety of platforms.
274 </p>
275
276 <pre caption="Ebuilds for NFS-support">
277 # <i>emerge -p nfs-utils portmap</i>
278 # <i>emerge nfs-utils portmap</i>
279 </pre>
280
281 <p>
282 Configure and install a kernel to support NFS v3 on all nodes:
283 </p>
284
285 <pre caption="Required Kernel Configurations for NFS">
286 CONFIG_NFS_FS=y
287 CONFIG_NFSD=y
288 CONFIG_SUNRPC=y
289 CONFIG_LOCKD=y
290 CONFIG_NFSD_V3=y
291 CONFIG_LOCKD_V4=y
292 </pre>
293
294 <p>
295 On the master node, edit your <path>/etc/hosts.allow</path> file to allow
296 connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
297 your <path>hosts.allow</path> will look like:
298 </p>
299
300 <pre caption="hosts.allow">
301 portmap:192.168.1.0/255.255.255.0
302 </pre>
303
304 <p>
305 Edit the <path>/etc/exports</path> file of the master node to export a work
306 directory struture (/home is good for this).
307 </p>
308
309 <pre caption="/etc/exports">
310 /home/ *(rw)
311 </pre>
312
313 <p>
314 Add nfs to your master node's default runlevel:
315 </p>
316
317 <pre caption="Adding NFS to the default runlevel">
318 # <i>rc-update add nfs default</i>
319 </pre>
320
321 <p>
322 To mount the nfs exported filesystem from the master, you also have to
323 configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
324 one:
325 </p>
326
327 <pre caption="/etc/fstab">
328 master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
329 </pre>
330
331 <p>
332 You'll also need to set up your nodes so that they mount the nfs filesystem by
333 issuing this command:
334 </p>
335
336 <pre caption="Adding nfsmount to the default runlevel">
337 # <i>rc-update add nfsmount default</i>
338 </pre>
339
340 </body>
341 </section>
342 <section>
343 <title>RSH/SSH</title>
344 <body>
345
346 <p>
347 SSH is a protocol for secure remote login and other secure network services
348 over an insecure network. OpenSSH uses public key cryptography to provide
349 secure authorization. Generating the public key, which is shared with remote
350 systems, and the private key which is kept on the local system, is done first
351 to configure OpenSSH on the cluster.
352 </p>
353
354 <p>
355 For transparent cluster usage, private/public keys may be used. This process
356 has two steps:
357 </p>
358
359 <ul>
360 <li>Generate public and private keys</li>
361 <li>Copy public key to slave nodes</li>
362 </ul>
363
364 <p>
365 For user based authentification, general and copy as follows:
366 </p>
367
368 <pre caption="SSH key authentication">
369 # <i>ssh-keygen -t dsa</i>
370 Generating public/private dsa key pair.
371 Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
372 Enter passphrase (empty for no passphrase):
373 Enter same passphrase again:
374 Your identification has been saved in /root/.ssh/id_dsa.
375 Your public key has been saved in /root/.ssh/id_dsa.pub.
376 The key fingerprint is:
377 f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
378
379 <comment>WARNING! If you already have an "authorized_keys" file,
380 please append to it, do not use the following command.</comment>
381
382 # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
383 root@master's password:
384 id_dsa.pub 100% 234 2.0MB/s 00:00
385
386 # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
387 root@master's password:
388 id_dsa.pub 100% 234 2.0MB/s 00:00
389 </pre>
390
391 <note>
392 Host keys must have an empty passphrase. RSA is required for host-based
393 authentification.
394 </note>
395
396 <p>
397 For host based authentication, you will also need to edit your
398 <path>/etc/ssh/shosts.equiv</path>.
399 </p>
400
401 <pre caption="/etc/ssh/shosts.equiv">
402 node01.adelie
403 node02.adelie
404 master.adelie
405 </pre>
406
407 <p>
408 And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
409 </p>
410
411 <pre caption="sshd configurations">
412 # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
413 # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
414
415 # This is the sshd server system-wide configuration file. See sshd(8)
416 # for more information.
417
418 # HostKeys for protocol version 2
419 HostKey /etc/ssh/ssh_host_rsa_key
420 </pre>
421
422 <p>
423 If your application require RSH communications, you will need to emerge
424 net-misc/netkit-rsh and sys-apps/xinetd.
425 </p>
426
427 <pre caption="Installing necessary applicaitons">
428 # <i>emerge -p xinetd</i>
429 # <i>emerge xinetd</i>
430 # <i>emerge -p netkit-rsh</i>
431 # <i>emerge netkit-rsh</i>
432 </pre>
433
434 <p>
435 Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
436 </p>
437
438 <pre caption="rsh">
439 # Adelie Linux Research &amp; Development Center
440 # /etc/xinetd.d/rsh
441
442 service shell
443 {
444 socket_type = stream
445 protocol = tcp
446 wait = no
447 user = root
448 group = tty
449 server = /usr/sbin/in.rshd
450 log_type = FILE /var/log/rsh
451 log_on_success = PID HOST USERID EXIT DURATION
452 log_on_failure = USERID ATTEMPT
453 disable = no
454 }
455 </pre>
456
457 <p>
458 Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
459 </p>
460
461 <pre caption="hosts.allow">
462 # Adelie Linux Research &amp; Development Center
463 # /etc/hosts.allow
464
465 in.rshd:192.168.1.0/255.255.255.0
466 </pre>
467
468 <p>
469 Or you can simply trust your cluster LAN:
470 </p>
471
472 <pre caption="hosts.allow">
473 # Adelie Linux Research &amp; Development Center
474 # /etc/hosts.allow
475
476 ALL:192.168.1.0/255.255.255.0
477 </pre>
478
479 <p>
480 Finally, configure host authentification from <path>/etc/hosts.equiv</path>.
481 </p>
482
483 <pre caption="hosts.equiv">
484 # Adelie Linux Research &amp; Development Center
485 # /etc/hosts.equiv
486
487 master
488 node01
489 node02
490 </pre>
491
492 <p>
493 And, add xinetd to your default runlevel:
494 </p>
495
496 <pre caption="Adding xinetd to the default runlevel">
497 # <i>rc-update add xinetd default</i>
498 </pre>
499
500 </body>
501 </section>
502 <section>
503 <title>NTP</title>
504 <body>
505
506 <p>
507 The Network Time Protocol (NTP) is used to synchronize the time of a computer
508 client or server to another server or reference time source, such as a radio
509 or satellite receiver or modem. It provides accuracies typically within a
510 millisecond on LANs and up to a few tens of milliseconds on WANs relative to
511 Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
512 receiver, for example. Typical NTP configurations utilize multiple redundant
513 servers and diverse network paths in order to achieve high accuracy and
514 reliability.
515 </p>
516
517 <p>
518 Select a NTP server geographically close to you from <uri
519 link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
520 Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
521 <path>/etc/ntp.conf</path> files on the master node.
522 </p>
523
524 <pre caption="Master /etc/conf.d/ntp">
525 # Copyright 1999-2002 Gentoo Technologies, Inc.
526 # Distributed under the terms of the GNU General Public License v2
527 # /etc/conf.d/ntpd
528
529 # NOTES:
530 # - NTPDATE variables below are used if you wish to set your
531 # clock when you start the ntp init.d script
532 # - make sure that the NTPDATE_CMD will close by itself ...
533 # the init.d script will not attempt to kill/stop it
534 # - ntpd will be used to maintain synchronization with a time
535 # server regardless of what NTPDATE is set to
536 # - read each of the comments above each of the variable
537
538 # Comment this out if you dont want the init script to warn
539 # about not having ntpdate setup
540 NTPDATE_WARN="n"
541
542 # Command to run to set the clock initially
543 # Most people should just uncomment this line ...
544 # however, if you know what you're doing, and you
545 # want to use ntpd to set the clock, change this to 'ntpd'
546 NTPDATE_CMD="ntpdate"
547
548 # Options to pass to the above command
549 # Most people should just uncomment this variable and
550 # change 'someserver' to a valid hostname which you
551 # can aquire from the URL's below
552 NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
553
554 ##
555 # A list of available servers is available here:
556 # http://www.eecis.udel.edu/~mills/ntp/servers.html
557 # Please follow the rules of engagement and use a
558 # Stratum 2 server (unless you qualify for Stratum 1)
559 ##
560
561 # Options to pass to the ntpd process that will *always* be run
562 # Most people should not uncomment this line ...
563 # however, if you know what you're doing, feel free to tweak
564 #NTPD_OPTS=""
565
566 </pre>
567
568 <p>
569 Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
570 synchronization source:
571 </p>
572
573 <pre caption="Master ntp.conf">
574 # Adelie Linux Research &amp; Development Center
575 # /etc/ntp.conf
576
577 # Synchronization source #1
578 server ntp1.cmc.ec.gc.ca
579 restrict ntp1.cmc.ec.gc.ca
580 # Synchronization source #2
581 server ntp2.cmc.ec.gc.ca
582 restrict ntp2.cmc.ec.gc.ca
583 stratum 10
584 driftfile /etc/ntp.drift.server
585 logfile /var/log/ntp
586 broadcast 192.168.1.255
587 restrict default kod
588 restrict 127.0.0.1
589 restrict 192.168.1.0 mask 255.255.255.0
590 </pre>
591
592 <p>
593 And on all your slave nodes, setup your synchronization source as your master
594 node.
595 </p>
596
597 <pre caption="Node /etc/conf.d/ntp">
598 # Copyright 1999-2002 Gentoo Technologies, Inc.
599 # Distributed under the terms of the GNU General Public License v2
600 # /etc/conf.d/ntpd
601
602 NTPDATE_WARN="n"
603 NTPDATE_CMD="ntpdate"
604 NTPDATE_OPTS="-b master"
605 </pre>
606
607 <pre caption="Node ntp.conf">
608 # Adelie Linux Research &amp; Development Center
609 # /etc/ntp.conf
610
611 # Synchronization source #1
612 server master
613 restrict master
614 stratum 11
615 driftfile /etc/ntp.drift.server
616 logfile /var/log/ntp
617 restrict default kod
618 restrict 127.0.0.1
619 </pre>
620
621 <p>
622 Then add ntpd to the default runlevel of all your nodes:
623 </p>
624
625 <pre caption="Adding ntpd to the default runlevel">
626 # <i>rc-update add ntpd default</i>
627 </pre>
628
629 <note>
630 NTP will not update the local clock if the time difference between your
631 synchronization source and the local clock is too great.
632 </note>
633
634 </body>
635 </section>
636 <section>
637 <title>IPTABLES</title>
638 <body>
639
640 <p>
641 To setup a firewall on your cluster, you will need iptables.
642 </p>
643
644 <pre caption="Installing iptables">
645 # <i>emerge -p iptables</i>
646 # <i>emerge iptables</i>
647 </pre>
648
649 <p>
650 Required kernel configuration:
651 </p>
652
653 <pre caption="IPtables kernel configuration">
654 CONFIG_NETFILTER=y
655 CONFIG_IP_NF_CONNTRACK=y
656 CONFIG_IP_NF_IPTABLES=y
657 CONFIG_IP_NF_MATCH_STATE=y
658 CONFIG_IP_NF_FILTER=y
659 CONFIG_IP_NF_TARGET_REJECT=y
660 CONFIG_IP_NF_NAT=y
661 CONFIG_IP_NF_NAT_NEEDED=y
662 CONFIG_IP_NF_TARGET_MASQUERADE=y
663 CONFIG_IP_NF_TARGET_LOG=y
664 </pre>
665
666 <p>
667 And the rules required for this firewall:
668 </p>
669
670 <pre caption="rule-save">
671 # Adelie Linux Research &amp; Development Center
672 # /var/lib/iptbles/rule-save
673
674 *filter
675 :INPUT ACCEPT [0:0]
676 :FORWARD ACCEPT [0:0]
677 :OUTPUT ACCEPT [0:0]
678 -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
679 -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
680 -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
681 -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
682 -A INPUT -p icmp -j ACCEPT
683 -A INPUT -j LOG
684 -A INPUT -j REJECT --reject-with icmp-port-unreachable
685 COMMIT
686 *nat
687 :PREROUTING ACCEPT [0:0]
688 :POSTROUTING ACCEPT [0:0]
689 :OUTPUT ACCEPT [0:0]
690 -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
691 COMMIT
692 </pre>
693
694 <p>
695 Then add iptables to the default runlevel of all your nodes:
696 </p>
697
698 <pre caption="Adding iptables to the default runlevel">
699 # <i>rc-update add iptables default</i>
700 </pre>
701
702 </body>
703 </section>
704 </chapter>
705
706 <chapter>
707 <title>HPC Tools</title>
708 <section>
709 <title>OpenPBS</title>
710 <body>
711
712 <p>
713 The Portable Batch System (PBS) is a flexible batch queueing and workload
714 management system originally developed for NASA. It operates on networked,
715 multi-platform UNIX environments, including heterogeneous clusters of
716 workstations, supercomputers, and massively parallel systems. Development of
717 PBS is provided by Altair Grid Technologies.
718 </p>
719
720 <pre caption="Installing openpbs">
721 # <i>emerge -p openpbs</i>
722 </pre>
723
724 <note>
725 OpenPBS ebuild does not currently set proper permissions on var-directories
726 used by OpenPBS.
727 </note>
728
729 <p>
730 Before starting using OpenPBS, some configurations are required. The files
731 you will need to personalize for your system are:
732 </p>
733
734 <ul>
735 <li>/etc/pbs_environment</li>
736 <li>/var/spool/PBS/server_name</li>
737 <li>/var/spool/PBS/server_priv/nodes</li>
738 <li>/var/spool/PBS/mom_priv/config</li>
739 <li>/var/spool/PBS/sched_priv/sched_config</li>
740 </ul>
741
742 <p>
743 Here is a sample sched_config:
744 </p>
745
746 <pre caption="/var/spool/PBS/sched_priv/sched_config">
747 #
748 # Create queues and set their attributes.
749 #
750 #
751 # Create and define queue upto4nodes
752 #
753 create queue upto4nodes
754 set queue upto4nodes queue_type = Execution
755 set queue upto4nodes Priority = 100
756 set queue upto4nodes resources_max.nodect = 4
757 set queue upto4nodes resources_min.nodect = 1
758 set queue upto4nodes enabled = True
759 set queue upto4nodes started = True
760 #
761 # Create and define queue default
762 #
763 create queue default
764 set queue default queue_type = Route
765 set queue default route_destinations = upto4nodes
766 set queue default enabled = True
767 set queue default started = True
768 #
769 # Set server attributes.
770 #
771 set server scheduling = True
772 set server acl_host_enable = True
773 set server default_queue = default
774 set server log_events = 511
775 set server mail_from = adm
776 set server query_other_jobs = True
777 set server resources_default.neednodes = 1
778 set server resources_default.nodect = 1
779 set server resources_default.nodes = 1
780 set server scheduler_iteration = 60
781 </pre>
782
783 <p>
784 To submit a task to OpenPBS, the command <c>qsub</c> is used with some
785 optional parameters. In the exemple below, "-l" allows you to specify
786 the resources required, "-j" provides for redirection of standard out and
787 standard error, and the "-m" will e-mail the user at begining (b), end (e)
788 and on abort (a) of the job.
789 </p>
790
791 <pre caption="Submitting a task">
792 <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
793 # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
794 </pre>
795
796 <p>
797 Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
798 may want to try a task manually. To request an interactive shell from OpenPBS,
799 use the "-I" parameter.
800 </p>
801
802 <pre caption="Requesting an interactive shell">
803 # <i>qsub -I</i>
804 </pre>
805
806 <p>
807 To check the status of your jobs, use the qstat command:
808 </p>
809
810 <pre caption="Checking the status of the jobs">
811 # <i>qstat</i>
812 Job id Name User Time Use S Queue
813 ------ ---- ---- -------- - -----
814 2.geist STDIN adelie 0 R upto1nodes
815 </pre>
816
817 </body>
818 </section>
819 <section>
820 <title>MPICH</title>
821 <body>
822
823 <p>
824 Message passing is a paradigm used widely on certain classes of parallel
825 machines, especially those with distributed memory. MPICH is a freely
826 available, portable implementation of MPI, the Standard for message-passing
827 libraries.
828 </p>
829
830 <p>
831 The mpich ebuild provided by Adelie Linux allows for two USE flags:
832 <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
833 installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
834 of <c>rsh</c>.
835 </p>
836
837 <pre caption="Installing the mpich application">
838 # <i>emerge -p mpich</i>
839 # <i>emerge mpich</i>
840 </pre>
841
842 <p>
843 You may need to export a mpich work directory to all your slave nodes in
844 <path>/etc/exports</path>:
845 </p>
846
847 <pre caption="/etc/exports">
848 /home *(rw)
849 </pre>
850
851 <p>
852 Most massively parallel processors (MPPs) provide a way to start a program on
853 a requested number of processors; <c>mpirun</c> makes use of the appropriate
854 command whenever possible. In contrast, workstation clusters require that each
855 process in a parallel job be started individually, though programs to help
856 start these processes exist. Because workstation clusters are not already
857 organized as an MPP, additional information is required to make use of them.
858 Mpich should be installed with a list of participating workstations in the
859 file <path>machines.LINUX</path> in the directory
860 <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
861 processors to run on.
862 </p>
863
864 <p>
865 Edit this file to reflect your cluster-lan configuration:
866 </p>
867
868 <pre caption="/usr/share/mpich/machines.LINUX">
869 # Change this file to contain the machines that you want to use
870 # to run MPI jobs on. The format is one host name per line, with either
871 # hostname
872 # or
873 # hostname:n
874 # where n is the number of processors in an SMP. The hostname should
875 # be the same as the result from the command "hostname"
876 master
877 node01
878 node02
879 # node03
880 # node04
881 # ...
882 </pre>
883
884 <p>
885 Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
886 you can use all of the machines that you have listed. This script performs
887 an <c>rsh</c> and a short directory listing; this tests that you both have
888 access to the node and that a program in the current directory is visible on
889 the remote node. If there are any problems, they will be listed. These
890 problems must be fixed before proceeding.
891 </p>
892
893 <p>
894 The only argument to <c>tstmachines</c> is the name of the architecture; this
895 is the same name as the extension on the machines file. For example, the
896 following tests that a program in the current directory can be executed by
897 all of the machines in the LINUX machines list.
898 </p>
899
900 <pre caption="Running a test">
901 # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
902 </pre>
903
904 <note>
905 This program is silent if all is well; if you want to see what it is doing,
906 use the -v (for verbose) argument:
907 </note>
908
909 <pre caption="Running a test verbosively">
910 # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
911 </pre>
912
913 <p>
914 The output from this command might look like:
915 </p>
916
917 <pre caption="Output of the above command">
918 Trying true on host1.uoffoo.edu ...
919 Trying true on host2.uoffoo.edu ...
920 Trying ls on host1.uoffoo.edu ...
921 Trying ls on host2.uoffoo.edu ...
922 Trying user program on host1.uoffoo.edu ...
923 Trying user program on host2.uoffoo.edu ...
924 </pre>
925
926 <p>
927 If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
928 solutions. In brief, there are three tests:
929 </p>
930
931 <ul>
932 <li>
933 <e>Can processes be started on remote machines?</e> tstmachines attempts
934 to run the shell command true on each machine in the machines files by
935 using the remote shell command.
936 </li>
937 <li>
938 <e>Is current working directory available to all machines?</e> This
939 attempts to ls a file that tstmachines creates by running ls using the
940 remote shell command.
941 </li>
942 <li>
943 <e>Can user programs be run on remote systems?</e> This checks that shared
944 libraries and other components have been properly installed on all
945 machines.
946 </li>
947 </ul>
948
949 <p>
950 And the required test for every development tool:
951 </p>
952
953 <pre caption="Testing a development tool">
954 # <i>cd ~</i>
955 # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
956 # <i>make hello++</i>
957 # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
958 </pre>
959
960 <p>
961 For further information on MPICH, consult the documentation at <uri
962 link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
963 </p>
964
965 </body>
966 </section>
967 <section>
968 <title>LAM</title>
969 <body>
970
971 <p>
972 (Coming Soon!)
973 </p>
974
975 </body>
976 </section>
977 <section>
978 <title>OMNI</title>
979 <body>
980
981 <p>
982 (Coming Soon!)
983 </p>
984
985 </body>
986 </section>
987 </chapter>
988
989 <chapter>
990 <title>Bibliography</title>
991 <section>
992 <body>
993
994 <p>
995 The original document is published at the <uri
996 link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
997 and is reproduced here with the permission of the authors and <uri
998 link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
999 Centre.
1000 </p>
1001
1002 <ul>
1003 <li><uri>http://www.gentoo.org</uri>, Gentoo Technologies, Inc.</li>
1004 <li>
1005 <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
1006 Adelie Linux Research and Development Centre
1007 </li>
1008 <li>
1009 <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
1010 Linux NFS Project
1011 </li>
1012 <li>
1013 <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
1014 Mathematics and Computer Science Division, Argonne National Laboratory
1015 </li>
1016 <li>
1017 <uri link="http://www.ntp.org/">http://ntp.org</uri>
1018 </li>
1019 <li>
1020 <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1021 David L. Mills, University of Delaware
1022 </li>
1023 <li>
1024 <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1025 Secure Shell Working Group, IETF, Internet Society
1026 </li>
1027 <li>
1028 <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1029 Guardian Digital
1030 </li>
1031 <li>
1032 <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1033 Altair Grid Technologies, LLC.
1034 </li>
1035 </ul>
1036
1037 </body>
1038 </section>
1039 </chapter>
1040
1041 </guide>

  ViewVC Help
Powered by ViewVC 1.1.20